A comparison of imbalanced data handling methods for pre-trained model in multi-label classification of stack overflow

Arisa Umparat

Please use this identifier to cite or link to this item: https://cuir.car.chula.ac.th/handle/123456789/82736

Title:	A comparison of imbalanced data handling methods for pre-trained model in multi-label classification of stack overflow
Other Titles:	การเปรียบเทียบวิธีการจัดการข้อมูลที่ไม่สมดุลสำหรับแบบจำลองที่ได้รับการฝึกฝนแล้วสำหรับวิธีการจำแนกประเภทแบบหลายลาเบลในสแต็กโอเวอร์โฟลว์
Authors:	Arisa Umparat
Advisors:	Suronapee Phoomvuthisarn
Other author:	Chulalongkorn University. Faculty of Commerce and Accountancy
Issue Date:	2022
Publisher:	Chulalongkorn University
Abstract:	Tag classification is essential in Stack Overflow. Instead of combining through pages or replies of irrelevant information, users can easily and quickly pinpoint relevant posts and answers using tags. Since User-submitted posts can have multiple tags, classifying tags in Stack Overflow can be challenging. This results in an imbalance problem between labels in the whole labelset. Pretrained deep learning models with small datasets can improve tag classification accuracy. Common multi-label resampling techniques with machine learning classifiers can also fix this issue. Still, few studies have explored which resampling technique can improve the performance of pre-trained deep models for predicting tags. To address this gap, we experimented to evaluate the effectiveness of ELECTRA, a powerful deep learning pre-trained model, with various multi-label resampling techniques in decreasing the imbalance that induces mislabeling in Stack Overflow's tagging posts. We compared six resampling techniques, such as ML-ROS, MLSMOTE, MLeNN, MLTL, ML-SOL, and REMEDIAL, to find the best method to mitigate the imbalance and improve tag prediction accuracy. Our results show that MLTL is the most effective selection to tackle the inequality in multi-label classification for our Stack Overflow data with deep learning scenarios. MLTL achieved 0.517, 0.804, 0.467, and 0.98 from the metrics Precision@1, Recall@5, F1-score@1, and AUC, respectively. Conversely, MLeNN gained only 0.323, 0.648, 0.277, and 0.95 from the same metrics.
Other Abstract:	การจัดประเภทแท็กมีความสำคัญในสแต็กโอเวอร์โฟลว์ นอกจากจะช่วยให้ผู้ใช้สามารถค้นหาข้อมูลแล้วยังช่วยเสนอวิธีแก้ปัญหาที่เกี่ยวข้องอย่างมีประสิทธิภาพมากขึ้นอีกด้วย เนื่องจากคำถามในโพสต์สามารถมีได้หลายแท็กดังนั้นการจัดประเภทแท็กในสแต็กโอเวอร์โฟลว์จึงถือเป็นเรื่องที่ท้าทาย ซึ่งส่งผลให้เกิดปัญหาความไม่สมดุลระหว่างแท็กกับแท็กทั้งหมด เราจึงนำโมเดลการเรียนรู้เชิงลึกที่ได้รับการฝึกฝนแล้วพร้อมกับชุดข้อมูลขนาดเล็กมาทดลองเพื่อเพิ่มความแม่นยำในการจำแนกหรือการทำนายแท็กได้ โดยใช้เทคนิคการสุ่มตัวอย่างใหม่ที่เหมาะกับการจำแนกประเภทแบบหลายลาเบลโดยเฉพาะ โดยทั่วไปแล้วเพียงแค่ใช้เทคนิคการเรียนรู้ของเครื่องก็สามารถแก้ไขปัญหานี้ได้เช่นกัน แต่มีแค่ไม่กี่งานวิจัยเท่านั้นที่ทดลองว่าเทคนิคการสุ่มตัวอย่างใหม่แบบใดที่สามารถปรับปรุงประสิทธิภาพของโมเดลเชิงลึกโดยใช้แบบจำลองที่ได้รับการฝึกฝนแล้วสำหรับการทำนายแท็ก เพื่อจัดการกับข้อจำกัดนี้ เราได้ทดลองเพื่อประเมินประสิทธิภาพของ ELECTRA ซึ่งเป็นโมเดลการเรียนรู้เชิงลึกที่ได้รับการฝึกฝนแล้วที่ทรงพลัง อีกทั้งยังเสริมด้วยด้วยเทคนิคการสุ่มตัวอย่างใหม่แบบหลายลาเบลเพื่อลดความไม่สมดุลของข้อมูลที่ทำให้เกิดการติดลาเบลผิดในโพสต์ของสแต็กโอเวอร์โฟลว์ เราเปรียบเทียบเทคนิคการสุ่มใหม่ 6 เทคนิค ประกอบไปด้วย ML-ROS, MLSMOTE, MLeNN, MLTL, ML-SOL และ REMEDIAL เพื่อหาวิธีที่ดีที่สุดในการลดความไม่สมดุลของข้อมูล พร้อมทั้งปรับปรุงความแม่นยำในการคาดทำนายแท็ก ซึงผลลัพธ์ของเราแสดงให้เห็นว่า MLTL เป็นตัวเลือกที่มีประสิทธิภาพมากที่สุดในการจัดการกับความไม่สมดุลในการจำแนกประเภทหลายลาเบลสำหรับข้อมูลในสแต็กโอเวอร์โฟลว์ในการเรียนรู้เชิงลึก โดยเทคนิค MLTL ทำได้ 0.517, 0.804, 0.467 และ 0.98 จากตัวชี้วัด Precision@1, Recall@5, F1-score@1 และ AUC ตามลำดับ แต่ MLeNN กลับทำได้แค่เพียง 0.323, 0.648, 0.277 และ 0.95 จากตัววัดผลเดียวกัน
Description:	Thesis (M.Sc.)--Chulalongkorn University, 2022
Degree Name:	Master of Science
Degree Level:	Master's Degree
Degree Discipline:	Statistics
URI:	https://cuir.car.chula.ac.th/handle/123456789/82736
URI:	http://doi.org/10.58837/CHULA.THE.2022.338
metadata.dc.identifier.DOI:	10.58837/CHULA.THE.2022.338
Type:	Thesis
Appears in Collections:	Acctn - Theses

Files in This Item:

File	Description	Size	Format
6480507026.pdf		2.76 MB	Adobe PDF	View/Open

Show full item record