Classification of abusive Thai messages in social networks using deep learning

Ruangsung Wanasukapunt

Please use this identifier to cite or link to this item: https://cuir.car.chula.ac.th/handle/123456789/79896

Title:	Classification of abusive Thai messages in social networks using deep learning
Other Titles:	การจำแนกข้อความไทยที่ใช้ไม่เหมาะสมในเครือข่ายสังคมโดยใช้การเรียนรู้เชิงลึก
Authors:	Ruangsung Wanasukapunt
Advisors:	Suphakant Phimoltares
Other author:	Chulalongkorn University. Faculty of Science
Issue Date:	2021
Publisher:	Chulalongkorn University
Abstract:	Social media has improved on traditional news sources by allowing increased access to information. However, the anonymity social media provides can lead to abusive and hateful speech without detection or repercussion from individuals with malicious intentions. This research develops a binomial and a multinomial classification model for classifying Thai social media text for five categories of abusive content detection in social media that include Rude, Figurative, Dirty, Offensive and Non-Abusive. The experiments demonstrated that DistilBERT achieved the highest F1 score with 0.8510 for the binomial model and 0.9067 for the multinomial model. BiLSTM performed second best with an F1 score of 0.8403 and 0.8969 for the binomial and multinomial models, respectively. Both deep learning models outperformed the traditional machine learning classifiers’ highest F1 score of 0.7452 and 0.8090 for the binomial and multinomial models, respectively. The deep learning architectures allow for better contextual representations of the words with the DistilBERT, enabling better modeling of long-range dependencies between words.
Other Abstract:	สื่อสังคมมีการปรับปรุงแหล่งข่าวแบบดั้งเดิมโดยอนุญาตให้มีการเข้าถึงข่าวสารเพิ่มขึ้น อย่างไรก็ตามการยอมไม่ให้เปิดเผยชื่อในสื่อสังคมก่อให้เกิดข้อความที่ใช้ไม่เหมาะสมและมีเจตนาร้ายโดยปราศจากการตรวจหาหรือผลที่ตามมาจากบุคคลด้วยความตั้งใจมุ่งร้าย งานวิจัยนี้พัฒนาตัวแบบการจำแนกแบบทวินามและอเนกนามสำหรับจำแนกข้อความบนสื่อสังคมไทยออกเป็นห้าประเภทสำหรับการตรวจหาเนื้อหาที่ไม่เหมาะสมในสื่อสังคม อันได้แก่ข้อความหยาบคาย ข้อความอุปมาอุปไมย ข้อความลามก ข้อความก้าวร้าว และข้อความที่ใช้ได้เหมาะสม การทดลองได้แสดงให้เห็นว่าดิสทิลเบิร์ทได้ให้คะแนนเอฟวันสูงสุดที่ 0.8510 สำหรับตัวแบบทวินามและ 0.9067 สำหรับตัวแบบอเนกนาม แอลเอสทีเอ็มแบบสองทิศทางได้ให้ผลดีที่สุดเป็นอันดับสองด้วยคะแนนเอฟวัน 0.8403 และ 0.8969 สำหรับตัวแบบทวินามและอเนกนามตามลำดับ ตัวแบบการเรียนรู้เชิงลึกทั้งสองได้ผลที่ดีกว่าตัวแบบการเรียนรู้ของเครื่องแบบดั้งเดิมที่มีคะแนนเอฟวันสูงสุดอยู่ที่ 0.7452 และ 0.8090 สำหรับตัวแบบทวินามและอเนกนามตามลำดับ สถาปัตยกรรมการเรียนรู้เชิงลึกได้ยอมให้การแทนเชิงบริบทของกลุ่มคำดีขึ้น โดยดิสทิลเบิร์ทได้ทำให้การสร้างตัวแบบของความเกี่ยวข้องกันระหว่างกลุ่มคำในช่วงที่ยาวดีขึ้น
Description:	Thesis (M.Sc.)--Chulalongkorn University, 2021
Degree Name:	Master of Science
Degree Level:	Master's Degree
Degree Discipline:	Computer Science and Information Technology
URI:	http://cuir.car.chula.ac.th/handle/123456789/79896
URI:	http://doi.org/10.58837/CHULA.THE.2021.116
metadata.dc.identifier.DOI:	10.58837/CHULA.THE.2021.116
Type:	Thesis
Appears in Collections:	Sci - Theses

Files in This Item:

File	Description	Size	Format
6172627123.pdf		2.11 MB	Adobe PDF	View/Open

Show full item record