Y-X-Y encoding for identifying types of sentence similarity

Thanaporn Jinnovart

Please use this identifier to cite or link to this item: https://cuir.car.chula.ac.th/handle/123456789/82930

Title:	Y-X-Y encoding for identifying types of sentence similarity
Other Titles:	การเข้ารหัสวายเอ็กซ์วายสำหรับการระบุชนิดความคล้ายของประโยค
Authors:	Thanaporn Jinnovart
Advisors:	Chidchanok Lursinsap
Other author:	Chulalongkorn University. Faculty of Sciences
Issue Date:	2022
Publisher:	Chulalongkorn University
Abstract:	The task of finding semantic similarity of any two arbitrary sentences consists of two main steps, which are encoding sentences to produce feature vectors of equal length and measuring the similarity, respectively. The quality of an encoding technique can determine the degree of success a model can achieve in measuring the similarity. This is because a good representation is subjected to how finely established the spectrum of similarities is. The clearer the definition of similarity is, the better the representations can be constructed. This, in turn, helps distinguish between types of sentences. Generally, all existing methods for measuring similarity were designed for vectorized data in a feature space of fixed dimensions. Thus, transforming a set of various-length sentences into a set of feature vectors in the same dimension is very essential. The dataset used in this thesis provides both relatedness score and textual entailment. Textual entailment distinguishes sentence pair relations among three classes: namely, neutral, entailment and contradiction. The task indicates the types of entailments, which is interpreted as relatedness in this thesis. Additionally, powerful pretrained encoding models are usually of millions of parameters, or even billions. This is one obstacle in training one’s own embedding model due to the need of resources with heavy computing capabilities. In this thesis, we propose a self-encoding scheme to classify among the three classes of textual entailment. The relevancy of all words in a sentence is simultaneously captured by this self-encoding structure. Unlike the other encoding methods based on sequential learning, no interference of memory loss due to the length of sentence occurs in this approach. The framework involves filtering contradiction pairs at an early stage and employing a set of y-x-y encoders, where y is the length after two sentences are concatenated and x is the optimal encoding size for samples of length y, and classifiers to output neutral and entailment probabilities. With over 90% accuracy for all classes, our method has proven that this task is possible to be carried out effectively without the need of large-scale datasets and heavy computational resources.
Other Abstract:	การหาความคล้ายคลึงระหว่างสองประโยคใดๆประกอบด้วยสองขั้นตอนหลักคือ การเข้ารหัสให้กับประโยคทั้งสองประโยคเพื่อสร้างเวกเตอร์ของคุณลักษณะที่มีความยาวเท่ากัน และการวัดความคล้ายคลึงระหว่างสองประโยคตามลำดับ คุณภาพของวิธีการเข้ารหัสสามารถกำหนดระดับของความสำเร็จของโมเดลในการวัดความคล้ายคลึงระหว่างสองประโยคได้ ทั้งนี้ก็เพราะการสร้างตัวแทนที่ดีขึ้นอยู่กับความละเอียดในการนิยามการแยกความคล้ายคลึง ยิ่งการนิยามความคล้ายคลึงชัดเจนมากเท่าใด การสร้างตัวแทนก็จะยิ่งดีขึ้นเท่านั้น ซึ่งจะช่วยในการแยกประเภทของความคล้ายคลึง โดยทั่วไปแล้ว ทุกวิธีที่มีอยู่สำหรับการวัดความคล้ายคลึงถูกออกแบบให้ข้อมูลในรูปแบบเวกเตอร์อยู่ในพื้นที่คุณลักษณะที่มีมิติตายตัว เพราะฉะนั้นการแปลงชุดประโยคที่มีความยาวต่างกันให้กลายเป็นชุดเวกเตอร์ของคุณลักษณะที่อยู่ในมิติเดียวกันเป็นเรื่องที่สำคัญมาก ชุดข้อมูลที่ใช้ในวิทยานิพนธ์นี้จัดเตรียมให้ทั้งค่าความเกี่ยวข้องเป็นตัวเลขและประเภทของความเกี่ยวข้อง ประเภทของความเกี่ยวข้องมีสามประเภท กล่าวคือ เป็นกลาง เกี่ยวข้อง และ ขัดแย้ง การแยกประเภทความเกี่ยวข้องบ่งบอกชนิดของความคล้ายคลึง นอกจากนี้โมเดลเข้ารหัสที่มีประสิทธิภาพสูงมักเรียนรู้ก่อนหน้าโดยใช้พารามิเตอร์จำนวนเป็นล้าน หรือแม้กระทั้งพันล้าน นี่คืออุปสรรคหนึ่งในการเทรนโมเดลเข้ารหัสด้วยตัวเองเนื่องจากจำเป็นต้องใช้ทรัพยากรที่มีความสามารถทางด้านการคำนวณมหาศาล ในวิทยานิพนธ์นี้ เรานำเสนอวิธีการแปลงรหัสคำด้วยตนเองเพื่อจำแนกประเภทความเกี่ยวข้องออกเป็นสามประเภท ความเกี่ยวข้องของแต่ละคำในประโยคถูกจับได้อย่างพร้อมกันโดยโครงสร้างการเข้ารหัสด้วยตนเองนี้ นอกจากนี้ยังต่างจากโมเดลการเข้ารหัสอื่นๆที่ขึ้นอยู่กับการเรียนรู้แบบตามลำดับ เพราะโมเดลที่นำเสนอนี้ไม่ถูกรบกวนจากการสูญเสียความทรงจำที่เกิดจากความยาวของประโยค โครงสร้างของงานมีการคัดกรองคู่ประโยคที่ขัดแย้งออกจากชุดข้อมูลในขั้นตอนเบื้องต้นและใช้โมเดลเข้ารหัสจำนวนหนึ่งในขั้นตอนหลัง ซึ่งตัวเข้ารหัสแต่ละตัวจะอยู่ในรูปแบบ y-x-y โดยที่ y คือความยาวที่ได้จากการต่อสองประโยคเข้าด้วยกัน และ x คือความยาวที่เหมาะสมที่สุดสำหรับข้อมูลที่มีความยาว y นอกจากนี้โมเดลคัดแยกประเภทจำนวนหนึ่งยังถูกนำมาใช้แยกข้อมูลระหว่าง กลุ่มเป็นกลาง กับ กลุ่มเกี่ยวข้อง โดยให้ค่าออกมาเป็นความน่าจะเป็น ด้วยความแม่นยำกว่า 90% สำหรับการคัดแยกแต่ละประเภททั้งสามประเภท วิธีที่นำเสนอได้พิสูจน์แล้วว่าเราสามารถทำการแยกประเภทความเกี่ยวข้องได้อย่างมีประสิทธิภาพโดยไม่ต้องใช้ชุดข้อมูลที่มีขนาดใหญ่และทรัพยากรที่มีความสามารถทางด้านการคำนวณอย่างมหาศาล
Description:	Thesis (M.Sc.)--Chulalongkorn University, 2022
Degree Name:	Master of Science
Degree Level:	Master's Degree
Degree Discipline:	Computer Science and Information Technology
URI:	https://cuir.car.chula.ac.th/handle/123456789/82930
URI:	http://doi.org/10.58837/CHULA.THE.2022.111
metadata.dc.identifier.DOI:	10.58837/CHULA.THE.2022.111
Type:	Thesis
Appears in Collections:	Sci - Theses

Files in This Item:

File	Description	Size	Format
6278010123.pdf		1.95 MB	Adobe PDF	View/Open

Show full item record