การเปรียบเทียบประสิทธิภาพของการประมาณค่าของพารามิเตอร์ด้วยวิธีลาสโซและวิธีการคัดเลือกชุดข้อมูลย่อยที่ดีที่สุดในการวิเคราะห์การถดถอยเชิงเส้นสำหรับข้อมูลที่มีมิติสูง

วรัญญา บุตรบุรี

Please use this identifier to cite or link to this item: https://cuir.car.chula.ac.th/handle/123456789/79226

Title:	การเปรียบเทียบประสิทธิภาพของการประมาณค่าของพารามิเตอร์ด้วยวิธีลาสโซและวิธีการคัดเลือกชุดข้อมูลย่อยที่ดีที่สุดในการวิเคราะห์การถดถอยเชิงเส้นสำหรับข้อมูลที่มีมิติสูง
Other Titles:	A performance comparison of parameter estimations by lasso method and the best subset selection method in linear regression analysis for high dimensional data
Authors:	วรัญญา บุตรบุรี
Advisors:	เสกสรร เกียรติสุไพบูลย์
Other author:	จุฬาลงกรณ์มหาวิทยาลัย. คณะพาณิชยศาสตร์และการบัญชี
Subjects:	การประมาณค่าพารามิเตอร์ การวิเคราะห์การถดถอย Parameter estimation Regression analysis
Issue Date:	2564
Publisher:	จุฬาลงกรณ์มหาวิทยาลัย
Abstract:	งานวิจัยครั้งนี้มีวัตถุประสงค์เพื่อเปรียบเทียบประสิทธิภาพของวิธีการประมาณค่าพารามิเตอร์สำหรับข้อมูลที่มีมิติสูงด้วยทั้งหมด 5 วิธี ได้แก่ วิธี L0Learn, L0L2Learn, L1, A-L1 และวิธี A-L1L2 โดยการเปรียบเทียบประสิทธิภาพจะเปรียบเทียบใน 2 ด้าน คือ 1) เปรียบเทียบประสิทธิภาพด้านการพยากรณ์ ซึ่งวัดจากค่าคลาดเคลื่อนการทำนาย (MSE) และ 2) ความถูกต้องในการคัดเลือกตัวแปรอิสระเข้าสู่ตัวแบบ ซึ่งพิจารณาจากของค่า Precision Recall และค่า AUC ข้อมูลที่มีมิติสูงที่ใช้ในการศึกษาครั้งนี้ได้จากการจำลอง โดยกำหนดให้ในแต่ละชุดข้อมูลประกอบด้วยจำนวนค่าสังเกต 100 ค่าสังเกต (n = 100) และมีตัวแปรอิสระจำนวน 100 ตัว (p = 1000) โดยตัวแปรอิสระมีการแจกแจงแบบปรกติหลายตัวแปรซึ่งมีความสัมพันธ์กันแบบยกกำลัง (Exponential Correlation) 3 ระดับคือ 0, 0.5 และ 0.9 ค่าความคลาดเคลื่อนสุ่มขึ้นอยู่กับอัตราส่วนสัญญาณต่อสัญญาณรบกวน (SNR) ซึ่งมี 6 ระดับคือ 0.1, 0.5, 1, 5, 10, และ 20 โดยจำลองข้อมูลจำนวน 100 ชุดในแต่ละสถานการณ์ จากการวัดประสิทธิภาพจากค่าเฉลี่ยของข้อมูลทั้ง 100 ชุด ผลการเปรียบเทียบประสิทธิภาพด้านการพยากรณ์พบว่า เมื่อข้อมูลมีค่า SNR ต่ำและตัวแปรอิสระมีความสัมพันธ์กันน้อยถึงปานกลาง วิธี L1 จะมีประสิทธิภาพสูงที่สุด ตามด้วยวิธี L0L2Leran วิธี L0Learn วิธี A-L1L2 และวิธี A-L1 ตามลำดับ แต่เมื่อข้อมูลมีค่า SNR เพิ่มสูงขึ้นและในขณะเดียวกันตัวแปรอิสระมีความสัมพันธ์กันมากขึ้นวิธี A-L1 และวิธี A-L1L2 จะมีประสิทธิภาพสูงที่สุด ตามด้วยวิธี L1 วิธี L0L2Leran วิธี L0Learn ตามลำดับ ส่วนผลการเปรียบเทียบประสิทธิภาพด้านการคัดเลือกตัวแปรเข้าสู่ตัวแบบ เมื่อพิจารณาจากค่าเฉลี่ยของค่า Precision วิธี L0Learn และวิธี L0L2Learn มีประสิทธิภาพมากกว่าวิธีอื่น ๆ และเมื่อพิจารณาจากค่าเฉลี่ยของค่า Recall ในกรณีข้อมูลมีค่า SNR ต่ำวิธี A-L1 และวิธี A-L1L2 จะมีประสิทธิภาพมากที่สุด รองลงมาคือวิธี L0L2Learn วิธี L1 และวิธี L0Learn ตามลำดับ แต่เมื่อข้อมูลมีค่า SNR มากขึ้นและตัวแปรอิสระมีความสัมพันธ์กันมากขึ้น วิธี L1 มีประสิทธิภาพสูงที่สุดเทียบเท่ากับวิธี A-L1 และวิธี A-L1L2 และเมื่อพิจารณาจากค่าเฉลี่ยของค่า AUC กรณีข้อมูลมีค่า SNR ต่ำและตัวแปรอิสระมีความสัมพันธ์กันน้อย วิธี L0L2Learn วิธี L1 วิธี A-L1 และวิธี A-L1L2 จะมีประสิทธิภาพใกล้เคียงกันและมีประสิทธิภาพมากกว่าวิธี L0Learn แต่เมื่อข้อมูลมีค่า SNR มากขึ้นและตัวแปรอิสระมีความสัมพันธ์กันมากขึ้นวิธี L0L2Learn และวิธี A-L1L2 จะมีประสิทธิภาพดีกว่าวิธีอื่น ๆ นอกจากนี้ยังพบว่าวิธี L0Learn และวิธี L0L2Learn จะให้ตัวแบบที่มีขนาดเล็กส่งผลให้ตัวแบบมีค่า Precision โดยเฉลี่ยสูง และมีข้อดีคือตัวแบบอธิบายได้ง่าย ในทางตรงกันข้ามวิธี L1 วิธี A-L1 และวิธี A-L1L2 จะให้ตัวแบบที่มีขนาดใหญ่กว่าส่งผลให้มีค่า Recall โดยเฉลี่ยสูง แต่มีข้อจำกัดคือตัวแบบอธิบายได้ยาก
Other Abstract:	The objective of this research is to compare the performances of parameter estimations from five sparse regression methods for high-dimensional data, namely L0Learn, L0L2Learn, L1, A-L1 and A-L1L2. The models compared based on two criterions: 1) the predictive performance measured by the prediction mean square error and 2) the variable selection accuracy measured by the variable selection precision, the variable selection recall, and the variable selection AUC. The high-dimensional data sets under this study are simulated data sets. Each data set contains one hundred observations (n = 100) and has one thousand predictors (p = 1000). The predictors are generated based on the multivariate normal distribution whose exponential correlation parameters are set at three levels: 0.0, 0.5 and 0.9. The random error terms are simulated to possess six levels of signal-to-noise ratio (SNR): 0.1, 0.5, 1, 5, 10 and 20. At each pair of correlation parameter and SNR, one hundred data sets are generated. The performances of each method are averaged over these one hundred data sets. The results show that, in terms of the predictive performance, at low SNR and low correlation, L1 performs best followed by L0L2Leran, L0Learn, A-L1L2 and A-L1, respectively. At high SNR and high correlation, A-L1 and A-L1L2 perform best followed by L1, L0L2Learn and L0Learn, respectively. In terms of the variable selection accuracy, when measured by the variable selection precision, overall L0Learn and L0L2Learn perform significantly better than L1, A-L1, and A-L1L2. When measured by the variable selection recall, at low SNR, A-L1 and A-L1L2 perform best followed by L0L2Learn, L1 and L0Learn respectively. At high SNR and high correlation, the performances of L1, A-L1 and A-L1L2 are close to one another, but significantly better than that of L0Learn and L0L2Learn. When measured by the variable selection AUC, at low SNR and low correlation, the performances of L0L2Learn, L1, A-L1 and A-L1L2 are close to one another, but significantly better than that of L0Learn. With variable selection AUC, at high SNR and high correlation, L0L2Learn and A-L1L2 performs significantly better than other methods. Overall, L0Learn and L0L2Leaarn select a small number of variables into the model, leading to high variable selection precision. In contrast, L1, A-L1 and A-L1L2 select a larger number of variables into the model, resulting in high variable selection recall.
Description:	วิทยานิพนธ์ (วท.ม.)--จุฬาลงกรณ์มหาวิทยาลัย, 2564
Degree Name:	วิทยาศาสตรมหาบัณฑิต
Degree Level:	ปริญญาโท
Degree Discipline:	สถิติ
URI:	http://cuir.car.chula.ac.th/handle/123456789/79226
URI:	http://doi.org/10.58837/CHULA.THE.2021.1062
metadata.dc.identifier.DOI:	10.58837/CHULA.THE.2021.1062
Type:	Thesis
Appears in Collections:	Acctn - Theses

Files in This Item:

File	Description	Size	Format
6280284826.pdf		2.65 MB	Adobe PDF	View/Open

Show full item record