การรู้จำเสียงพูดภาษาไทย ระยะที่หนึ่ง : การรู้จำเสียงพูดคำไทยโดดๆ โดยไม่ขึ้นกับผู้พูด

สมชาย จิตะพันธ์กุล

Please use this identifier to cite or link to this item: https://cuir.car.chula.ac.th/handle/123456789/2262

Title:	การรู้จำเสียงพูดภาษาไทย ระยะที่หนึ่ง : การรู้จำเสียงพูดคำไทยโดดๆ โดยไม่ขึ้นกับผู้พูด
Other Titles:	การรู้จำเสียงพูดคำไทยโดดๆ โดยไม่ขึ้นกับผู้พูด
Authors:	สมชาย จิตะพันธ์กุล
Email:	Somchai.J@chula.ac.th
Other author:	จุฬาลงกรณ์มหาวิทยาลัย. ภาควิชาวิศวกรรมไฟฟ้า
Subjects:	การรู้จำเสียงพูดอัตโนมัติ ภาษาไทย -- คำและวลี
Issue Date:	2540
Publisher:	จุฬาลงกรณ์มหาวิทยาลัย
Abstract:	งานวิจัยนนี้มีวัตถุประสงค์ในการศึกษาและเลือกกรรมวิธีที่มีประสิทธิภาพสูงสุดในการรู้จำเสียงพูดตัวเลขไทย ระหว่างกรรมวิธีไดนามิก ไทม์วาร์ปิง (DTW) กรรมวิธีแบบจำลองฮิดเดน มาร์คอฟ (HMM) และกรรมวิธีนิวรอลเน็ตเวิร์ก (NN) ทั้งสามกรรมวิธีประกอบด้วยกระบวนการ 4 ขั้นตอน คือ การประมวลเบื้องต้น การวัดหาค่าลักษณะสำคัญ การจำแนกรูปแบบ และการตัดสินใจ ในการประมวลเบื้องต้น กรรมวิธีย่อยในการหาจุดสิ้นสุดเสียงพูดเป็นเรื่องหลักที่ได้รับการพิจารณาทั้งนี้รายละเอียดของกรรมวิธีย่อยนี้จะแตกต่างกันไปในแต่ละกรรมวิธีหลักเบื้องต้น แต่ทั้งหมดใช้หลักการพิจารณาระดับพลังงานของสัญญาณ ในการวัดหาค่าลักษณะสำคัญ DTWใช้ผลการแปลงฮาร์ตเลย์คำนวณหาค่าพารามิเตอร์ ในขณะที่ HMM ใช้การหาค่าสัมประสิทธิ์การประมาณพันธะเชิงเส้น ลำดับ 10 ร่วมกับการควอนไตซ์เวกตอร์ของรหัสขนาด 64 เพื่อคำนวณค่าพารามิเตอร์ สำหรับ NN ใช้การหาค่าสัมประสิทธิ์การประมาณพันธ์เชิงเส้น ลำดับ 10 เท่านั้นในการกำหนดค่าพารามิเตอร์ ในขั้นตอนของการจำแนกรูปแบบ DTW ใช้วิธีกาไดนามิก ไทม์วาร์ปิง เพื่อกำหนดรูปแบบส่วน HMM ใช้แบบจำลองฮิดเดน มาร์คอฟ จำนวน 3 สถานะ เพื่อคำนวณหารูปแบบ โดยที่ NN ใช้อัลกอริทึมแบบแบคพรอพาเกชันเพื่อหารูปแบบที่เหมาะสม ในขั้นตอนการตัดสินใจ DTW ใช้เงื่อนไจ Nearest Neighbor กับค่าความคลาดเคลื่อนที่ได้จากการเปรียบเทียบรูปแบบทดสอบกับรูปแบบอ้างอิง ในขณะที่ HMM ใช้อัลกอริทึม Viterbi ในการตัดสินใจซึ่งเป็นกระบวนการที่ซับซ้อนที่สุด และ NN ใช้กรรมวิธีที่ธรรมดาที่สุด กล่าวคือ ใช้เงื่อนไขความคลาดเคลื่อนต่ำสุดในการเปรียบเทียบ ในการทดสอบและเปรียบเทียบการทำงานของระบบการรู้จำ มีการจัดเตรียมข้อสนเทศ 3 ชุด ชุดแรกเป็นชุดฝึกฝน ชุดที่ 2 และ 3 เป็นชุดทดสอบ 1 และ 2 ตามลำดับ แต่ละชุดเป็นข้อมูลเสียงที่บันทึกจากผู้พูดทั้งเพศหญิงและชายที่มีอายุอยู่ในช่วง 18 ถึง 25 ปี ข้อสนเทศชุดฝึกฝนและชุดทดสอบ 1 เป็นข้อมูลที่บันทึกจากผู้พูดกลุ่มเดียวกัน แต่บันทึกข้อมูลไว้คนละชุด ส่วนข้อสนเทศชุดทดสอบ 2 เป็นข้อมูลที่บันทึกจากผู้พูดต่างกลุ่มออกไป จำนวนตัวอย่างในแต่ละกลุ่มเรียงตามลำดับ คือ 20, 22 และ 20 และ 20 สำหรับ DTW 45, 45 และ 10 สำหรับ HMM และ 30, 30 และ 12 สำหรับ NN อัตราการรู้จำเฉลี่ยของแต่ละกลุ่มที่ได้เรียงลำดับคือ ร้อยละ 90.50, 86.50, และ 79.25 สำหรับ DTW ที่ใช้รูปแบบอ้างอิงของ 20 ตัวอย่าง ร้อยละ 95.30, 89.70 และ 84.00 สำหรับ HMM ที่ใช้รูปแบบอ้างอิงของ 45 ตัวอย่าง และร้อยละ 98.20, 84.30 และ 89.40 สำหรับ NN ที่ใช้รูปแบบอ้างอิงของ 30 ตัวอย่าง
Other Abstract:	This research has the objective to study and select an efficient algorithm for speaker independent Thai numeral word recognition among the Dynamic Time Warping (DTW), Hidden Markov Model (HMM), and Neural Network (NN). All three methods are composed of 4 steps: Preprocessing, Feature Measurement, Pattern Classification, and Decision Making. The first main consideration is the endpoint detection techniques in preprocessing step that used different details among those three methods but all of them were based on energy level measurement. For feature measurement step, DTW used the discrete Harley transform to extract required parameters, but HMM used LPC of order 10 in accordance with the vector quantization (VQ) of 64 codebooks to compute its essential features, and NN used also LPC of order 10 to measure its necessary parameters. In pattern classification step, DTW used its time warping algorithm to create pattern, and 3 states of hidden Markov model was used to construct pattern m HMM, but the backpropagation algorithm was executed to form the pattern. The Nearest Neighbor condition was set for DTW in decision making step. For HMM, this step is more complicate than another by using the Viterbi algorithm. The most simple criteria for decision should certainly is that of NN by using the minimum error distance. To test and compare those three methods, the separated speech training set and testing set and 2 were composed of both male and female speakers within the range of 18 to 25 years of age. The training set and testing set 1 were the same speakers group but different data. The testing set 2 was another speaker group. The number of each set was varied betweeb those methods: 20, 20, and 20 for DTW: 45, 45, and 10 for HMM; and 30, 30, and 12 for NN, respectively. The average recognition rates of each set were : 90.30%, 89.70% and 84.00% fro HMM with 45 reference samples; and 98.20%, 84.30%, and 89.40% for NN with 30 references samples, respectively.
URI:	http://cuir.car.chula.ac.th/handle/123456789/2262
Type:	Technical Report
Appears in Collections:	Eng - Research Reports

Files in This Item:

File	Description	Size	Format
Somchai(obj).pdf		13.86 MB	Adobe PDF	View/Open

Show full item record