Real-time instance segmentation and point cloud extraction for Japanese food using RGB-D camera

Suthiwat Yarnchalothorn

Please use this identifier to cite or link to this item: https://cuir.car.chula.ac.th/handle/123456789/77290

Title:	Real-time instance segmentation and point cloud extraction for Japanese food using RGB-D camera
Other Titles:	การตรวจจับวัตถุในระดับพิกเซลแบบทันทีและการสกัดพิกัดสามมิติสำหรับอาหารญี่ปุ่นโดยใช้กล้อง RGB-D
Authors:	Suthiwat Yarnchalothorn
Advisors:	Nattapol Damrongplasit Hayashi Eiji
Other author:	Chulalongkorn University. Faculty of Engineering
Issue Date:	2020
Publisher:	Chulalongkorn University
Abstract:	Innovation in technology is playing an important role in the development of food industry, as is indicated by the growing number of food review and food delivery applications. Similarly, it is expected that the process of producing and packaging food itself will become increasingly automated using a robotic system. The shift towards food automation would help ensure quality control of food products and improve production line efficiency. One key enabler for such automated system is the ability to detect and classify food object with great accuracy and speed. In this study, we explore real-time food object segmentation using RGB-D depth camera. Instance segmentation based on 2D RGB data is used to classify Japanese food objects at a pixel-level with Cascade Mask R-CNN and Hybrid Task Cascade deep learning models. The model is trained on both local GPU and cloud service. The precision and recall values for classifying food objects under different scenario conditions are investigated. Furthermore, we construct 3D point cloud of food objects using depth information from the camera, which will help facilitate food automation operation such as precision grasping of food object with numerous shapes and sizes. The result shows that the trained HTC model has better precision than Cascade Mask R-CNN model, albeit at a lower detection speed. The inference speed of both models monotonically decreases as the number of food objects and image resolution of the processed image increase. In addition, it is found that that the accuracy of the HTC detection can be quite sensitive to environmental factors such as background colors, low brightness, and having an incomplete object. The 2D segmentation result is combined with 3D point cloud extraction to realize real-time 3D segmentation of Japanese food objects with an average framerate of 6.71 fps.
Other Abstract:	ในปัจจุบันนวัตกรรมส่งผลให้เกิดการพัฒนาอุตสาหกรรมอาหาร สังเกตได้จากความนิยมที่เพิ่มขึ้นของการวิจารณ์อาหารบนอินเตอร์เน็ตและธุรกิจการจัดส่งอาหารแบบรวดเร็ว ในทำนองเดียวกันกระบวนการผลิตและกระบวนการบรรจุอาหารใส่บรรจุภัณฑ์จะเปลี่ยนจากใช้แรงงานคนเป็นอัตโนมัติโดยใช้หุ่นยนต์เข้ามาแทนที่อย่างแพร่หลาย การเปลี่ยนเปลงนี้จะทำให้ผู้ผลิตสามารถควบคุมคุณภาพอาหารและเพิ่มประสิทธิภาพในกระบวนการผลิตได้ อย่างไรก็ตามปัจจัยที่สำคัญอย่างหนึ่งที่จะทำให้สิ่งนี้เป็นไปได้คือความสามารถในการตรวจจับและแยกประเภทของอาหารจากภาพถ่ายอย่างแม่นยำด้วยความเร็วสูง ในงานวิจัยนี้เราจะศึกษาการตรวจจับวัตถุอาหารแบบทันทีโดยใช้ภาพจากกล้องวัดความลึก วิธีที่เลือกใช้คือการตรวจจับวัตุในระดับพิกเซลโดยใช้การเรียนรู้แบบอัตโนมัติที่มีโครงข่ายประสาทหลายชั้นเพื่อตรวจจับชิ้นอาหารญี่ปุ่นในระดับพิกเซล ในที่นี้จะใช้แบบจำลอง 2 แบบ คือ Cascade Mask R-CNN และ Hybrid Task Cascade โดยแบบจำลองทั้งหมดจะเรียนรู้ด้วยตัวมันเองบนทั้งหมดสองแพลตฟอร์ม คือ บนเครื่องคอมพิวเตอร์ และบนบริการคลาวด์ จากนั้นได้ทำการศึกษาแบบจำลองที่สร้างขึ้นในสภาวะต่าง ๆ นอกจากนี้จะนำข้อมูลความลึกที่ได้จากกล้องมาประสานกับข้อมูลการตรวจจับวัตถุที่ได้จากขั้นตอนแรกเพื่อสกัดข้อมูลพิกัดสามมิติของวัตถุอาหารซึ่งจะสามารถนำมาใช้ประโยชน์ในกระบวนการผลิตอาหารแบบอัตโนมัติ เช่น การหยิบและวางชิ้นอาหารซึ่งมีรูปร่างและขนาดที่หลากหลายได้อย่างแม่นยำ จากผลการทดลองพบว่าแบบจำลอง HTC มีความแม่นยำสูงกว่าแบบจำลอง Cascade Mask R-CNN บนทั้งสองแพลตฟอร์มที่ใช้ในการเรียนรู้อัตโนมัติ แต่ในทางกลับกันแบบจำลอง HTC จะมีความเร็วในการตรวจจับที่ช้ากว่า จากนั้นยังพบว่าความเร็วในการตรวจจับวัตถุของทั้งสองแบบจำลองมีแนวโน้มจะลดลงเมื่อจำนวนวัตถุในภาพเพิ่มขึ้นและเมื่อความละเอียดของภาพเพิ่มขึ้น ยิ่งไปกว่านั้นผลการทดลองแสดงให้เห็นว่าการเปลี่ยนแปลงสภาพแวดล้อม ได้แก่ การเปลี่ยนสีพื้นหลัง การปรับลดความสว่าง การวางวัตถุอาหารซ้อนทับ และการใช้อาหารที่ไม่สมบูรณ์ ส่งผลให้ความแม่นยำของแบบจำลอง HTC ลดลง หลังจากนั้นได้ทำการสกัดพิกัดสามมิติของวัตถุอาหารออกมาโดยมีความเร็วเฉลี่ยอยู่ที่ 6.71 เฟรมต่อวินาที
Description:	Thesis (M.Eng.)--Chulalongkorn University, 2020
Degree Name:	Master of Engineering
Degree Level:	Master's Degree
Degree Discipline:	Cyber-Physical System
URI:	http://cuir.car.chula.ac.th/handle/123456789/77290
URI:	http://doi.org/10.58837/CHULA.THE.2020.148
metadata.dc.identifier.DOI:	10.58837/CHULA.THE.2020.148
Type:	Thesis
Appears in Collections:	Eng - Theses

Files in This Item:

File	Description	Size	Format
6270375221.pdf		4 MB	Adobe PDF	View/Open

Show full item record