Griffani Megiyanto Rahmatullah , Shanq-Jang Ruan , Lieber Po-Hung Li
{"title":"Recognizing Indonesian words based on visual cues of lip movement using deep learning","authors":"Griffani Megiyanto Rahmatullah , Shanq-Jang Ruan , Lieber Po-Hung Li","doi":"10.1016/j.measurement.2025.116968","DOIUrl":null,"url":null,"abstract":"<div><div>Lipreading is one of the techniques that can enhance speech perception. However, there are still limited studies of lipreading research focusing on low-resource languages, such as Indonesian. In this study, we introduce an instrument designed to generate lipreading datasets using CC BY video data available on YouTube called Lipreading Information Resource Assembler-Generator (LIRA-Gen). Using this instrument, we present the first Indonesian language lipreading dataset (IDLRW) containing over 48,000 videos with 100-word categories spoken by various persons in natural conditions. Also, we developed a deep learning architecture consisting of an Advanced Residual Network (ARN) using ResNet-34 incorporated with a Channel Spatial Attention (CSA) module, improved sequence modeling by fusing Bi-Gru with Mamba (BGM), an integrated word decision module, and fine-tuned hyperparameter. Our measurement shows that it reaches an accuracy of 60.51% on the IDLRW dataset and outperforms state-of-the-art lipreading models from another dataset even without implementing an additional learning strategy.</div></div>","PeriodicalId":18349,"journal":{"name":"Measurement","volume":"250 ","pages":"Article 116968"},"PeriodicalIF":5.2000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Measurement","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0263224125003276","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0
Abstract
Lipreading is one of the techniques that can enhance speech perception. However, there are still limited studies of lipreading research focusing on low-resource languages, such as Indonesian. In this study, we introduce an instrument designed to generate lipreading datasets using CC BY video data available on YouTube called Lipreading Information Resource Assembler-Generator (LIRA-Gen). Using this instrument, we present the first Indonesian language lipreading dataset (IDLRW) containing over 48,000 videos with 100-word categories spoken by various persons in natural conditions. Also, we developed a deep learning architecture consisting of an Advanced Residual Network (ARN) using ResNet-34 incorporated with a Channel Spatial Attention (CSA) module, improved sequence modeling by fusing Bi-Gru with Mamba (BGM), an integrated word decision module, and fine-tuned hyperparameter. Our measurement shows that it reaches an accuracy of 60.51% on the IDLRW dataset and outperforms state-of-the-art lipreading models from another dataset even without implementing an additional learning strategy.
期刊介绍:
Contributions are invited on novel achievements in all fields of measurement and instrumentation science and technology. Authors are encouraged to submit novel material, whose ultimate goal is an advancement in the state of the art of: measurement and metrology fundamentals, sensors, measurement instruments, measurement and estimation techniques, measurement data processing and fusion algorithms, evaluation procedures and methodologies for plants and industrial processes, performance analysis of systems, processes and algorithms, mathematical models for measurement-oriented purposes, distributed measurement systems in a connected world.