关于 Silentinterpreter 的调查：利用深度学习分析嘴唇运动并提取语音

International Journal of Scientific Research in Science, Engineering and Technology Pub Date : 2024-04-07 DOI:10.32628/ijsrset2411219

Ameen Hafeez, Rohith M K, Sakshi Prashant, Sinchana Hegde, Prof. Shwetha K S

{"title":"关于 Silentinterpreter 的调查：利用深度学习分析嘴唇运动并提取语音","authors":"Ameen Hafeez, Rohith M K, Sakshi Prashant, Sinchana Hegde, Prof. Shwetha K S","doi":"10.32628/ijsrset2411219","DOIUrl":null,"url":null,"abstract":"Lip reading is a complex but interesting path for the growth of speech recognition algorithms. It is the ability of deciphering spoken words by evaluating visual cues from lip movements. In this study, we suggest a unique method for lip reading that converts lip motions into textual representations by using deep neural networks. Convolutional neural networks are used in the methodology to extract visual features, recurrent neural networks are used to simulate temporal context, and the Connectionist Temporal Classification loss function is used to align lip features with corresponding phonemes.\nThe study starts with a thorough investigation of data loading methods, which include alignment extraction and video preparation. A well selected dataset with video clips and matching phonetic alignments is presented. We select relevant face regions, convert frames to grayscale, then standardize the resulting data so that it can be fed into a neural network.\nThe neural network architecture is presented in depth, displaying a series of bidirectional LSTM layers for temporal context understanding after 3D convolutional layers for spatial feature extraction. Careful consideration of input shapes, layer combinations, and parameter selections forms the foundation of the model's design. To train the model, we align predicted phoneme sequences with ground truth alignments using the CTC loss.\nDynamic learning rate scheduling and a unique callback mechanism for training visualization of predictions are integrated into the training process. After training on a sizable dataset, the model exhibits remarkable convergence and proves its capacity to understand intricate temporal correlations.\nThrough the use of both quantitative and qualitative evaluations, the results are thoroughly assessed. We visually check the model's lip reading abilities and assess its performance using common speech recognition criteria. It is explored how different model topologies and hyperparameters affect performance, offering guidance for future research.\nThe trained model is tested on external video samples to show off its practical application. Its accuracy and resilience in lip-reading spoken phrases are demonstrated.\nBy providing a deep learning framework for precise and effective speech recognition, this research adds to the rapidly changing field of lip reading devices. The results offer opportunities for additional development and implementation in various fields, such as assistive technologies, audio-visual communication systems, and human-computer interaction.","PeriodicalId":14228,"journal":{"name":"International Journal of Scientific Research in Science, Engineering and Technology","volume":"14 6","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Survey on Silentinterpreter : Analysis of Lip Movement and Extracting Speech using Deep Learning\",\"authors\":\"Ameen Hafeez, Rohith M K, Sakshi Prashant, Sinchana Hegde, Prof. Shwetha K S\",\"doi\":\"10.32628/ijsrset2411219\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Lip reading is a complex but interesting path for the growth of speech recognition algorithms. It is the ability of deciphering spoken words by evaluating visual cues from lip movements. In this study, we suggest a unique method for lip reading that converts lip motions into textual representations by using deep neural networks. Convolutional neural networks are used in the methodology to extract visual features, recurrent neural networks are used to simulate temporal context, and the Connectionist Temporal Classification loss function is used to align lip features with corresponding phonemes.\\nThe study starts with a thorough investigation of data loading methods, which include alignment extraction and video preparation. A well selected dataset with video clips and matching phonetic alignments is presented. We select relevant face regions, convert frames to grayscale, then standardize the resulting data so that it can be fed into a neural network.\\nThe neural network architecture is presented in depth, displaying a series of bidirectional LSTM layers for temporal context understanding after 3D convolutional layers for spatial feature extraction. Careful consideration of input shapes, layer combinations, and parameter selections forms the foundation of the model's design. To train the model, we align predicted phoneme sequences with ground truth alignments using the CTC loss.\\nDynamic learning rate scheduling and a unique callback mechanism for training visualization of predictions are integrated into the training process. After training on a sizable dataset, the model exhibits remarkable convergence and proves its capacity to understand intricate temporal correlations.\\nThrough the use of both quantitative and qualitative evaluations, the results are thoroughly assessed. We visually check the model's lip reading abilities and assess its performance using common speech recognition criteria. It is explored how different model topologies and hyperparameters affect performance, offering guidance for future research.\\nThe trained model is tested on external video samples to show off its practical application. Its accuracy and resilience in lip-reading spoken phrases are demonstrated.\\nBy providing a deep learning framework for precise and effective speech recognition, this research adds to the rapidly changing field of lip reading devices. The results offer opportunities for additional development and implementation in various fields, such as assistive technologies, audio-visual communication systems, and human-computer interaction.\",\"PeriodicalId\":14228,\"journal\":{\"name\":\"International Journal of Scientific Research in Science, Engineering and Technology\",\"volume\":\"14 6\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-04-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Scientific Research in Science, Engineering and Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.32628/ijsrset2411219\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Scientific Research in Science, Engineering and Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.32628/ijsrset2411219","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

唇读是语音识别算法发展的一个复杂但有趣的途径。它是一种通过评估唇部动作的视觉线索来破译口语的能力。在这项研究中，我们提出了一种独特的唇语阅读方法，通过使用深度神经网络将唇部动作转换为文字表述。该方法使用卷积神经网络提取视觉特征，使用递归神经网络模拟时间上下文，并使用连接时序分类损失函数将唇部特征与相应的音素对齐。研究首先对数据加载方法进行了深入研究，包括配准提取和视频准备。我们选择了相关的人脸区域，将帧转换为灰度，然后将得到的数据标准化，以便将其输入神经网络。我们深入介绍了神经网络架构，在用于空间特征提取的三维卷积层之后，展示了一系列用于时间上下文理解的双向 LSTM 层。对输入形状、层组合和参数选择的仔细考虑构成了模型设计的基础。为了训练模型，我们使用 CTC 损失将预测的音素序列与地面实况对齐。动态学习率调度和用于预测可视化训练的独特回调机制被集成到了训练过程中。通过定量和定性评估，我们对结果进行了全面评估。我们直观地检查了模型的读唇能力，并使用常见的语音识别标准对其性能进行了评估。我们还探讨了不同的模型拓扑结构和超参数对性能的影响，为未来的研究提供了指导。我们在外部视频样本上对训练有素的模型进行了测试，以展示其实际应用。通过为精确有效的语音识别提供深度学习框架，这项研究为日新月异的唇读设备领域添砖加瓦。这些成果为辅助技术、视听通信系统和人机交互等各个领域的进一步开发和实施提供了机会。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Survey on Silentinterpreter : Analysis of Lip Movement and Extracting Speech using Deep Learning

Lip reading is a complex but interesting path for the growth of speech recognition algorithms. It is the ability of deciphering spoken words by evaluating visual cues from lip movements. In this study, we suggest a unique method for lip reading that converts lip motions into textual representations by using deep neural networks. Convolutional neural networks are used in the methodology to extract visual features, recurrent neural networks are used to simulate temporal context, and the Connectionist Temporal Classification loss function is used to align lip features with corresponding phonemes. The study starts with a thorough investigation of data loading methods, which include alignment extraction and video preparation. A well selected dataset with video clips and matching phonetic alignments is presented. We select relevant face regions, convert frames to grayscale, then standardize the resulting data so that it can be fed into a neural network. The neural network architecture is presented in depth, displaying a series of bidirectional LSTM layers for temporal context understanding after 3D convolutional layers for spatial feature extraction. Careful consideration of input shapes, layer combinations, and parameter selections forms the foundation of the model's design. To train the model, we align predicted phoneme sequences with ground truth alignments using the CTC loss. Dynamic learning rate scheduling and a unique callback mechanism for training visualization of predictions are integrated into the training process. After training on a sizable dataset, the model exhibits remarkable convergence and proves its capacity to understand intricate temporal correlations. Through the use of both quantitative and qualitative evaluations, the results are thoroughly assessed. We visually check the model's lip reading abilities and assess its performance using common speech recognition criteria. It is explored how different model topologies and hyperparameters affect performance, offering guidance for future research. The trained model is tested on external video samples to show off its practical application. Its accuracy and resilience in lip-reading spoken phrases are demonstrated. By providing a deep learning framework for precise and effective speech recognition, this research adds to the rapidly changing field of lip reading devices. The results offer opportunities for additional development and implementation in various fields, such as assistive technologies, audio-visual communication systems, and human-computer interaction.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Scientific Research in Science, Engineering and Technology

自引率

0.00%

发文量

期刊最新文献

UGC Guidelines on Sustainable and Vibrant University- Industry Linkage System for Indian Universities, 2024 Leachate as a Fertilizer Artificial Intelligence in Healthcare : A Review Advancements in Quadcopter Development through Additive Manufacturing: A Comprehensive Review Sensing Human Emotion using Emerging Machine Learning Techniques