Ameen Hafeez, Rohith M K, Sakshi Prashant, Sinchana Hegde, Prof. Shwetha K S
{"title":"关于 Silentinterpreter 的调查:利用深度学习分析嘴唇运动并提取语音","authors":"Ameen Hafeez, Rohith M K, Sakshi Prashant, Sinchana Hegde, Prof. Shwetha K S","doi":"10.32628/ijsrset2411219","DOIUrl":null,"url":null,"abstract":"Lip reading is a complex but interesting path for the growth of speech recognition algorithms. It is the ability of deciphering spoken words by evaluating visual cues from lip movements. In this study, we suggest a unique method for lip reading that converts lip motions into textual representations by using deep neural networks. Convolutional neural networks are used in the methodology to extract visual features, recurrent neural networks are used to simulate temporal context, and the Connectionist Temporal Classification loss function is used to align lip features with corresponding phonemes.\nThe study starts with a thorough investigation of data loading methods, which include alignment extraction and video preparation. A well selected dataset with video clips and matching phonetic alignments is presented. We select relevant face regions, convert frames to grayscale, then standardize the resulting data so that it can be fed into a neural network.\nThe neural network architecture is presented in depth, displaying a series of bidirectional LSTM layers for temporal context understanding after 3D convolutional layers for spatial feature extraction. Careful consideration of input shapes, layer combinations, and parameter selections forms the foundation of the model's design. To train the model, we align predicted phoneme sequences with ground truth alignments using the CTC loss.\nDynamic learning rate scheduling and a unique callback mechanism for training visualization of predictions are integrated into the training process. After training on a sizable dataset, the model exhibits remarkable convergence and proves its capacity to understand intricate temporal correlations.\nThrough the use of both quantitative and qualitative evaluations, the results are thoroughly assessed. We visually check the model's lip reading abilities and assess its performance using common speech recognition criteria. It is explored how different model topologies and hyperparameters affect performance, offering guidance for future research.\nThe trained model is tested on external video samples to show off its practical application. Its accuracy and resilience in lip-reading spoken phrases are demonstrated.\nBy providing a deep learning framework for precise and effective speech recognition, this research adds to the rapidly changing field of lip reading devices. The results offer opportunities for additional development and implementation in various fields, such as assistive technologies, audio-visual communication systems, and human-computer interaction.","PeriodicalId":14228,"journal":{"name":"International Journal of Scientific Research in Science, Engineering and Technology","volume":"14 6","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Survey on Silentinterpreter : Analysis of Lip Movement and Extracting Speech using Deep Learning\",\"authors\":\"Ameen Hafeez, Rohith M K, Sakshi Prashant, Sinchana Hegde, Prof. Shwetha K S\",\"doi\":\"10.32628/ijsrset2411219\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Lip reading is a complex but interesting path for the growth of speech recognition algorithms. It is the ability of deciphering spoken words by evaluating visual cues from lip movements. In this study, we suggest a unique method for lip reading that converts lip motions into textual representations by using deep neural networks. Convolutional neural networks are used in the methodology to extract visual features, recurrent neural networks are used to simulate temporal context, and the Connectionist Temporal Classification loss function is used to align lip features with corresponding phonemes.\\nThe study starts with a thorough investigation of data loading methods, which include alignment extraction and video preparation. A well selected dataset with video clips and matching phonetic alignments is presented. We select relevant face regions, convert frames to grayscale, then standardize the resulting data so that it can be fed into a neural network.\\nThe neural network architecture is presented in depth, displaying a series of bidirectional LSTM layers for temporal context understanding after 3D convolutional layers for spatial feature extraction. Careful consideration of input shapes, layer combinations, and parameter selections forms the foundation of the model's design. To train the model, we align predicted phoneme sequences with ground truth alignments using the CTC loss.\\nDynamic learning rate scheduling and a unique callback mechanism for training visualization of predictions are integrated into the training process. After training on a sizable dataset, the model exhibits remarkable convergence and proves its capacity to understand intricate temporal correlations.\\nThrough the use of both quantitative and qualitative evaluations, the results are thoroughly assessed. We visually check the model's lip reading abilities and assess its performance using common speech recognition criteria. It is explored how different model topologies and hyperparameters affect performance, offering guidance for future research.\\nThe trained model is tested on external video samples to show off its practical application. Its accuracy and resilience in lip-reading spoken phrases are demonstrated.\\nBy providing a deep learning framework for precise and effective speech recognition, this research adds to the rapidly changing field of lip reading devices. The results offer opportunities for additional development and implementation in various fields, such as assistive technologies, audio-visual communication systems, and human-computer interaction.\",\"PeriodicalId\":14228,\"journal\":{\"name\":\"International Journal of Scientific Research in Science, Engineering and Technology\",\"volume\":\"14 6\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-04-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Scientific Research in Science, Engineering and Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.32628/ijsrset2411219\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Scientific Research in Science, Engineering and Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.32628/ijsrset2411219","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Survey on Silentinterpreter : Analysis of Lip Movement and Extracting Speech using Deep Learning
Lip reading is a complex but interesting path for the growth of speech recognition algorithms. It is the ability of deciphering spoken words by evaluating visual cues from lip movements. In this study, we suggest a unique method for lip reading that converts lip motions into textual representations by using deep neural networks. Convolutional neural networks are used in the methodology to extract visual features, recurrent neural networks are used to simulate temporal context, and the Connectionist Temporal Classification loss function is used to align lip features with corresponding phonemes.
The study starts with a thorough investigation of data loading methods, which include alignment extraction and video preparation. A well selected dataset with video clips and matching phonetic alignments is presented. We select relevant face regions, convert frames to grayscale, then standardize the resulting data so that it can be fed into a neural network.
The neural network architecture is presented in depth, displaying a series of bidirectional LSTM layers for temporal context understanding after 3D convolutional layers for spatial feature extraction. Careful consideration of input shapes, layer combinations, and parameter selections forms the foundation of the model's design. To train the model, we align predicted phoneme sequences with ground truth alignments using the CTC loss.
Dynamic learning rate scheduling and a unique callback mechanism for training visualization of predictions are integrated into the training process. After training on a sizable dataset, the model exhibits remarkable convergence and proves its capacity to understand intricate temporal correlations.
Through the use of both quantitative and qualitative evaluations, the results are thoroughly assessed. We visually check the model's lip reading abilities and assess its performance using common speech recognition criteria. It is explored how different model topologies and hyperparameters affect performance, offering guidance for future research.
The trained model is tested on external video samples to show off its practical application. Its accuracy and resilience in lip-reading spoken phrases are demonstrated.
By providing a deep learning framework for precise and effective speech recognition, this research adds to the rapidly changing field of lip reading devices. The results offer opportunities for additional development and implementation in various fields, such as assistive technologies, audio-visual communication systems, and human-computer interaction.