Saad A. Bazaz, AbdurRehman Subhani, Syed Z.A. Hadi
{"title":"Automated Dubbing and Facial Synchronization using Deep Learning","authors":"Saad A. Bazaz, AbdurRehman Subhani, Syed Z.A. Hadi","doi":"10.1109/ICAI55435.2022.9773697","DOIUrl":null,"url":null,"abstract":"With the recent global boom in video content creation and consumption during the pandemic, linguistics remains the only barrier in producing im-mersive content for global communities. To solve this, content creators use a manual dubbing process, where voice actors are hired to produce a “voiceover” over the video. We aim to break down the language barrier and thus make “videos for everyone”. We propose an end-to-end architecture that automatically translates videos and produces synchronized dubbed voices using deep learning models, in a specified target language. Our architecture takes a modular approach, allowing the user to tweak each component or replace it with a better one. We present our results from said architecture, and describe possible future motivations to scale this to accommodate multiple languages and multiple use cases. A sample of our results can be found here: https://youtu.be/eGB-gL6bDr4","PeriodicalId":146842,"journal":{"name":"2022 2nd International Conference on Artificial Intelligence (ICAI)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 2nd International Conference on Artificial Intelligence (ICAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAI55435.2022.9773697","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
With the recent global boom in video content creation and consumption during the pandemic, linguistics remains the only barrier in producing im-mersive content for global communities. To solve this, content creators use a manual dubbing process, where voice actors are hired to produce a “voiceover” over the video. We aim to break down the language barrier and thus make “videos for everyone”. We propose an end-to-end architecture that automatically translates videos and produces synchronized dubbed voices using deep learning models, in a specified target language. Our architecture takes a modular approach, allowing the user to tweak each component or replace it with a better one. We present our results from said architecture, and describe possible future motivations to scale this to accommodate multiple languages and multiple use cases. A sample of our results can be found here: https://youtu.be/eGB-gL6bDr4