{"title":"Audio-to-Facial Landmarks Generator for Talking Face Video Synthesis","authors":"Dasol Jeong, Injae Lee, J. Paik","doi":"10.1109/ICEIC57457.2023.10049847","DOIUrl":null,"url":null,"abstract":"Audio driven talking face methods have been studied to process the accuracy lip synchronization. However, how to create movement of head poses and personalized facial features is a challenging problem. In order to solve this problem, it is necessary to identify the context based on the audio, create the head pose and lip motion, and synthesize the personalized face. We introduce a facial landmark generation method including audio-based head pose and lip motion using an audio transformer. The audio transformer extracts audio features containing contextual information and creates generalized head pose and lip motion landmarks. In order to synthesize personalized features on the generated landmarks, a talking face video is generated by applying the method learned through meta-learning. With just a few single images, even unknown faces can be spoken in the audio you want. In addition, the proposed method is applicable to various languages, and enables photo-realistic synthesis and fast inference.","PeriodicalId":373752,"journal":{"name":"2023 International Conference on Electronics, Information, and Communication (ICEIC)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Electronics, Information, and Communication (ICEIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICEIC57457.2023.10049847","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Audio driven talking face methods have been studied to process the accuracy lip synchronization. However, how to create movement of head poses and personalized facial features is a challenging problem. In order to solve this problem, it is necessary to identify the context based on the audio, create the head pose and lip motion, and synthesize the personalized face. We introduce a facial landmark generation method including audio-based head pose and lip motion using an audio transformer. The audio transformer extracts audio features containing contextual information and creates generalized head pose and lip motion landmarks. In order to synthesize personalized features on the generated landmarks, a talking face video is generated by applying the method learned through meta-learning. With just a few single images, even unknown faces can be spoken in the audio you want. In addition, the proposed method is applicable to various languages, and enables photo-realistic synthesis and fast inference.