基于关键点检测和运动转移结合图像风格转移的发音动画生成

Comput. Pub Date : 2023-07-28 DOI:10.3390/computers12080150

Xufeng Ling, Yun Zhu, W. Liu, Jingxin Liang, Jie Yang

{"title":"基于关键点检测和运动转移结合图像风格转移的发音动画生成","authors":"Xufeng Ling, Yun Zhu, W. Liu, Jingxin Liang, Jie Yang","doi":"10.3390/computers12080150","DOIUrl":null,"url":null,"abstract":"Knowing the correct positioning of the tongue and mouth for pronunciation is crucial for learning English pronunciation correctly. Articulatory animation is an effective way to address the above task and helpful to English learners. However, articulatory animations are all traditionally hand-drawn. Different situations require varying animation styles, so a comprehensive redraw of all the articulatory animations is necessary. To address this issue, we developed a method for the automatic generation of articulatory animations using a deep learning system. Our method leverages an automatic keypoint-based detection network, a motion transfer network, and a style transfer network to generate a series of articulatory animations that adhere to the desired style. By inputting a target-style articulation image, our system is capable of producing animations with the desired characteristics. We created a dataset of articulation images and animations from public sources, including the International Phonetic Association (IPA), to establish our articulation image animation dataset. We performed preprocessing on the articulation images by segmenting them into distinct areas each corresponding to a specific articulatory part, such as the tongue, upper jaw, lower jaw, soft palate, and vocal cords. We trained a deep neural network model capable of automatically detecting the keypoints in typical articulation images. Also, we trained a generative adversarial network (GAN) model that can generate end-to-end animation of different styles automatically from the characteristics of keypoints and the learned image style. To train a relatively robust model, we used four different style videos: one magnetic resonance imaging (MRI) articulatory video and three hand-drawn videos. For further applications, we combined the consonant and vowel animations together to generate a syllable animation and the animation of a word consisting of many syllables. Experiments show that this system can auto-generate articulatory animations according to input phonetic symbols and should be helpful to people for English articulation correction.","PeriodicalId":10526,"journal":{"name":"Comput.","volume":"10 1","pages":"150"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The Generation of Articulatory Animations Based on Keypoint Detection and Motion Transfer Combined with Image Style Transfer\",\"authors\":\"Xufeng Ling, Yun Zhu, W. Liu, Jingxin Liang, Jie Yang\",\"doi\":\"10.3390/computers12080150\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Knowing the correct positioning of the tongue and mouth for pronunciation is crucial for learning English pronunciation correctly. Articulatory animation is an effective way to address the above task and helpful to English learners. However, articulatory animations are all traditionally hand-drawn. Different situations require varying animation styles, so a comprehensive redraw of all the articulatory animations is necessary. To address this issue, we developed a method for the automatic generation of articulatory animations using a deep learning system. Our method leverages an automatic keypoint-based detection network, a motion transfer network, and a style transfer network to generate a series of articulatory animations that adhere to the desired style. By inputting a target-style articulation image, our system is capable of producing animations with the desired characteristics. We created a dataset of articulation images and animations from public sources, including the International Phonetic Association (IPA), to establish our articulation image animation dataset. We performed preprocessing on the articulation images by segmenting them into distinct areas each corresponding to a specific articulatory part, such as the tongue, upper jaw, lower jaw, soft palate, and vocal cords. We trained a deep neural network model capable of automatically detecting the keypoints in typical articulation images. Also, we trained a generative adversarial network (GAN) model that can generate end-to-end animation of different styles automatically from the characteristics of keypoints and the learned image style. To train a relatively robust model, we used four different style videos: one magnetic resonance imaging (MRI) articulatory video and three hand-drawn videos. For further applications, we combined the consonant and vowel animations together to generate a syllable animation and the animation of a word consisting of many syllables. Experiments show that this system can auto-generate articulatory animations according to input phonetic symbols and should be helpful to people for English articulation correction.\",\"PeriodicalId\":10526,\"journal\":{\"name\":\"Comput.\",\"volume\":\"10 1\",\"pages\":\"150\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Comput.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3390/computers12080150\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Comput.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/computers12080150","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

了解舌头和嘴巴的正确发音位置对于正确学习英语发音至关重要。发音动画是解决上述问题的有效途径，对英语学习者很有帮助。然而，发音动画传统上都是手绘的。不同的情况需要不同的动画风格，所以一个全面的重新绘制所有的发音动画是必要的。为了解决这个问题，我们开发了一种使用深度学习系统自动生成发音动画的方法。我们的方法利用基于关键点的自动检测网络、运动转移网络和风格转移网络来生成一系列符合所需风格的发音动画。通过输入目标风格的发音图像，我们的系统能够产生具有所需特征的动画。我们从包括国际语音协会(IPA)在内的公共资源中创建了一个发音图像和动画数据集，以建立我们的发音图像动画数据集。我们对发音图像进行预处理，将它们分割成不同的区域，每个区域对应于特定的发音部分，如舌头、上颌、下颌、软腭和声带。我们训练了一个能够自动检测典型发音图像中关键点的深度神经网络模型。此外，我们还训练了一个生成式对抗网络(GAN)模型，该模型可以根据关键点的特征和学习到的图像风格自动生成不同风格的端到端动画。为了训练一个相对稳健的模型，我们使用了四种不同风格的视频:一个磁共振成像(MRI)发音视频和三个手绘视频。为了进一步应用，我们将辅音动画和元音动画结合在一起，生成音节动画和由多个音节组成的单词动画。实验表明，该系统可以根据输入的音标自动生成发音动画，对英语发音校正有一定的帮助。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

The Generation of Articulatory Animations Based on Keypoint Detection and Motion Transfer Combined with Image Style Transfer

Knowing the correct positioning of the tongue and mouth for pronunciation is crucial for learning English pronunciation correctly. Articulatory animation is an effective way to address the above task and helpful to English learners. However, articulatory animations are all traditionally hand-drawn. Different situations require varying animation styles, so a comprehensive redraw of all the articulatory animations is necessary. To address this issue, we developed a method for the automatic generation of articulatory animations using a deep learning system. Our method leverages an automatic keypoint-based detection network, a motion transfer network, and a style transfer network to generate a series of articulatory animations that adhere to the desired style. By inputting a target-style articulation image, our system is capable of producing animations with the desired characteristics. We created a dataset of articulation images and animations from public sources, including the International Phonetic Association (IPA), to establish our articulation image animation dataset. We performed preprocessing on the articulation images by segmenting them into distinct areas each corresponding to a specific articulatory part, such as the tongue, upper jaw, lower jaw, soft palate, and vocal cords. We trained a deep neural network model capable of automatically detecting the keypoints in typical articulation images. Also, we trained a generative adversarial network (GAN) model that can generate end-to-end animation of different styles automatically from the characteristics of keypoints and the learned image style. To train a relatively robust model, we used four different style videos: one magnetic resonance imaging (MRI) articulatory video and three hand-drawn videos. For further applications, we combined the consonant and vowel animations together to generate a syllable animation and the animation of a word consisting of many syllables. Experiments show that this system can auto-generate articulatory animations according to input phonetic symbols and should be helpful to people for English articulation correction.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Comput.

自引率

0.00%

发文量