基于关键点检测和运动转移结合图像风格转移的发音动画生成

Comput. Pub Date : 2023-07-28 DOI:10.3390/computers12080150
Xufeng Ling, Yun Zhu, W. Liu, Jingxin Liang, Jie Yang
{"title":"基于关键点检测和运动转移结合图像风格转移的发音动画生成","authors":"Xufeng Ling, Yun Zhu, W. Liu, Jingxin Liang, Jie Yang","doi":"10.3390/computers12080150","DOIUrl":null,"url":null,"abstract":"Knowing the correct positioning of the tongue and mouth for pronunciation is crucial for learning English pronunciation correctly. Articulatory animation is an effective way to address the above task and helpful to English learners. However, articulatory animations are all traditionally hand-drawn. Different situations require varying animation styles, so a comprehensive redraw of all the articulatory animations is necessary. To address this issue, we developed a method for the automatic generation of articulatory animations using a deep learning system. Our method leverages an automatic keypoint-based detection network, a motion transfer network, and a style transfer network to generate a series of articulatory animations that adhere to the desired style. By inputting a target-style articulation image, our system is capable of producing animations with the desired characteristics. We created a dataset of articulation images and animations from public sources, including the International Phonetic Association (IPA), to establish our articulation image animation dataset. We performed preprocessing on the articulation images by segmenting them into distinct areas each corresponding to a specific articulatory part, such as the tongue, upper jaw, lower jaw, soft palate, and vocal cords. We trained a deep neural network model capable of automatically detecting the keypoints in typical articulation images. Also, we trained a generative adversarial network (GAN) model that can generate end-to-end animation of different styles automatically from the characteristics of keypoints and the learned image style. To train a relatively robust model, we used four different style videos: one magnetic resonance imaging (MRI) articulatory video and three hand-drawn videos. For further applications, we combined the consonant and vowel animations together to generate a syllable animation and the animation of a word consisting of many syllables. Experiments show that this system can auto-generate articulatory animations according to input phonetic symbols and should be helpful to people for English articulation correction.","PeriodicalId":10526,"journal":{"name":"Comput.","volume":"10 1","pages":"150"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The Generation of Articulatory Animations Based on Keypoint Detection and Motion Transfer Combined with Image Style Transfer\",\"authors\":\"Xufeng Ling, Yun Zhu, W. Liu, Jingxin Liang, Jie Yang\",\"doi\":\"10.3390/computers12080150\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Knowing the correct positioning of the tongue and mouth for pronunciation is crucial for learning English pronunciation correctly. Articulatory animation is an effective way to address the above task and helpful to English learners. However, articulatory animations are all traditionally hand-drawn. Different situations require varying animation styles, so a comprehensive redraw of all the articulatory animations is necessary. To address this issue, we developed a method for the automatic generation of articulatory animations using a deep learning system. Our method leverages an automatic keypoint-based detection network, a motion transfer network, and a style transfer network to generate a series of articulatory animations that adhere to the desired style. By inputting a target-style articulation image, our system is capable of producing animations with the desired characteristics. We created a dataset of articulation images and animations from public sources, including the International Phonetic Association (IPA), to establish our articulation image animation dataset. We performed preprocessing on the articulation images by segmenting them into distinct areas each corresponding to a specific articulatory part, such as the tongue, upper jaw, lower jaw, soft palate, and vocal cords. We trained a deep neural network model capable of automatically detecting the keypoints in typical articulation images. Also, we trained a generative adversarial network (GAN) model that can generate end-to-end animation of different styles automatically from the characteristics of keypoints and the learned image style. To train a relatively robust model, we used four different style videos: one magnetic resonance imaging (MRI) articulatory video and three hand-drawn videos. For further applications, we combined the consonant and vowel animations together to generate a syllable animation and the animation of a word consisting of many syllables. Experiments show that this system can auto-generate articulatory animations according to input phonetic symbols and should be helpful to people for English articulation correction.\",\"PeriodicalId\":10526,\"journal\":{\"name\":\"Comput.\",\"volume\":\"10 1\",\"pages\":\"150\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Comput.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3390/computers12080150\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Comput.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/computers12080150","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

了解舌头和嘴巴的正确发音位置对于正确学习英语发音至关重要。发音动画是解决上述问题的有效途径,对英语学习者很有帮助。然而,发音动画传统上都是手绘的。不同的情况需要不同的动画风格,所以一个全面的重新绘制所有的发音动画是必要的。为了解决这个问题,我们开发了一种使用深度学习系统自动生成发音动画的方法。我们的方法利用基于关键点的自动检测网络、运动转移网络和风格转移网络来生成一系列符合所需风格的发音动画。通过输入目标风格的发音图像,我们的系统能够产生具有所需特征的动画。我们从包括国际语音协会(IPA)在内的公共资源中创建了一个发音图像和动画数据集,以建立我们的发音图像动画数据集。我们对发音图像进行预处理,将它们分割成不同的区域,每个区域对应于特定的发音部分,如舌头、上颌、下颌、软腭和声带。我们训练了一个能够自动检测典型发音图像中关键点的深度神经网络模型。此外,我们还训练了一个生成式对抗网络(GAN)模型,该模型可以根据关键点的特征和学习到的图像风格自动生成不同风格的端到端动画。为了训练一个相对稳健的模型,我们使用了四种不同风格的视频:一个磁共振成像(MRI)发音视频和三个手绘视频。为了进一步应用,我们将辅音动画和元音动画结合在一起,生成音节动画和由多个音节组成的单词动画。实验表明,该系统可以根据输入的音标自动生成发音动画,对英语发音校正有一定的帮助。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
The Generation of Articulatory Animations Based on Keypoint Detection and Motion Transfer Combined with Image Style Transfer
Knowing the correct positioning of the tongue and mouth for pronunciation is crucial for learning English pronunciation correctly. Articulatory animation is an effective way to address the above task and helpful to English learners. However, articulatory animations are all traditionally hand-drawn. Different situations require varying animation styles, so a comprehensive redraw of all the articulatory animations is necessary. To address this issue, we developed a method for the automatic generation of articulatory animations using a deep learning system. Our method leverages an automatic keypoint-based detection network, a motion transfer network, and a style transfer network to generate a series of articulatory animations that adhere to the desired style. By inputting a target-style articulation image, our system is capable of producing animations with the desired characteristics. We created a dataset of articulation images and animations from public sources, including the International Phonetic Association (IPA), to establish our articulation image animation dataset. We performed preprocessing on the articulation images by segmenting them into distinct areas each corresponding to a specific articulatory part, such as the tongue, upper jaw, lower jaw, soft palate, and vocal cords. We trained a deep neural network model capable of automatically detecting the keypoints in typical articulation images. Also, we trained a generative adversarial network (GAN) model that can generate end-to-end animation of different styles automatically from the characteristics of keypoints and the learned image style. To train a relatively robust model, we used four different style videos: one magnetic resonance imaging (MRI) articulatory video and three hand-drawn videos. For further applications, we combined the consonant and vowel animations together to generate a syllable animation and the animation of a word consisting of many syllables. Experiments show that this system can auto-generate articulatory animations according to input phonetic symbols and should be helpful to people for English articulation correction.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A U-Net Architecture for Inpainting Lightstage Normal Maps Implementing Virtualization on Single-Board Computers: A Case Study on Edge Computing Electrocardiogram Signals Classification Using Deep-Learning-Based Incorporated Convolutional Neural Network and Long Short-Term Memory Framework The Mechanism of Resonant Amplification of One-Dimensional Detonation Propagating in a Non-Uniform Mixture Application of Immersive VR Serious Games in the Treatment of Schizophrenia Negative Symptoms
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1