基于超声舌图像的端到端普通话语音重建

IF 4.8 2区 医学 Q2 ENGINEERING, BIOMEDICAL IEEE Transactions on Neural Systems and Rehabilitation Engineering Pub Date : 2024-12-20 DOI:10.1109/TNSRE.2024.3520498
Fengji Li;Fei Shen;Ding Ma;Jie Zhou;Shaochuan Zhang;Li Wang;Fan Fan;Tao Liu;Xiaohong Chen;Tomoki Toda;Haijun Niu
{"title":"基于超声舌图像的端到端普通话语音重建","authors":"Fengji Li;Fei Shen;Ding Ma;Jie Zhou;Shaochuan Zhang;Li Wang;Fan Fan;Tao Liu;Xiaohong Chen;Tomoki Toda;Haijun Niu","doi":"10.1109/TNSRE.2024.3520498","DOIUrl":null,"url":null,"abstract":"The loss of speech function following a laryngectomy usually leads to severe physiological and psychological distress for laryngectomees. In clinical practice, most laryngectomees retain intact upper tract articulatory organs, emphasizing the significance of speech rehabilitation that utilizes articulatory motion information to effectively restore speech. This study proposed a deep learning-based end-to-end method for speech reconstruction using ultrasound tongue images. Initially, ultrasound tongue images and speech data were collected simultaneously with a designed Mandarin corpus. Subsequently, a speech reconstruction model was built based on adversarial neural networks. The model includes a pretrained feature extractor to process ultrasound images, an upsampling block to generate speech, and discriminators to ensure the similarity and fidelity of the reconstructed speech. Finally, both objective and subjective evaluations were conducted for the reconstructed speech. The reconstructed speech demonstrated high intelligibility in both Mandarin phonemes and tones. The character error rate of phonemes in automatic speech recognition was 0.2605, and tone error rate obtained from dictation tests was 0.1784, respectively. Objective results showed high similarity between the reconstructed and ground truth speech. Subjective perception results also indicated an acceptable level of naturalness. The proposed method demonstrates its capability to reconstruct tonal Mandarin speech from ultrasound tongue images. However, future research should concentrate on specific conditions of laryngectomees, aiming to enhance and optimize model performance. This will be achieved by enlarging training datasets, investigating the impact of ultrasound tongue imaging parameters, and further refining this method.","PeriodicalId":13419,"journal":{"name":"IEEE Transactions on Neural Systems and Rehabilitation Engineering","volume":"33 ","pages":"140-149"},"PeriodicalIF":4.8000,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10810495","citationCount":"0","resultStr":"{\"title\":\"End-to-End Mandarin Speech Reconstruction Based on Ultrasound Tongue Images Using Deep Learning\",\"authors\":\"Fengji Li;Fei Shen;Ding Ma;Jie Zhou;Shaochuan Zhang;Li Wang;Fan Fan;Tao Liu;Xiaohong Chen;Tomoki Toda;Haijun Niu\",\"doi\":\"10.1109/TNSRE.2024.3520498\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The loss of speech function following a laryngectomy usually leads to severe physiological and psychological distress for laryngectomees. In clinical practice, most laryngectomees retain intact upper tract articulatory organs, emphasizing the significance of speech rehabilitation that utilizes articulatory motion information to effectively restore speech. This study proposed a deep learning-based end-to-end method for speech reconstruction using ultrasound tongue images. Initially, ultrasound tongue images and speech data were collected simultaneously with a designed Mandarin corpus. Subsequently, a speech reconstruction model was built based on adversarial neural networks. The model includes a pretrained feature extractor to process ultrasound images, an upsampling block to generate speech, and discriminators to ensure the similarity and fidelity of the reconstructed speech. Finally, both objective and subjective evaluations were conducted for the reconstructed speech. The reconstructed speech demonstrated high intelligibility in both Mandarin phonemes and tones. The character error rate of phonemes in automatic speech recognition was 0.2605, and tone error rate obtained from dictation tests was 0.1784, respectively. Objective results showed high similarity between the reconstructed and ground truth speech. Subjective perception results also indicated an acceptable level of naturalness. The proposed method demonstrates its capability to reconstruct tonal Mandarin speech from ultrasound tongue images. However, future research should concentrate on specific conditions of laryngectomees, aiming to enhance and optimize model performance. This will be achieved by enlarging training datasets, investigating the impact of ultrasound tongue imaging parameters, and further refining this method.\",\"PeriodicalId\":13419,\"journal\":{\"name\":\"IEEE Transactions on Neural Systems and Rehabilitation Engineering\",\"volume\":\"33 \",\"pages\":\"140-149\"},\"PeriodicalIF\":4.8000,\"publicationDate\":\"2024-12-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10810495\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Neural Systems and Rehabilitation Engineering\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10810495/\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, BIOMEDICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Neural Systems and Rehabilitation Engineering","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10810495/","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}
引用次数: 0

摘要

喉切除术后语言功能的丧失通常会给喉切除术患者带来严重的生理和心理困扰。在临床实践中,大多数喉切除术保留了完整的上束发音器官,强调了利用发音运动信息有效恢复语言功能的语言康复的重要性。本研究提出了一种基于深度学习的端到端超声舌图像语音重建方法。首先,在设计的普通话语料库中同时收集舌头超声图像和语音数据。随后,建立了基于对抗神经网络的语音重构模型。该模型包括一个预训练的特征提取器来处理超声图像,一个上采样块来生成语音,以及鉴别器来保证重建语音的相似性和保真度。最后,对重构语音进行客观和主观评价。重构后的语音在普通话音素和声调上都具有较高的可理解性。语音自动识别中的音素字符错误率为0.2605,听写测试中的声调错误率为0.1784。客观结果表明,重构语音与原真语音具有较高的相似性。主观感知结果也表明了一个可接受的自然水平。结果表明,该方法能够从超声舌图像中重建声调普通话语音。然而,未来的研究应集中在喉切除术患者的具体情况,旨在提高和优化模型的性能。这将通过扩大训练数据集,研究超声舌成像参数的影响,并进一步完善该方法来实现。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
End-to-End Mandarin Speech Reconstruction Based on Ultrasound Tongue Images Using Deep Learning
The loss of speech function following a laryngectomy usually leads to severe physiological and psychological distress for laryngectomees. In clinical practice, most laryngectomees retain intact upper tract articulatory organs, emphasizing the significance of speech rehabilitation that utilizes articulatory motion information to effectively restore speech. This study proposed a deep learning-based end-to-end method for speech reconstruction using ultrasound tongue images. Initially, ultrasound tongue images and speech data were collected simultaneously with a designed Mandarin corpus. Subsequently, a speech reconstruction model was built based on adversarial neural networks. The model includes a pretrained feature extractor to process ultrasound images, an upsampling block to generate speech, and discriminators to ensure the similarity and fidelity of the reconstructed speech. Finally, both objective and subjective evaluations were conducted for the reconstructed speech. The reconstructed speech demonstrated high intelligibility in both Mandarin phonemes and tones. The character error rate of phonemes in automatic speech recognition was 0.2605, and tone error rate obtained from dictation tests was 0.1784, respectively. Objective results showed high similarity between the reconstructed and ground truth speech. Subjective perception results also indicated an acceptable level of naturalness. The proposed method demonstrates its capability to reconstruct tonal Mandarin speech from ultrasound tongue images. However, future research should concentrate on specific conditions of laryngectomees, aiming to enhance and optimize model performance. This will be achieved by enlarging training datasets, investigating the impact of ultrasound tongue imaging parameters, and further refining this method.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
8.60
自引率
8.20%
发文量
479
审稿时长
6-12 weeks
期刊介绍: Rehabilitative and neural aspects of biomedical engineering, including functional electrical stimulation, acoustic dynamics, human performance measurement and analysis, nerve stimulation, electromyography, motor control and stimulation; and hardware and software applications for rehabilitation engineering and assistive devices.
期刊最新文献
Enhancing Manual Wheelchair Propulsion: Incremental Assistance Levels of Pushrim-Activated Power-Assist Proportionally Reduce Physiological and Biomechanical Demands in Able-Bodied Participants. Improving Acceptance to Sensory Substitution: A study on the V2A-SS Learning Model based on Information Processing Learning Theory. The More, the Better? Evaluating the Role of EEG Preprocessing for Deep Learning Applications Locomotion Joint Angle and Moment Estimation With Soft Wearable Sensors for Personalized Exosuit Control LAST-PAIN: Learning Adaptive Spike Thresholds for Low Back Pain Biosignals Classification
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1