Accurate synthesis of dysarthric Speech for ASR data augmentation

IF 2.4 3区 计算机科学 Q2 ACOUSTICS Speech Communication Pub Date : 2024-08-10 DOI:10.1016/j.specom.2024.103112
{"title":"Accurate synthesis of dysarthric Speech for ASR data augmentation","authors":"","doi":"10.1016/j.specom.2024.103112","DOIUrl":null,"url":null,"abstract":"<div><p>Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech Recognition (ASR) systems can help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech, which is not readily available for dysarthric talkers.</p><p>This paper presents a new dysarthric speech synthesis method for the purpose of ASR training data augmentation. Differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels are important components for dysarthric speech modeling, synthesis, and augmentation. For dysarthric speech synthesis, a modified neural multi-talker TTS is implemented by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels.</p><p>To evaluate the effectiveness for synthesis of training data for ASR, dysarthria-specific speech recognition was used. Results show that a DNN<img>HMM model trained on additional synthetic dysarthric speech achieves relative Word Error Rate (WER) improvement of 12.2 % compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5 %, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has significant impact on the dysarthric ASR systems. In addition, we have conducted a subjective evaluation to evaluate the dysarthricness and similarity of synthesized speech. Our subjective evaluation shows that the perceived dysarthricness of synthesized speech is similar to that of true dysarthric speech, especially for higher levels of dysarthria. Audio samples are available at https://mohammadelc.github.io/SpeechGroupUKY/</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4000,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639324000839","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

Abstract

Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech Recognition (ASR) systems can help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech, which is not readily available for dysarthric talkers.

This paper presents a new dysarthric speech synthesis method for the purpose of ASR training data augmentation. Differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels are important components for dysarthric speech modeling, synthesis, and augmentation. For dysarthric speech synthesis, a modified neural multi-talker TTS is implemented by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels.

To evaluate the effectiveness for synthesis of training data for ASR, dysarthria-specific speech recognition was used. Results show that a DNNHMM model trained on additional synthetic dysarthric speech achieves relative Word Error Rate (WER) improvement of 12.2 % compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5 %, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has significant impact on the dysarthric ASR systems. In addition, we have conducted a subjective evaluation to evaluate the dysarthricness and similarity of synthesized speech. Our subjective evaluation shows that the perceived dysarthricness of synthesized speech is similar to that of true dysarthric speech, especially for higher levels of dysarthria. Audio samples are available at https://mohammadelc.github.io/SpeechGroupUKY/

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
为 ASR 数据扩增准确合成听力障碍语音
构音障碍是一种运动性语言障碍,通常表现为语言发音肌肉控制缓慢、不协调,导致语言清晰度降低。自动语音识别(ASR)系统可以帮助构音障碍者更有效地进行交流。然而,针对肢体障碍的强大自动语音识别系统需要大量的训练语音,而肢体障碍者并不容易获得这些语音。本文介绍了一种新的肢体障碍语音合成方法,用于增强自动语音识别系统的训练数据。不同严重程度的发音障碍自发语音在前音和声学特征上的差异是发音障碍语音建模、合成和增强的重要组成部分。在构音障碍语音合成方面,通过添加构音障碍严重程度系数和停顿插入模型,实现了改进的神经多语种 TTS,以合成不同严重程度的构音障碍语音。结果表明,与基线相比,在额外合成的肢体障碍语音上训练的 DNNHMM 模型的相对词错误率(WER)提高了 12.2%,而添加严重程度和停顿插入控制后,词错误率降低了 6.5%,显示了添加这些参数的有效性。TORGO 数据库的总体结果表明,使用障碍合成语音来增加障碍模式语音的训练量,对障碍 ASR 系统有显著影响。此外,我们还进行了一项主观评估,以评价合成语音的障听度和相似度。我们的主观评估结果表明,合成语音的发音障碍感知与真正的发音障碍语音相似,尤其是在构音障碍程度较高的情况下。音频样本见 https://mohammadelc.github.io/SpeechGroupUKY/
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Speech Communication
Speech Communication 工程技术-计算机:跨学科应用
CiteScore
6.80
自引率
6.20%
发文量
94
审稿时长
19.2 weeks
期刊介绍: Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.
期刊最新文献
A corpus of audio-visual recordings of linguistically balanced, Danish sentences for speech-in-noise experiments Forms, factors and functions of phonetic convergence: Editorial Feasibility of acoustic features of vowel sounds in estimating the upper airway cross sectional area during wakefulness: A pilot study Zero-shot voice conversion based on feature disentanglement Multi-modal co-learning for silent speech recognition based on ultrasound tongue images
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1