Accurate synthesis of dysarthric Speech for ASR data augmentation

IF 2.4 3区计算机科学 Q2 ACOUSTICS Speech Communication Pub Date : 2024-08-10 DOI:10.1016/j.specom.2024.103112

Mohammad Soleymanpour , Michael T. Johnson , Rahim Soleymanpour , Jeffrey Berry

{"title":"Accurate synthesis of dysarthric Speech for ASR data augmentation","authors":"Mohammad Soleymanpour , Michael T. Johnson , Rahim Soleymanpour , Jeffrey Berry","doi":"10.1016/j.specom.2024.103112","DOIUrl":null,"url":null,"abstract":"<div><p>Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech Recognition (ASR) systems can help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech, which is not readily available for dysarthric talkers.</p><p>This paper presents a new dysarthric speech synthesis method for the purpose of ASR training data augmentation. Differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels are important components for dysarthric speech modeling, synthesis, and augmentation. For dysarthric speech synthesis, a modified neural multi-talker TTS is implemented by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels.</p><p>To evaluate the effectiveness for synthesis of training data for ASR, dysarthria-specific speech recognition was used. Results show that a DNN<img>HMM model trained on additional synthetic dysarthric speech achieves relative Word Error Rate (WER) improvement of 12.2 % compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5 %, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has significant impact on the dysarthric ASR systems. In addition, we have conducted a subjective evaluation to evaluate the dysarthricness and similarity of synthesized speech. Our subjective evaluation shows that the perceived dysarthricness of synthesized speech is similar to that of true dysarthric speech, especially for higher levels of dysarthria. Audio samples are available at https://mohammadelc.github.io/SpeechGroupUKY/</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"164 ","pages":"Article 103112"},"PeriodicalIF":2.4000,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639324000839","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech Recognition (ASR) systems can help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech, which is not readily available for dysarthric talkers.

This paper presents a new dysarthric speech synthesis method for the purpose of ASR training data augmentation. Differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels are important components for dysarthric speech modeling, synthesis, and augmentation. For dysarthric speech synthesis, a modified neural multi-talker TTS is implemented by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels.

To evaluate the effectiveness for synthesis of training data for ASR, dysarthria-specific speech recognition was used. Results show that a DNNHMM model trained on additional synthetic dysarthric speech achieves relative Word Error Rate (WER) improvement of 12.2 % compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5 %, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has significant impact on the dysarthric ASR systems. In addition, we have conducted a subjective evaluation to evaluate the dysarthricness and similarity of synthesized speech. Our subjective evaluation shows that the perceived dysarthricness of synthesized speech is similar to that of true dysarthric speech, especially for higher levels of dysarthria. Audio samples are available at https://mohammadelc.github.io/SpeechGroupUKY/

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

为 ASR 数据扩增准确合成听力障碍语音

构音障碍是一种运动性语言障碍，通常表现为语言发音肌肉控制缓慢、不协调，导致语言清晰度降低。自动语音识别（ASR）系统可以帮助构音障碍者更有效地进行交流。然而，针对肢体障碍的强大自动语音识别系统需要大量的训练语音，而肢体障碍者并不容易获得这些语音。本文介绍了一种新的肢体障碍语音合成方法，用于增强自动语音识别系统的训练数据。不同严重程度的发音障碍自发语音在前音和声学特征上的差异是发音障碍语音建模、合成和增强的重要组成部分。在构音障碍语音合成方面，通过添加构音障碍严重程度系数和停顿插入模型，实现了改进的神经多语种 TTS，以合成不同严重程度的构音障碍语音。结果表明，与基线相比，在额外合成的肢体障碍语音上训练的 DNNHMM 模型的相对词错误率（WER）提高了 12.2%，而添加严重程度和停顿插入控制后，词错误率降低了 6.5%，显示了添加这些参数的有效性。TORGO 数据库的总体结果表明，使用障碍合成语音来增加障碍模式语音的训练量，对障碍 ASR 系统有显著影响。此外，我们还进行了一项主观评估，以评价合成语音的障听度和相似度。我们的主观评估结果表明，合成语音的发音障碍感知与真正的发音障碍语音相似，尤其是在构音障碍程度较高的情况下。音频样本见 https://mohammadelc.github.io/SpeechGroupUKY/

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Speech Communication 工程技术-计算机：跨学科应用

CiteScore

6.80

自引率

6.20%

发文量

审稿时长

19.2 weeks

期刊介绍： Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.