利用仿真和合成增强技术改进肢体障害语音分割

IF 4.4 3区医学 Q2 ENGINEERING, BIOMEDICAL IEEE Journal of Translational Engineering in Health and Medicine-Jtehm Pub Date : 2024-03-11 DOI:10.1109/JTEHM.2024.3375323

Saeid Alavi Naeini;Leif Simmatis;Deniz Jafari;Yana Yunusova;Babak Taati

{"title":"利用仿真和合成增强技术改进肢体障害语音分割","authors":"Saeid Alavi Naeini;Leif Simmatis;Deniz Jafari;Yana Yunusova;Babak Taati","doi":"10.1109/JTEHM.2024.3375323","DOIUrl":null,"url":null,"abstract":"Acoustic features extracted from speech can help with the diagnosis of neurological diseases and monitoring of symptoms over time. Temporal segmentation of audio signals into individual words is an important pre-processing step needed prior to extracting acoustic features. Machine learning techniques could be used to automate speech segmentation via automatic speech recognition (ASR) and sequence to sequence alignment. While state-of-the-art ASR models achieve good performance on healthy speech, their performance significantly drops when evaluated on dysarthric speech. Fine-tuning ASR models on impaired speech can improve performance in dysarthric individuals, but it requires representative clinical data, which is difficult to collect and may raise privacy concerns. This study explores the feasibility of using two augmentation methods to increase ASR performance on dysarthric speech: 1) healthy individuals varying their speaking rate and loudness (as is often used in assessments of pathological speech); 2) synthetic speech with variations in speaking rate and accent (to ensure more diverse vocal representations and fairness). Experimental evaluations showed that fine-tuning a pre-trained ASR model with data from these two sources outperformed a model fine-tuned only on real clinical data and matched the performance of a model fine-tuned on the combination of real clinical data and synthetic speech. When evaluated on held-out acoustic data from 24 individuals with various neurological diseases, the best performing model achieved an average word error rate of 5.7% and a mean correct count accuracy of 94.4%. In segmenting the data into individual words, a mean intersection-over-union of 89.2% was obtained against manual parsing (ground truth). It can be concluded that emulated and synthetic augmentations can significantly reduce the need for real clinical data of dysarthric speech when fine-tuning ASR models and, in turn, for speech segmentation.","PeriodicalId":54255,"journal":{"name":"IEEE Journal of Translational Engineering in Health and Medicine-Jtehm","volume":"12 ","pages":"382-389"},"PeriodicalIF":4.4000,"publicationDate":"2024-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10464345","citationCount":"0","resultStr":"{\"title\":\"Improving Dysarthric Speech Segmentation With Emulated and Synthetic Augmentation\",\"authors\":\"Saeid Alavi Naeini;Leif Simmatis;Deniz Jafari;Yana Yunusova;Babak Taati\",\"doi\":\"10.1109/JTEHM.2024.3375323\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Acoustic features extracted from speech can help with the diagnosis of neurological diseases and monitoring of symptoms over time. Temporal segmentation of audio signals into individual words is an important pre-processing step needed prior to extracting acoustic features. Machine learning techniques could be used to automate speech segmentation via automatic speech recognition (ASR) and sequence to sequence alignment. While state-of-the-art ASR models achieve good performance on healthy speech, their performance significantly drops when evaluated on dysarthric speech. Fine-tuning ASR models on impaired speech can improve performance in dysarthric individuals, but it requires representative clinical data, which is difficult to collect and may raise privacy concerns. This study explores the feasibility of using two augmentation methods to increase ASR performance on dysarthric speech: 1) healthy individuals varying their speaking rate and loudness (as is often used in assessments of pathological speech); 2) synthetic speech with variations in speaking rate and accent (to ensure more diverse vocal representations and fairness). Experimental evaluations showed that fine-tuning a pre-trained ASR model with data from these two sources outperformed a model fine-tuned only on real clinical data and matched the performance of a model fine-tuned on the combination of real clinical data and synthetic speech. When evaluated on held-out acoustic data from 24 individuals with various neurological diseases, the best performing model achieved an average word error rate of 5.7% and a mean correct count accuracy of 94.4%. In segmenting the data into individual words, a mean intersection-over-union of 89.2% was obtained against manual parsing (ground truth). It can be concluded that emulated and synthetic augmentations can significantly reduce the need for real clinical data of dysarthric speech when fine-tuning ASR models and, in turn, for speech segmentation.\",\"PeriodicalId\":54255,\"journal\":{\"name\":\"IEEE Journal of Translational Engineering in Health and Medicine-Jtehm\",\"volume\":\"12 \",\"pages\":\"382-389\"},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2024-03-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10464345\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Journal of Translational Engineering in Health and Medicine-Jtehm\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10464345/\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, BIOMEDICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Translational Engineering in Health and Medicine-Jtehm","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10464345/","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}

引用次数: 0

摘要

从语音中提取声学特征有助于诊断神经系统疾病和监测症状的变化。将音频信号按时间分割成单个单词是提取声学特征前所需的重要预处理步骤。机器学习技术可用于通过自动语音识别（ASR）和序列对序列配准自动进行语音分割。虽然最先进的 ASR 模型在健康语音上取得了良好的性能，但在评估听力障碍语音时，其性能却明显下降。在受损语音上对 ASR 模型进行微调可以提高发育障碍患者的性能，但这需要有代表性的临床数据，而这些数据很难收集，而且可能会引起隐私方面的担忧。本研究探讨了使用两种增强方法提高肢体运动障碍语音的 ASR 性能的可行性：1) 改变健康人的说话速度和响度（病理语音评估中常用的方法）；2) 改变说话速度和口音的合成语音（以确保更多样化的声音表现和公平性）。实验评估结果表明，利用这两种来源的数据对预先训练好的 ASR 模型进行微调，其效果优于仅根据真实临床数据进行微调的模型，并且与根据真实临床数据和合成语音组合进行微调的模型效果相当。在对 24 名患有各种神经系统疾病的患者的语音数据进行评估时，表现最好的模型的平均单词错误率为 5.7%，平均正确计数准确率为 94.4%。在将数据分割成单个单词时，与人工解析（地面实况）相比，平均交叉-重合率达到 89.2%。可以得出这样的结论：在微调 ASR 模型时，仿真和合成增强可以大大减少对真实临床语音数据的需求，进而减少对语音分段的需求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Improving Dysarthric Speech Segmentation With Emulated and Synthetic Augmentation

Acoustic features extracted from speech can help with the diagnosis of neurological diseases and monitoring of symptoms over time. Temporal segmentation of audio signals into individual words is an important pre-processing step needed prior to extracting acoustic features. Machine learning techniques could be used to automate speech segmentation via automatic speech recognition (ASR) and sequence to sequence alignment. While state-of-the-art ASR models achieve good performance on healthy speech, their performance significantly drops when evaluated on dysarthric speech. Fine-tuning ASR models on impaired speech can improve performance in dysarthric individuals, but it requires representative clinical data, which is difficult to collect and may raise privacy concerns. This study explores the feasibility of using two augmentation methods to increase ASR performance on dysarthric speech: 1) healthy individuals varying their speaking rate and loudness (as is often used in assessments of pathological speech); 2) synthetic speech with variations in speaking rate and accent (to ensure more diverse vocal representations and fairness). Experimental evaluations showed that fine-tuning a pre-trained ASR model with data from these two sources outperformed a model fine-tuned only on real clinical data and matched the performance of a model fine-tuned on the combination of real clinical data and synthetic speech. When evaluated on held-out acoustic data from 24 individuals with various neurological diseases, the best performing model achieved an average word error rate of 5.7% and a mean correct count accuracy of 94.4%. In segmenting the data into individual words, a mean intersection-over-union of 89.2% was obtained against manual parsing (ground truth). It can be concluded that emulated and synthetic augmentations can significantly reduce the need for real clinical data of dysarthric speech when fine-tuning ASR models and, in turn, for speech segmentation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Journal of Translational Engineering in Health and Medicine-Jtehm Engineering-Biomedical Engineering

CiteScore

7.40

自引率

2.90%

发文量

审稿时长

27 weeks

期刊介绍： The IEEE Journal of Translational Engineering in Health and Medicine is an open access product that bridges the engineering and clinical worlds, focusing on detailed descriptions of advanced technical solutions to a clinical need along with clinical results and healthcare relevance. The journal provides a platform for state-of-the-art technology directions in the interdisciplinary field of biomedical engineering, embracing engineering, life sciences and medicine. A unique aspect of the journal is its ability to foster a collaboration between physicians and engineers for presenting broad and compelling real world technological and engineering solutions that can be implemented in the interest of improving quality of patient care and treatment outcomes, thereby reducing costs and improving efficiency. The journal provides an active forum for clinical research and relevant state-of the-art technology for members of all the IEEE societies that have an interest in biomedical engineering as well as reaching out directly to physicians and the medical community through the American Medical Association (AMA) and other clinical societies. The scope of the journal includes, but is not limited, to topics on: Medical devices, healthcare delivery systems, global healthcare initiatives, and ICT based services; Technological relevance to healthcare cost reduction; Technology affecting healthcare management, decision-making, and policy; Advanced technical work that is applied to solving specific clinical needs.