Combined approach to dysarthric speaker verification using data augmentation and feature fusion

IF 2.4 3区计算机科学 Q2 ACOUSTICS Speech Communication Pub Date : 2024-04-06 DOI:10.1016/j.specom.2024.103070

Shinimol Salim , Syed Shahnawazuddin , Waquar Ahmad

{"title":"Combined approach to dysarthric speaker verification using data augmentation and feature fusion","authors":"Shinimol Salim , Syed Shahnawazuddin , Waquar Ahmad","doi":"10.1016/j.specom.2024.103070","DOIUrl":null,"url":null,"abstract":"<div>In this study, the challenges of adapting automatic speaker verification (ASV) systems to accommodate individuals with dysarthria, a speech disorder affecting intelligibility and articulation, are addressed. The scarcity of dysarthric speech data presents a significant obstacle in the development of an effective ASV system. To mitigate the detrimental effects of data paucity, an out-of-domain data augmentation approach was employed based on the observation that dysarthric speech often exhibits longer phoneme duration. Motivated by this observation, the duration of healthy speech data was modified with various stretching factors and then pooled into training, resulting in a significant reduction in the error rate. In addition to analyzing average phoneme duration, another analysis revealed that dysarthric speech contains crucial high-frequency spectral information. However, Mel-frequency cepstral coefficients (MFCC) are inherently designed to down-sample spectral information in the higher-frequency regions, and the same is true for Mel-filterbank features. To address this shortcoming, Linear-filterbank cepstral coefficients (LFCC) were used in combination with MFCC features. While MFCC effectively captures certain aspects of dysarthric speech, LFCC complements this by capturing high-frequency details essential for accurate dysarthric speaker verification. This proposed feature fusion effectively minimizes spectral information loss, further reducing error rates. To support the significance of combination of MFCC and LFCC features in an automatic speaker verification system for speakers with dysarthria, comprehensive experimentation was conducted. The fusion of MFCC and LFCC features was compared with several other front-end acoustic features, such as Mel-filterbank features, linear filterbank features, wavelet filterbank features, linear prediction cepstral coefficients (LPCC), frequency domain LPCC, and constant Q cepstral coefficients (CQCC). The approaches were evaluated using both i-vector and x-vector-based representation, comparing systems developed using MFCC and LFCC features individually and in combination. The experimental results presented in this paper demonstrate substantial improvements, with a 25.78% reduction in equal error rate (EER) for i-vector models and a 23.66% reduction in EER for x-vector models when compared to the baseline ASV system. Additionally, the effect of feature concatenation with variation in dysarthria severity levels (low, medium, and high) was studied, and the proposed approach was found to be highly effective in those cases as well.</div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"160 ","pages":"Article 103070"},"PeriodicalIF":2.4000,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639324000426","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

In this study, the challenges of adapting automatic speaker verification (ASV) systems to accommodate individuals with dysarthria, a speech disorder affecting intelligibility and articulation, are addressed. The scarcity of dysarthric speech data presents a significant obstacle in the development of an effective ASV system. To mitigate the detrimental effects of data paucity, an out-of-domain data augmentation approach was employed based on the observation that dysarthric speech often exhibits longer phoneme duration. Motivated by this observation, the duration of healthy speech data was modified with various stretching factors and then pooled into training, resulting in a significant reduction in the error rate. In addition to analyzing average phoneme duration, another analysis revealed that dysarthric speech contains crucial high-frequency spectral information. However, Mel-frequency cepstral coefficients (MFCC) are inherently designed to down-sample spectral information in the higher-frequency regions, and the same is true for Mel-filterbank features. To address this shortcoming, Linear-filterbank cepstral coefficients (LFCC) were used in combination with MFCC features. While MFCC effectively captures certain aspects of dysarthric speech, LFCC complements this by capturing high-frequency details essential for accurate dysarthric speaker verification. This proposed feature fusion effectively minimizes spectral information loss, further reducing error rates. To support the significance of combination of MFCC and LFCC features in an automatic speaker verification system for speakers with dysarthria, comprehensive experimentation was conducted. The fusion of MFCC and LFCC features was compared with several other front-end acoustic features, such as Mel-filterbank features, linear filterbank features, wavelet filterbank features, linear prediction cepstral coefficients (LPCC), frequency domain LPCC, and constant Q cepstral coefficients (CQCC). The approaches were evaluated using both i-vector and x-vector-based representation, comparing systems developed using MFCC and LFCC features individually and in combination. The experimental results presented in this paper demonstrate substantial improvements, with a 25.78% reduction in equal error rate (EER) for i-vector models and a 23.66% reduction in EER for x-vector models when compared to the baseline ASV system. Additionally, the effect of feature concatenation with variation in dysarthria severity levels (low, medium, and high) was studied, and the proposed approach was found to be highly effective in those cases as well.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用数据扩增和特征融合的组合方法验证发音障碍者

构音障碍是一种影响清晰度和发音的语言障碍，本研究探讨了如何调整自动说话者验证（ASV）系统以适应构音障碍患者的挑战。构音障碍语音数据的匮乏是开发有效 ASV 系统的一大障碍。为了减轻数据匮乏带来的不利影响，我们采用了一种域外数据增强方法，该方法基于对发音障碍语音通常表现出较长音素持续时间的观察。在这一观察结果的激励下，健康语音数据的持续时间被各种拉伸因子修改，然后汇集到训练中，从而显著降低了错误率。除了分析平均音素持续时间外，另一项分析显示，发音障碍语音包含重要的高频频谱信息。然而，Mel-frequency cepstral coefficients（MFCC）的固有设计会降低高频区域的频谱信息采样率，Mel-filterbank 特征也是如此。为了解决这一缺陷，我们将线性滤波器组共谱系数（LFCC）与 MFCC 特征结合使用。MFCC 能有效捕捉发音障碍语音的某些方面，而 LFCC 则能捕捉高频细节，从而对准确验证发音障碍说话者起到补充作用。这种拟议的特征融合有效地减少了频谱信息损失，进一步降低了错误率。为了证明 MFCC 和 LFCC 特征在构音障碍说话人自动验证系统中的重要性，我们进行了全面的实验。MFCC 和 LFCC 特征的融合与其他几种前端声学特征进行了比较，如 Mel 滤波库特征、线性滤波库特征、小波滤波库特征、线性预测前谱系数 (LPCC)、频域 LPCC 和常数 Q 前谱系数 (CQCC)。本文使用基于 i 向量和 x 向量的表示方法对这些方法进行了评估，并对使用 MFCC 和 LFCC 特征单独或组合开发的系统进行了比较。与基线 ASV 系统相比，本文介绍的实验结果表明，i-vector 模型的等效错误率 (EER) 降低了 25.78%，x-vector 模型的等效错误率降低了 23.66%。此外，还研究了构音障碍严重程度变化（低、中、高）对特征连接的影响，发现所提出的方法在这些情况下也非常有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Speech Communication 工程技术-计算机：跨学科应用

CiteScore

6.80

自引率

6.20%

发文量

审稿时长

19.2 weeks

期刊介绍： Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.

期刊最新文献

The Ohio Child Speech Corpus Editorial Board Phonetic realizations of focus in declarative intonation in Iraqi Arabic Non-intrusive binaural speech recognition prediction for hearing aid processing Nasal coarticulation in Lombard speech