Combined approach to dysarthric speaker verification using data augmentation and feature fusion

IF 2.4 3区 计算机科学 Q2 ACOUSTICS Speech Communication Pub Date : 2024-04-06 DOI:10.1016/j.specom.2024.103070
Shinimol Salim , Syed Shahnawazuddin , Waquar Ahmad
{"title":"Combined approach to dysarthric speaker verification using data augmentation and feature fusion","authors":"Shinimol Salim ,&nbsp;Syed Shahnawazuddin ,&nbsp;Waquar Ahmad","doi":"10.1016/j.specom.2024.103070","DOIUrl":null,"url":null,"abstract":"<div><p>In this study, the challenges of adapting automatic speaker verification (ASV) systems to accommodate individuals with dysarthria, a speech disorder affecting intelligibility and articulation, are addressed. The scarcity of dysarthric speech data presents a significant obstacle in the development of an effective ASV system. To mitigate the detrimental effects of data paucity, an out-of-domain data augmentation approach was employed based on the observation that dysarthric speech often exhibits longer phoneme duration. Motivated by this observation, the duration of healthy speech data was modified with various stretching factors and then pooled into training, resulting in a significant reduction in the error rate. In addition to analyzing average phoneme duration, another analysis revealed that dysarthric speech contains crucial high-frequency spectral information. However, Mel-frequency cepstral coefficients (MFCC) are inherently designed to down-sample spectral information in the higher-frequency regions, and the same is true for Mel-filterbank features. To address this shortcoming, Linear-filterbank cepstral coefficients (LFCC) were used in combination with MFCC features. While MFCC effectively captures certain aspects of dysarthric speech, LFCC complements this by capturing high-frequency details essential for accurate dysarthric speaker verification. This proposed feature fusion effectively minimizes spectral information loss, further reducing error rates. To support the significance of combination of MFCC and LFCC features in an automatic speaker verification system for speakers with dysarthria, comprehensive experimentation was conducted. The fusion of MFCC and LFCC features was compared with several other front-end acoustic features, such as Mel-filterbank features, linear filterbank features, wavelet filterbank features, linear prediction cepstral coefficients (LPCC), frequency domain LPCC, and constant Q cepstral coefficients (CQCC). The approaches were evaluated using both <em>i</em>-vector and <em>x</em>-vector-based representation, comparing systems developed using MFCC and LFCC features individually and in combination. The experimental results presented in this paper demonstrate substantial improvements, with a 25.78% reduction in equal error rate (EER) for <em>i</em>-vector models and a 23.66% reduction in EER for <em>x</em>-vector models when compared to the baseline ASV system. Additionally, the effect of feature concatenation with variation in dysarthria severity levels (low, medium, and high) was studied, and the proposed approach was found to be highly effective in those cases as well.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"160 ","pages":"Article 103070"},"PeriodicalIF":2.4000,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639324000426","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

Abstract

In this study, the challenges of adapting automatic speaker verification (ASV) systems to accommodate individuals with dysarthria, a speech disorder affecting intelligibility and articulation, are addressed. The scarcity of dysarthric speech data presents a significant obstacle in the development of an effective ASV system. To mitigate the detrimental effects of data paucity, an out-of-domain data augmentation approach was employed based on the observation that dysarthric speech often exhibits longer phoneme duration. Motivated by this observation, the duration of healthy speech data was modified with various stretching factors and then pooled into training, resulting in a significant reduction in the error rate. In addition to analyzing average phoneme duration, another analysis revealed that dysarthric speech contains crucial high-frequency spectral information. However, Mel-frequency cepstral coefficients (MFCC) are inherently designed to down-sample spectral information in the higher-frequency regions, and the same is true for Mel-filterbank features. To address this shortcoming, Linear-filterbank cepstral coefficients (LFCC) were used in combination with MFCC features. While MFCC effectively captures certain aspects of dysarthric speech, LFCC complements this by capturing high-frequency details essential for accurate dysarthric speaker verification. This proposed feature fusion effectively minimizes spectral information loss, further reducing error rates. To support the significance of combination of MFCC and LFCC features in an automatic speaker verification system for speakers with dysarthria, comprehensive experimentation was conducted. The fusion of MFCC and LFCC features was compared with several other front-end acoustic features, such as Mel-filterbank features, linear filterbank features, wavelet filterbank features, linear prediction cepstral coefficients (LPCC), frequency domain LPCC, and constant Q cepstral coefficients (CQCC). The approaches were evaluated using both i-vector and x-vector-based representation, comparing systems developed using MFCC and LFCC features individually and in combination. The experimental results presented in this paper demonstrate substantial improvements, with a 25.78% reduction in equal error rate (EER) for i-vector models and a 23.66% reduction in EER for x-vector models when compared to the baseline ASV system. Additionally, the effect of feature concatenation with variation in dysarthria severity levels (low, medium, and high) was studied, and the proposed approach was found to be highly effective in those cases as well.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用数据扩增和特征融合的组合方法验证发音障碍者
构音障碍是一种影响清晰度和发音的语言障碍,本研究探讨了如何调整自动说话者验证(ASV)系统以适应构音障碍患者的挑战。构音障碍语音数据的匮乏是开发有效 ASV 系统的一大障碍。为了减轻数据匮乏带来的不利影响,我们采用了一种域外数据增强方法,该方法基于对发音障碍语音通常表现出较长音素持续时间的观察。在这一观察结果的激励下,健康语音数据的持续时间被各种拉伸因子修改,然后汇集到训练中,从而显著降低了错误率。除了分析平均音素持续时间外,另一项分析显示,发音障碍语音包含重要的高频频谱信息。然而,Mel-frequency cepstral coefficients(MFCC)的固有设计会降低高频区域的频谱信息采样率,Mel-filterbank 特征也是如此。为了解决这一缺陷,我们将线性滤波器组共谱系数(LFCC)与 MFCC 特征结合使用。MFCC 能有效捕捉发音障碍语音的某些方面,而 LFCC 则能捕捉高频细节,从而对准确验证发音障碍说话者起到补充作用。这种拟议的特征融合有效地减少了频谱信息损失,进一步降低了错误率。为了证明 MFCC 和 LFCC 特征在构音障碍说话人自动验证系统中的重要性,我们进行了全面的实验。MFCC 和 LFCC 特征的融合与其他几种前端声学特征进行了比较,如 Mel 滤波库特征、线性滤波库特征、小波滤波库特征、线性预测前谱系数 (LPCC)、频域 LPCC 和常数 Q 前谱系数 (CQCC)。本文使用基于 i 向量和 x 向量的表示方法对这些方法进行了评估,并对使用 MFCC 和 LFCC 特征单独或组合开发的系统进行了比较。与基线 ASV 系统相比,本文介绍的实验结果表明,i-vector 模型的等效错误率 (EER) 降低了 25.78%,x-vector 模型的等效错误率降低了 23.66%。此外,还研究了构音障碍严重程度变化(低、中、高)对特征连接的影响,发现所提出的方法在这些情况下也非常有效。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Speech Communication
Speech Communication 工程技术-计算机:跨学科应用
CiteScore
6.80
自引率
6.20%
发文量
94
审稿时长
19.2 weeks
期刊介绍: Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.
期刊最新文献
Fixed frequency range empirical wavelet transform based acoustic and entropy features for speech emotion recognition AFP-Conformer: Asymptotic feature pyramid conformer for spoofing speech detection A robust temporal map of speech monitoring from planning to articulation The combined effects of bilingualism and musicianship on listeners’ perception of non-native lexical tones Evaluating the effects of continuous pitch and speech tempo modifications on perceptual speaker verification performance by familiar and unfamiliar listeners
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1