Recent works show that speech separation guided diarization (SSGD) is an increasingly promising direction, mainly thanks to the recent progress in speech separation. It performs diarization by first separating the speakers and then applying voice activity detection (VAD) on each separated stream. In this work we conduct an in-depth study of SSGD in the conversational telephone speech (CTS) domain, focusing mainly on low-latency streaming diarization applications. We consider three state-of-the-art speech separation (SSep) algorithms and study their performance both in online and offline scenarios, considering non-causal and causal implementations as well as continuous SSep (CSS) windowed inference. We compare different SSGD algorithms on two widely used CTS datasets: CALLHOME and Fisher Corpus (Part 1 and 2) and evaluate both separation and diarization performance. To improve performance, a novel, causal and computationally efficient leakage removal algorithm is proposed, which significantly decreases false alarms. We also explore, for the first time, fully end-to-end SSGD integration between SSep and VAD modules. Crucially, this enables fine-tuning on real-world data for which oracle speakers sources are not available. In particular, our best model achieves 8.8% DER on CALLHOME, which outperforms the current state-of-the-art end-to-end neural diarization model, despite being trained on an order of magnitude less data and having significantly lower latency, i.e., 0.1 vs. 1 s. Finally, we also show that the separated signals can be readily used also for automatic speech recognition, reaching performance close to using oracle sources in some configurations.
{"title":"End-to-end integration of speech separation and voice activity detection for low-latency diarization of telephone conversations","authors":"Giovanni Morrone , Samuele Cornell , Luca Serafini , Enrico Zovato , Alessio Brutti , Stefano Squartini","doi":"10.1016/j.specom.2024.103081","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103081","url":null,"abstract":"<div><p>Recent works show that speech separation guided diarization (SSGD) is an increasingly promising direction, mainly thanks to the recent progress in speech separation. It performs diarization by first separating the speakers and then applying voice activity detection (VAD) on each separated stream. In this work we conduct an in-depth study of SSGD in the conversational telephone speech (CTS) domain, focusing mainly on low-latency streaming diarization applications. We consider three state-of-the-art speech separation (SSep) algorithms and study their performance both in online and offline scenarios, considering non-causal and causal implementations as well as continuous SSep (CSS) windowed inference. We compare different SSGD algorithms on two widely used CTS datasets: CALLHOME and Fisher Corpus (Part 1 and 2) and evaluate both separation and diarization performance. To improve performance, a novel, causal and computationally efficient leakage removal algorithm is proposed, which significantly decreases false alarms. We also explore, for the first time, fully end-to-end SSGD integration between SSep and VAD modules. Crucially, this enables fine-tuning on real-world data for which oracle speakers sources are not available. In particular, our best model achieves 8.8% DER on CALLHOME, which outperforms the current state-of-the-art end-to-end neural diarization model, despite being trained on an order of magnitude less data and having significantly lower latency, i.e., 0.1 vs. 1 s. Finally, we also show that the separated signals can be readily used also for automatic speech recognition, reaching performance close to using oracle sources in some configurations.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"161 ","pages":"Article 103081"},"PeriodicalIF":3.2,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141078094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-01DOI: 10.1016/j.specom.2024.103084
Ping Tang, Shanpeng Li, Yanan Shen, Qianxi Yu, Yan Feng
Children with cochlear implants (CIs) face challenges in tonal perception under noise. Nevertheless, our previous research demonstrated that seeing visual-articulatory cues (speakers’ facial/head movements) benefited these children to perceive isolated tones better, particularly in noisy environments, with those implanted earlier gaining more benefits. However, tones in daily speech typically occur in sentence contexts where visual cues are largely reduced compared to those in isolated contexts. It was thus unclear if visual benefits on tonal perception still hold in these challenging sentence contexts. Therefore, this study tested 64 children with CIs and 64 age-matched NH children. Target tones in sentence-medial position were presented in audio-only (AO) or audiovisual (AV) conditions, in quiet and noisy environments. Children selected the target tone using a picture-point task. The results showed that, while NH children did not show any perception difference between AO and AV conditions, children with CIs significantly improved their perceptual accuracy from AO to AV conditions. The degree of improvement was negatively correlated with their implantation ages. Therefore, children with CIs were able to use visual-articulatory cues to facilitate their tonal perception even in sentence contexts, and earlier auditory experience might be important in shaping this ability.
在噪音环境下,植入人工耳蜗(CI)的儿童在音调感知方面面临挑战。然而,我们之前的研究表明,看到视觉-发音线索(说话者的面部/头部动作)有利于这些儿童更好地感知孤立的音调,尤其是在噪音环境中,植入时间较早的儿童受益更多。然而,日常言语中的音调通常出现在句子语境中,与孤立语境中的音调相比,句子语境中的视觉线索要少得多。因此,还不清楚在这些具有挑战性的句子语境中,视觉对音调感知的益处是否仍然存在。因此,本研究对 64 名 CI 儿童和 64 名年龄匹配的 NH 儿童进行了测试。在安静和嘈杂的环境中,以纯音频(AO)或视听(AV)条件呈现句子中间位置的目标音调。儿童通过图片点任务选择目标音调。结果表明,虽然正常儿童在纯音频和视听条件下没有表现出任何感知上的差异,但患有人工耳蜗的儿童在纯音频和视听条件下的感知准确性有了显著提高。提高程度与植入年龄呈负相关。因此,即使在句子语境中,植入人工耳蜗的儿童也能利用视觉-发音线索来促进他们对音调的感知,而早期的听觉经验可能是形成这种能力的重要因素。
{"title":"Visual-articulatory cues facilitate children with CIs to better perceive Mandarin tones in sentences","authors":"Ping Tang, Shanpeng Li, Yanan Shen, Qianxi Yu, Yan Feng","doi":"10.1016/j.specom.2024.103084","DOIUrl":"10.1016/j.specom.2024.103084","url":null,"abstract":"<div><p>Children with cochlear implants (CIs) face challenges in tonal perception under noise. Nevertheless, our previous research demonstrated that seeing visual-articulatory cues (speakers’ facial/head movements) benefited these children to perceive isolated tones better, particularly in noisy environments, with those implanted earlier gaining more benefits. However, tones in daily speech typically occur in sentence contexts where visual cues are largely reduced compared to those in isolated contexts. It was thus unclear if visual benefits on tonal perception still hold in these challenging sentence contexts. Therefore, this study tested 64 children with CIs and 64 age-matched NH children. Target tones in sentence-medial position were presented in audio-only (AO) or audiovisual (AV) conditions, in quiet and noisy environments. Children selected the target tone using a picture-point task. The results showed that, while NH children did not show any perception difference between AO and AV conditions, children with CIs significantly improved their perceptual accuracy from AO to AV conditions. The degree of improvement was negatively correlated with their implantation ages. Therefore, children with CIs were able to use visual-articulatory cues to facilitate their tonal perception even in sentence contexts, and earlier auditory experience might be important in shaping this ability.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"160 ","pages":"Article 103084"},"PeriodicalIF":3.2,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141028923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-01DOI: 10.1016/j.specom.2024.103082
Dina El Zarka , Anneliese Kelterer , Michele Gubian , Barbara Schuppler
This paper investigates the prosody of sentences elicited in three Information Structure (IS) conditions: all-new, theme-rheme and rhematic focus-background. The sentences were produced by 18 speakers of Egyptian Arabic (EA). This is the first quantitative study to provide a comprehensive analysis of holistic f0 contours (by means of GAMM) and configurations of f0, duration and intensity (by means of FPCA) associated with the three IS conditions, both across and within speakers. A significant difference between focus-background and the other information structure conditions was found, but also strong inter-speaker variation in terms of strategies and the degree to which these strategies were applied. The results suggest that post-focus register lowering and the duration of the stressed syllables of the focused and the utterance-final word are more consistent cues to focus than a higher peak of the focus accent. In addition, some independence of duration and intensity from f0 could be identified. These results thus support the assumption that, when focus is marked prosodically in EA, it is marked by prominence. Nevertheless, the fact that a considerable number of EA speakers did not apply prosodic marking and the fact that prosodic focus marking was gradient rather than categorical suggest that EA does not have a fully conventionalized prosodic focus construction.
本文研究了在三种信息结构(IS)条件下激发的句子的拟声:全新、主题-主题和主题-焦点-背景。句子是由 18 位讲埃及阿拉伯语(EA)的人发出的。这是首次对三种 IS 条件下的整体 f0 等高线(通过 GAMM)和与之相关的 f0、持续时间和强度配置(通过 FPCA)进行全面分析的定量研究,包括不同说话者之间和说话者内部的分析。结果发现,焦点-背景与其他信息结构条件之间存在明显差异,但在策略和策略应用程度方面,不同说话者之间也存在很大差异。研究结果表明,聚焦后的音位降低以及聚焦词和语篇末尾词的重读音节的持续时间比聚焦重音的峰值更高,是更一致的聚焦线索。此外,还可以发现持续时间和强度与 f0 有一定的独立性。因此,这些结果支持了这样的假设,即在 EA 中,当重点在前音上被标记时,它是通过突出来标记的。然而,相当多的东亚语使用者没有使用前音标记,而且前音重心标记是渐变的而不是分类的,这些事实表明东亚语并没有完全常规化的前音重心结构。
{"title":"The prosody of theme, rheme and focus in Egyptian Arabic: A quantitative investigation of tunes, configurations and speaker variability","authors":"Dina El Zarka , Anneliese Kelterer , Michele Gubian , Barbara Schuppler","doi":"10.1016/j.specom.2024.103082","DOIUrl":"10.1016/j.specom.2024.103082","url":null,"abstract":"<div><p>This paper investigates the prosody of sentences elicited in three Information Structure (IS) conditions: all-new, theme-rheme and rhematic focus-background. The sentences were produced by 18 speakers of Egyptian Arabic (EA). This is the first quantitative study to provide a comprehensive analysis of holistic f0 contours (by means of GAMM) and configurations of f0, duration and intensity (by means of FPCA) associated with the three IS conditions, both across and within speakers. A significant difference between focus-background and the other information structure conditions was found, but also strong inter-speaker variation in terms of strategies and the degree to which these strategies were applied. The results suggest that post-focus register lowering and the duration of the stressed syllables of the focused and the utterance-final word are more consistent cues to focus than a higher peak of the focus accent. In addition, some independence of duration and intensity from f0 could be identified. These results thus support the assumption that, when focus is marked prosodically in EA, it is marked by prominence. Nevertheless, the fact that a considerable number of EA speakers did not apply prosodic marking and the fact that prosodic focus marking was gradient rather than categorical suggest that EA does not have a fully conventionalized prosodic focus construction.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"160 ","pages":"Article 103082"},"PeriodicalIF":3.2,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000542/pdfft?md5=dcb4ae8365c4f0e84a5827d3ae202551&pid=1-s2.0-S0167639324000542-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141035839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-01DOI: 10.1016/j.specom.2024.103071
Sanli Tian , Zehan Li , Zhaobiao Lyv , Gaofeng Cheng , Qing Xiao , Ta Li , Qingwei Zhao
Knowledge distillation (KD) is a popular model compression method to improve the performance of lightweight models by transferring knowledge from a teacher model to a student model. However, applying KD to connectionist temporal classification (CTC) ASR model is challenging due to its peaky posterior property. In this paper, we propose to address this issue by treating non-blank and blank frames differently for two main reasons. First, the non-blank frames in the teacher model’s posterior matrix and hidden representations provide more acoustic and linguistic information than the blank frames, but the frame number of non-blank frames only accounts for a small fraction of all frames, leading to a severe learning imbalance problem. Second, the non-blank tokens in the teacher’s blank-frame posteriors exhibit irregular probability distributions, negatively impacting the student model’s learning. Thus, we propose to factorize the distillation of non-blank and blank frames and further combine them into a progressive KD framework, which contains three incremental stages to facilitate the student model gradually building up its knowledge. The first stage involves a simple binary classification KD task, in which the student learns to distinguish between non-blank and blank frames, as the two types of frames are learned separately in subsequent stages. The second stage is a factorized representation-based KD, in which hidden representations are divided into non-blank and blank frames so that both can be distilled in a balanced manner. In the third stage, the student learns from the teacher’s posterior matrix through our proposed method, factorized KL-divergence (FKL), which performs different operation on blank and non-blank frame posteriors to alleviate the imbalance issue and reduce the influence of irregular probability distributions. Compared to the baseline, our proposed method achieves 22.5% relative CER reduction on the Aishell-1 dataset, 23.0% relative WER reduction on the Tedlium-2 dataset, and 17.6% relative WER reduction on the LibriSpeech dataset. To show the generalization of our method, we also evaluate our method on the hybrid CTC/Attention architecture as well as on scenarios with cross-model topology KD.
{"title":"Factorized and progressive knowledge distillation for CTC-based ASR models","authors":"Sanli Tian , Zehan Li , Zhaobiao Lyv , Gaofeng Cheng , Qing Xiao , Ta Li , Qingwei Zhao","doi":"10.1016/j.specom.2024.103071","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103071","url":null,"abstract":"<div><p>Knowledge distillation (KD) is a popular model compression method to improve the performance of lightweight models by transferring knowledge from a teacher model to a student model. However, applying KD to connectionist temporal classification (CTC) ASR model is challenging due to its peaky posterior property. In this paper, we propose to address this issue by treating non-blank and blank frames differently for two main reasons. First, the non-blank frames in the teacher model’s posterior matrix and hidden representations provide more acoustic and linguistic information than the blank frames, but the frame number of non-blank frames only accounts for a small fraction of all frames, leading to a severe learning imbalance problem. Second, the non-blank tokens in the teacher’s blank-frame posteriors exhibit irregular probability distributions, negatively impacting the student model’s learning. Thus, we propose to factorize the distillation of non-blank and blank frames and further combine them into a progressive KD framework, which contains three incremental stages to facilitate the student model gradually building up its knowledge. The first stage involves a simple binary classification KD task, in which the student learns to distinguish between non-blank and blank frames, as the two types of frames are learned separately in subsequent stages. The second stage is a factorized representation-based KD, in which hidden representations are divided into non-blank and blank frames so that both can be distilled in a balanced manner. In the third stage, the student learns from the teacher’s posterior matrix through our proposed method, factorized KL-divergence (FKL), which performs different operation on blank and non-blank frame posteriors to alleviate the imbalance issue and reduce the influence of irregular probability distributions. Compared to the baseline, our proposed method achieves 22.5% relative CER reduction on the Aishell-1 dataset, 23.0% relative WER reduction on the Tedlium-2 dataset, and 17.6% relative WER reduction on the LibriSpeech dataset. To show the generalization of our method, we also evaluate our method on the hybrid CTC/Attention architecture as well as on scenarios with cross-model topology KD.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"160 ","pages":"Article 103071"},"PeriodicalIF":3.2,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140879835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-01DOI: 10.1016/j.specom.2024.103083
Benjamin Elie , Juraj Šimko , Alice Turk
This paper presents a model of speech articulation planning and generation based on General Tau Theory and Optimal Control Theory. Because General Tau Theory assumes that articulatory targets are always reached, the model accounts for speech variation via context-dependent articulatory targets. Targets are chosen via the optimization of a composite objective function. This function models three different task requirements: maximal intelligibility, minimal articulatory effort and minimal utterance duration. The paper shows that systematic phonetic variability can be reproduced by adjusting the weights assigned to each task requirement. Weights can be adjusted globally to simulate different speech styles, and can be adjusted locally to simulate different levels of prosodic prominence. The solution of the optimization procedure contains Tau equation parameter values for each articulatory movement, namely position of the articulator at the movement offset, movement duration, and a parameter which relates to the shape of the movement’s velocity profile. The paper presents simulations which illustrate the ability of the model to predict or reproduce several well-known characteristics of speech. These phenomena include close-to-symmetric velocity profiles for articulatory movement, variation related to speech rate, centralization of unstressed vowels, lengthening of stressed vowels, lenition of unstressed lingual stop consonants, and coarticulation of stop consonants.
本文介绍了一种基于通用 Tau 理论和最优控制理论的语音发音规划和生成模型。由于通用 Tau 理论假设发音目标总是可以达到,因此该模型通过与语境相关的发音目标来考虑语音的变化。目标是通过优化综合目标函数来选择的。该函数模拟了三种不同的任务要求:最大可懂度、最小发音努力和最短语篇持续时间。论文表明,通过调整分配给每个任务要求的权重,可以再现系统的语音变异性。权重可以全局调整,以模拟不同的语音风格,也可以局部调整,以模拟不同的前音突出程度。优化程序的解决方案包含每个发音动作的 Tau 方程参数值,即发音器在动作偏移时的位置、动作持续时间以及与动作速度曲线形状有关的参数。论文中的模拟结果表明,该模型能够预测或再现几种众所周知的语音特征。这些现象包括近乎对称的发音运动速度曲线、与语速有关的变化、非重读元音的集中、重读元音的延长、非重读舌尖停止辅音的变长以及停止辅音的共同发音。
{"title":"Optimization-based planning of speech articulation using general Tau Theory","authors":"Benjamin Elie , Juraj Šimko , Alice Turk","doi":"10.1016/j.specom.2024.103083","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103083","url":null,"abstract":"<div><p>This paper presents a model of speech articulation planning and generation based on General Tau Theory and Optimal Control Theory. Because General Tau Theory assumes that articulatory targets are always reached, the model accounts for speech variation via context-dependent articulatory targets. Targets are chosen via the optimization of a composite objective function. This function models three different task requirements: maximal intelligibility, minimal articulatory effort and minimal utterance duration. The paper shows that systematic phonetic variability can be reproduced by adjusting the weights assigned to each task requirement. Weights can be adjusted globally to simulate different speech styles, and can be adjusted locally to simulate different levels of prosodic prominence. The solution of the optimization procedure contains Tau equation parameter values for each articulatory movement, namely position of the articulator at the movement offset, movement duration, and a parameter which relates to the shape of the movement’s velocity profile. The paper presents simulations which illustrate the ability of the model to predict or reproduce several well-known characteristics of speech. These phenomena include close-to-symmetric velocity profiles for articulatory movement, variation related to speech rate, centralization of unstressed vowels, lengthening of stressed vowels, lenition of unstressed lingual stop consonants, and coarticulation of stop consonants.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"160 ","pages":"Article 103083"},"PeriodicalIF":3.2,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000554/pdfft?md5=9244f2762d9cdb76bf74cf04a57a092e&pid=1-s2.0-S0167639324000554-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140948784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-21DOI: 10.1016/j.specom.2024.103072
Jiazhong Zeng , Jianxin Peng , Shuyin Xiang
The speech intelligibility index (SII) and speech transmission index (STI) are widely accepted objective metrics for assessing speech intelligibility. In previous work, the relationship between STI and Chinese speech intelligibility (CSI) scores was studied. In this paper, the relationship between SII and CSI scores in rooms for the elderly aged 60–69 and over 70 is investigated by using auralization method under different background noise levels (40dBA and 55dBA) and different reverberation times. The results show that SII has good correlation with CSI score of the elderly. To get the same CSI score as the young adults, the elderly need a larger SII value, and the value increases with the increase of the age for the elderly. Since hearing loss of the elderly is considered in the calculation of SII, the difference in the required SII between the elderly and young is less than that of the required STI under the same CSI score condition. This indicates that SII is a more consistent evaluation criterion for different ages.
{"title":"Chinese speech intelligibility and speech intelligibility index for the elderly","authors":"Jiazhong Zeng , Jianxin Peng , Shuyin Xiang","doi":"10.1016/j.specom.2024.103072","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103072","url":null,"abstract":"<div><p>The speech intelligibility index (SII) and speech transmission index (STI) are widely accepted objective metrics for assessing speech intelligibility. In previous work, the relationship between STI and Chinese speech intelligibility (CSI) scores was studied. In this paper, the relationship between SII and CSI scores in rooms for the elderly aged 60–69 and over 70 is investigated by using auralization method under different background noise levels (40dBA and 55dBA) and different reverberation times. The results show that SII has good correlation with CSI score of the elderly. To get the same CSI score as the young adults, the elderly need a larger SII value, and the value increases with the increase of the age for the elderly. Since hearing loss of the elderly is considered in the calculation of SII, the difference in the required SII between the elderly and young is less than that of the required STI under the same CSI score condition. This indicates that SII is a more consistent evaluation criterion for different ages.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"160 ","pages":"Article 103072"},"PeriodicalIF":3.2,"publicationDate":"2024-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140638630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-06DOI: 10.1016/j.specom.2024.103070
Shinimol Salim , Syed Shahnawazuddin , Waquar Ahmad
In this study, the challenges of adapting automatic speaker verification (ASV) systems to accommodate individuals with dysarthria, a speech disorder affecting intelligibility and articulation, are addressed. The scarcity of dysarthric speech data presents a significant obstacle in the development of an effective ASV system. To mitigate the detrimental effects of data paucity, an out-of-domain data augmentation approach was employed based on the observation that dysarthric speech often exhibits longer phoneme duration. Motivated by this observation, the duration of healthy speech data was modified with various stretching factors and then pooled into training, resulting in a significant reduction in the error rate. In addition to analyzing average phoneme duration, another analysis revealed that dysarthric speech contains crucial high-frequency spectral information. However, Mel-frequency cepstral coefficients (MFCC) are inherently designed to down-sample spectral information in the higher-frequency regions, and the same is true for Mel-filterbank features. To address this shortcoming, Linear-filterbank cepstral coefficients (LFCC) were used in combination with MFCC features. While MFCC effectively captures certain aspects of dysarthric speech, LFCC complements this by capturing high-frequency details essential for accurate dysarthric speaker verification. This proposed feature fusion effectively minimizes spectral information loss, further reducing error rates. To support the significance of combination of MFCC and LFCC features in an automatic speaker verification system for speakers with dysarthria, comprehensive experimentation was conducted. The fusion of MFCC and LFCC features was compared with several other front-end acoustic features, such as Mel-filterbank features, linear filterbank features, wavelet filterbank features, linear prediction cepstral coefficients (LPCC), frequency domain LPCC, and constant Q cepstral coefficients (CQCC). The approaches were evaluated using both i-vector and x-vector-based representation, comparing systems developed using MFCC and LFCC features individually and in combination. The experimental results presented in this paper demonstrate substantial improvements, with a 25.78% reduction in equal error rate (EER) for i-vector models and a 23.66% reduction in EER for x-vector models when compared to the baseline ASV system. Additionally, the effect of feature concatenation with variation in dysarthria severity levels (low, medium, and high) was studied, and the proposed approach was found to be highly effective in those cases as well.
{"title":"Combined approach to dysarthric speaker verification using data augmentation and feature fusion","authors":"Shinimol Salim , Syed Shahnawazuddin , Waquar Ahmad","doi":"10.1016/j.specom.2024.103070","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103070","url":null,"abstract":"<div><p>In this study, the challenges of adapting automatic speaker verification (ASV) systems to accommodate individuals with dysarthria, a speech disorder affecting intelligibility and articulation, are addressed. The scarcity of dysarthric speech data presents a significant obstacle in the development of an effective ASV system. To mitigate the detrimental effects of data paucity, an out-of-domain data augmentation approach was employed based on the observation that dysarthric speech often exhibits longer phoneme duration. Motivated by this observation, the duration of healthy speech data was modified with various stretching factors and then pooled into training, resulting in a significant reduction in the error rate. In addition to analyzing average phoneme duration, another analysis revealed that dysarthric speech contains crucial high-frequency spectral information. However, Mel-frequency cepstral coefficients (MFCC) are inherently designed to down-sample spectral information in the higher-frequency regions, and the same is true for Mel-filterbank features. To address this shortcoming, Linear-filterbank cepstral coefficients (LFCC) were used in combination with MFCC features. While MFCC effectively captures certain aspects of dysarthric speech, LFCC complements this by capturing high-frequency details essential for accurate dysarthric speaker verification. This proposed feature fusion effectively minimizes spectral information loss, further reducing error rates. To support the significance of combination of MFCC and LFCC features in an automatic speaker verification system for speakers with dysarthria, comprehensive experimentation was conducted. The fusion of MFCC and LFCC features was compared with several other front-end acoustic features, such as Mel-filterbank features, linear filterbank features, wavelet filterbank features, linear prediction cepstral coefficients (LPCC), frequency domain LPCC, and constant Q cepstral coefficients (CQCC). The approaches were evaluated using both <em>i</em>-vector and <em>x</em>-vector-based representation, comparing systems developed using MFCC and LFCC features individually and in combination. The experimental results presented in this paper demonstrate substantial improvements, with a 25.78% reduction in equal error rate (EER) for <em>i</em>-vector models and a 23.66% reduction in EER for <em>x</em>-vector models when compared to the baseline ASV system. Additionally, the effect of feature concatenation with variation in dysarthria severity levels (low, medium, and high) was studied, and the proposed approach was found to be highly effective in those cases as well.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"160 ","pages":"Article 103070"},"PeriodicalIF":3.2,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140555266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-01DOI: 10.1016/j.specom.2024.103067
Nutan Singh, Priyanka Tripathi
Parkinson's Disease (PD) is a progressive neurodegenerative disorder affecting motor and non-motor symptoms. Its symptoms develop slowly, making early identification difficult. Machine learning has a significant potential to predict Parkinson's disease on features hidden in voice data. This work aimed to identify the most relevant features from a high-dimensional dataset, which helps accurately classify Parkinson's Disease with less computation time. Three individual datasets with various medical features based on voice have been analyzed in this work. An Ensemble Feature Selection Algorithm (EFSA) technique based on filter, wrapper, and embedding algorithms that pick highly relevant features for identifying Parkinson's Disease is proposed, and the same has been validated on three different datasets based on voice. These techniques can shorten training time to improve model accuracy and minimize overfitting. We utilized different ML models such as K-Nearest Neighbors (KNN), Random Forest, Decision Tree, Support Vector Machine (SVM), Bagging Classifier, Multi-Layer Perceptron (MLP) Classifier, and Gradient Boosting. Each of these models was fine-tuned to ensure optimal performance within our specific context. Moreover, in addition to these established classifiers, we proposed an ensemble classifier is found on a high optimal majority of the votes. Dataset-I achieves classification accuracy with 97.6 %, F1-score 97.9 %, precision with 98 % and recall with 98 %. Dataset-II achieves classification accuracy 90.2 %, F1-score 90.2 %, precision 90.2 %, and recall 90.5 %. Dataset-III achieves 83.3 % accuracy, F1-score 83.3 %, precision 83.5 % and recall 83.3 %. These results have been taken using 13 out of 23, 45 out of 754, and 17 out of 46 features from respective datasets. The proposed EFSA model has performed with higher accuracy and is more efficient than other models for each dataset.
{"title":"An ensemble technique to predict Parkinson's disease using machine learning algorithms","authors":"Nutan Singh, Priyanka Tripathi","doi":"10.1016/j.specom.2024.103067","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103067","url":null,"abstract":"<div><p>Parkinson's Disease (PD) is a progressive neurodegenerative disorder affecting motor and non-motor symptoms. Its symptoms develop slowly, making early identification difficult. Machine learning has a significant potential to predict Parkinson's disease on features hidden in voice data. This work aimed to identify the most relevant features from a high-dimensional dataset, which helps accurately classify Parkinson's Disease with less computation time. Three individual datasets with various medical features based on voice have been analyzed in this work. An Ensemble Feature Selection Algorithm (EFSA) technique based on filter, wrapper, and embedding algorithms that pick highly relevant features for identifying Parkinson's Disease is proposed, and the same has been validated on three different datasets based on voice. These techniques can shorten training time to improve model accuracy and minimize overfitting. We utilized different ML models such as K-Nearest Neighbors (KNN), Random Forest, Decision Tree, Support Vector Machine (SVM), Bagging Classifier, Multi-Layer Perceptron (MLP) Classifier, and Gradient Boosting. Each of these models was fine-tuned to ensure optimal performance within our specific context. Moreover, in addition to these established classifiers, we proposed an ensemble classifier is found on a high optimal majority of the votes. Dataset-I achieves classification accuracy with 97.6 %, F<sub>1</sub>-score 97.9 %, precision with 98 % and recall with 98 %. Dataset-II achieves classification accuracy 90.2 %, F<sub>1</sub>-score 90.2 %, precision 90.2 %, and recall 90.5 %. Dataset-III achieves 83.3 % accuracy, F<sub>1</sub>-score 83.3 %, precision 83.5 % and recall 83.3 %. These results have been taken using 13 out of 23, 45 out of 754, and 17 out of 46 features from respective datasets. The proposed EFSA model has performed with higher accuracy and is more efficient than other models for each dataset.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"159 ","pages":"Article 103067"},"PeriodicalIF":3.2,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140547363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study investigates conversational feedback, that is, a listener's reaction in response to a speaker, a phenomenon which occurs in all natural interactions. Feedback depends on the main speaker's productions and in return supports the elaboration of the interaction. As a consequence, feedback production has a direct impact on the quality of the interaction.
This paper examines all types of feedback, from generic to specific feedback, the latter of which has received less attention in the literature. We also present a fine-grained labeling system introducing two sub-types of specific feedback: positive/negative and given/new. Following a literature review on linguistic and machine learning perspectives highlighting the main issues in feedback prediction, we present a model based on a set of multimodal features which predicts the possible position of feedback and its type. This computational model makes it possible to precisely identify the different features in the speaker's production (morpho-syntactic, prosodic and mimo-gestural) which play a role in triggering feedback from the listener; the model also evaluates their relative importance.
The main contribution of this study is twofold: we sought to improve 1/ the model's performance in comparison with other approaches relying on a small set of features, and 2/ the model's interpretability, in particular by investigating feature importance. By integrating all the different modalities as well as high-level features, our model is uniquely positioned to be applied to French corpora.
{"title":"A multimodal model for predicting feedback position and type during conversation","authors":"Auriane Boudin , Roxane Bertrand , Stéphane Rauzy , Magalie Ochs , Philippe Blache","doi":"10.1016/j.specom.2024.103066","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103066","url":null,"abstract":"<div><p>This study investigates conversational feedback, that is, a listener's reaction in response to a speaker, a phenomenon which occurs in all natural interactions. Feedback depends on the main speaker's productions and in return supports the elaboration of the interaction. As a consequence, feedback production has a direct impact on the quality of the interaction.</p><p>This paper examines all types of feedback, from generic to specific feedback, the latter of which has received less attention in the literature. We also present a fine-grained labeling system introducing two sub-types of specific feedback: <em>positive/negative</em> and <em>given/new</em>. Following a literature review on linguistic and machine learning perspectives highlighting the main issues in feedback prediction, we present a model based on a set of multimodal features which predicts the possible position of feedback and its type. This computational model makes it possible to precisely identify the different features in the speaker's production (morpho-syntactic, prosodic and mimo-gestural) which play a role in triggering feedback from the listener; the model also evaluates their relative importance.</p><p>The main contribution of this study is twofold: we sought to improve 1/ the model's performance in comparison with other approaches relying on a small set of features, and 2/ the model's interpretability, in particular by investigating feature importance. By integrating all the different modalities as well as high-level features, our model is uniquely positioned to be applied to French corpora.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"159 ","pages":"Article 103066"},"PeriodicalIF":3.2,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000384/pdfft?md5=d3bb6a1d05cfbf539d30e718f252c2d8&pid=1-s2.0-S0167639324000384-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140331131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-01DOI: 10.1016/j.specom.2024.103068
Szymon Drgas
In this article, a lightweight and interpretable speech intelligibility prediction network is proposed. It is based on the ESTOI metric with several extensions: learned modulation filterbank, temporal attention, and taking into account robustness of a given reference recording. The proposed network is differentiable, and therefore it can be applied as a loss function in speech enhancement systems. The method was evaluated using the Clarity Prediction Challenge dataset. Compared to MB-STOI, the best of the systems proposed in this paper reduced RMSE from 28.01 to 21.33. It also outperformed best performing systems from the Clarity Challenge, while its training does not require additional labels like speech enhancement system and talker. It also has small memory and requirements, therefore, it can be potentially used as a loss function to train speech enhancement system. As it would consume less resources, the saved ones can be used for a larger speech enhancement neural network.
{"title":"Speech intelligibility prediction using generalized ESTOI with fine-tuned parameters","authors":"Szymon Drgas","doi":"10.1016/j.specom.2024.103068","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103068","url":null,"abstract":"<div><p>In this article, a lightweight and interpretable speech intelligibility prediction network is proposed. It is based on the ESTOI metric with several extensions: learned modulation filterbank, temporal attention, and taking into account robustness of a given reference recording. The proposed network is differentiable, and therefore it can be applied as a loss function in speech enhancement systems. The method was evaluated using the Clarity Prediction Challenge dataset. Compared to MB-STOI, the best of the systems proposed in this paper reduced RMSE from 28.01 to 21.33. It also outperformed best performing systems from the Clarity Challenge, while its training does not require additional labels like speech enhancement system and talker. It also has small memory and requirements, therefore, it can be potentially used as a loss function to train speech enhancement system. As it would consume less resources, the saved ones can be used for a larger speech enhancement neural network.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"159 ","pages":"Article 103068"},"PeriodicalIF":3.2,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140540077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}