Pub Date : 2025-11-27DOI: 10.1016/j.specom.2025.103335
Hikaru Yanagida , Yusuke Ijima , Naohiro Tawara
This study aims to identify individual characteristics such as age, gender, personality traits, and values that influence the perception of one’s own recorded voice. While previous studies have shown that the perception of one’s own recorded voice is different from that of others, and that these differences are influenced by individual characteristics, only a limited number of individual characteristics were examined in past research. In our study, we conducted a large-scale subjective experiment with 141 Japanese participants using multiple individual characteristics. Participants evaluated impressions of their own recorded voices and the voices of others, and we analyzed the relationship between each of the individual characteristics and the voice impressions. Our findings showed that individual characteristics such as the frequency of listening to one’s own recorded voice (which had not been examined in the previous studies) influenced the perception of one’s own recorded voice. We further analyzed the use of combinations of multiple individual characteristics, including those that influenced impressions in a single use, to predict impressions of one’s own recorded voice and found that they were better predicted by the combination of multiple individual characteristics than by the use of a single individual characteristic.
{"title":"Effect of individual characteristics on impressions of one’s own recorded voice","authors":"Hikaru Yanagida , Yusuke Ijima , Naohiro Tawara","doi":"10.1016/j.specom.2025.103335","DOIUrl":"10.1016/j.specom.2025.103335","url":null,"abstract":"<div><div>This study aims to identify individual characteristics such as age, gender, personality traits, and values that influence the perception of one’s own recorded voice. While previous studies have shown that the perception of one’s own recorded voice is different from that of others, and that these differences are influenced by individual characteristics, only a limited number of individual characteristics were examined in past research. In our study, we conducted a large-scale subjective experiment with 141 Japanese participants using multiple individual characteristics. Participants evaluated impressions of their own recorded voices and the voices of others, and we analyzed the relationship between each of the individual characteristics and the voice impressions. Our findings showed that individual characteristics such as the frequency of listening to one’s own recorded voice (which had not been examined in the previous studies) influenced the perception of one’s own recorded voice. We further analyzed the use of combinations of multiple individual characteristics, including those that influenced impressions in a single use, to predict impressions of one’s own recorded voice and found that they were better predicted by the combination of multiple individual characteristics than by the use of a single individual characteristic.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"176 ","pages":"Article 103335"},"PeriodicalIF":3.0,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-24DOI: 10.1016/j.specom.2025.103333
Theo Lepage, Reda Dehak
Deep learning models trained in a supervised setting have revolutionized audio and speech processing. However, their performance inherently depends on the quantity of human-annotated data, making them costly to scale and prone to poor generalization under unseen conditions. To address these challenges, Self-Supervised Learning (SSL) has emerged as a promising paradigm, leveraging vast amounts of unlabeled data to learn relevant representations. The application of SSL for Automatic Speech Recognition (ASR) has been extensively studied, but research on other downstream tasks, notably Speaker Recognition (SR), remains in its early stages. This work describes major SSL instance-invariance frameworks (e.g., SimCLR, MoCo, and DINO), initially developed for computer vision, along with their adaptation to SR. Various SSL methods for SR, proposed in the literature and built upon these frameworks, are also presented. An extensive review of these approaches is then conducted: (1) the effect of the main hyperparameters of SSL frameworks is investigated; (2) the role of SSL components is studied (e.g., data-augmentation, projector, positive sampling); and (3) SSL frameworks are evaluated on SR with in-domain and out-of-domain data, using a consistent experimental setup, and a comprehensive comparison of SSL methods from the literature is provided. Specifically, DINO achieves the best downstream performance and effectively models intra-speaker variability, although it is highly sensitive to hyperparameters and training conditions, while SimCLR and MoCo provide robust alternatives that effectively capture inter-speaker variability and are less prone to collapse. This work aims to highlight recent trends and advancements, identifying current challenges in the field.
{"title":"Self-Supervised Learning for Speaker Recognition: A study and review","authors":"Theo Lepage, Reda Dehak","doi":"10.1016/j.specom.2025.103333","DOIUrl":"10.1016/j.specom.2025.103333","url":null,"abstract":"<div><div>Deep learning models trained in a supervised setting have revolutionized audio and speech processing. However, their performance inherently depends on the quantity of human-annotated data, making them costly to scale and prone to poor generalization under unseen conditions. To address these challenges, Self-Supervised Learning (SSL) has emerged as a promising paradigm, leveraging vast amounts of unlabeled data to learn relevant representations. The application of SSL for Automatic Speech Recognition (ASR) has been extensively studied, but research on other downstream tasks, notably Speaker Recognition (SR), remains in its early stages. This work describes major SSL instance-invariance frameworks (e.g., SimCLR, MoCo, and DINO), initially developed for computer vision, along with their adaptation to SR. Various SSL methods for SR, proposed in the literature and built upon these frameworks, are also presented. An extensive review of these approaches is then conducted: (1) the effect of the main hyperparameters of SSL frameworks is investigated; (2) the role of SSL components is studied (e.g., data-augmentation, projector, positive sampling); and (3) SSL frameworks are evaluated on SR with in-domain and out-of-domain data, using a consistent experimental setup, and a comprehensive comparison of SSL methods from the literature is provided. Specifically, DINO achieves the best downstream performance and effectively models intra-speaker variability, although it is highly sensitive to hyperparameters and training conditions, while SimCLR and MoCo provide robust alternatives that effectively capture inter-speaker variability and are less prone to collapse. This work aims to highlight recent trends and advancements, identifying current challenges in the field.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"176 ","pages":"Article 103333"},"PeriodicalIF":3.0,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-24DOI: 10.1016/j.specom.2025.103332
Weijie Lu , Yunfeng Xu , Jintan Gu
Multimodal Dialogue Emotion Recognition is rapidly emerging as a research hotspot with broad application prospects. In recent years, researchers have invested a lot of effort in the integration of modal feature information, but the detailed analysis of each modal feature information is still insufficient, and the difference in the influence of each modal feature information on the recognition results has not been fully considered. In order to solve this problem, we propose a Transformer-based multimodal interaction model with an adaptive weighted fusion mechanism (TIAWFM). The model effectively captures deep inter-modal correlations in multimodal emotion recognition tasks, mitigating the limitations of unimodal representations. We observe that incorporating specific conversational contexts and dynamically allocating weights to each modality not only fully leverages the model’s capabilities but also enables more accurate capture of emotional information embedded in the features. We conducted extensive experiments on two benchmark multimodal datasets, IEMOCAP and MELD. Experimental results demonstrate that TIAWFM exhibits significant advantages in dynamically integrating multimodal information, leading to notable improvements in both the accuracy and robustness of emotion recognition.
{"title":"Adaptive weighting in a transformer framework for multimodal emotion recognition","authors":"Weijie Lu , Yunfeng Xu , Jintan Gu","doi":"10.1016/j.specom.2025.103332","DOIUrl":"10.1016/j.specom.2025.103332","url":null,"abstract":"<div><div>Multimodal Dialogue Emotion Recognition is rapidly emerging as a research hotspot with broad application prospects. In recent years, researchers have invested a lot of effort in the integration of modal feature information, but the detailed analysis of each modal feature information is still insufficient, and the difference in the influence of each modal feature information on the recognition results has not been fully considered. In order to solve this problem, we propose a Transformer-based multimodal interaction model with an adaptive weighted fusion mechanism (TIAWFM). The model effectively captures deep inter-modal correlations in multimodal emotion recognition tasks, mitigating the limitations of unimodal representations. We observe that incorporating specific conversational contexts and dynamically allocating weights to each modality not only fully leverages the model’s capabilities but also enables more accurate capture of emotional information embedded in the features. We conducted extensive experiments on two benchmark multimodal datasets, IEMOCAP and MELD. Experimental results demonstrate that TIAWFM exhibits significant advantages in dynamically integrating multimodal information, leading to notable improvements in both the accuracy and robustness of emotion recognition.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"176 ","pages":"Article 103332"},"PeriodicalIF":3.0,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145594745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-15DOI: 10.1016/j.specom.2025.103330
Junrui Ni , Liming Wang , Yang Zhang , Kaizhi Qian , Heting Gao , Mark Hasegawa-Johnson , James Glass , Chang D. Yoo
Recent advancements in supervised automatic speech recognition (ASR) have achieved remarkable performance, largely due to the growing availability of large transcribed speech corpora. However, most languages lack sufficient paired speech and text data to effectively train these systems. In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora by proposing the removal of reliance on a phoneme lexicon. We explore a new research direction: word-level unsupervised ASR, and experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling. Using a curated speech corpus containing a fixed number of English words, our system iteratively refines the word segmentation structure and achieves a word error rate of between 20%–23%, depending on the vocabulary size, without parallel transcripts, oracle word boundaries, or a pronunciation lexicon. This innovative model surpasses the performance of previous unsupervised ASR models under the lexicon-free setting.
{"title":"Towards unsupervised speech recognition without pronunciation models","authors":"Junrui Ni , Liming Wang , Yang Zhang , Kaizhi Qian , Heting Gao , Mark Hasegawa-Johnson , James Glass , Chang D. Yoo","doi":"10.1016/j.specom.2025.103330","DOIUrl":"10.1016/j.specom.2025.103330","url":null,"abstract":"<div><div>Recent advancements in supervised automatic speech recognition (ASR) have achieved remarkable performance, largely due to the growing availability of large transcribed speech corpora. However, most languages lack sufficient paired speech and text data to effectively train these systems. In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora by proposing the removal of reliance on a phoneme lexicon. We explore a new research direction: word-level unsupervised ASR, and experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling. Using a curated speech corpus containing a fixed number of English words, our system iteratively refines the word segmentation structure and achieves a word error rate of between 20%–23%, depending on the vocabulary size, without parallel transcripts, oracle word boundaries, or a pronunciation lexicon. This innovative model surpasses the performance of previous unsupervised ASR models under the lexicon-free setting.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"176 ","pages":"Article 103330"},"PeriodicalIF":3.0,"publicationDate":"2025-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-11DOI: 10.1016/j.specom.2025.103329
Paul Foulkes, Vincent Hughes, Kayleigh Peters, Jasmine Rouse
This study compares the relative performance of formant- and MFCC-based analyses of the same dataset, extending the work of Franco-Pedroso and Gonzalez-Rodriguez (2016). Using a corpus of read speech from 24 male English speakers we extracted vowel formant data and segmentally-based MFCCs of all phonemes. Data were taken from three versions of the corpus: 10 min and 3 min samples with wholly automated segment labelling and data extraction (the 10U and 3U datasets), and 3 min samples with manually corrected segment labelling and manual checking of formant tracking (the 3C dataset). The datasets were split in half and used for nine speaker discrimination tests: six tests using formants or MFCCs in each of the 10U, 3U and 3C datasets, and three fused systems combining formants and MFCCs for each dataset.
The formant-based tests revealed that the best performing segments were /ɪ/, /eɪ/, /aɪ/, /e/, /ʌ/ and /əː/. These vowels also performed well in MFCC-based tests, along with the three nasal consonants /m, n, ŋ/ and /k/. Relatively similar patterns were found for the three datasets. There was also a correlation with segment frequency: more frequent phonemes generally yielded better results. In addition, formant-based measures gave better EER and Cllr values than segmentally-based MFCCs. For formants, the best results came from the 10U dataset, while for MFCCs the best results came from the manually corrected 3C dataset. The effect of manual correction was starkest for consonants. Finally, the fused systems performed very well, with both formant- and MFCC-based systems producing EERs close to 0 in some cases. The best systems were those using the 3C dataset. The fused 10U system generally produced notably weaker LLRs, presumably because of the inevitably larger number of data labelling errors.
While the study is not forensically realistic, it has a number of implications for forensic speaker comparison. First, the best performing segments are those vowels in which formant separation is clear, and consonants (nasals) with formant structure. Second, manual correction of data is beneficial, especially for consonants. MFCCs are high dimensional data relative to vowel formants taken at a segment’s midpoint. Misalignment of automated labelling and tracking is thus potentially more likely to have a deleterious effect on MFCCs. While the 10U dataset yielded the best scores for vowel formants, there is a danger that it overestimates the discriminatory power of those segments. A degree of manual correction is therefore worthwhile. Finally, although MFCC data yielded worse scores on a segment by segment basis, the fused system worked very well. Further research is therefore merited on MFCC-based analysis of segments as variables in speaker comparison, and more broadly in phonetic research.
{"title":"The discriminative capacity of English segments in forensic speaker comparison","authors":"Paul Foulkes, Vincent Hughes, Kayleigh Peters, Jasmine Rouse","doi":"10.1016/j.specom.2025.103329","DOIUrl":"10.1016/j.specom.2025.103329","url":null,"abstract":"<div><div>This study compares the relative performance of formant- and MFCC-based analyses of the same dataset, extending the work of Franco-Pedroso and Gonzalez-Rodriguez (2016). Using a corpus of read speech from 24 male English speakers we extracted vowel formant data and segmentally-based MFCCs of all phonemes. Data were taken from three versions of the corpus: 10 min and 3 min samples with wholly automated segment labelling and data extraction (the 10U and 3U datasets), and 3 min samples with manually corrected segment labelling and manual checking of formant tracking (the 3C dataset). The datasets were split in half and used for nine speaker discrimination tests: six tests using formants or MFCCs in each of the 10U, 3U and 3C datasets, and three fused systems combining formants and MFCCs for each dataset.</div><div>The formant-based tests revealed that the best performing segments were /ɪ/, /eɪ/, /aɪ/, /e/, /ʌ/ and /əː/. These vowels also performed well in MFCC-based tests, along with the three nasal consonants /m, n, ŋ/ and /k/. Relatively similar patterns were found for the three datasets. There was also a correlation with segment frequency: more frequent phonemes generally yielded better results. In addition, formant-based measures gave better EER and <em>C</em><sub>llr</sub> values than segmentally-based MFCCs. For formants, the best results came from the 10U dataset, while for MFCCs the best results came from the manually corrected 3C dataset. The effect of manual correction was starkest for consonants. Finally, the fused systems performed very well, with both formant- and MFCC-based systems producing EERs close to 0 in some cases. The best systems were those using the 3C dataset. The fused 10U system generally produced notably weaker LLRs, presumably because of the inevitably larger number of data labelling errors.</div><div>While the study is not forensically realistic, it has a number of implications for forensic speaker comparison. First, the best performing segments are those vowels in which formant separation is clear, and consonants (nasals) with formant structure. Second, manual correction of data is beneficial, especially for consonants. MFCCs are high dimensional data relative to vowel formants taken at a segment’s midpoint. Misalignment of automated labelling and tracking is thus potentially more likely to have a deleterious effect on MFCCs. While the 10U dataset yielded the best scores for vowel formants, there is a danger that it overestimates the discriminatory power of those segments. A degree of manual correction is therefore worthwhile. Finally, although MFCC data yielded worse scores on a segment by segment basis, the fused system worked very well. Further research is therefore merited on MFCC-based analysis of segments as variables in speaker comparison, and more broadly in phonetic research.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"176 ","pages":"Article 103329"},"PeriodicalIF":3.0,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01DOI: 10.1016/j.specom.2025.103324
Eija M.A. Aalto , Hana Ben Asker , Lucie Ménard , Walcir Cardoso , Catherine Laporte
Several publications have explored second language (L2) articulation through lingual ultrasound imaging technology. This systematic review and thematic analysis collate and evaluate these studies, focusing on methodologies, experimental setups, and findings. The review includes 31 works: 23 on ultrasound biofeedback and 8 on characterizing L2 articulation. English is the predominant language studied (82 % as L1 or L2), with participants mainly young adults (2–60 participants per study). The 23 ultrasound biofeedback studies showed significant variation in session numbers and length, including 16 PICO studies (i.e. study design with participants, intervention, controls/comparison group, outcome) where ultrasound biofeedback was compared to auditory feedback and/or control conditions. Data analysis of biofeedback studies often included acoustic or perceptual assessments in addition or instead of ultrasound data analysis. Analysis of results indicate that ultrasound biofeedback is effective for improving L2 articulation. However, the PICO studies revealed that while ultrasound biofeedback may offer certain advantages, these findings remain preliminary and warrant further investigation. Learner characteristics and target selection may affect biofeedback efficacy. Ultrasound also proved valuable for characterizing L2 articulation by showing articulatory and coarticulatory patterns, particularly in English sounds like /ɹ/, /l/, and various vowels. L2 characterization studies frequently used dynamic speech movement analysis. Moving forward, researchers are encouraged to use dynamic movement analysis also in biofeedback studies to deepen understanding of articulation processes. Expanding linguistic and demographic diversity in future research is essential to capturing language heterogeneity.
一些出版物通过语言超声成像技术探讨了第二语言(L2)的发音。本系统综述和专题分析对这些研究进行了整理和评价,重点关注方法、实验设置和结果。本文综述31篇,其中超声生物反馈23篇,L2关节表征8篇。英语是研究的主要语言(82%为L1或L2),参与者主要是年轻人(每个研究2-60名参与者)。23项超声生物反馈研究显示,疗程数量和长度存在显著差异,其中包括16项PICO研究(即研究设计包括受试者、干预、对照组/对照组、结果),其中超声生物反馈与听觉反馈和/或对照条件进行了比较。生物反馈研究的数据分析通常包括声学或感知评估,或者代替超声数据分析。分析结果表明,超声生物反馈对改善L2关节是有效的。然而,PICO研究表明,虽然超声生物反馈可能提供某些优势,但这些发现仍然是初步的,需要进一步的研究。学习者特征和目标选择可能影响生物反馈效果。通过显示发音和辅助发音模式,超声波也被证明对表征L2发音很有价值,特别是在英语发音中,如/ r /, /l/和各种元音。二语表征研究中经常使用动态语音运动分析。展望未来,研究人员也被鼓励在生物反馈研究中使用动态运动分析来加深对发音过程的理解。在未来的研究中扩大语言和人口的多样性是捕捉语言异质性的必要条件。
{"title":"Ultrasound imaging in second language research: Systematic review and thematic analysis","authors":"Eija M.A. Aalto , Hana Ben Asker , Lucie Ménard , Walcir Cardoso , Catherine Laporte","doi":"10.1016/j.specom.2025.103324","DOIUrl":"10.1016/j.specom.2025.103324","url":null,"abstract":"<div><div>Several publications have explored second language (L2) articulation through lingual ultrasound imaging technology. This systematic review and thematic analysis collate and evaluate these studies, focusing on methodologies, experimental setups, and findings. The review includes 31 works: 23 on ultrasound biofeedback and 8 on characterizing L2 articulation. English is the predominant language studied (82 % as L1 or L2), with participants mainly young adults (2–60 participants per study). The 23 ultrasound biofeedback studies showed significant variation in session numbers and length, including 16 PICO studies (i.e. study design with participants, intervention, controls/comparison group, outcome) where ultrasound biofeedback was compared to auditory feedback and/or control conditions. Data analysis of biofeedback studies often included acoustic or perceptual assessments in addition or instead of ultrasound data analysis. Analysis of results indicate that ultrasound biofeedback is effective for improving L2 articulation. However, the PICO studies revealed that while ultrasound biofeedback may offer certain advantages, these findings remain preliminary and warrant further investigation. Learner characteristics and target selection may affect biofeedback efficacy. Ultrasound also proved valuable for characterizing L2 articulation by showing articulatory and coarticulatory patterns, particularly in English sounds like /ɹ/, /l/, and various vowels. L2 characterization studies frequently used dynamic speech movement analysis. Moving forward, researchers are encouraged to use dynamic movement analysis also in biofeedback studies to deepen understanding of articulation processes. Expanding linguistic and demographic diversity in future research is essential to capturing language heterogeneity.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"175 ","pages":"Article 103324"},"PeriodicalIF":3.0,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145474243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Estimating the severity of dysarthria, a speech disorder from neurological conditions, is important in medicine. It helps with diagnosis, early detection, and personalized treatment. Significant progress has been made in leveraging SSL models as feature extractors for various classification tasks, demonstrating their effectiveness. Building on this, this paper examines whether using all features extracted from SSL models is necessary for optimal dysarthria severity classification from speech. We focused on layer-wise feature analysis of one base model, Wav2Vec2-base, and four large models, Wav2Vec2-large, HuBERT-large, Data2Vec-large, and WavLM-large, using a Convolutional Neural Network (CNN) as classifier with mel-frequency cepstral coefficients (MFCC) features as baseline. Experiments showed that the later transformer layers of the SSL models were more effective in the dysarthria severity classification, compared to the earlier layers. This is because the later transformer layers better capture articulation, and complex temporal patterns refined from the mid layers. More specifically, analysis revealed that embeddings from transformer encoder layer 23 of HuBERT-large yielded the best performance among all three models, possibly due to HuBERT’s hierarchical learning from unsupervised clustering. To further assess whether all dimensions are important, we examined the impact of varying feature dimensions. Our findings indicated that reducing the dimensions to 32 (from 1024 dimension) led to further improvements in accuracy. This indicates that not all features are necessary for effective severity classification. Additionally, feature fusion was conducted using the optimal reduced dimensions from the best-performing layer combined with varying dimensions of the MFCC features, resulting in further improvement in performance. The highest accuracy of 70.44% was achieved by combining 32 selected dimensions from the HuBERT-large model with 21 MFCC feature dimensions. The feature fusion of HuBERT-large (32) and MFCC (21) outperformed the HuBERT-large baseline by 6.36% and MFCC baseline by 15.28% in absolute. Furthermore, combining the fused features with handcrafted features from articulatory, prosodic, phonatory, and respiratory domains increased the classification accuracy to 73.53%, resulting in a more robust representation for dysarthria severity classification. Probing analyses of articulatory and prosodic features supported the choice of the best-performing HuBERT layer, while the low correlation with handcrafted features highlighted their complementary contribution. Finally, comparative t-SNE visualizations further validated the effectiveness of the proposed feature fusion, demonstrating clearer class separability.
{"title":"Do all features matter? Layer-wise feature probing of self-supervised speech models for dysarthria severity classification","authors":"Paban Sapkota , Harsh Srivastava , Hemant Kumar Kathania , Shrikanth Narayanan , Sudarsana Reddy Kadiri","doi":"10.1016/j.specom.2025.103326","DOIUrl":"10.1016/j.specom.2025.103326","url":null,"abstract":"<div><div>Estimating the severity of dysarthria, a speech disorder from neurological conditions, is important in medicine. It helps with diagnosis, early detection, and personalized treatment. Significant progress has been made in leveraging SSL models as feature extractors for various classification tasks, demonstrating their effectiveness. Building on this, this paper examines whether using all features extracted from SSL models is necessary for optimal dysarthria severity classification from speech. We focused on layer-wise feature analysis of one base model, Wav2Vec2-base, and four large models, Wav2Vec2-large, HuBERT-large, Data2Vec-large, and WavLM-large, using a Convolutional Neural Network (CNN) as classifier with mel-frequency cepstral coefficients (MFCC) features as baseline. Experiments showed that the later transformer layers of the SSL models were more effective in the dysarthria severity classification, compared to the earlier layers. This is because the later transformer layers better capture articulation, and complex temporal patterns refined from the mid layers. More specifically, analysis revealed that embeddings from transformer encoder layer 23 of HuBERT-large yielded the best performance among all three models, possibly due to HuBERT’s hierarchical learning from unsupervised clustering. To further assess whether all dimensions are important, we examined the impact of varying feature dimensions. Our findings indicated that reducing the dimensions to 32 (from 1024 dimension) led to further improvements in accuracy. This indicates that not all features are necessary for effective severity classification. Additionally, feature fusion was conducted using the optimal reduced dimensions from the best-performing layer combined with varying dimensions of the MFCC features, resulting in further improvement in performance. The highest accuracy of 70.44% was achieved by combining 32 selected dimensions from the HuBERT-large model with 21 MFCC feature dimensions. The feature fusion of HuBERT-large (32) and MFCC (21) outperformed the HuBERT-large baseline by 6.36% and MFCC baseline by 15.28% in absolute. Furthermore, combining the fused features with handcrafted features from articulatory, prosodic, phonatory, and respiratory domains increased the classification accuracy to 73.53%, resulting in a more robust representation for dysarthria severity classification. Probing analyses of articulatory and prosodic features supported the choice of the best-performing HuBERT layer, while the low correlation with handcrafted features highlighted their complementary contribution. Finally, comparative t-SNE visualizations further validated the effectiveness of the proposed feature fusion, demonstrating clearer class separability.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"175 ","pages":"Article 103326"},"PeriodicalIF":3.0,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145474244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01DOI: 10.1016/j.specom.2025.103319
Yuying Xie, Zheng-Hua Tan
Recent advancements in deep learning have significantly impacted the field of speech signal processing, particularly in the analysis and manipulation of complex spectrograms. This survey provides a comprehensive overview of the state-of-the-art techniques leveraging deep neural networks for processing complex spectrograms, which encapsulate both magnitude and phase information. We begin by introducing complex spectrograms and their associated features for various speech processing tasks. Next, we examine the key components and architectures of complex-valued neural networks, which are specifically designed to handle complex-valued data and have been applied to complex spectrogram processing. As recent studies have primarily focused on applying real-valued neural networks to complex spectrograms, we revisit these approaches and their architectural designs. We then discuss various training strategies and loss functions tailored for training neural networks to process and model complex spectrograms. The survey further examines key applications, including phase retrieval, speech enhancement, and speaker separation, where deep learning has achieved significant progress by leveraging complex spectrograms or their derived feature representations. Additionally, we examine the intersection of complex spectrograms with generative models. This survey aims to serve as a valuable resource for researchers and practitioners in the field of speech signal processing, deep learning and related fields.
{"title":"A survey of deep learning for complex speech spectrograms","authors":"Yuying Xie, Zheng-Hua Tan","doi":"10.1016/j.specom.2025.103319","DOIUrl":"10.1016/j.specom.2025.103319","url":null,"abstract":"<div><div>Recent advancements in deep learning have significantly impacted the field of speech signal processing, particularly in the analysis and manipulation of complex spectrograms. This survey provides a comprehensive overview of the state-of-the-art techniques leveraging deep neural networks for processing complex spectrograms, which encapsulate both magnitude and phase information. We begin by introducing complex spectrograms and their associated features for various speech processing tasks. Next, we examine the key components and architectures of complex-valued neural networks, which are specifically designed to handle complex-valued data and have been applied to complex spectrogram processing. As recent studies have primarily focused on applying real-valued neural networks to complex spectrograms, we revisit these approaches and their architectural designs. We then discuss various training strategies and loss functions tailored for training neural networks to process and model complex spectrograms. The survey further examines key applications, including phase retrieval, speech enhancement, and speaker separation, where deep learning has achieved significant progress by leveraging complex spectrograms or their derived feature representations. Additionally, we examine the intersection of complex spectrograms with generative models. This survey aims to serve as a valuable resource for researchers and practitioners in the field of speech signal processing, deep learning and related fields.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"175 ","pages":"Article 103319"},"PeriodicalIF":3.0,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145474242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dysarthria, a vocal movement disorder resulting from neurological disease, is recognized by various indicators, such as diminished intensity, uncontrolled pitch variations, varying speech rate, and hypo/hypernasality, among other symptoms. It presents significant hurdles for dysarthric individuals when interfacing with machines operated by automatic speech recognition (ASR) systems tailored for the speech of neurologically healthy people. This research delves into the voice characteristics contributing to decreased intelligibility within human-machine interaction by investigating the behaviour of ASR systems with varying degrees of dysarthria. The work presents a pilot study of dysarthria in Hindi-speaking population by compiling a Hindi corpus. The corpus scrutinizes the distinct voice attributes present in dysarthric speech, focusing on parameters like pitch perturbation, amplitude perturbation, articulation rate, pause and phoneme rate derived from sustained phonation and continuous speech data captured using a conventional close-talk and throat microphone. The speech dataset includes recordings from sixty participants with neurological conditions, each providing thirty sentences. Participants are categorized into four intelligibility groups for analysis using the Google Cloud Speech to Text conversion system. The phonation analysis reveals greater disturbances in pitch and intensity variation as intelligibility decreases. Additionally, a sentence-level analysis was conducted to explore the influence of inter-word pauses and word complexity across different intelligibility groups. The results show that individuals with severe dysarthria tend to speak more slowly and misarticulate longer words. The study provides numerical ranges for pitch, amplitude, and time perturbation, which will be helpful for researchers working in the field of DSR system development, which utilizes data augmentation to generate synthetic dysarthric data to mitigate data scarcity. The work establishes a relationship between word complexity and intelligibility, which will support speech pathologists in designing customized speech training programs to improve intelligibility for individuals with dysarthria.
{"title":"Categorization of patients affected with neurogenerative dysarthria among Hindi-speaking population and analyzing factors causing reduced speech intelligibility at the human-machine interface","authors":"Raj Kumar , Manoj Tripathy , Niraj Kumar , R.S. Anand","doi":"10.1016/j.specom.2025.103328","DOIUrl":"10.1016/j.specom.2025.103328","url":null,"abstract":"<div><div>Dysarthria, a vocal movement disorder resulting from neurological disease, is recognized by various indicators, such as diminished intensity, uncontrolled pitch variations, varying speech rate, and hypo/hypernasality, among other symptoms. It presents significant hurdles for dysarthric individuals when interfacing with machines operated by automatic speech recognition (ASR) systems tailored for the speech of neurologically healthy people. This research delves into the voice characteristics contributing to decreased intelligibility within human-machine interaction by investigating the behaviour of ASR systems with varying degrees of dysarthria. The work presents a pilot study of dysarthria in Hindi-speaking population by compiling a Hindi corpus. The corpus scrutinizes the distinct voice attributes present in dysarthric speech, focusing on parameters like pitch perturbation, amplitude perturbation, articulation rate, pause and phoneme rate derived from sustained phonation and continuous speech data captured using a conventional close-talk and throat microphone. The speech dataset includes recordings from sixty participants with neurological conditions, each providing thirty sentences. Participants are categorized into four intelligibility groups for analysis using the Google Cloud Speech to Text conversion system. The phonation analysis reveals greater disturbances in pitch and intensity variation as intelligibility decreases. Additionally, a sentence-level analysis was conducted to explore the influence of inter-word pauses and word complexity across different intelligibility groups. The results show that individuals with severe dysarthria tend to speak more slowly and misarticulate longer words. The study provides numerical ranges for pitch, amplitude, and time perturbation, which will be helpful for researchers working in the field of DSR system development, which utilizes data augmentation to generate synthetic dysarthric data to mitigate data scarcity. The work establishes a relationship between word complexity and intelligibility, which will support speech pathologists in designing customized speech training programs to improve intelligibility for individuals with dysarthria.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"175 ","pages":"Article 103328"},"PeriodicalIF":3.0,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145528715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01DOI: 10.1016/j.specom.2025.103323
Yongqiang Chen , Qianhua He , Zunxian Liu , Mingru Yang , Wenwu Wang
Keyword spotting (KWS) suffers from the domain shift between training and testing in practical complex situations. To improve the robustness of KWS models in noisy environments, this paper proposes a novel domain-invariant feature extraction strategy called supervised probabilistic multi-domain adversarial training (SPMDAT). Based on supervised adversarial domain adaptation (SADA), SPMDAT makes better use of differently distributed data (multi-condition data) by using a class-wise domain discriminator to estimate the domain index probability distribution. Experimental results on three different deep networks showed that the SPMDAT could improve KWS performances for three noisy situations: seen noise, unseen noise, and seen noise with ultra-low signal-to-noise ratio (SNR) levels, compared to the multi-condition training (MCT) strategy. Especially, for KWT-1, the average relative improvements are 9.63%, 10.83%, and 28.16%, respectively. SPMDAT also achieves better results in the three test situations than the other two SADA strategies adapted from unsupervised domain adaptation (UDA) methods. Since the three strategies are only used in the training process, all the improvements are achieved without increasing the computational complexity of the inference models. In addition, to better understand the practicability of the SADA-based strategies, experiments are conducted to assess the impact of model parameters on the performance. The results show that models with approximately 69 K parameters already achieve performance improvements over MCT, suggesting the effectiveness of the strategies for small-footprint KWS models.
{"title":"Noise-robust feature extraction for keyword spotting based on supervised adversarial domain adaptation training strategies","authors":"Yongqiang Chen , Qianhua He , Zunxian Liu , Mingru Yang , Wenwu Wang","doi":"10.1016/j.specom.2025.103323","DOIUrl":"10.1016/j.specom.2025.103323","url":null,"abstract":"<div><div>Keyword spotting (KWS) suffers from the domain shift between training and testing in practical complex situations. To improve the robustness of KWS models in noisy environments, this paper proposes a novel domain-invariant feature extraction strategy called supervised probabilistic multi-domain adversarial training (SPMDAT). Based on supervised adversarial domain adaptation (SADA), SPMDAT makes better use of differently distributed data (multi-condition data) by using a class-wise domain discriminator to estimate the domain index probability distribution. Experimental results on three different deep networks showed that the SPMDAT could improve KWS performances for three noisy situations: seen noise, unseen noise, and seen noise with ultra-low signal-to-noise ratio (SNR) levels, compared to the multi-condition training (MCT) strategy. Especially, for KWT-1, the average relative improvements are 9.63%, 10.83%, and 28.16%, respectively. SPMDAT also achieves better results in the three test situations than the other two SADA strategies adapted from unsupervised domain adaptation (UDA) methods. Since the three strategies are only used in the training process, all the improvements are achieved without increasing the computational complexity of the inference models. In addition, to better understand the practicability of the SADA-based strategies, experiments are conducted to assess the impact of model parameters on the performance. The results show that models with approximately 69 K parameters already achieve performance improvements over MCT, suggesting the effectiveness of the strategies for small-footprint KWS models.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"175 ","pages":"Article 103323"},"PeriodicalIF":3.0,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}