首页 > 最新文献

Speech Communication最新文献

英文 中文
Effect of individual characteristics on impressions of one’s own recorded voice 个人特征对自己录制的声音印象的影响
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-11-27 DOI: 10.1016/j.specom.2025.103335
Hikaru Yanagida , Yusuke Ijima , Naohiro Tawara
This study aims to identify individual characteristics such as age, gender, personality traits, and values that influence the perception of one’s own recorded voice. While previous studies have shown that the perception of one’s own recorded voice is different from that of others, and that these differences are influenced by individual characteristics, only a limited number of individual characteristics were examined in past research. In our study, we conducted a large-scale subjective experiment with 141 Japanese participants using multiple individual characteristics. Participants evaluated impressions of their own recorded voices and the voices of others, and we analyzed the relationship between each of the individual characteristics and the voice impressions. Our findings showed that individual characteristics such as the frequency of listening to one’s own recorded voice (which had not been examined in the previous studies) influenced the perception of one’s own recorded voice. We further analyzed the use of combinations of multiple individual characteristics, including those that influenced impressions in a single use, to predict impressions of one’s own recorded voice and found that they were better predicted by the combination of multiple individual characteristics than by the use of a single individual characteristic.
这项研究旨在确定个人特征,如年龄、性别、性格特征和价值观,这些特征会影响人们对自己录制的声音的感知。虽然以前的研究表明,一个人对自己录制的声音的感知与其他人不同,而且这些差异受到个人特征的影响,但在过去的研究中,只研究了有限数量的个人特征。在我们的研究中,我们对141名日本参与者进行了大规模的主观实验,使用了多个个体特征。参与者评估了他们自己录制的声音和其他人的声音的印象,我们分析了每个个体特征和声音印象之间的关系。我们的研究结果表明,个人特征,如听自己录制的声音的频率(这在之前的研究中没有得到检验)会影响人们对自己录制的声音的感知。我们进一步分析了使用多个个体特征的组合,包括那些在一次使用中影响印象的个体特征,来预测一个人自己录制的声音的印象,发现多个个体特征的组合比使用单个个体特征更能预测他们。
{"title":"Effect of individual characteristics on impressions of one’s own recorded voice","authors":"Hikaru Yanagida ,&nbsp;Yusuke Ijima ,&nbsp;Naohiro Tawara","doi":"10.1016/j.specom.2025.103335","DOIUrl":"10.1016/j.specom.2025.103335","url":null,"abstract":"<div><div>This study aims to identify individual characteristics such as age, gender, personality traits, and values that influence the perception of one’s own recorded voice. While previous studies have shown that the perception of one’s own recorded voice is different from that of others, and that these differences are influenced by individual characteristics, only a limited number of individual characteristics were examined in past research. In our study, we conducted a large-scale subjective experiment with 141 Japanese participants using multiple individual characteristics. Participants evaluated impressions of their own recorded voices and the voices of others, and we analyzed the relationship between each of the individual characteristics and the voice impressions. Our findings showed that individual characteristics such as the frequency of listening to one’s own recorded voice (which had not been examined in the previous studies) influenced the perception of one’s own recorded voice. We further analyzed the use of combinations of multiple individual characteristics, including those that influenced impressions in a single use, to predict impressions of one’s own recorded voice and found that they were better predicted by the combination of multiple individual characteristics than by the use of a single individual characteristic.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"176 ","pages":"Article 103335"},"PeriodicalIF":3.0,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Self-Supervised Learning for Speaker Recognition: A study and review 自监督学习在说话人识别中的研究与回顾
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-11-24 DOI: 10.1016/j.specom.2025.103333
Theo Lepage, Reda Dehak
Deep learning models trained in a supervised setting have revolutionized audio and speech processing. However, their performance inherently depends on the quantity of human-annotated data, making them costly to scale and prone to poor generalization under unseen conditions. To address these challenges, Self-Supervised Learning (SSL) has emerged as a promising paradigm, leveraging vast amounts of unlabeled data to learn relevant representations. The application of SSL for Automatic Speech Recognition (ASR) has been extensively studied, but research on other downstream tasks, notably Speaker Recognition (SR), remains in its early stages. This work describes major SSL instance-invariance frameworks (e.g., SimCLR, MoCo, and DINO), initially developed for computer vision, along with their adaptation to SR. Various SSL methods for SR, proposed in the literature and built upon these frameworks, are also presented. An extensive review of these approaches is then conducted: (1) the effect of the main hyperparameters of SSL frameworks is investigated; (2) the role of SSL components is studied (e.g., data-augmentation, projector, positive sampling); and (3) SSL frameworks are evaluated on SR with in-domain and out-of-domain data, using a consistent experimental setup, and a comprehensive comparison of SSL methods from the literature is provided. Specifically, DINO achieves the best downstream performance and effectively models intra-speaker variability, although it is highly sensitive to hyperparameters and training conditions, while SimCLR and MoCo provide robust alternatives that effectively capture inter-speaker variability and are less prone to collapse. This work aims to highlight recent trends and advancements, identifying current challenges in the field.
在监督环境中训练的深度学习模型已经彻底改变了音频和语音处理。然而,它们的性能本质上依赖于人工注释数据的数量,这使得它们的扩展成本很高,并且在未知的条件下容易出现较差的泛化。为了应对这些挑战,自我监督学习(Self-Supervised Learning, SSL)作为一种很有前途的范例出现了,它利用大量未标记的数据来学习相关的表示。SSL在自动语音识别(ASR)中的应用已经得到了广泛的研究,但对其他下游任务,特别是说话人识别(SR)的研究仍处于早期阶段。本文描述了主要的SSL实例不变性框架(例如SimCLR、MoCo和DINO),最初是为计算机视觉开发的,以及它们对SR的适应。本文还介绍了文献中提出的基于这些框架的SR的各种SSL方法。然后对这些方法进行了广泛的回顾:(1)研究了SSL框架的主要超参数的影响;(2)研究了SSL组件的作用(如数据增强、投影仪、正采样);(3)使用一致的实验设置,在域内和域外数据的SR上评估SSL框架,并对文献中的SSL方法进行了全面比较。具体来说,DINO实现了最佳的下游性能,并有效地模拟了说话人内部的变化,尽管它对超参数和训练条件高度敏感,而SimCLR和MoCo提供了鲁棒的替代方案,有效地捕获了说话人之间的变化,并且不容易崩溃。这项工作旨在突出最近的趋势和进展,确定该领域当前的挑战。
{"title":"Self-Supervised Learning for Speaker Recognition: A study and review","authors":"Theo Lepage,&nbsp;Reda Dehak","doi":"10.1016/j.specom.2025.103333","DOIUrl":"10.1016/j.specom.2025.103333","url":null,"abstract":"<div><div>Deep learning models trained in a supervised setting have revolutionized audio and speech processing. However, their performance inherently depends on the quantity of human-annotated data, making them costly to scale and prone to poor generalization under unseen conditions. To address these challenges, Self-Supervised Learning (SSL) has emerged as a promising paradigm, leveraging vast amounts of unlabeled data to learn relevant representations. The application of SSL for Automatic Speech Recognition (ASR) has been extensively studied, but research on other downstream tasks, notably Speaker Recognition (SR), remains in its early stages. This work describes major SSL instance-invariance frameworks (e.g., SimCLR, MoCo, and DINO), initially developed for computer vision, along with their adaptation to SR. Various SSL methods for SR, proposed in the literature and built upon these frameworks, are also presented. An extensive review of these approaches is then conducted: (1) the effect of the main hyperparameters of SSL frameworks is investigated; (2) the role of SSL components is studied (e.g., data-augmentation, projector, positive sampling); and (3) SSL frameworks are evaluated on SR with in-domain and out-of-domain data, using a consistent experimental setup, and a comprehensive comparison of SSL methods from the literature is provided. Specifically, DINO achieves the best downstream performance and effectively models intra-speaker variability, although it is highly sensitive to hyperparameters and training conditions, while SimCLR and MoCo provide robust alternatives that effectively capture inter-speaker variability and are less prone to collapse. This work aims to highlight recent trends and advancements, identifying current challenges in the field.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"176 ","pages":"Article 103333"},"PeriodicalIF":3.0,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adaptive weighting in a transformer framework for multimodal emotion recognition 多模态情感识别变压器框架中的自适应加权
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-11-24 DOI: 10.1016/j.specom.2025.103332
Weijie Lu , Yunfeng Xu , Jintan Gu
Multimodal Dialogue Emotion Recognition is rapidly emerging as a research hotspot with broad application prospects. In recent years, researchers have invested a lot of effort in the integration of modal feature information, but the detailed analysis of each modal feature information is still insufficient, and the difference in the influence of each modal feature information on the recognition results has not been fully considered. In order to solve this problem, we propose a Transformer-based multimodal interaction model with an adaptive weighted fusion mechanism (TIAWFM). The model effectively captures deep inter-modal correlations in multimodal emotion recognition tasks, mitigating the limitations of unimodal representations. We observe that incorporating specific conversational contexts and dynamically allocating weights to each modality not only fully leverages the model’s capabilities but also enables more accurate capture of emotional information embedded in the features. We conducted extensive experiments on two benchmark multimodal datasets, IEMOCAP and MELD. Experimental results demonstrate that TIAWFM exhibits significant advantages in dynamically integrating multimodal information, leading to notable improvements in both the accuracy and robustness of emotion recognition.
多模态对话情感识别正迅速成为一个具有广阔应用前景的研究热点。近年来,研究人员在模态特征信息的集成方面投入了大量的精力,但对各个模态特征信息的详细分析仍然不足,并且没有充分考虑到各个模态特征信息对识别结果影响的差异。为了解决这一问题,我们提出了一种基于变压器的多模态交互模型,该模型具有自适应加权融合机制(TIAWFM)。该模型有效地捕获了多模态情感识别任务中的深度多模态相关性,减轻了单模态表示的局限性。我们观察到,结合特定的会话上下文并动态分配权重给每个模态不仅充分利用了模型的功能,而且能够更准确地捕获嵌入在特征中的情感信息。我们在IEMOCAP和MELD两个基准多模态数据集上进行了广泛的实验。实验结果表明,TIAWFM在动态整合多模态信息方面具有显著优势,显著提高了情绪识别的准确性和鲁棒性。
{"title":"Adaptive weighting in a transformer framework for multimodal emotion recognition","authors":"Weijie Lu ,&nbsp;Yunfeng Xu ,&nbsp;Jintan Gu","doi":"10.1016/j.specom.2025.103332","DOIUrl":"10.1016/j.specom.2025.103332","url":null,"abstract":"<div><div>Multimodal Dialogue Emotion Recognition is rapidly emerging as a research hotspot with broad application prospects. In recent years, researchers have invested a lot of effort in the integration of modal feature information, but the detailed analysis of each modal feature information is still insufficient, and the difference in the influence of each modal feature information on the recognition results has not been fully considered. In order to solve this problem, we propose a Transformer-based multimodal interaction model with an adaptive weighted fusion mechanism (TIAWFM). The model effectively captures deep inter-modal correlations in multimodal emotion recognition tasks, mitigating the limitations of unimodal representations. We observe that incorporating specific conversational contexts and dynamically allocating weights to each modality not only fully leverages the model’s capabilities but also enables more accurate capture of emotional information embedded in the features. We conducted extensive experiments on two benchmark multimodal datasets, IEMOCAP and MELD. Experimental results demonstrate that TIAWFM exhibits significant advantages in dynamically integrating multimodal information, leading to notable improvements in both the accuracy and robustness of emotion recognition.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"176 ","pages":"Article 103332"},"PeriodicalIF":3.0,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145594745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards unsupervised speech recognition without pronunciation models 无语音模型的无监督语音识别
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-11-15 DOI: 10.1016/j.specom.2025.103330
Junrui Ni , Liming Wang , Yang Zhang , Kaizhi Qian , Heting Gao , Mark Hasegawa-Johnson , James Glass , Chang D. Yoo
Recent advancements in supervised automatic speech recognition (ASR) have achieved remarkable performance, largely due to the growing availability of large transcribed speech corpora. However, most languages lack sufficient paired speech and text data to effectively train these systems. In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora by proposing the removal of reliance on a phoneme lexicon. We explore a new research direction: word-level unsupervised ASR, and experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling. Using a curated speech corpus containing a fixed number of English words, our system iteratively refines the word segmentation structure and achieves a word error rate of between 20%–23%, depending on the vocabulary size, without parallel transcripts, oracle word boundaries, or a pronunciation lexicon. This innovative model surpasses the performance of previous unsupervised ASR models under the lexicon-free setting.
有监督自动语音识别(ASR)的最新进展取得了显著的成绩,这主要是由于越来越多的大型转录语音语料库的可用性。然而,大多数语言缺乏足够的配对语音和文本数据来有效地训练这些系统。在本文中,我们通过提出消除对音素词典的依赖来解决在没有配对语音和文本语料库的情况下开发ASR系统的挑战。我们探索了一个新的研究方向:词级无监督语音识别,并通过实验证明了一种无监督语音识别器可以从语音到语音和文本到文本的掩码token填充中产生。使用包含固定数量英语单词的精选语音语料库,我们的系统迭代地改进了分词结构,并根据词汇量的大小实现了20%-23%的单词错误率,而不需要并行转录本、oracle单词边界或发音词典。这个创新的模型超越了以前无词典设置下的无监督ASR模型的性能。
{"title":"Towards unsupervised speech recognition without pronunciation models","authors":"Junrui Ni ,&nbsp;Liming Wang ,&nbsp;Yang Zhang ,&nbsp;Kaizhi Qian ,&nbsp;Heting Gao ,&nbsp;Mark Hasegawa-Johnson ,&nbsp;James Glass ,&nbsp;Chang D. Yoo","doi":"10.1016/j.specom.2025.103330","DOIUrl":"10.1016/j.specom.2025.103330","url":null,"abstract":"<div><div>Recent advancements in supervised automatic speech recognition (ASR) have achieved remarkable performance, largely due to the growing availability of large transcribed speech corpora. However, most languages lack sufficient paired speech and text data to effectively train these systems. In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora by proposing the removal of reliance on a phoneme lexicon. We explore a new research direction: word-level unsupervised ASR, and experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling. Using a curated speech corpus containing a fixed number of English words, our system iteratively refines the word segmentation structure and achieves a word error rate of between 20%–23%, depending on the vocabulary size, without parallel transcripts, oracle word boundaries, or a pronunciation lexicon. This innovative model surpasses the performance of previous unsupervised ASR models under the lexicon-free setting.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"176 ","pages":"Article 103330"},"PeriodicalIF":3.0,"publicationDate":"2025-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The discriminative capacity of English segments in forensic speaker comparison 司法说话人比较中英语分词的辨析能力
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-11-11 DOI: 10.1016/j.specom.2025.103329
Paul Foulkes, Vincent Hughes, Kayleigh Peters, Jasmine Rouse
This study compares the relative performance of formant- and MFCC-based analyses of the same dataset, extending the work of Franco-Pedroso and Gonzalez-Rodriguez (2016). Using a corpus of read speech from 24 male English speakers we extracted vowel formant data and segmentally-based MFCCs of all phonemes. Data were taken from three versions of the corpus: 10 min and 3 min samples with wholly automated segment labelling and data extraction (the 10U and 3U datasets), and 3 min samples with manually corrected segment labelling and manual checking of formant tracking (the 3C dataset). The datasets were split in half and used for nine speaker discrimination tests: six tests using formants or MFCCs in each of the 10U, 3U and 3C datasets, and three fused systems combining formants and MFCCs for each dataset.
The formant-based tests revealed that the best performing segments were /ɪ/, /eɪ/, /aɪ/, /e/, /ʌ/ and /əː/. These vowels also performed well in MFCC-based tests, along with the three nasal consonants /m, n, ŋ/ and /k/. Relatively similar patterns were found for the three datasets. There was also a correlation with segment frequency: more frequent phonemes generally yielded better results. In addition, formant-based measures gave better EER and Cllr values than segmentally-based MFCCs. For formants, the best results came from the 10U dataset, while for MFCCs the best results came from the manually corrected 3C dataset. The effect of manual correction was starkest for consonants. Finally, the fused systems performed very well, with both formant- and MFCC-based systems producing EERs close to 0 in some cases. The best systems were those using the 3C dataset. The fused 10U system generally produced notably weaker LLRs, presumably because of the inevitably larger number of data labelling errors.
While the study is not forensically realistic, it has a number of implications for forensic speaker comparison. First, the best performing segments are those vowels in which formant separation is clear, and consonants (nasals) with formant structure. Second, manual correction of data is beneficial, especially for consonants. MFCCs are high dimensional data relative to vowel formants taken at a segment’s midpoint. Misalignment of automated labelling and tracking is thus potentially more likely to have a deleterious effect on MFCCs. While the 10U dataset yielded the best scores for vowel formants, there is a danger that it overestimates the discriminatory power of those segments. A degree of manual correction is therefore worthwhile. Finally, although MFCC data yielded worse scores on a segment by segment basis, the fused system worked very well. Further research is therefore merited on MFCC-based analysis of segments as variables in speaker comparison, and more broadly in phonetic research.
本研究比较了基于形成峰和基于mfc的同一数据集分析的相对性能,扩展了Franco-Pedroso和Gonzalez-Rodriguez(2016)的工作。使用来自24名男性英语使用者的读语音语料库,我们提取了所有音素的元音形成峰数据和基于片段的mfccc。数据取自三个版本的语料库:10分钟和3分钟的样本,完全自动化的片段标记和数据提取(10U和3U数据集),以及3分钟的样本,手动校正的片段标记和手动检查的峰跟踪(3C数据集)。数据集被分成两半,用于9个说话人识别测试:在10U、3U和3C数据集中使用共振子或mfccc的6个测试,以及每个数据集结合共振子和mfccc的3个融合系统。基于共振峰的测试显示,表现最好的片段是/ / /、/e/ /、/a / /、/e/、/ / /和/ / /。这些元音以及三个鼻辅音/m、n、音/和/k/在基于mfcc的测试中也表现良好。在三个数据集中发现了相对相似的模式。音素频率也有相关性:音素频率越高,结果越好。此外,基于共振峰的测量比基于分段的mfccc给出了更好的EER和Cllr值。对于共振体,最好的结果来自10U数据集,而对于mfccc,最好的结果来自手动校正的3C数据集。手工矫正辅音的效果最为明显。最后,融合系统表现非常好,在某些情况下,基于形成峰和mfc的系统产生的EERs接近于0。最好的系统是使用3C数据集的系统。融合的10U系统通常产生明显较弱的llr,可能是因为不可避免的大量数据标记错误。虽然这项研究在法医上并不现实,但它对法医说话人比较有一些启示。首先,表现最好的音段是那些形成峰分离清晰的元音和具有形成峰结构的辅音(鼻音)。第二,人工校正数据是有益的,特别是对辅音。mfccc是相对于元音共振峰的高维数据。因此,自动标记和跟踪的不对齐可能更有可能对mfc产生有害影响。虽然10U数据集在元音共振音方面取得了最好的成绩,但它存在高估这些音段的区别能力的危险。因此,一定程度的人工校正是值得的。最后,尽管MFCC数据在分段的基础上得到了较差的分数,但融合系统的效果非常好。因此,基于mfcc的分词分析作为说话人比较变量,以及更广泛的语音研究值得进一步研究。
{"title":"The discriminative capacity of English segments in forensic speaker comparison","authors":"Paul Foulkes,&nbsp;Vincent Hughes,&nbsp;Kayleigh Peters,&nbsp;Jasmine Rouse","doi":"10.1016/j.specom.2025.103329","DOIUrl":"10.1016/j.specom.2025.103329","url":null,"abstract":"<div><div>This study compares the relative performance of formant- and MFCC-based analyses of the same dataset, extending the work of Franco-Pedroso and Gonzalez-Rodriguez (2016). Using a corpus of read speech from 24 male English speakers we extracted vowel formant data and segmentally-based MFCCs of all phonemes. Data were taken from three versions of the corpus: 10 min and 3 min samples with wholly automated segment labelling and data extraction (the 10U and 3U datasets), and 3 min samples with manually corrected segment labelling and manual checking of formant tracking (the 3C dataset). The datasets were split in half and used for nine speaker discrimination tests: six tests using formants or MFCCs in each of the 10U, 3U and 3C datasets, and three fused systems combining formants and MFCCs for each dataset.</div><div>The formant-based tests revealed that the best performing segments were /ɪ/, /eɪ/, /aɪ/, /e/, /ʌ/ and /əː/. These vowels also performed well in MFCC-based tests, along with the three nasal consonants /m, n, ŋ/ and /k/. Relatively similar patterns were found for the three datasets. There was also a correlation with segment frequency: more frequent phonemes generally yielded better results. In addition, formant-based measures gave better EER and <em>C</em><sub>llr</sub> values than segmentally-based MFCCs. For formants, the best results came from the 10U dataset, while for MFCCs the best results came from the manually corrected 3C dataset. The effect of manual correction was starkest for consonants. Finally, the fused systems performed very well, with both formant- and MFCC-based systems producing EERs close to 0 in some cases. The best systems were those using the 3C dataset. The fused 10U system generally produced notably weaker LLRs, presumably because of the inevitably larger number of data labelling errors.</div><div>While the study is not forensically realistic, it has a number of implications for forensic speaker comparison. First, the best performing segments are those vowels in which formant separation is clear, and consonants (nasals) with formant structure. Second, manual correction of data is beneficial, especially for consonants. MFCCs are high dimensional data relative to vowel formants taken at a segment’s midpoint. Misalignment of automated labelling and tracking is thus potentially more likely to have a deleterious effect on MFCCs. While the 10U dataset yielded the best scores for vowel formants, there is a danger that it overestimates the discriminatory power of those segments. A degree of manual correction is therefore worthwhile. Finally, although MFCC data yielded worse scores on a segment by segment basis, the fused system worked very well. Further research is therefore merited on MFCC-based analysis of segments as variables in speaker comparison, and more broadly in phonetic research.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"176 ","pages":"Article 103329"},"PeriodicalIF":3.0,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ultrasound imaging in second language research: Systematic review and thematic analysis 超声成像在第二语言研究中的应用:系统回顾与专题分析
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-11-01 DOI: 10.1016/j.specom.2025.103324
Eija M.A. Aalto , Hana Ben Asker , Lucie Ménard , Walcir Cardoso , Catherine Laporte
Several publications have explored second language (L2) articulation through lingual ultrasound imaging technology. This systematic review and thematic analysis collate and evaluate these studies, focusing on methodologies, experimental setups, and findings. The review includes 31 works: 23 on ultrasound biofeedback and 8 on characterizing L2 articulation. English is the predominant language studied (82 % as L1 or L2), with participants mainly young adults (2–60 participants per study). The 23 ultrasound biofeedback studies showed significant variation in session numbers and length, including 16 PICO studies (i.e. study design with participants, intervention, controls/comparison group, outcome) where ultrasound biofeedback was compared to auditory feedback and/or control conditions. Data analysis of biofeedback studies often included acoustic or perceptual assessments in addition or instead of ultrasound data analysis. Analysis of results indicate that ultrasound biofeedback is effective for improving L2 articulation. However, the PICO studies revealed that while ultrasound biofeedback may offer certain advantages, these findings remain preliminary and warrant further investigation. Learner characteristics and target selection may affect biofeedback efficacy. Ultrasound also proved valuable for characterizing L2 articulation by showing articulatory and coarticulatory patterns, particularly in English sounds like /ɹ/, /l/, and various vowels. L2 characterization studies frequently used dynamic speech movement analysis. Moving forward, researchers are encouraged to use dynamic movement analysis also in biofeedback studies to deepen understanding of articulation processes. Expanding linguistic and demographic diversity in future research is essential to capturing language heterogeneity.
一些出版物通过语言超声成像技术探讨了第二语言(L2)的发音。本系统综述和专题分析对这些研究进行了整理和评价,重点关注方法、实验设置和结果。本文综述31篇,其中超声生物反馈23篇,L2关节表征8篇。英语是研究的主要语言(82%为L1或L2),参与者主要是年轻人(每个研究2-60名参与者)。23项超声生物反馈研究显示,疗程数量和长度存在显著差异,其中包括16项PICO研究(即研究设计包括受试者、干预、对照组/对照组、结果),其中超声生物反馈与听觉反馈和/或对照条件进行了比较。生物反馈研究的数据分析通常包括声学或感知评估,或者代替超声数据分析。分析结果表明,超声生物反馈对改善L2关节是有效的。然而,PICO研究表明,虽然超声生物反馈可能提供某些优势,但这些发现仍然是初步的,需要进一步的研究。学习者特征和目标选择可能影响生物反馈效果。通过显示发音和辅助发音模式,超声波也被证明对表征L2发音很有价值,特别是在英语发音中,如/ r /, /l/和各种元音。二语表征研究中经常使用动态语音运动分析。展望未来,研究人员也被鼓励在生物反馈研究中使用动态运动分析来加深对发音过程的理解。在未来的研究中扩大语言和人口的多样性是捕捉语言异质性的必要条件。
{"title":"Ultrasound imaging in second language research: Systematic review and thematic analysis","authors":"Eija M.A. Aalto ,&nbsp;Hana Ben Asker ,&nbsp;Lucie Ménard ,&nbsp;Walcir Cardoso ,&nbsp;Catherine Laporte","doi":"10.1016/j.specom.2025.103324","DOIUrl":"10.1016/j.specom.2025.103324","url":null,"abstract":"<div><div>Several publications have explored second language (L2) articulation through lingual ultrasound imaging technology. This systematic review and thematic analysis collate and evaluate these studies, focusing on methodologies, experimental setups, and findings. The review includes 31 works: 23 on ultrasound biofeedback and 8 on characterizing L2 articulation. English is the predominant language studied (82 % as L1 or L2), with participants mainly young adults (2–60 participants per study). The 23 ultrasound biofeedback studies showed significant variation in session numbers and length, including 16 PICO studies (i.e. study design with participants, intervention, controls/comparison group, outcome) where ultrasound biofeedback was compared to auditory feedback and/or control conditions. Data analysis of biofeedback studies often included acoustic or perceptual assessments in addition or instead of ultrasound data analysis. Analysis of results indicate that ultrasound biofeedback is effective for improving L2 articulation. However, the PICO studies revealed that while ultrasound biofeedback may offer certain advantages, these findings remain preliminary and warrant further investigation. Learner characteristics and target selection may affect biofeedback efficacy. Ultrasound also proved valuable for characterizing L2 articulation by showing articulatory and coarticulatory patterns, particularly in English sounds like /ɹ/, /l/, and various vowels. L2 characterization studies frequently used dynamic speech movement analysis. Moving forward, researchers are encouraged to use dynamic movement analysis also in biofeedback studies to deepen understanding of articulation processes. Expanding linguistic and demographic diversity in future research is essential to capturing language heterogeneity.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"175 ","pages":"Article 103324"},"PeriodicalIF":3.0,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145474243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Do all features matter? Layer-wise feature probing of self-supervised speech models for dysarthria severity classification 所有功能都重要吗?构音障碍严重程度分类的自监督语音模型分层特征探索
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-11-01 DOI: 10.1016/j.specom.2025.103326
Paban Sapkota , Harsh Srivastava , Hemant Kumar Kathania , Shrikanth Narayanan , Sudarsana Reddy Kadiri
Estimating the severity of dysarthria, a speech disorder from neurological conditions, is important in medicine. It helps with diagnosis, early detection, and personalized treatment. Significant progress has been made in leveraging SSL models as feature extractors for various classification tasks, demonstrating their effectiveness. Building on this, this paper examines whether using all features extracted from SSL models is necessary for optimal dysarthria severity classification from speech. We focused on layer-wise feature analysis of one base model, Wav2Vec2-base, and four large models, Wav2Vec2-large, HuBERT-large, Data2Vec-large, and WavLM-large, using a Convolutional Neural Network (CNN) as classifier with mel-frequency cepstral coefficients (MFCC) features as baseline. Experiments showed that the later transformer layers of the SSL models were more effective in the dysarthria severity classification, compared to the earlier layers. This is because the later transformer layers better capture articulation, and complex temporal patterns refined from the mid layers. More specifically, analysis revealed that embeddings from transformer encoder layer 23 of HuBERT-large yielded the best performance among all three models, possibly due to HuBERT’s hierarchical learning from unsupervised clustering. To further assess whether all dimensions are important, we examined the impact of varying feature dimensions. Our findings indicated that reducing the dimensions to 32 (from 1024 dimension) led to further improvements in accuracy. This indicates that not all features are necessary for effective severity classification. Additionally, feature fusion was conducted using the optimal reduced dimensions from the best-performing layer combined with varying dimensions of the MFCC features, resulting in further improvement in performance. The highest accuracy of 70.44% was achieved by combining 32 selected dimensions from the HuBERT-large model with 21 MFCC feature dimensions. The feature fusion of HuBERT-large (32) and MFCC (21) outperformed the HuBERT-large baseline by 6.36% and MFCC baseline by 15.28% in absolute. Furthermore, combining the fused features with handcrafted features from articulatory, prosodic, phonatory, and respiratory domains increased the classification accuracy to 73.53%, resulting in a more robust representation for dysarthria severity classification. Probing analyses of articulatory and prosodic features supported the choice of the best-performing HuBERT layer, while the low correlation with handcrafted features highlighted their complementary contribution. Finally, comparative t-SNE visualizations further validated the effectiveness of the proposed feature fusion, demonstrating clearer class separability.
构音障碍是一种由神经系统疾病引起的语言障碍,估计其严重程度在医学上很重要。它有助于诊断、早期发现和个性化治疗。在利用SSL模型作为各种分类任务的特征提取器方面已经取得了重大进展,证明了它们的有效性。在此基础上,本文研究了是否需要使用从SSL模型中提取的所有特征来从语音中进行最佳构音障碍严重程度分类。我们专注于一个基本模型,Wav2Vec2-base,以及四个大型模型,Wav2Vec2-large, HuBERT-large, data2vec2 -large和WavLM-large的分层特征分析,使用卷积神经网络(CNN)作为分类器,以melf -frequency cepstral系数(MFCC)特征为基线。实验表明,与较早的层相比,SSL模型的后变压器层在构音障碍严重程度分类中更有效。这是因为后面的转换层更好地捕获了衔接,以及从中间层提炼出来的复杂时间模式。更具体地说,分析显示HuBERT-large的变压器编码器第23层的嵌入在所有三个模型中产生了最好的性能,这可能是由于HuBERT从无监督聚类中分层学习。为了进一步评估是否所有维度都很重要,我们检查了不同特征维度的影响。我们的研究结果表明,将维度减少到32(从1024个维度)可以进一步提高准确性。这表明并不是所有的特征都是有效的严重性分类所必需的。此外,将性能最佳层的最优降维与MFCC特征的不同维度相结合,进行特征融合,进一步提高了性能。将HuBERT-large模型中选择的32个维度与21个MFCC特征维度相结合,达到了70.44%的最高准确率。HuBERT-large(32)和MFCC(21)的特征融合绝对优于HuBERT-large基线的6.36%和MFCC基线的15.28%。此外,将融合特征与来自发音、韵律、语音和呼吸域的手工特征相结合,将分类准确率提高到73.53%,从而更稳健地表示构音障碍严重程度分类。对发音和韵律特征的探索性分析支持选择表现最好的休伯特层,而与手工特征的低相关性突出了它们的互补贡献。最后,对比t-SNE可视化进一步验证了所提出的特征融合的有效性,展示了更清晰的类可分性。
{"title":"Do all features matter? Layer-wise feature probing of self-supervised speech models for dysarthria severity classification","authors":"Paban Sapkota ,&nbsp;Harsh Srivastava ,&nbsp;Hemant Kumar Kathania ,&nbsp;Shrikanth Narayanan ,&nbsp;Sudarsana Reddy Kadiri","doi":"10.1016/j.specom.2025.103326","DOIUrl":"10.1016/j.specom.2025.103326","url":null,"abstract":"<div><div>Estimating the severity of dysarthria, a speech disorder from neurological conditions, is important in medicine. It helps with diagnosis, early detection, and personalized treatment. Significant progress has been made in leveraging SSL models as feature extractors for various classification tasks, demonstrating their effectiveness. Building on this, this paper examines whether using all features extracted from SSL models is necessary for optimal dysarthria severity classification from speech. We focused on layer-wise feature analysis of one base model, Wav2Vec2-base, and four large models, Wav2Vec2-large, HuBERT-large, Data2Vec-large, and WavLM-large, using a Convolutional Neural Network (CNN) as classifier with mel-frequency cepstral coefficients (MFCC) features as baseline. Experiments showed that the later transformer layers of the SSL models were more effective in the dysarthria severity classification, compared to the earlier layers. This is because the later transformer layers better capture articulation, and complex temporal patterns refined from the mid layers. More specifically, analysis revealed that embeddings from transformer encoder layer 23 of HuBERT-large yielded the best performance among all three models, possibly due to HuBERT’s hierarchical learning from unsupervised clustering. To further assess whether all dimensions are important, we examined the impact of varying feature dimensions. Our findings indicated that reducing the dimensions to 32 (from 1024 dimension) led to further improvements in accuracy. This indicates that not all features are necessary for effective severity classification. Additionally, feature fusion was conducted using the optimal reduced dimensions from the best-performing layer combined with varying dimensions of the MFCC features, resulting in further improvement in performance. The highest accuracy of 70.44% was achieved by combining 32 selected dimensions from the HuBERT-large model with 21 MFCC feature dimensions. The feature fusion of HuBERT-large (32) and MFCC (21) outperformed the HuBERT-large baseline by 6.36% and MFCC baseline by 15.28% in absolute. Furthermore, combining the fused features with handcrafted features from articulatory, prosodic, phonatory, and respiratory domains increased the classification accuracy to 73.53%, resulting in a more robust representation for dysarthria severity classification. Probing analyses of articulatory and prosodic features supported the choice of the best-performing HuBERT layer, while the low correlation with handcrafted features highlighted their complementary contribution. Finally, comparative t-SNE visualizations further validated the effectiveness of the proposed feature fusion, demonstrating clearer class separability.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"175 ","pages":"Article 103326"},"PeriodicalIF":3.0,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145474244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A survey of deep learning for complex speech spectrograms 复杂语音谱图的深度学习研究综述
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-11-01 DOI: 10.1016/j.specom.2025.103319
Yuying Xie, Zheng-Hua Tan
Recent advancements in deep learning have significantly impacted the field of speech signal processing, particularly in the analysis and manipulation of complex spectrograms. This survey provides a comprehensive overview of the state-of-the-art techniques leveraging deep neural networks for processing complex spectrograms, which encapsulate both magnitude and phase information. We begin by introducing complex spectrograms and their associated features for various speech processing tasks. Next, we examine the key components and architectures of complex-valued neural networks, which are specifically designed to handle complex-valued data and have been applied to complex spectrogram processing. As recent studies have primarily focused on applying real-valued neural networks to complex spectrograms, we revisit these approaches and their architectural designs. We then discuss various training strategies and loss functions tailored for training neural networks to process and model complex spectrograms. The survey further examines key applications, including phase retrieval, speech enhancement, and speaker separation, where deep learning has achieved significant progress by leveraging complex spectrograms or their derived feature representations. Additionally, we examine the intersection of complex spectrograms with generative models. This survey aims to serve as a valuable resource for researchers and practitioners in the field of speech signal processing, deep learning and related fields.
深度学习的最新进展对语音信号处理领域产生了重大影响,特别是在复杂频谱图的分析和处理方面。本调查提供了利用深度神经网络处理复杂频谱图的最先进技术的全面概述,这些频谱图包含幅度和相位信息。我们首先介绍复杂频谱图及其相关特征,用于各种语音处理任务。接下来,我们研究了复杂值神经网络的关键组件和架构,它们专门用于处理复杂值数据,并已应用于复杂频谱图处理。由于最近的研究主要集中在将实值神经网络应用于复杂的频谱图上,我们重新审视这些方法及其架构设计。然后我们讨论了各种训练策略和损失函数,用于训练神经网络来处理和模拟复杂的谱图。该调查进一步研究了关键应用,包括相位检索、语音增强和说话人分离,在这些应用中,深度学习通过利用复杂的频谱图或其衍生的特征表示取得了重大进展。此外,我们研究了复杂谱图与生成模型的交集。本调查旨在为语音信号处理、深度学习及相关领域的研究人员和从业人员提供有价值的资源。
{"title":"A survey of deep learning for complex speech spectrograms","authors":"Yuying Xie,&nbsp;Zheng-Hua Tan","doi":"10.1016/j.specom.2025.103319","DOIUrl":"10.1016/j.specom.2025.103319","url":null,"abstract":"<div><div>Recent advancements in deep learning have significantly impacted the field of speech signal processing, particularly in the analysis and manipulation of complex spectrograms. This survey provides a comprehensive overview of the state-of-the-art techniques leveraging deep neural networks for processing complex spectrograms, which encapsulate both magnitude and phase information. We begin by introducing complex spectrograms and their associated features for various speech processing tasks. Next, we examine the key components and architectures of complex-valued neural networks, which are specifically designed to handle complex-valued data and have been applied to complex spectrogram processing. As recent studies have primarily focused on applying real-valued neural networks to complex spectrograms, we revisit these approaches and their architectural designs. We then discuss various training strategies and loss functions tailored for training neural networks to process and model complex spectrograms. The survey further examines key applications, including phase retrieval, speech enhancement, and speaker separation, where deep learning has achieved significant progress by leveraging complex spectrograms or their derived feature representations. Additionally, we examine the intersection of complex spectrograms with generative models. This survey aims to serve as a valuable resource for researchers and practitioners in the field of speech signal processing, deep learning and related fields.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"175 ","pages":"Article 103319"},"PeriodicalIF":3.0,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145474242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Categorization of patients affected with neurogenerative dysarthria among Hindi-speaking population and analyzing factors causing reduced speech intelligibility at the human-machine interface 印地语人群神经再生构音障碍患者分类及人机界面言语清晰度降低因素分析
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-11-01 DOI: 10.1016/j.specom.2025.103328
Raj Kumar , Manoj Tripathy , Niraj Kumar , R.S. Anand
Dysarthria, a vocal movement disorder resulting from neurological disease, is recognized by various indicators, such as diminished intensity, uncontrolled pitch variations, varying speech rate, and hypo/hypernasality, among other symptoms. It presents significant hurdles for dysarthric individuals when interfacing with machines operated by automatic speech recognition (ASR) systems tailored for the speech of neurologically healthy people. This research delves into the voice characteristics contributing to decreased intelligibility within human-machine interaction by investigating the behaviour of ASR systems with varying degrees of dysarthria. The work presents a pilot study of dysarthria in Hindi-speaking population by compiling a Hindi corpus. The corpus scrutinizes the distinct voice attributes present in dysarthric speech, focusing on parameters like pitch perturbation, amplitude perturbation, articulation rate, pause and phoneme rate derived from sustained phonation and continuous speech data captured using a conventional close-talk and throat microphone. The speech dataset includes recordings from sixty participants with neurological conditions, each providing thirty sentences. Participants are categorized into four intelligibility groups for analysis using the Google Cloud Speech to Text conversion system. The phonation analysis reveals greater disturbances in pitch and intensity variation as intelligibility decreases. Additionally, a sentence-level analysis was conducted to explore the influence of inter-word pauses and word complexity across different intelligibility groups. The results show that individuals with severe dysarthria tend to speak more slowly and misarticulate longer words. The study provides numerical ranges for pitch, amplitude, and time perturbation, which will be helpful for researchers working in the field of DSR system development, which utilizes data augmentation to generate synthetic dysarthric data to mitigate data scarcity. The work establishes a relationship between word complexity and intelligibility, which will support speech pathologists in designing customized speech training programs to improve intelligibility for individuals with dysarthria.
构音障碍是一种由神经系统疾病引起的发声运动障碍,可通过多种指标来识别,如强度减弱、音调变化失控、语速变化、鼻音减退/鼻音亢进等症状。当与为神经健康人群量身定制的自动语音识别(ASR)系统操作的机器进行交互时,这给运动障碍患者带来了巨大的障碍。本研究通过调查具有不同程度构音障碍的ASR系统的行为,深入研究了在人机交互中导致可理解性下降的语音特征。该工作提出了一个试点研究的构音障碍在印地语人口通过汇编印地语语料库。该语料库仔细研究了困难语音中存在的独特语音属性,重点关注音高扰动、幅度扰动、发音率、停顿和音素率等参数,这些参数来自持续发声和使用传统近距离谈话和喉部麦克风捕获的连续语音数据。语音数据集包括来自60个神经系统疾病参与者的录音,每个人提供30个句子。参与者被分为四个可理解性组,使用谷歌云语音到文本转换系统进行分析。发声分析表明,随着可理解性的降低,音高和强度变化的干扰更大。此外,本文还通过句子层面的分析,探讨了不同可理解度群体中词间停顿和词复杂度的影响。结果表明,患有严重构音障碍的人往往说话更慢,而且长单词发音错误。该研究提供了基音、振幅和时间扰动的数值范围,这将有助于DSR系统开发领域的研究人员利用数据增强来生成合成的失稳数据,以减轻数据的稀缺性。这项工作建立了单词复杂性和可理解性之间的关系,这将支持语言病理学家设计定制的语言训练计划,以提高构音障碍患者的可理解性。
{"title":"Categorization of patients affected with neurogenerative dysarthria among Hindi-speaking population and analyzing factors causing reduced speech intelligibility at the human-machine interface","authors":"Raj Kumar ,&nbsp;Manoj Tripathy ,&nbsp;Niraj Kumar ,&nbsp;R.S. Anand","doi":"10.1016/j.specom.2025.103328","DOIUrl":"10.1016/j.specom.2025.103328","url":null,"abstract":"<div><div>Dysarthria, a vocal movement disorder resulting from neurological disease, is recognized by various indicators, such as diminished intensity, uncontrolled pitch variations, varying speech rate, and hypo/hypernasality, among other symptoms. It presents significant hurdles for dysarthric individuals when interfacing with machines operated by automatic speech recognition (ASR) systems tailored for the speech of neurologically healthy people. This research delves into the voice characteristics contributing to decreased intelligibility within human-machine interaction by investigating the behaviour of ASR systems with varying degrees of dysarthria. The work presents a pilot study of dysarthria in Hindi-speaking population by compiling a Hindi corpus. The corpus scrutinizes the distinct voice attributes present in dysarthric speech, focusing on parameters like pitch perturbation, amplitude perturbation, articulation rate, pause and phoneme rate derived from sustained phonation and continuous speech data captured using a conventional close-talk and throat microphone. The speech dataset includes recordings from sixty participants with neurological conditions, each providing thirty sentences. Participants are categorized into four intelligibility groups for analysis using the Google Cloud Speech to Text conversion system. The phonation analysis reveals greater disturbances in pitch and intensity variation as intelligibility decreases. Additionally, a sentence-level analysis was conducted to explore the influence of inter-word pauses and word complexity across different intelligibility groups. The results show that individuals with severe dysarthria tend to speak more slowly and misarticulate longer words. The study provides numerical ranges for pitch, amplitude, and time perturbation, which will be helpful for researchers working in the field of DSR system development, which utilizes data augmentation to generate synthetic dysarthric data to mitigate data scarcity. The work establishes a relationship between word complexity and intelligibility, which will support speech pathologists in designing customized speech training programs to improve intelligibility for individuals with dysarthria.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"175 ","pages":"Article 103328"},"PeriodicalIF":3.0,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145528715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Noise-robust feature extraction for keyword spotting based on supervised adversarial domain adaptation training strategies 基于监督对抗域自适应训练策略的关键词识别噪声鲁棒特征提取
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-11-01 DOI: 10.1016/j.specom.2025.103323
Yongqiang Chen , Qianhua He , Zunxian Liu , Mingru Yang , Wenwu Wang
Keyword spotting (KWS) suffers from the domain shift between training and testing in practical complex situations. To improve the robustness of KWS models in noisy environments, this paper proposes a novel domain-invariant feature extraction strategy called supervised probabilistic multi-domain adversarial training (SPMDAT). Based on supervised adversarial domain adaptation (SADA), SPMDAT makes better use of differently distributed data (multi-condition data) by using a class-wise domain discriminator to estimate the domain index probability distribution. Experimental results on three different deep networks showed that the SPMDAT could improve KWS performances for three noisy situations: seen noise, unseen noise, and seen noise with ultra-low signal-to-noise ratio (SNR) levels, compared to the multi-condition training (MCT) strategy. Especially, for KWT-1, the average relative improvements are 9.63%, 10.83%, and 28.16%, respectively. SPMDAT also achieves better results in the three test situations than the other two SADA strategies adapted from unsupervised domain adaptation (UDA) methods. Since the three strategies are only used in the training process, all the improvements are achieved without increasing the computational complexity of the inference models. In addition, to better understand the practicability of the SADA-based strategies, experiments are conducted to assess the impact of model parameters on the performance. The results show that models with approximately 69 K parameters already achieve performance improvements over MCT, suggesting the effectiveness of the strategies for small-footprint KWS models.
在复杂的实际情况下,关键字识别(KWS)受到训练和测试领域转移的影响。为了提高KWS模型在噪声环境下的鲁棒性,本文提出了一种新的域不变特征提取策略——监督概率多域对抗训练(SPMDAT)。SPMDAT基于监督式对抗域自适应(SADA),利用分类域鉴别器估计域索引概率分布,更好地利用了不同分布的数据(多条件数据)。在三个不同深度网络上的实验结果表明,与多条件训练(MCT)策略相比,SPMDAT可以提高KWS在三种噪声情况下的性能:可见噪声、不可见噪声和超低信噪比的可见噪声。特别是对于KWT-1,平均相对改善率分别为9.63%、10.83%和28.16%。SPMDAT在三种测试情境下的效果也优于其他两种由无监督域自适应(UDA)方法改编的SADA策略。由于这三种策略只在训练过程中使用,所以所有的改进都是在不增加推理模型计算复杂度的情况下实现的。此外,为了更好地了解基于sada的策略的实用性,我们进行了实验来评估模型参数对性能的影响。结果表明,与MCT相比,具有大约69 K参数的模型已经取得了性能改进,这表明该策略对于小占用空间的KWS模型是有效的。
{"title":"Noise-robust feature extraction for keyword spotting based on supervised adversarial domain adaptation training strategies","authors":"Yongqiang Chen ,&nbsp;Qianhua He ,&nbsp;Zunxian Liu ,&nbsp;Mingru Yang ,&nbsp;Wenwu Wang","doi":"10.1016/j.specom.2025.103323","DOIUrl":"10.1016/j.specom.2025.103323","url":null,"abstract":"<div><div>Keyword spotting (KWS) suffers from the domain shift between training and testing in practical complex situations. To improve the robustness of KWS models in noisy environments, this paper proposes a novel domain-invariant feature extraction strategy called supervised probabilistic multi-domain adversarial training (SPMDAT). Based on supervised adversarial domain adaptation (SADA), SPMDAT makes better use of differently distributed data (multi-condition data) by using a class-wise domain discriminator to estimate the domain index probability distribution. Experimental results on three different deep networks showed that the SPMDAT could improve KWS performances for three noisy situations: seen noise, unseen noise, and seen noise with ultra-low signal-to-noise ratio (SNR) levels, compared to the multi-condition training (MCT) strategy. Especially, for KWT-1, the average relative improvements are 9.63%, 10.83%, and 28.16%, respectively. SPMDAT also achieves better results in the three test situations than the other two SADA strategies adapted from unsupervised domain adaptation (UDA) methods. Since the three strategies are only used in the training process, all the improvements are achieved without increasing the computational complexity of the inference models. In addition, to better understand the practicability of the SADA-based strategies, experiments are conducted to assess the impact of model parameters on the performance. The results show that models with approximately 69 K parameters already achieve performance improvements over MCT, suggesting the effectiveness of the strategies for small-footprint KWS models.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"175 ","pages":"Article 103323"},"PeriodicalIF":3.0,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Speech Communication
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1