首页 > 最新文献

Computer Speech and Language最新文献

英文 中文
Mispronunciation detection and diagnosis based on large language models 基于大型语言模型的发音错误检测与诊断
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-10 DOI: 10.1016/j.csl.2026.101942
Yanlu Xie , Huihang Zhong , Xuhui Lan , Wenwei Dong
This study explores the potential application of Large Language Models (LLMs) in the Mispronunciation Detection and Diagnosis (MDD) system, which includes pronunciation error detection, feedback, and diagnosis. Accurate detection of incorrect pronunciation, along with comprehensive and effective diagnosis, is key to guiding learners in corrective exercises. Traditional MDD research requires collecting data for specific tasks and training models similar to those used in speech recognition. Moreover, most previous research focuses on identifying the types of errors rather than providing specific pronunciation guidance, resulting in textual feedback on pronunciation errors being relatively limited and the content not sufficiently rich. The recent breakthroughs in LLMs have created new opportunities for pronunciation learning through their ability to generate fluent, educationally valuable feedback, such as explaining error types, demonstrating correct pronunciation, and providing personalized practice guidance. This study explores the potential of multimodal speech models in end-to-end pronunciation error detection and feedback generation. Our experiment results show that comprehensive fine-tuning of the Whisper model using second language (L2) speech data can improve its ability to model L2 speech, thereby increasing the accuracy of mispronunciation detection. The feedback text generated by this model is comparable in quality to the current state-of-the-art (SOTA) level based on LLMs (G-Score of 0.52, compared to SOTA’s 0.54). In addition, this study proposes a pronunciation error feedback method based on pronunciation attribute features using LLMs. The LLMs effectively improve the accuracy and effectiveness of feedback text by analyzing the pronunciation attribute features of incorrect phoneme positions. The evaluation of feedback from LLMs by L2 learners indicates a significant improvement in comprehensibility and helpfulness when using these pronunciation representations. These results confirm the potential of articulatory feature engineering and strategic model optimization in CAPT systems by enhancing learner engagement while reducing instructor workload.
本研究探讨了大型语言模型(Large Language Models, LLMs)在发音错误检测与诊断(MDD)系统中的潜在应用,该系统包括发音错误检测、反馈和诊断。准确发现发音错误,并进行全面有效的诊断,是指导学习者进行纠正练习的关键。传统的MDD研究需要为特定任务收集数据,并训练类似于语音识别中使用的模型。此外,以往的研究大多侧重于识别错误类型,而不是提供具体的语音指导,导致对语音错误的文本反馈相对有限,内容不够丰富。法学硕士课程最近的突破为发音学习创造了新的机会,因为他们能够提供流利的、有教育价值的反馈,比如解释错误类型、展示正确的发音,以及提供个性化的练习指导。本研究探讨了多模态语音模型在端到端发音错误检测和反馈生成方面的潜力。我们的实验结果表明,使用第二语言(L2)语音数据对Whisper模型进行全面微调可以提高其对L2语音的建模能力,从而提高错误发音检测的准确性。该模型生成的反馈文本在质量上与当前基于llm的最先进(SOTA)水平相当(G-Score为0.52,而SOTA的G-Score为0.54)。此外,本研究还提出了一种基于语音属性特征的语音错误反馈方法。llm通过分析错误音位的发音属性特征,有效地提高了反馈文本的准确性和有效性。第二语言学习者对法学硕士反馈的评估表明,使用这些发音表征时,他们的可理解性和帮助性有了显著提高。这些结果证实了发音特征工程和策略模型优化在CAPT系统中的潜力,可以提高学习者的参与度,同时减少教师的工作量。
{"title":"Mispronunciation detection and diagnosis based on large language models","authors":"Yanlu Xie ,&nbsp;Huihang Zhong ,&nbsp;Xuhui Lan ,&nbsp;Wenwei Dong","doi":"10.1016/j.csl.2026.101942","DOIUrl":"10.1016/j.csl.2026.101942","url":null,"abstract":"<div><div>This study explores the potential application of Large Language Models (LLMs) in the Mispronunciation Detection and Diagnosis (MDD) system, which includes pronunciation error detection, feedback, and diagnosis. Accurate detection of incorrect pronunciation, along with comprehensive and effective diagnosis, is key to guiding learners in corrective exercises. Traditional MDD research requires collecting data for specific tasks and training models similar to those used in speech recognition. Moreover, most previous research focuses on identifying the types of errors rather than providing specific pronunciation guidance, resulting in textual feedback on pronunciation errors being relatively limited and the content not sufficiently rich. The recent breakthroughs in LLMs have created new opportunities for pronunciation learning through their ability to generate fluent, educationally valuable feedback, such as explaining error types, demonstrating correct pronunciation, and providing personalized practice guidance. This study explores the potential of multimodal speech models in end-to-end pronunciation error detection and feedback generation. Our experiment results show that comprehensive fine-tuning of the Whisper model using second language (L2) speech data can improve its ability to model L2 speech, thereby increasing the accuracy of mispronunciation detection. The feedback text generated by this model is comparable in quality to the current state-of-the-art (SOTA) level based on LLMs (G-Score of 0.52, compared to SOTA’s 0.54). In addition, this study proposes a pronunciation error feedback method based on pronunciation attribute features using LLMs. The LLMs effectively improve the accuracy and effectiveness of feedback text by analyzing the pronunciation attribute features of incorrect phoneme positions. The evaluation of feedback from LLMs by L2 learners indicates a significant improvement in comprehensibility and helpfulness when using these pronunciation representations. These results confirm the potential of articulatory feature engineering and strategic model optimization in CAPT systems by enhancing learner engagement while reducing instructor workload.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101942"},"PeriodicalIF":3.4,"publicationDate":"2026-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145977418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pitch-Aware multi-feature fusion for classifying statements, questions, and exclamations in low-resource languages 基于音高感知的多特征融合的低资源语言陈述句、疑问句和感叹词分类
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-07 DOI: 10.1016/j.csl.2026.101941
Ayub Othman Abdulrahman
Automatic classification of statements, questions, and exclamations is important for dialogue systems, speech analytics, language documentation, and other human-computer interaction tasks. Speech pitch and prosody are central cues for these categories, but pitch-based classification remains challenging due to speaker variability, recording conditions, and overlapping prosodic patterns across classes, especially in low-resource settings. We present an innovative multi-feature fusion architecture that combines pretrained wav2Vec 2.0 raw-waveform embeddings (transfer learning), 40-dimensional Mel-Frequency Cepstral Coefficients (MFCCs) features, and Mel-spectrogram representations into an integrated framework. Our work explicitly depends on pitch-related cues (captured primarily by the waveform embeddings and spectrogram branch) together with complementary MFCCs spectral features, which together improve robustness. The model concatenates 128-dimensional representations from each branch and refines the fused vector with fully connected layers. This study leverages SQEBSP, a recently published pitch-annotated Kurdish speech dataset collected by the authors, comprising 12,660 utterances from 431 speakers, to evaluate statement, question, and exclamation classification. The proposed method achieves approximately 97% accuracy on the training/validation data, and about 88% accuracy on a separate held-out test set comprising 20% of the dataset, substantially outperforming single-feature baselines (58.8–79.3%) and prior three-class systems (68.0%). Ablation experiments confirm that the pitch-related inputs contribute substantially to classification accuracy, while MFCC features provide complementary spectral/timbre information. Our research indicates that the combination of pretrained wav2Vec 2.0 representations with multi-feature fusion and supervised fine-tuning provides an efficient method for pitch-informed speech classification in low-resource scenarios.
语句、问题和感叹号的自动分类对于对话系统、语音分析、语言文档和其他人机交互任务非常重要。语音音高和韵律是这些分类的中心线索,但基于音高的分类仍然具有挑战性,因为说话者的变化、记录条件和跨类重叠的韵律模式,特别是在低资源环境中。我们提出了一种创新的多特征融合架构,该架构将预训练的wav2Vec 2.0原始波形嵌入(迁移学习)、40维mel -频率倒谱系数(MFCCs)特征和mel -频谱图表示结合到一个集成框架中。我们的工作明确地依赖于与音高相关的线索(主要由波形嵌入和频谱图分支捕获)以及互补的mfccc频谱特征,它们共同提高了鲁棒性。该模型将来自每个分支的128维表示连接起来,并用完全连接的层精炼融合向量。这项研究利用SQEBSP,一个由作者收集的最近发表的音调注释的库尔德语语音数据集,包括来自431个说话者的12,660个话语,来评估陈述句、疑问和感叹词的分类。该方法在训练/验证数据上的准确率约为97%,在包含20%数据集的单独测试集上的准确率约为88%,大大优于单特征基线(58.8-79.3%)和先前的三类系统(68.0%)。烧蚀实验证实,与音高相关的输入极大地提高了分类精度,而MFCC特征提供了互补的光谱/音色信息。我们的研究表明,将预训练的wav2Vec 2.0表示与多特征融合和监督微调相结合,为低资源场景下的音高知情语音分类提供了一种有效的方法。
{"title":"Pitch-Aware multi-feature fusion for classifying statements, questions, and exclamations in low-resource languages","authors":"Ayub Othman Abdulrahman","doi":"10.1016/j.csl.2026.101941","DOIUrl":"10.1016/j.csl.2026.101941","url":null,"abstract":"<div><div>Automatic classification of statements, questions, and exclamations is important for dialogue systems, speech analytics, language documentation, and other human-computer interaction tasks. Speech pitch and prosody are central cues for these categories, but pitch-based classification remains challenging due to speaker variability, recording conditions, and overlapping prosodic patterns across classes, especially in low-resource settings. We present an innovative multi-feature fusion architecture that combines pretrained wav2Vec 2.0 raw-waveform embeddings (transfer learning), 40-dimensional Mel-Frequency Cepstral Coefficients (MFCCs) features, and Mel-spectrogram representations into an integrated framework. Our work explicitly depends on pitch-related cues (captured primarily by the waveform embeddings and spectrogram branch) together with complementary MFCCs spectral features, which together improve robustness. The model concatenates 128-dimensional representations from each branch and refines the fused vector with fully connected layers. This study leverages SQEBSP, a recently published pitch-annotated Kurdish speech dataset collected by the authors, comprising 12,660 utterances from 431 speakers, to evaluate statement, question, and exclamation classification. The proposed method achieves approximately 97% accuracy on the training/validation data, and about 88% accuracy on a separate held-out test set comprising 20% of the dataset, substantially outperforming single-feature baselines (58.8–79.3%) and prior three-class systems (68.0%). Ablation experiments confirm that the pitch-related inputs contribute substantially to classification accuracy, while MFCC features provide complementary spectral/timbre information. Our research indicates that the combination of pretrained wav2Vec 2.0 representations with multi-feature fusion and supervised fine-tuning provides an efficient method for pitch-informed speech classification in low-resource scenarios.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101941"},"PeriodicalIF":3.4,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145977419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
One-class neural network with hybrid pooling on dual-band frequency for spoofing speech detection 基于双频混合池的一类神经网络欺骗语音检测
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-06 DOI: 10.1016/j.csl.2026.101937
Jianqiang Zhang , Yushui Geng , Peng Zhang , Fuqiang Wang , Xiaoming Wu
Effective detection of spoofing attacks in automatic speaker verification relies heavily on discriminative artifacts confined to specific spectral or temporal regions. While low-frequency band features have demonstrated utility, the artifact information inherent in high-frequency components remains substantially underexplored. To bridge this gap, we propose OCNet-HPDB, an end-to-end one-class neural network that processes both lower and upper spectral halves through a hybrid pooling mechanism, enabling holistic exploitation of spoofing cues across the entire frequency spectrum. The model further incorporates an advanced loss function, the Compactness-enhanced Threshold-based One-class Softmax, which encourages tighter clustering of target-class samples in the embedding space. With strategic repositioning of the squeeze-and-excitation block ahead of residual connections, our approach achieves notable performance improvements on the ASVspoof 2019 LA and ASVspoof 2021 LA datasets without employing data augmentation. On the ASVspoof 2019 LA evaluation, our system achieves a 0.29% equal error rate (EER) and a 0.0094 minimum tandem detection cost function (min t-DCF), representing relative reductions of 34.09% and 35.17% over the prior state-of-the-art single systems. For the more challenging ASVspoof 2021 LA benchmark, the model attains a 7.2% EER, corresponding to a 24.84% relative improvement.
在自动说话人验证中,欺骗攻击的有效检测很大程度上依赖于局限于特定频谱或时间区域的判别伪影。虽然低频特征已经证明了实用性,但高频组件中固有的工件信息仍未得到充分的探索。为了弥补这一差距,我们提出了OCNet-HPDB,这是一种端到端一类神经网络,通过混合池机制处理下半部和上半部频谱,从而能够在整个频谱中全面利用欺骗线索。该模型进一步结合了一种先进的损失函数,即基于紧凑度增强阈值的单类Softmax,它鼓励在嵌入空间中对目标类样本进行更紧密的聚类。通过在残余连接之前战略性地重新定位挤压和激励块,我们的方法在不使用数据增强的情况下,在ASVspoof 2019 LA和ASVspoof 2021 LA数据集上实现了显着的性能改进。在ASVspoof 2019 LA评估中,我们的系统实现了0.29%的等错误率(EER)和0.0094的最小串联检测成本函数(min t-DCF),与之前最先进的单一系统相比,相对降低了34.09%和35.17%。对于更具挑战性的ASVspoof 2021 LA基准测试,该模型达到了7.2%的EER,对应于24.84%的相对改进。
{"title":"One-class neural network with hybrid pooling on dual-band frequency for spoofing speech detection","authors":"Jianqiang Zhang ,&nbsp;Yushui Geng ,&nbsp;Peng Zhang ,&nbsp;Fuqiang Wang ,&nbsp;Xiaoming Wu","doi":"10.1016/j.csl.2026.101937","DOIUrl":"10.1016/j.csl.2026.101937","url":null,"abstract":"<div><div>Effective detection of spoofing attacks in automatic speaker verification relies heavily on discriminative artifacts confined to specific spectral or temporal regions. While low-frequency band features have demonstrated utility, the artifact information inherent in high-frequency components remains substantially underexplored. To bridge this gap, we propose OCNet-HPDB, an end-to-end one-class neural network that processes both lower and upper spectral halves through a hybrid pooling mechanism, enabling holistic exploitation of spoofing cues across the entire frequency spectrum. The model further incorporates an advanced loss function, the Compactness-enhanced Threshold-based One-class Softmax, which encourages tighter clustering of target-class samples in the embedding space. With strategic repositioning of the squeeze-and-excitation block ahead of residual connections, our approach achieves notable performance improvements on the ASVspoof 2019 LA and ASVspoof 2021 LA datasets without employing data augmentation. On the ASVspoof 2019 LA evaluation, our system achieves a 0.29% equal error rate (EER) and a 0.0094 minimum tandem detection cost function (min t-DCF), representing relative reductions of 34.09% and 35.17% over the prior state-of-the-art single systems. For the more challenging ASVspoof 2021 LA benchmark, the model attains a 7.2% EER, corresponding to a 24.84% relative improvement.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101937"},"PeriodicalIF":3.4,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145925549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Decoding phone pairs from MEG signals across speech modalities 解码电话对从MEG信号跨语音模式
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-05 DOI: 10.1016/j.csl.2026.101939
Xabier de Zuazo , Eva Navas , Ibon Saratxaga , Mathieu Bourguignon , Nicola Molinaro
Understanding the neural mechanisms underlying speech production is essential for both advancing cognitive neuroscience theory and developing practical communication technologies. In this study, we investigated magnetoencephalography (MEG) signals to perform binary phone-pair classification from brain activity during speech production and perception (passive listening and voice playback) tasks. Using a dataset comprising 18 participants, we performed pairwise phone classification, extending our analysis to 20 phonetic pairs. Multiple machine learning approaches, including regularized linear models and neural network architectures, were compared to determine their effectiveness in decoding phonetic information. Our results demonstrate significantly higher decoding accuracy during speech production (73.4%) compared to passive listening and playback modalities (approximately 51%), emphasizing the richer neural information available during overt speech. Among the models, the Elastic Net classifier consistently outperformed more complex neural networks, highlighting the effectiveness of traditional regularization techniques when applied to limited and high-dimensional MEG datasets. Besides, analysis of specific brain frequency bands revealed that low-frequency oscillations, particularly Delta (0.2 Hz to 3 Hz) and Theta (4 Hz to 7 Hz), contributed the most substantially to decoding accuracy, suggesting that these bands encode critical speech production-related neural processes. Despite using advanced denoising methods, it remains unclear whether decoding solely reflects neural activity or if residual muscular or movement artifacts also contribute, indicating the need for further methodological refinement. Overall, our findings underline the critical importance of examining overt speech production paradigms, which, despite their complexity, offer opportunities to improve brain-computer interfaces to help individuals with severe speech impairments.
理解语音产生背后的神经机制对于推进认知神经科学理论和发展实用的通信技术至关重要。在这项研究中,我们研究了脑磁图(MEG)信号在语音产生和感知(被动倾听和语音回放)任务中执行二进制电话对分类的大脑活动。使用包含18个参与者的数据集,我们进行了配对电话分类,将我们的分析扩展到20个语音对。我们比较了多种机器学习方法,包括正则化线性模型和神经网络架构,以确定它们在解码语音信息方面的有效性。我们的研究结果表明,与被动聆听和播放模式(约51%)相比,语音产生过程中的解码准确率(73.4%)显著提高,强调了显性语音过程中更丰富的神经信息。在这些模型中,Elastic Net分类器的表现始终优于更复杂的神经网络,突显了传统正则化技术在有限和高维MEG数据集上的有效性。此外,对特定脑频段的分析显示,低频振荡,特别是Delta (0.2 Hz至3 Hz)和Theta (4 Hz至7 Hz),对解码精度的贡献最大,这表明这些频段编码了与语音产生相关的关键神经过程。尽管使用了先进的去噪方法,但仍不清楚解码是否仅反映神经活动,或者是否残留的肌肉或运动伪影也有贡献,这表明需要进一步改进方法。总的来说,我们的发现强调了研究显性语言产生范式的重要性,尽管它们很复杂,但它为改善脑机接口提供了机会,以帮助患有严重语言障碍的个体。
{"title":"Decoding phone pairs from MEG signals across speech modalities","authors":"Xabier de Zuazo ,&nbsp;Eva Navas ,&nbsp;Ibon Saratxaga ,&nbsp;Mathieu Bourguignon ,&nbsp;Nicola Molinaro","doi":"10.1016/j.csl.2026.101939","DOIUrl":"10.1016/j.csl.2026.101939","url":null,"abstract":"<div><div>Understanding the neural mechanisms underlying speech production is essential for both advancing cognitive neuroscience theory and developing practical communication technologies. In this study, we investigated magnetoencephalography (MEG) signals to perform binary phone-pair classification from brain activity during speech production and perception (passive listening and voice playback) tasks. Using a dataset comprising 18 participants, we performed pairwise phone classification, extending our analysis to 20 phonetic pairs. Multiple machine learning approaches, including regularized linear models and neural network architectures, were compared to determine their effectiveness in decoding phonetic information. Our results demonstrate significantly higher decoding accuracy during speech production (73.4%) compared to passive listening and playback modalities (approximately 51%), emphasizing the richer neural information available during overt speech. Among the models, the Elastic Net classifier consistently outperformed more complex neural networks, highlighting the effectiveness of traditional regularization techniques when applied to limited and high-dimensional MEG datasets. Besides, analysis of specific brain frequency bands revealed that low-frequency oscillations, particularly Delta (0.2<!--> <!-->Hz to 3<!--> <!-->Hz) and Theta (4<!--> <!-->Hz to 7<!--> <!-->Hz), contributed the most substantially to decoding accuracy, suggesting that these bands encode critical speech production-related neural processes. Despite using advanced denoising methods, it remains unclear whether decoding solely reflects neural activity or if residual muscular or movement artifacts also contribute, indicating the need for further methodological refinement. Overall, our findings underline the critical importance of examining overt speech production paradigms, which, despite their complexity, offer opportunities to improve brain-computer interfaces to help individuals with severe speech impairments.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101939"},"PeriodicalIF":3.4,"publicationDate":"2026-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145925552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Speech acoustics to rt-MRI articulatory dynamics inversion with video diffusion model 基于视频扩散模型的语音声学对rt-MRI发音动力学的反演
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-31 DOI: 10.1016/j.csl.2025.101928
Xuan Shi , Tiantian Feng , Jay Park , Christina Hagedorn , Louis Goldstein , Shrikanth Narayanan
Inverting speech acoustics to articulatory dynamics presents a multidisciplinary challenge, spanning clinical, linguistic, and engineering domains, with applications including speech therapy and second language learning. Despite its significance, existing methods lack a systematic approach for generating articulatory dynamics of a more complete vocal tract from speech acoustics. Availability of spatio-temporally rich video covering the entire oro-pharynx and laryngeal region of the vocal tract during speech at high frame rates (83 frames/second) using real-time MRI (rt-MRI), alongside linguistic-theory guided computational frameworks, offers new possibilities to improve speech to articulatory inversion. In this work, we propose a novel system for inverting speech acoustics to articulatory dynamics using an rt-MRI driven video diffusion model. Additionally, we introduce a new evaluation method, a linguistic knowledge guided pixel intensity correlation within articulatory regions of interest (ROIs), for video generative model. Our results demonstrate the system’s competitive performance in generalizing to unseen reference speech audio.
将语音声学转化为发音动力学提出了一个多学科的挑战,跨越临床,语言学和工程领域,应用包括语言治疗和第二语言学习。尽管其意义重大,但现有的方法缺乏一种系统的方法来从语音声学中产生更完整的声道的发音动力学。利用实时MRI (rt-MRI)和语言学理论指导的计算框架,以高帧率(83帧/秒)提供了覆盖整个声道口咽和喉区的时空丰富视频,为改善语音到发音反转提供了新的可能性。在这项工作中,我们提出了一个使用rt-MRI驱动的视频扩散模型将语音声学转化为发音动力学的新系统。此外,我们还为视频生成模型引入了一种新的评估方法,即语言知识引导的发音感兴趣区域(roi)内像素强度相关性。我们的结果表明,该系统在泛化到不可见的参考语音音频方面具有竞争力。
{"title":"Speech acoustics to rt-MRI articulatory dynamics inversion with video diffusion model","authors":"Xuan Shi ,&nbsp;Tiantian Feng ,&nbsp;Jay Park ,&nbsp;Christina Hagedorn ,&nbsp;Louis Goldstein ,&nbsp;Shrikanth Narayanan","doi":"10.1016/j.csl.2025.101928","DOIUrl":"10.1016/j.csl.2025.101928","url":null,"abstract":"<div><div>Inverting speech acoustics to articulatory dynamics presents a multidisciplinary challenge, spanning clinical, linguistic, and engineering domains, with applications including speech therapy and second language learning. Despite its significance, existing methods lack a systematic approach for generating articulatory dynamics of a more complete vocal tract from speech acoustics. Availability of spatio-temporally rich video covering the entire oro-pharynx and laryngeal region of the vocal tract during speech at high frame rates (83 frames/second) using real-time MRI (rt-MRI), alongside linguistic-theory guided computational frameworks, offers new possibilities to improve speech to articulatory inversion. In this work, we propose a novel system for inverting speech acoustics to articulatory dynamics using an rt-MRI driven video diffusion model. Additionally, we introduce a new evaluation method, a linguistic knowledge guided pixel intensity correlation within articulatory regions of interest (ROIs), for video generative model. Our results demonstrate the system’s competitive performance in generalizing to unseen reference speech audio.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101928"},"PeriodicalIF":3.4,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145925553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Entrainment detection using DNN 使用深度神经网络进行夹带检测
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-26 DOI: 10.1016/j.csl.2025.101930
Jay Kejriwal , Štefan Beňuš , Lina M. Rojas-Barahona
During conversation, speakers adjust their linguistic characteristics to become more similar to their partners. This complex phenomenon is known as entrainment, and speakers dynamically entrain as well as disentrain on different linguistic features. Researchers have utilized a range of computational methods to explore entrainment. Recent technological advancements have facilitated the use of deep learning, which offers a systematic quantification of acoustic entrainment dynamics. In this study, we investigate the capability of deep learning architectures to extract and leverage textual features for the efficient representation and learning of entrainment. By adjusting the architecture of an acoustic-based DNN entrainment model, we present an unsupervised deep learning framework that derives representations from textual features containing relevant information for identifying entrainment at three linguistic levels: lexical, syntactic, and semantic. To investigate the performance of each model within the proposed framework, various text-based and speech features were extracted. Entrainment was quantified using different distance measures in the representation space. The performance of the trained models was evaluated by distinguishing real and sham conversations using the proposed distances. Our results suggest that acoustic-based DNN models outperform text-based DNN models and that distance measures affect the models’ performance.
在对话过程中,说话者会调整自己的语言特征,使之更接近对方。这种复杂的现象被称为夹带,说话者在不同的语言特征上动态地夹带和剥离。研究人员利用了一系列的计算方法来探索夹带。最近的技术进步促进了深度学习的使用,它提供了声学夹带动力学的系统量化。在本研究中,我们研究了深度学习架构提取和利用文本特征来有效表示和学习夹带的能力。通过调整基于声学的DNN诱导模型的架构,我们提出了一个无监督深度学习框架,该框架从包含相关信息的文本特征中提取表征,用于在三个语言层面(词汇、句法和语义)识别诱导。为了研究每个模型在该框架内的性能,提取了各种基于文本和语音的特征。在表征空间中使用不同的距离度量来量化夹带。通过使用建议的距离区分真实和虚假对话来评估训练模型的性能。我们的研究结果表明,基于声学的深度神经网络模型优于基于文本的深度神经网络模型,并且距离度量会影响模型的性能。
{"title":"Entrainment detection using DNN","authors":"Jay Kejriwal ,&nbsp;Štefan Beňuš ,&nbsp;Lina M. Rojas-Barahona","doi":"10.1016/j.csl.2025.101930","DOIUrl":"10.1016/j.csl.2025.101930","url":null,"abstract":"<div><div>During conversation, speakers adjust their linguistic characteristics to become more similar to their partners. This complex phenomenon is known as entrainment, and speakers dynamically entrain as well as disentrain on different linguistic features. Researchers have utilized a range of computational methods to explore entrainment. Recent technological advancements have facilitated the use of deep learning, which offers a systematic quantification of acoustic entrainment dynamics. In this study, we investigate the capability of deep learning architectures to extract and leverage textual features for the efficient representation and learning of entrainment. By adjusting the architecture of an acoustic-based DNN entrainment model, we present an unsupervised deep learning framework that derives representations from textual features containing relevant information for identifying entrainment at three linguistic levels: lexical, syntactic, and semantic. To investigate the performance of each model within the proposed framework, various text-based and speech features were extracted. Entrainment was quantified using different distance measures in the representation space. The performance of the trained models was evaluated by distinguishing real and sham conversations using the proposed distances. Our results suggest that acoustic-based DNN models outperform text-based DNN models and that distance measures affect the models’ performance.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101930"},"PeriodicalIF":3.4,"publicationDate":"2025-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145925550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Speech emotion recognition using multimodal LLMs and quality-controlled TTS-based data augmentation for Iberian languages 使用多模态llm和质量控制的基于tts的伊比利亚语数据增强的语音情感识别
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-23 DOI: 10.1016/j.csl.2025.101927
Jaime Bellver-Soler, Anmol Guragain, Samuel Ramos-Varela, Ricardo Córdoba, Luis Fernando D’Haro
This work proposes the usage of multimodal large language models for speech emotion recognition (SER) on low-resource settings for Iberian languages. Given the existing low amount of annotated SER data for other languages, we also propose a pipeline for generating high-quality synthetic data using existing emotional text-to-speech (TTS) and their cloning capabilities.
Specifically, we design a selective, quality-controlled TTS pipeline combining LLM-ensemble translation with self-verification and expressive voice cloning, followed by automatic ASR-WER, speaker-similarity, and emotion filters. This approach introduces a novel filtering strategy that ensures synthetic data reliability. The resulting data include MSP-MEA,1 a synthetic Spanish extension of MSP-Podcast.
Building on our previous multimodal SER framework, we compare the usage of frozen LLM as classifiers with an MLP baseline and evaluate classical versus TTS-based augmentation across five corpora (IEMOCAP, MEACorpus, EMS, VERBO, AhoEmo3). The best configuration (W2v-BERT-2 attentive pooling frozen Bloomz-7b1) improves mean F1 by +4.9 point increase over a MLP head baseline. Among augmentation techniques, Mix-up remains the most robust overall, while TTS achieves competitive performance, surpassing traditional data augmentation techniques on EMS and VERBO. These results indicate that carefully filtered TTS data can complement classical perturbations, providing a viable, dataset-dependent strategy for multilingual SER. Code, models, and datasets are publicly released.2
这项工作提出了在伊比利亚语言的低资源设置上使用多模态大语言模型进行语音情感识别(SER)。鉴于其他语言的注释SER数据数量较少,我们还提出了一个利用现有的情感文本到语音(TTS)及其克隆能力生成高质量合成数据的管道。具体来说,我们设计了一个选择性的、质量控制的TTS管道,将llm集成翻译与自我验证和表达性语音克隆结合起来,然后是自动ASR-WER、说话者相似度和情感过滤器。该方法引入了一种新的过滤策略,以确保合成数据的可靠性。结果数据包括MSP-MEA,1是MSP-Podcast的合成西班牙语扩展。在我们之前的多模态SER框架的基础上,我们比较了冻结LLM作为分类器与MLP基线的使用情况,并在五个语料库(IEMOCAP, MEACorpus, EMS, VERBO, AhoEmo3)中评估了经典与基于tts的增强。最佳配置(W2v-BERT-2→细心池化→冷冻Bloomz-7b1)比MLP头部基线提高平均F1 +4.9点。在增强技术中,mixed -up总体上仍然是最强大的,而TTS在EMS和VERBO上取得了具有竞争力的性能,超过了传统的数据增强技术。这些结果表明,仔细过滤的TTS数据可以补充经典扰动,为多语言SER提供一种可行的、依赖于数据集的策略。代码、模型和数据集是公开发布的
{"title":"Speech emotion recognition using multimodal LLMs and quality-controlled TTS-based data augmentation for Iberian languages","authors":"Jaime Bellver-Soler,&nbsp;Anmol Guragain,&nbsp;Samuel Ramos-Varela,&nbsp;Ricardo Córdoba,&nbsp;Luis Fernando D’Haro","doi":"10.1016/j.csl.2025.101927","DOIUrl":"10.1016/j.csl.2025.101927","url":null,"abstract":"<div><div>This work proposes the usage of multimodal large language models for speech emotion recognition (SER) on low-resource settings for Iberian languages. Given the existing low amount of annotated SER data for other languages, we also propose a pipeline for generating high-quality synthetic data using existing emotional text-to-speech (TTS) and their cloning capabilities.</div><div>Specifically, we design a selective, quality-controlled TTS pipeline combining LLM-ensemble translation with self-verification and expressive voice cloning, followed by automatic ASR-WER, speaker-similarity, and emotion filters. This approach introduces a novel filtering strategy that ensures synthetic data reliability. The resulting data include MSP-MEA,<span><span><sup>1</sup></span></span> a synthetic Spanish extension of MSP-Podcast.</div><div>Building on our previous multimodal SER framework, we compare the usage of frozen LLM as classifiers with an MLP baseline and evaluate classical versus TTS-based augmentation across five corpora (IEMOCAP, MEACorpus, EMS, VERBO, AhoEmo3). The best configuration (W2v-BERT-2 <span><math><mo>→</mo></math></span> attentive pooling <span><math><mo>→</mo></math></span> frozen Bloomz-7b1) improves mean F1 by +4.9 point increase over a MLP head baseline. Among augmentation techniques, Mix-up remains the most robust overall, while TTS achieves competitive performance, surpassing traditional data augmentation techniques on EMS and VERBO. These results indicate that carefully filtered TTS data can complement classical perturbations, providing a viable, dataset-dependent strategy for multilingual SER. Code, models, and datasets are publicly released.<span><span><sup>2</sup></span></span></div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101927"},"PeriodicalIF":3.4,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Artificial protozoa lotus effect algorithm enabled cognitive brain optimal model for sentiment analysis utilizing multimodal data 人工原生动物荷花效应算法实现认知脑优化模型,利用多模态数据进行情感分析
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-22 DOI: 10.1016/j.csl.2025.101929
Sanjeevkumar Angadi , Saili Hemant Sable , Tejaswini Zope , Rajani Amol Hemade , Vaibhavi Umesh Avachat
Understanding public sentiment derived from online data is a challenging research problem with numerous applications, including contextual analysis and opinion assessment on specific events. Traditionally, sentiment analysis has concentrated on a single modality, such as text or images. However, utilizing multimodal information such as images, text, and audio can enhance model accuracy. Despite this advantage, combining visual and textual features often leads to decreased performance. This is mainly because of the model's inability to efficiently capture the intricate relationships amongst diverse modalities. To confront these challenges, a new technique named Artificial Protozoa Lotus Effect Algorithm _ Cognitive Brain Optimal Model (APLEA_CBO) model has been developed for sentiment analysis using multimodal data. Initially, feature extraction is performed on audio data to obtain the feature vector outcome-1. Similarly, feature extraction is conducted on the input text to extract suitable features and is considered outcome-2. Both feature sets are then processed for sentiment analysis using the Cognitive Brain Optimal Model (CBOM), which is developed by employing Recurrent Denoising Long Short-Term Memory (RD-LSTM). The CBOM is trained using the Artificial Protozoa Lotus Effect Algorithm (APLEA), which is the integration of Artificial Protozoa Optimization (APO) and Lotus Effect Algorithm (LEA). It is noted that the APLEA_CBO model has gained an FPR of 7.17 %, a recall of 92.76 %, a precision of 90.62 %, and an accuracy of 90.60 %.
从在线数据中理解公众情绪是一个具有挑战性的研究问题,有许多应用,包括上下文分析和对特定事件的意见评估。传统上,情感分析集中于单一的情态,如文本或图像。然而,利用图像、文本和音频等多模态信息可以提高模型的准确性。尽管有这个优点,但是将视觉和文本特性结合起来经常会导致性能下降。这主要是因为该模型无法有效地捕捉各种模态之间的复杂关系。为了应对这些挑战,本文提出了一种基于多模态数据的情感分析新技术——人工原生动物莲花效应算法-认知脑优化模型(APLEA_CBO)。首先,对音频数据进行特征提取,得到特征向量result -1。同样,对输入文本进行特征提取,提取合适的特征,认为是结果2。然后使用认知大脑最优模型(CBOM)对两个特征集进行处理以进行情感分析,该模型是通过使用循环降噪长短期记忆(RD-LSTM)开发的。CBOM采用人工原生动物优化算法(APO)和莲花效应算法(LEA)相结合的人工原生动物莲花效应算法(apade)进行训练。结果表明,APLEA_CBO模型的FPR为7.17%,召回率为92.76%,精密度为90.62%,准确度为90.60%。
{"title":"Artificial protozoa lotus effect algorithm enabled cognitive brain optimal model for sentiment analysis utilizing multimodal data","authors":"Sanjeevkumar Angadi ,&nbsp;Saili Hemant Sable ,&nbsp;Tejaswini Zope ,&nbsp;Rajani Amol Hemade ,&nbsp;Vaibhavi Umesh Avachat","doi":"10.1016/j.csl.2025.101929","DOIUrl":"10.1016/j.csl.2025.101929","url":null,"abstract":"<div><div>Understanding public sentiment derived from online data is a challenging research problem with numerous applications, including contextual analysis and opinion assessment on specific events. Traditionally, sentiment analysis has concentrated on a single modality, such as text or images. However, utilizing multimodal information such as images, text, and audio can enhance model accuracy. Despite this advantage, combining visual and textual features often leads to decreased performance. This is mainly because of the model's inability to efficiently capture the intricate relationships amongst diverse modalities. To confront these challenges, a new technique named Artificial Protozoa Lotus Effect Algorithm _ Cognitive Brain Optimal Model (APLEA_CBO) model has been developed for sentiment analysis using multimodal data. Initially, feature extraction is performed on audio data to obtain the feature vector outcome-1. Similarly, feature extraction is conducted on the input text to extract suitable features and is considered outcome-2. Both feature sets are then processed for sentiment analysis using the Cognitive Brain Optimal Model (CBOM), which is developed by employing Recurrent Denoising Long Short-Term Memory (RD-LSTM). The CBOM is trained using the Artificial Protozoa Lotus Effect Algorithm (APLEA), which is the integration of Artificial Protozoa Optimization (APO) and Lotus Effect Algorithm (LEA). It is noted that the APLEA_CBO model has gained an FPR of 7.17 %, a recall of 92.76 %, a precision of 90.62 %, and an accuracy of 90.60 %.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101929"},"PeriodicalIF":3.4,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145884237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Leveraging saliency-based pre-trained foundation model representations to uncover breathing patterns in speech 利用基于显著性的预训练基础模型表示来揭示语音中的呼吸模式
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-20 DOI: 10.1016/j.csl.2025.101926
Vikramjit Mitra, Anirban Chatterjee, Ke Zhai, Helen Weng, Ayuko Hill, Nicole Hay, Christopher Webb, Jamie Cheng, Erdrin Azemi
The process of human speech production involves coordinated respiratory action to elicit acoustic speech signals. Typically, speech is produced when air is forced from the lungs and is modulated by the vocal tract, where such actions are interspersed by moments of breathing in air (inhalation) to refill the lungs again. Respiratory rate (RR) is a vital metric that is used to assess the overall health, fitness, and general well-being of an individual. Existing approaches to measure RR (number of breaths one takes in a minute) are performed using specialized equipment or training. Studies have demonstrated that machine learning algorithms can be used to estimate RR using bio-sensor signals as input. Speech-based estimation of RR can offer an effective approach to measure the vital metric without requiring any specialized equipment or sensors. This work investigates a machine learning based approach to estimate RR from speech segments obtained from subjects speaking to a close-talking microphone device. Data were collected from N=26 individuals, where the groundtruth RR was obtained through commercial grade chest-belts and then manually corrected for any errors. A convolutional long-short term memory network (Conv-LSTM) is proposed to estimate respiration time-series data from the speech signal. We demonstrate that the use of pre-trained representations obtained from a foundation model, such as Wav2Vec2, can be used to estimate respiration-time-series with low root-mean-squared error and high correlation coefficient, when compared with the baseline. The model-driven time series can be used to estimate RR with a low mean absolute error (MAE) 1.6 breaths/min.
人类语音产生的过程涉及到协调的呼吸动作来引出声学语音信号。通常情况下,当空气从肺部排出并被声道调节时,就会产生语音,在声道中,这些动作被吸入空气(吸入)的时刻所穿插,以再次充满肺部。呼吸频率(RR)是一个重要的指标,用于评估个人的整体健康,健身和一般福祉。现有的测量RR(一分钟内呼吸次数)的方法是使用专门的设备或训练来执行的。研究表明,机器学习算法可以使用生物传感器信号作为输入来估计RR。基于语音的RR估计可以提供一种有效的方法来测量重要指标,而不需要任何专门的设备或传感器。这项工作研究了一种基于机器学习的方法,从对着近距离说话的麦克风设备说话的受试者获得的语音片段中估计RR。从N=26个人中收集数据,其中通过商业级胸带获得基础真实RR,然后手动纠正任何错误。提出了一种卷积长短期记忆网络(convolutional long-short term memory network,简称convl - lstm)来估计语音信号中的呼吸时间序列数据。我们证明,与基线相比,使用从基础模型(如Wav2Vec2)获得的预训练表示可以用于估计具有低均方根误差和高相关系数的呼吸时间序列。模型驱动的时间序列可用于估计RR,平均绝对误差(MAE)≈1.6次/分钟。
{"title":"Leveraging saliency-based pre-trained foundation model representations to uncover breathing patterns in speech","authors":"Vikramjit Mitra,&nbsp;Anirban Chatterjee,&nbsp;Ke Zhai,&nbsp;Helen Weng,&nbsp;Ayuko Hill,&nbsp;Nicole Hay,&nbsp;Christopher Webb,&nbsp;Jamie Cheng,&nbsp;Erdrin Azemi","doi":"10.1016/j.csl.2025.101926","DOIUrl":"10.1016/j.csl.2025.101926","url":null,"abstract":"<div><div>The process of human speech production involves coordinated respiratory action to elicit acoustic speech signals. Typically, speech is produced when air is forced from the lungs and is modulated by the vocal tract, where such actions are interspersed by moments of breathing in air (inhalation) to refill the lungs again. Respiratory rate (<em>RR</em>) is a vital metric that is used to assess the overall health, fitness, and general well-being of an individual. Existing approaches to measure <em>RR</em> (number of breaths one takes in a minute) are performed using specialized equipment or training. Studies have demonstrated that machine learning algorithms can be used to estimate <em>RR</em> using bio-sensor signals as input. Speech-based estimation of <em>RR</em> can offer an effective approach to measure the vital metric without requiring any specialized equipment or sensors. This work investigates a machine learning based approach to estimate <em>RR</em> from speech segments obtained from subjects speaking to a close-talking microphone device. Data were collected from N=26 individuals, where the groundtruth <em>RR</em> was obtained through commercial grade chest-belts and then manually corrected for any errors. A convolutional long-short term memory network (<em>Conv-LSTM</em>) is proposed to estimate respiration time-series data from the speech signal. We demonstrate that the use of pre-trained representations obtained from a foundation model, such as <em>Wav2Vec2</em>, can be used to estimate respiration-time-series with low root-mean-squared error and high correlation coefficient, when compared with the baseline. The model-driven time series can be used to estimate <em>RR</em> with a low mean absolute error (<em>MAE</em>) <span><math><mrow><mo>≈</mo><mn>1</mn><mo>.</mo><mn>6</mn></mrow></math></span> breaths/min.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101926"},"PeriodicalIF":3.4,"publicationDate":"2025-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Is self-supervised learning enough to fill in the gap? A study on speech inpainting 自我监督学习是否足以填补这一空白?绘画中的言语研究
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-16 DOI: 10.1016/j.csl.2025.101922
Ihab Asaad , Maxime Jacquelin , Olivier Perrotin , Laurent Girin , Thomas Hueber
Speech inpainting consists in reconstructing corrupted or missing speech segments using surrounding context, a process that closely resembles the pretext tasks in Self-Supervised Learning (SSL) for speech encoders. This study investigates using SSL-trained speech encoders for inpainting without any additional training beyond the initial pretext task, and simply adding a decoder to generate a waveform. We compare this approach to supervised fine-tuning of speech encoders for a downstream task—here, inpainting. Practically, we integrate HuBERT as the SSL encoder and HiFi-GAN as the decoder in two configurations: (1) fine-tuning the decoder to align with the frozen pre-trained encoder’s output and (2) fine-tuning the encoder for an inpainting task based on a frozen decoder’s input. Evaluations are conducted under single- and multi-speaker conditions using in-domain datasets and out-of-domain datasets (including unseen speakers, diverse speaking styles, and noise). Both informed and blind inpainting scenarios are considered, where the position of the corrupted segment is either known or unknown. The proposed SSL-based methods are benchmarked against several baselines, including a text-informed method combining automatic speech recognition with zero-shot text-to-speech synthesis. Performance is assessed using objective metrics and perceptual evaluations. The results demonstrate that both approaches outperform baselines, successfully reconstructing speech segments up to 200 ms, and sometimes up to 400 ms. Notably, fine-tuning the SSL encoder achieves more accurate speech reconstruction in single-speaker settings, while a pre-trained encoder proves more effective for multi-speaker scenarios. This demonstrates that an SSL pretext task can transfer to speech inpainting, enabling successful speech reconstruction with a pre-trained encoder.
语音绘制包括使用周围的上下文重建损坏或缺失的语音片段,这个过程非常类似于语音编码器的自我监督学习(SSL)中的借口任务。本研究调查了使用ssl训练的语音编码器进行绘图,除了初始的借口任务之外,没有任何额外的训练,只是添加了一个解码器来生成波形。我们将这种方法与语音编码器的监督微调进行比较,以完成下游任务-这里是绘画。实际上,我们将HuBERT作为SSL编码器和HiFi-GAN作为解码器集成在两种配置中:(1)微调解码器以与冻结的预训练编码器的输出对齐;(2)根据冻结的解码器的输入微调编码器以完成油漆任务。使用域内数据集和域外数据集(包括看不见的说话者、不同的说话风格和噪音)在单说话者和多说话者条件下进行评估。在已知或未知损坏段的位置的情况下,考虑了知情和盲目的绘制场景。提出的基于ssl的方法针对几个基线进行了基准测试,包括将自动语音识别与零射击文本到语音合成相结合的文本通知方法。性能评估使用客观指标和感性评价。结果表明,这两种方法都优于基线,成功地重建了200毫秒的语音片段,有时甚至高达400毫秒。值得注意的是,微调SSL编码器在单扬声器设置下实现更准确的语音重建,而预训练编码器在多扬声器场景下更有效。这表明SSL借口任务可以转移到语音绘制,从而使用预训练的编码器实现成功的语音重建。
{"title":"Is self-supervised learning enough to fill in the gap? A study on speech inpainting","authors":"Ihab Asaad ,&nbsp;Maxime Jacquelin ,&nbsp;Olivier Perrotin ,&nbsp;Laurent Girin ,&nbsp;Thomas Hueber","doi":"10.1016/j.csl.2025.101922","DOIUrl":"10.1016/j.csl.2025.101922","url":null,"abstract":"<div><div>Speech inpainting consists in reconstructing corrupted or missing speech segments using surrounding context, a process that closely resembles the pretext tasks in Self-Supervised Learning (SSL) for speech encoders. This study investigates using SSL-trained speech encoders for inpainting without any additional training beyond the initial pretext task, and simply adding a decoder to generate a waveform. We compare this approach to supervised fine-tuning of speech encoders for a downstream task—here, inpainting. Practically, we integrate HuBERT as the SSL encoder and HiFi-GAN as the decoder in two configurations: (1) fine-tuning the decoder to align with the frozen pre-trained encoder’s output and (2) fine-tuning the encoder for an inpainting task based on a frozen decoder’s input. Evaluations are conducted under single- and multi-speaker conditions using in-domain datasets and out-of-domain datasets (including unseen speakers, diverse speaking styles, and noise). Both informed and blind inpainting scenarios are considered, where the position of the corrupted segment is either known or unknown. The proposed SSL-based methods are benchmarked against several baselines, including a text-informed method combining automatic speech recognition with zero-shot text-to-speech synthesis. Performance is assessed using objective metrics and perceptual evaluations. The results demonstrate that both approaches outperform baselines, successfully reconstructing speech segments up to 200 ms, and sometimes up to 400 ms. Notably, fine-tuning the SSL encoder achieves more accurate speech reconstruction in single-speaker settings, while a pre-trained encoder proves more effective for multi-speaker scenarios. This demonstrates that an SSL pretext task can transfer to speech inpainting, enabling successful speech reconstruction with a pre-trained encoder.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101922"},"PeriodicalIF":3.4,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145791096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computer Speech and Language
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1