首页 > 最新文献

Computer Speech and Language最新文献

英文 中文
Audiovisual speech enhancement and voice activity detection using generative and regressive visual features 基于生成和回归视觉特征的视听语音增强和语音活动检测
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-14 DOI: 10.1016/j.csl.2025.101924
Cheng Yu , Vahid Ahmadi Kalkhorani , Buye Xu , DeLiang Wang
We present an audiovisual speech enhancement (AVSE) system to address two related tasks: speech enhancement (SE) and voice activity detection (VAD). The system is based on a complex spectral mapping model and performs two-stage audiovisual fusion. The first stage is a signal-level fusion module, where a generative lip-to-speech conversion method produces time-frequency (T-F) features from lip movements. This allows the system to leverage noise-free T-F representations, which are crucial for improving speech intelligibility, particularly in challenging acoustic environments. The second stage is an embedding-level fusion module, where high-dimensional embedding features from a jointly trained visual encoder are integrated. Additionally, we propose a multitask learning framework that optimizes both SE and VAD tasks. The inclusion of a VAD decoder enables the system to distinguish speech from non-speech segments. We evaluate the system on multiple benchmark datasets, including COG-MHEAR, LRS3-AudioSet, and LRS3-CHiME3, and achieve state-of-the-art SE and speech recognition results, and significant robustness in VAD compared to the audio-only baseline. These results highlight the effectiveness of our system in realistic environments.
我们提出了一个视听语音增强(AVSE)系统来解决两个相关的任务:语音增强(SE)和语音活动检测(VAD)。该系统基于复杂的光谱映射模型,并进行两阶段的视听融合。第一阶段是信号级融合模块,其中生成唇到语音转换方法产生唇运动的时频(T-F)特征。这使得系统能够利用无噪声的T-F表示,这对于提高语音清晰度至关重要,特别是在具有挑战性的声学环境中。第二阶段是嵌入级融合模块,其中集成了来自联合训练的视觉编码器的高维嵌入特征。此外,我们提出了一个多任务学习框架,优化SE和VAD任务。VAD解码器的包含使系统能够区分语音和非语音段。我们在多个基准数据集上对系统进行了评估,包括COG-MHEAR、LRS3-AudioSet和LRS3-CHiME3,并获得了最先进的SE和语音识别结果,与纯音频基线相比,VAD具有显著的鲁棒性。这些结果突出了我们的系统在现实环境中的有效性。
{"title":"Audiovisual speech enhancement and voice activity detection using generative and regressive visual features","authors":"Cheng Yu ,&nbsp;Vahid Ahmadi Kalkhorani ,&nbsp;Buye Xu ,&nbsp;DeLiang Wang","doi":"10.1016/j.csl.2025.101924","DOIUrl":"10.1016/j.csl.2025.101924","url":null,"abstract":"<div><div>We present an audiovisual speech enhancement (AVSE) system to address two related tasks: speech enhancement (SE) and voice activity detection (VAD). The system is based on a complex spectral mapping model and performs two-stage audiovisual fusion. The first stage is a signal-level fusion module, where a generative lip-to-speech conversion method produces time-frequency (T-F) features from lip movements. This allows the system to leverage noise-free T-F representations, which are crucial for improving speech intelligibility, particularly in challenging acoustic environments. The second stage is an embedding-level fusion module, where high-dimensional embedding features from a jointly trained visual encoder are integrated. Additionally, we propose a multitask learning framework that optimizes both SE and VAD tasks. The inclusion of a VAD decoder enables the system to distinguish speech from non-speech segments. We evaluate the system on multiple benchmark datasets, including COG-MHEAR, LRS3-AudioSet, and LRS3-CHiME3, and achieve state-of-the-art SE and speech recognition results, and significant robustness in VAD compared to the audio-only baseline. These results highlight the effectiveness of our system in realistic environments.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101924"},"PeriodicalIF":3.4,"publicationDate":"2025-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145791094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Survey of end-to-end multi-speaker automatic speech recognition for monaural audio 端到端多扬声器单声道自动语音识别研究进展
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-11 DOI: 10.1016/j.csl.2025.101925
Xinlu He, Jacob Whitehill
Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances have driven the shift from cascade systems to end-to-end (E2E) architectures, which reduce error propagation and better exploit the synergy between speech content and speaker identity. Despite rapid progress in E2E multi-speaker ASR, the field lacks a comprehensive review of recent developments. This survey provides a systematic taxonomy of E2E neural approaches for multi-speaker ASR, highlighting recent advances and comparative analysis. Specifically, we analyze: (1) architectural paradigms (single-input-multiple-output (SIMO) vs. single-input-single-output (SISO)) for pre-segmented audio, analyzing their distinct characteristics and trade-offs; (2) recent architectural and algorithmic improvements based on these two paradigms, including multi-modal inputs; (3) extensions to long-form speech, including segmentation strategy and speaker-consistent hypothesis stitching. Further, we (4) evaluate and compare methods across standard benchmarks. We conclude with a discussion of open challenges and future research directions towards building robust and scalable multi-speaker ASR.
单耳多说话人自动语音识别(ASR)仍然具有挑战性,因为数据稀缺和识别和归因单个说话人的固有困难,特别是在重叠语音中。最近的进展推动了从级联系统到端到端(E2E)架构的转变,这减少了错误传播,并更好地利用了语音内容和说话者身份之间的协同作用。尽管端到端多扬声器ASR进展迅速,但该领域缺乏对最近发展的全面审查。本文对多语ASR的端到端神经方法进行了系统的分类,重点介绍了最新进展并进行了比较分析。具体来说,我们分析了:(1)预分段音频的架构范例(单输入多输出(SIMO) vs单输入单输出(SISO)),分析了它们的独特特征和权衡;(2)最近基于这两种范式的架构和算法改进,包括多模态输入;(3)对长形言语的扩展,包括分词策略和说话人一致假设拼接。此外,我们(4)评估和比较跨标准基准的方法。最后,我们讨论了构建健壮且可扩展的多扬声器ASR的开放挑战和未来研究方向。
{"title":"Survey of end-to-end multi-speaker automatic speech recognition for monaural audio","authors":"Xinlu He,&nbsp;Jacob Whitehill","doi":"10.1016/j.csl.2025.101925","DOIUrl":"10.1016/j.csl.2025.101925","url":null,"abstract":"<div><div>Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances have driven the shift from cascade systems to end-to-end (E2E) architectures, which reduce error propagation and better exploit the synergy between speech content and speaker identity. Despite rapid progress in E2E multi-speaker ASR, the field lacks a comprehensive review of recent developments. This survey provides a systematic taxonomy of E2E neural approaches for multi-speaker ASR, highlighting recent advances and comparative analysis. Specifically, we analyze: (1) architectural paradigms (single-input-multiple-output (SIMO) vs. single-input-single-output (SISO)) for pre-segmented audio, analyzing their distinct characteristics and trade-offs; (2) recent architectural and algorithmic improvements based on these two paradigms, including multi-modal inputs; (3) extensions to long-form speech, including segmentation strategy and speaker-consistent hypothesis stitching. Further, we (4) evaluate and compare methods across standard benchmarks. We conclude with a discussion of open challenges and future research directions towards building robust and scalable multi-speaker ASR.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101925"},"PeriodicalIF":3.4,"publicationDate":"2025-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145791095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhanced audio-visual speech enhancement with posterior sampling methods in recurrent variational autoencoders 用后验抽样方法在循环变分自编码器中增强视听语音
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-06 DOI: 10.1016/j.csl.2025.101923
Z. Foroushi, R.M. Dansereau
Recovering intelligible speech in noise is essential for robust communication. This work presents an audio-visual speech enhancement framework based on a Recurrent Variational Autoencoder (AV-RVAE), where posterior inference is extended using sampling-based methods including the Metropolis-Adjusted Langevin Algorithm (MALA), Langevin Dynamics EM (LDEM), Hamiltonian Monte Carlo (HMC), Barker sampling, and a hybrid MALA+Barker variant. To isolate the contribution of visual cues, an audio-only baseline (A-RVAE) is trained and evaluated under identical data and inference conditions.
Performance is assessed using Scale-Invariant Signal-to-Distortion Ratio (SI-SDR), Perceptual Evaluation of Speech Quality (PESQ), and Short-Time Objective Intelligibility (STOI), along with anytime convergence curves (metric versus wall-clock time) and the Real-Time Factor (RTF; ratio of runtime to audio duration) to measure computational efficiency.
Experimental results show that the hybrid MALA+Barker sampler achieves the best overall performance, while LDEM and step-size-optimized MALA exhibit the lowest RTFs, the MALA+Barker sampler offers the most favorable balance between efficiency and enhancement quality. Across all sampling strategies, the AV-RVAE consistently surpasses the audio-only baseline, particularly at low SNRs, confirming the benefit of visual fusion combined with advanced posterior sampling for robust speech enhancement in challenging acoustic environments.
在噪声环境中恢复可理解的言语对于强健的交流是必不可少的。这项工作提出了一个基于循环变分自编码器(AV-RVAE)的视听语音增强框架,其中后验推理使用基于采样的方法进行扩展,包括大都市调整朗格万算法(MALA)、朗格万动力学EM (LDEM)、哈密顿蒙特卡罗(HMC)、巴克采样和混合MALA+巴克变体。为了隔离视觉线索的贡献,在相同的数据和推理条件下训练和评估了纯音频基线(A-RVAE)。性能评估使用尺度不变信号失真比(SI-SDR),语音质量感知评估(PESQ)和短时客观可理解性(STOI),以及随时收敛曲线(度量与时钟时间)和实时因子(RTF;运行时间与音频持续时间的比率)来衡量计算效率。实验结果表明,混合MALA+Barker采样器的综合性能最好,而LDEM和步长优化的MALA采样器的RTFs最低,MALA+Barker采样器在效率和增强质量之间取得了最有利的平衡。在所有采样策略中,AV-RVAE始终优于纯音频基线,特别是在低信噪比的情况下,这证实了视觉融合与先进后验采样相结合的优势,可以在具有挑战性的声学环境中增强语音。
{"title":"Enhanced audio-visual speech enhancement with posterior sampling methods in recurrent variational autoencoders","authors":"Z. Foroushi,&nbsp;R.M. Dansereau","doi":"10.1016/j.csl.2025.101923","DOIUrl":"10.1016/j.csl.2025.101923","url":null,"abstract":"<div><div>Recovering intelligible speech in noise is essential for robust communication. This work presents an audio-visual speech enhancement framework based on a Recurrent Variational Autoencoder (AV-RVAE), where posterior inference is extended using sampling-based methods including the Metropolis-Adjusted Langevin Algorithm (MALA), Langevin Dynamics EM (LDEM), Hamiltonian Monte Carlo (HMC), Barker sampling, and a hybrid MALA+Barker variant. To isolate the contribution of visual cues, an audio-only baseline (A-RVAE) is trained and evaluated under identical data and inference conditions.</div><div>Performance is assessed using Scale-Invariant Signal-to-Distortion Ratio (SI-SDR), Perceptual Evaluation of Speech Quality (PESQ), and Short-Time Objective Intelligibility (STOI), along with anytime convergence curves (metric versus wall-clock time) and the Real-Time Factor (RTF; ratio of runtime to audio duration) to measure computational efficiency.</div><div>Experimental results show that the hybrid MALA+Barker sampler achieves the best overall performance, while LDEM and step-size-optimized MALA exhibit the lowest RTFs, the MALA+Barker sampler offers the most favorable balance between efficiency and enhancement quality. Across all sampling strategies, the AV-RVAE consistently surpasses the audio-only baseline, particularly at low SNRs, confirming the benefit of visual fusion combined with advanced posterior sampling for robust speech enhancement in challenging acoustic environments.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101923"},"PeriodicalIF":3.4,"publicationDate":"2025-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145737567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Do modern speech LLMs and re-scoring techniques improve bilingual ASR performance for Basque and Spanish in domain-specific contexts? 现代语音法学硕士和重新评分技术是否提高了巴斯克语和西班牙语在特定领域上下文中的双语ASR表现?
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-27 DOI: 10.1016/j.csl.2025.101905
Ander González-Docasal , Juan Camilo Vásquez-Correa , Haritz Arzelus , Aitor Álvarez , Santiago A. Moreno-Acevedo
This paper presents an extended evaluation of Vicomtech’s automatic speech recognition (ASR) systems developed for the Albayzín 2024 Bilingual Basque-Spanish Speech-to-Text (BBS-S2T) Challenge, a task focused on transcribing bilingual parliamentary recordings featuring frequent intra- and inter-sentential code-switching between Basque and Spanish. These recordings, drawn from Basque Parliament plenary sessions, pose significant challenges due to the abrupt language alternations, the limited availability of digital resources for Basque, and the absence of contextual and speaker information. The study incorporates additional analysis of state-of-the-art ASR architectures, namely Phi4-multimodal and CrisperWhisper, fine-tuned on the challenge dataset. Furthermore, the systems were evaluated on a complementary benchmark to assess model robustness. A detailed comparison of automatic hypothesis selection techniques, including both traditional n-gram and large language model (LLM)-based approaches, is also provided. Results demonstrate that optimal word error rate (WER) does not always correlate with the most accurate transcriptions, highlighting the complexity of evaluating ASR performance in code-switching scenarios.
本文对Vicomtech的自动语音识别(ASR)系统进行了扩展评估,该系统是为Albayzín 2024双语巴斯克语-西班牙语语音到文本(BBS-S2T)挑战赛开发的,该挑战赛的重点是转录双语议会录音,其中包括巴斯克语和西班牙语之间频繁的句子内和句子间代码转换。这些录音来自巴斯克议会全体会议,由于语言的突然变化,巴斯克语数字资源的可用性有限,以及缺乏上下文和演讲者信息,这些录音构成了重大挑战。该研究结合了对最先进的ASR架构(即Phi4-multimodal和CrisperWhisper)的额外分析,并对挑战数据集进行了微调。此外,对系统进行了互补基准评估,以评估模型的鲁棒性。本文还详细比较了自动假设选择技术,包括传统的n-gram和基于大语言模型(LLM)的方法。结果表明,最佳单词错误率(WER)并不总是与最准确的转录相关,这突出了在代码切换场景下评估ASR性能的复杂性。
{"title":"Do modern speech LLMs and re-scoring techniques improve bilingual ASR performance for Basque and Spanish in domain-specific contexts?","authors":"Ander González-Docasal ,&nbsp;Juan Camilo Vásquez-Correa ,&nbsp;Haritz Arzelus ,&nbsp;Aitor Álvarez ,&nbsp;Santiago A. Moreno-Acevedo","doi":"10.1016/j.csl.2025.101905","DOIUrl":"10.1016/j.csl.2025.101905","url":null,"abstract":"<div><div>This paper presents an extended evaluation of Vicomtech’s automatic speech recognition (ASR) systems developed for the Albayzín 2024 Bilingual Basque-Spanish Speech-to-Text (BBS-S2T) Challenge, a task focused on transcribing bilingual parliamentary recordings featuring frequent intra- and inter-sentential code-switching between Basque and Spanish. These recordings, drawn from Basque Parliament plenary sessions, pose significant challenges due to the abrupt language alternations, the limited availability of digital resources for Basque, and the absence of contextual and speaker information. The study incorporates additional analysis of state-of-the-art ASR architectures, namely Phi4-multimodal and CrisperWhisper, fine-tuned on the challenge dataset. Furthermore, the systems were evaluated on a complementary benchmark to assess model robustness. A detailed comparison of automatic hypothesis selection techniques, including both traditional <span><math><mi>n</mi></math></span>-gram and large language model (LLM)-based approaches, is also provided. Results demonstrate that optimal word error rate (WER) does not always correlate with the most accurate transcriptions, highlighting the complexity of evaluating ASR performance in code-switching scenarios.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101905"},"PeriodicalIF":3.4,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145645755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Keyword Mamba: Spoken keyword spotting with state space models 关键词曼巴:用状态空间模型识别口语关键词
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-27 DOI: 10.1016/j.csl.2025.101909
Hanyu Ding , Wenlong Dong , Qirong Mao
Keyword spotting (KWS) is an essential task in speech processing. It is widely used in voice assistants and smart devices. Deep learning models like CNNs, RNNs, and Transformers have performed well in KWS. However, they often struggle to handle long-term patterns and stay efficient at the same time. In this work, we present Keyword Mamba, a new architecture for KWS. It uses a neural state space model (SSM) called Mamba. We apply Mamba along the time axis and also explore how it can replace the self-attention part in Transformer models. We test our model on the Google Speech Commands datasets. The results show that Keyword Mamba reaches strong accuracy with fewer parameters and lower computational cost. To our knowledge, this is the first time a state space model has been used for KWS. These results suggest that Mamba has strong potential in speech-related tasks.
关键词识别是语音处理中的一项重要任务。它被广泛应用于语音助手和智能设备。cnn、rnn和Transformers等深度学习模型在KWS中表现良好。然而,他们经常在处理长期模式的同时保持效率。在这项工作中,我们提出了关键字曼巴,一个新的架构的KWS。它使用一种叫做曼巴的神经状态空间模型(SSM)。我们沿着时间轴应用Mamba,并探索它如何取代Transformer模型中的自关注部分。我们在谷歌语音命令数据集上测试了我们的模型。结果表明,关键词Mamba以较少的参数和较低的计算成本获得了较好的精度。据我们所知,这是第一次将状态空间模型用于KWS。这些结果表明,曼巴在言语相关任务中具有很强的潜力。
{"title":"Keyword Mamba: Spoken keyword spotting with state space models","authors":"Hanyu Ding ,&nbsp;Wenlong Dong ,&nbsp;Qirong Mao","doi":"10.1016/j.csl.2025.101909","DOIUrl":"10.1016/j.csl.2025.101909","url":null,"abstract":"<div><div>Keyword spotting (KWS) is an essential task in speech processing. It is widely used in voice assistants and smart devices. Deep learning models like CNNs, RNNs, and Transformers have performed well in KWS. However, they often struggle to handle long-term patterns and stay efficient at the same time. In this work, we present Keyword Mamba, a new architecture for KWS. It uses a neural state space model (SSM) called Mamba. We apply Mamba along the time axis and also explore how it can replace the self-attention part in Transformer models. We test our model on the Google Speech Commands datasets. The results show that Keyword Mamba reaches strong accuracy with fewer parameters and lower computational cost. To our knowledge, this is the first time a state space model has been used for KWS. These results suggest that Mamba has strong potential in speech-related tasks.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101909"},"PeriodicalIF":3.4,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MS-Swinformer and DMTL: Multi-scale spatial fusion and dynamic multi-task learning for speech emotion recognition MS-Swinformer和DMTL:语音情感识别的多尺度空间融合和动态多任务学习
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-26 DOI: 10.1016/j.csl.2025.101908
Defu Lan, Hai Cheng
Speech is a vital medium for communication and emotional expression, often embedding rich affective information in human interactions. Effectively uncovering and leveraging such emotional cues holds significant potential across domains such as mental health, education, and automotive safety. However, existing methods often suffer from incomplete audio feature extraction and imbalanced feature utilization. To address these challenges, this paper proposes a novel Speech Emotion Recognition (SER) framework based on Multi-Scale Spatial Fusion using Swin-Transformer (MS-Swinformer) and Dynamic Multi-Task Learning (DMTL). Specifically, we first design a multi-scale feature extraction module that captures localized patterns in both frequency and temporal dimensions via convolutional kernels of varying sizes. Next, we enhance the Swin-Transformer architecture by incorporating an adaptive window attention mechanism, which effectively models the hierarchical feature dependencies in long-duration speech signals, thereby improving the perception of both local and global contextual information. In addition, we introduce a dynamic multi-task learning strategy that jointly optimizes high-level semantic features extracted via Wav2Vec2 and low-level acoustic features derived from MFCCs. By dynamically adjusting task weights during training, our approach enables optimal fusion of multi-source information and mitigates the problem of feature utilization imbalance. Extensive experiments on the IEMOCAP and CASIA datasets demonstrate that our model achieves highly competitive performance compared to existing state-of-the-art methods.
言语是沟通和情感表达的重要媒介,在人类互动中往往蕴含着丰富的情感信息。有效地发现和利用这些情感线索在心理健康、教育和汽车安全等领域具有巨大的潜力。然而,现有的音频特征提取方法存在特征提取不完整、特征利用不均衡等问题。为了解决这些挑战,本文提出了一种基于多尺度空间融合的语音情感识别(SER)框架,该框架使用了swing - transformer (MS-Swinformer)和动态多任务学习(DMTL)。具体来说,我们首先设计了一个多尺度特征提取模块,通过不同大小的卷积核捕获频率和时间维度的局部模式。接下来,我们通过引入自适应窗口注意机制来增强swing - transformer架构,该机制有效地模拟了长时间语音信号中的分层特征依赖关系,从而提高了对局部和全局上下文信息的感知。此外,我们还引入了一种动态多任务学习策略,该策略联合优化了通过Wav2Vec2提取的高级语义特征和从mfccc提取的低级声学特征。该方法通过在训练过程中动态调整任务权值,实现了多源信息的最优融合,缓解了特征利用不平衡的问题。在IEMOCAP和CASIA数据集上进行的大量实验表明,与现有的最先进的方法相比,我们的模型具有极具竞争力的性能。
{"title":"MS-Swinformer and DMTL: Multi-scale spatial fusion and dynamic multi-task learning for speech emotion recognition","authors":"Defu Lan,&nbsp;Hai Cheng","doi":"10.1016/j.csl.2025.101908","DOIUrl":"10.1016/j.csl.2025.101908","url":null,"abstract":"<div><div>Speech is a vital medium for communication and emotional expression, often embedding rich affective information in human interactions. Effectively uncovering and leveraging such emotional cues holds significant potential across domains such as mental health, education, and automotive safety. However, existing methods often suffer from incomplete audio feature extraction and imbalanced feature utilization. To address these challenges, this paper proposes a novel Speech Emotion Recognition (SER) framework based on Multi-Scale Spatial Fusion using Swin-Transformer (MS-Swinformer) and Dynamic Multi-Task Learning (DMTL). Specifically, we first design a multi-scale feature extraction module that captures localized patterns in both frequency and temporal dimensions via convolutional kernels of varying sizes. Next, we enhance the Swin-Transformer architecture by incorporating an adaptive window attention mechanism, which effectively models the hierarchical feature dependencies in long-duration speech signals, thereby improving the perception of both local and global contextual information. In addition, we introduce a dynamic multi-task learning strategy that jointly optimizes high-level semantic features extracted via Wav2Vec2 and low-level acoustic features derived from MFCCs. By dynamically adjusting task weights during training, our approach enables optimal fusion of multi-source information and mitigates the problem of feature utilization imbalance. Extensive experiments on the IEMOCAP and CASIA datasets demonstrate that our model achieves highly competitive performance compared to existing state-of-the-art methods.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101908"},"PeriodicalIF":3.4,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145737568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Toward robust replay attack detection in Automatic Speaker Verification: A study of spectrum estimation and channel magnitude response modeling 自动说话人验证中的鲁棒重放攻击检测:频谱估计和信道幅度响应建模研究
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-26 DOI: 10.1016/j.csl.2025.101906
Şule Bekiryazıcı , Cemal Hanilçi , Neyir Ozcan
Automatic Speaker Verification (ASV) systems are increasingly adopted for biometric authentication but remain highly vulnerable to spoofing, particularly replay attacks. Existing countermeasures (CMs) for replay attack detection rely predominantly on discrete Fourier transform (DFT)-based spectral features, which are sensitive to noise and channel distortions common in physical access (PA) scenarios. This work presents the first comprehensive study of Channel Magnitude Response (CMR) representations for replay detection, explicitly analyzing the impact of spectrum estimation and feature design. The contribution of this work are fourfold: (i) CMR estimation is generalized beyond MFCCs to LFCC and CQCC features, with LFCC-based CMRs offering superior discrimination; (ii) alternative spectrum estimators — linear prediction (LP) and multitaper (MT) — are integrated into the CMR pipeline, yielding substantial gains over conventional DFT (iii) robustness is investigated under silence-free (voiced-only) conditions, mitigating known biases in ASVspoof datasets and (iv) a systematic evaluation of CMR is provided on the recently released ReplayDF corpus, a challenging benchmark combining replay and synthetic speech variability. Experiments on ASVspoof 2017, 2019, 2021, and ReplayDF using both baseline classifiers (ResNet18 and LCNN) and stronger models (Res2Net50 and SE-Res2Net50) show that the proposed approach consistently outperforms conventional features. Particularly, LFCC–CMR features with LP spectra achieve an Equal Error Rate (EER) as low as 1.34% on ASVspoof 2019 (PA), representing considerable relative improvements over traditional methods. Moreover, CMR-based systems retain high performance even when silent segments are removed, unlike conventional approaches. These results establish CMR with principled spectral modeling as a robust and generalizable framework for replay attack detection, opening new directions for resilient spoofing countermeasures.
自动说话人验证(ASV)系统越来越多地用于生物识别认证,但仍然极易受到欺骗,特别是重播攻击。现有的重放攻击检测对策(CMs)主要依赖于基于离散傅立叶变换(DFT)的频谱特征,这些特征对物理访问(PA)场景中常见的噪声和信道畸变敏感。这项工作首次全面研究了信道幅度响应(CMR)表示用于重播检测,明确分析了频谱估计和特征设计的影响。本工作的贡献有四个方面:(i) CMR估计从mfccc推广到LFCC和CQCC特征,其中基于lfccc的CMR具有更好的判别能力;(ii)替代频谱估计器——线性预测(LP)和多度(MT)——被集成到CMR管道中,比传统的DFT产生了实质性的收益;(iii)在无静音(仅语音)条件下研究鲁棒性,减轻ASVspoof数据集中的已知偏差;(iv)在最近发布的ReplayDF语料库上提供了CMR的系统评估,这是一个具有挑战性的基准,结合了重播和合成语音可变性。在ASVspoof 2017、2019、2021和ReplayDF上使用基线分类器(ResNet18和LCNN)和更强模型(Res2Net50和SE-Res2Net50)进行的实验表明,所提出的方法始终优于传统特征。特别是,具有LP光谱的LFCC-CMR特征在ASVspoof 2019 (PA)上实现了低至1.34%的等错误率(EER),相对于传统方法有了相当大的改进。此外,与传统方法不同,基于cmr的系统即使在移除沉默段时也能保持高性能。这些结果建立了具有原则谱建模的CMR作为重放攻击检测的鲁棒且可推广的框架,为弹性欺骗对策开辟了新的方向。
{"title":"Toward robust replay attack detection in Automatic Speaker Verification: A study of spectrum estimation and channel magnitude response modeling","authors":"Şule Bekiryazıcı ,&nbsp;Cemal Hanilçi ,&nbsp;Neyir Ozcan","doi":"10.1016/j.csl.2025.101906","DOIUrl":"10.1016/j.csl.2025.101906","url":null,"abstract":"<div><div>Automatic Speaker Verification (ASV) systems are increasingly adopted for biometric authentication but remain highly vulnerable to spoofing, particularly replay attacks. Existing countermeasures (CMs) for replay attack detection rely predominantly on discrete Fourier transform (DFT)-based spectral features, which are sensitive to noise and channel distortions common in physical access (PA) scenarios. This work presents the first comprehensive study of Channel Magnitude Response (CMR) representations for replay detection, explicitly analyzing the impact of spectrum estimation and feature design. The contribution of this work are fourfold: (i) CMR estimation is generalized beyond MFCCs to LFCC and CQCC features, with LFCC-based CMRs offering superior discrimination; (ii) alternative spectrum estimators — linear prediction (LP) and multitaper (MT) — are integrated into the CMR pipeline, yielding substantial gains over conventional DFT (iii) robustness is investigated under silence-free (voiced-only) conditions, mitigating known biases in ASVspoof datasets and (iv) a systematic evaluation of CMR is provided on the recently released ReplayDF corpus, a challenging benchmark combining replay and synthetic speech variability. Experiments on ASVspoof 2017, 2019, 2021, and ReplayDF using both baseline classifiers (ResNet18 and LCNN) and stronger models (Res2Net50 and SE-Res2Net50) show that the proposed approach consistently outperforms conventional features. Particularly, LFCC–CMR features with LP spectra achieve an Equal Error Rate (EER) as low as 1.34% on ASVspoof 2019 (PA), representing considerable relative improvements over traditional methods. Moreover, CMR-based systems retain high performance even when silent segments are removed, unlike conventional approaches. These results establish CMR with principled spectral modeling as a robust and generalizable framework for replay attack detection, opening new directions for resilient spoofing countermeasures.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"98 ","pages":"Article 101906"},"PeriodicalIF":3.4,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145624974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A robust framework for noisy speech recognition using Frequency-Guided-Swin Transformer 基于频导swin变压器的噪声语音识别鲁棒框架
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-24 DOI: 10.1016/j.csl.2025.101907
Noussaiba Djeffal , Djamel Addou , Hamza Kheddar , Sid Ahmed Selouani
Conventional automatic speech recognition (ASR) systems often struggle to generalize across diverse and noisy environments, where background interference significantly degrades recognition accuracy. This work presents a novel approach to noisy speech recognition by combining convolutional neural networks (CNN) and Swin Transformer with frequency-guided multi-head self-Attention (FG-MSA) architectures. The proposed method addresses the challenge of recognizing speech in noisy environments, focusing on character-level transcription from noisy audio. The CNN efficiently extracts localized features, while the Swin Transformer, with its hierarchical structure and shifted window mechanism, captures both local and long-range dependencies. The FG-MSA mechanism is introduced to guide the attention mechanism toward frequency components that are most relevant for speech recognition, improving robustness in noisy conditions. The proposed method improves performance and efficiency for ASR, especially in noisy environments. Evaluated on the Aurora-2 dataset, and the noisy speech commands (NSC) dataset. The proposed CNN-FG-Swin Transformer achieved an average accuracy of 87.19% on the isolated Aurora-2 dataset, outperforming the baseline Swin Transformer by 2.49%. For all datasets, the proposed model achieved an average accuracy of 87.01%, outperforming all the compared state-of-the-art methods. On the NSC dataset at -9 dB, it achieved a word error rate (WER) of 36.20%, outperforming the end-to-end capsule network models by 8%, both DNN 38.63% and LSTM 69.09% baselines, confirming its robustness in real-world conditions.
传统的自动语音识别(ASR)系统通常难以在多样化和嘈杂的环境中进行泛化,背景干扰会显著降低识别的准确性。本文提出了一种将卷积神经网络(CNN)和Swin Transformer与频率引导多头自注意(FG-MSA)架构相结合的噪声语音识别新方法。提出的方法解决了在嘈杂环境中识别语音的挑战,侧重于从嘈杂音频中进行字符级转录。CNN有效地提取了局部特征,而Swin Transformer利用其分层结构和移位窗口机制,同时捕获了局部和远程依赖关系。引入FG-MSA机制,引导注意力机制转向与语音识别最相关的频率成分,提高噪声条件下的鲁棒性。该方法提高了ASR的性能和效率,特别是在噪声环境下。在Aurora-2数据集和噪声语音命令(NSC)数据集上进行了评估。本文提出的CNN-FG-Swin Transformer在孤立的Aurora-2数据集上的平均准确率为87.19%,比基线Swin Transformer高出2.49%。对于所有数据集,该模型的平均准确率为87.01%,优于所有比较的最先进的方法。在-9 dB的NSC数据集上,该模型的单词错误率(WER)为36.20%,比端到端胶囊网络模型高出8%,DNN和LSTM分别高出38.63%和69.09%,证实了其在现实条件下的鲁棒性。
{"title":"A robust framework for noisy speech recognition using Frequency-Guided-Swin Transformer","authors":"Noussaiba Djeffal ,&nbsp;Djamel Addou ,&nbsp;Hamza Kheddar ,&nbsp;Sid Ahmed Selouani","doi":"10.1016/j.csl.2025.101907","DOIUrl":"10.1016/j.csl.2025.101907","url":null,"abstract":"<div><div>Conventional automatic speech recognition (ASR) systems often struggle to generalize across diverse and noisy environments, where background interference significantly degrades recognition accuracy. This work presents a novel approach to noisy speech recognition by combining convolutional neural networks (CNN) and Swin Transformer with frequency-guided multi-head self-Attention (FG-MSA) architectures. The proposed method addresses the challenge of recognizing speech in noisy environments, focusing on character-level transcription from noisy audio. The CNN efficiently extracts localized features, while the Swin Transformer, with its hierarchical structure and shifted window mechanism, captures both local and long-range dependencies. The FG-MSA mechanism is introduced to guide the attention mechanism toward frequency components that are most relevant for speech recognition, improving robustness in noisy conditions. The proposed method improves performance and efficiency for ASR, especially in noisy environments. Evaluated on the Aurora-2 dataset, and the noisy speech commands (NSC) dataset. The proposed CNN-FG-Swin Transformer achieved an average accuracy of 87.19% on the isolated Aurora-2 dataset, outperforming the baseline Swin Transformer by 2.49%. For all datasets, the proposed model achieved an average accuracy of 87.01%, outperforming all the compared state-of-the-art methods. On the NSC dataset at -9 dB, it achieved a word error rate (WER) of 36.20%, outperforming the end-to-end capsule network models by 8%, both DNN 38.63% and LSTM 69.09% baselines, confirming its robustness in real-world conditions.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"98 ","pages":"Article 101907"},"PeriodicalIF":3.4,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145625011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using Knowledge Induction strategies: LLMs can do better in knowledge-driven dialogue tasks 使用知识归纳策略:法学硕士在知识驱动的对话任务中可以做得更好
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-19 DOI: 10.1016/j.csl.2025.101903
Sisi Peng , Wenlin Zhang , Hao Zhang , Shunhang Li , Dan Qu
Large language models (LLMs) encode vast knowledge through pre-training, yet struggle with knowledge misalignment in knowledge-intensive dialogue tasks, manifested as knowledge scarcity and misuse. While existing solutions often rely on external knowledge bases or labor-intensive prompt engineering, they face limitations in scalability, generalization, and computational efficiency. To address these challenges, this paper introduces two Knowledge Induction (KI) strategies: Explicit Knowledge Induction (EKI) and Implicit Knowledge Induction (IKI), designed to systematically mine and leverage the internal knowledge of LLMs without external retrieval. EKI employs a structured two-phase prompting mechanism to elicit and apply explicit knowledge, while IKI integrates a knowledge-grounded Chain-of-Thought (K-CoT) to guide response generation through an implicit reasoning pathway. Both strategies enhance the model’s self-awareness of its knowledge reservoir and improve factual grounding through constrained generation. We evaluate our methods across multiple LLMs including GPT-4, LLaMA3 and ChatGLM3 on four dialogue benchmarks. Results show that KI strategies significantly outperform strong prompting baselines and closely approximate the performance of retrieval-augmented generation (RAG) systems, while reducing inference latency by up to 50%. Notably, a fine-tuned ChatGLM3 with KI achieves performance comparable to LLaMA3-70B. Additional analyses confirm that our approach also reduces hallucination rate and improves general truthfulness, demonstrating its potential for building efficient and reliable knowledge-driven dialogue systems.
大型语言模型(llm)通过预训练对大量知识进行编码,但在知识密集型对话任务中存在知识偏差,表现为知识稀缺和误用。虽然现有的解决方案通常依赖于外部知识库或劳动密集型的快速工程,但它们在可伸缩性、泛化和计算效率方面面临限制。为了解决这些挑战,本文介绍了两种知识归纳(KI)策略:显性知识归纳(EKI)和隐性知识归纳(IKI),旨在系统地挖掘和利用法学硕士的内部知识,而无需外部检索。EKI采用结构化的两阶段提示机制来引出和应用显性知识,而IKI集成了一个基于知识的思维链(K-CoT),通过一个隐含的推理途径来指导反应的产生。这两种策略都增强了模型对其知识库的自我意识,并通过约束生成改善了事实基础。我们在四个对话基准上评估了多个llm(包括GPT-4、LLaMA3和ChatGLM3)的方法。结果表明,KI策略显著优于强提示基线,并接近检索增强生成(RAG)系统的性能,同时将推理延迟减少了50%。值得注意的是,经过微调的带有KI的ChatGLM3实现了与LLaMA3-70B相当的性能。额外的分析证实,我们的方法也减少了幻觉率,提高了一般的真实性,证明了它在构建高效可靠的知识驱动对话系统方面的潜力。
{"title":"Using Knowledge Induction strategies: LLMs can do better in knowledge-driven dialogue tasks","authors":"Sisi Peng ,&nbsp;Wenlin Zhang ,&nbsp;Hao Zhang ,&nbsp;Shunhang Li ,&nbsp;Dan Qu","doi":"10.1016/j.csl.2025.101903","DOIUrl":"10.1016/j.csl.2025.101903","url":null,"abstract":"<div><div>Large language models (LLMs) encode vast knowledge through pre-training, yet struggle with knowledge misalignment in knowledge-intensive dialogue tasks, manifested as knowledge scarcity and misuse. While existing solutions often rely on external knowledge bases or labor-intensive prompt engineering, they face limitations in scalability, generalization, and computational efficiency. To address these challenges, this paper introduces two Knowledge Induction (KI) strategies: Explicit Knowledge Induction (EKI) and Implicit Knowledge Induction (IKI), designed to systematically mine and leverage the internal knowledge of LLMs without external retrieval. EKI employs a structured two-phase prompting mechanism to elicit and apply explicit knowledge, while IKI integrates a knowledge-grounded Chain-of-Thought (K-CoT) to guide response generation through an implicit reasoning pathway. Both strategies enhance the model’s self-awareness of its knowledge reservoir and improve factual grounding through constrained generation. We evaluate our methods across multiple LLMs including GPT-4, LLaMA3 and ChatGLM3 on four dialogue benchmarks. Results show that KI strategies significantly outperform strong prompting baselines and closely approximate the performance of retrieval-augmented generation (RAG) systems, while reducing inference latency by up to 50%. Notably, a fine-tuned ChatGLM3 with KI achieves performance comparable to LLaMA3-70B. Additional analyses confirm that our approach also reduces hallucination rate and improves general truthfulness, demonstrating its potential for building efficient and reliable knowledge-driven dialogue systems.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"98 ","pages":"Article 101903"},"PeriodicalIF":3.4,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145555164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VOCAL-denoiser: A novel focal-based Unet for a robust speech denoising 语音去噪:一种基于焦点的鲁棒语音去噪新方法
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-19 DOI: 10.1016/j.csl.2025.101904
Mohammed M. Nasef , Mohammed M. Nabil , Amr M. Sauber
Speech, a powerful information source for insights like language, emotion, and health, is often marred by noise, hindering analysis, and limiting its potential. By removing unwanted sounds and boosting intelligibility, speech denoising paves the way for enhanced human-computer interaction and language processing. To overcome the challenges facing speech denoising, VOCAL*-Denoiser is proposed, which is a causal focal-based speech denoising model simulating the humans’ hearing system in distinguishing speech from noise. The proposed model consists of four components: encoder, bottleneck, decoder, and refinement. Magnitude spectrograms were employed as the proposed model input features. To enhance the proposed model generalization power and overcome dataset shortage problem, a new dataset has been synthesized incorporating multilinguals along with various noise types including extreme ones. Additionally, to mimic real-world noises, the dataset blends up to five overlapping noises at different Signal-to-Noise Ratios (SNRs). Experimental results proves that the proposed model generalizes well to diverse unseen noises with extreme SNR values and across multiple languages. Furthermore, the proposed model outputs very high-quality speeches, demonstrating superior speech quality and intelligibility. Performance was validated using objective metrics as well as composite metrics to approximate Mean Opinion Score. These evaluations confirm the model’s ability to outperform other models in delivering robust speech denoising under challenging noise conditions.
语音是语言、情感和健康等洞察力的强大信息源,但它经常被噪音所破坏,阻碍了分析,限制了其潜力。通过去除不需要的声音和提高可理解性,语音去噪为增强人机交互和语言处理铺平了道路。为了克服语音去噪面临的挑战,提出了基于因果焦点的语音去噪模型VOCAL*-Denoiser,该模型模拟了人类听觉系统区分语音和噪声的过程。该模型由四个部分组成:编码器、瓶颈、解码器和细化。采用幅度谱图作为模型输入特征。为了提高模型的泛化能力,克服数据集短缺的问题,本文合成了一个包含多种语言和极端噪声类型的新数据集。此外,为了模拟现实世界的噪声,该数据集以不同的信噪比(SNRs)混合了多达五个重叠的噪声。实验结果表明,该模型可以很好地泛化各种具有极端信噪比的不可见噪声,并且可以跨多种语言。此外,该模型输出了非常高质量的语音,表现出优异的语音质量和可理解性。使用客观指标和综合指标来近似平均意见得分来验证性能。这些评估证实了该模型在具有挑战性的噪声条件下提供鲁棒语音去噪方面优于其他模型的能力。
{"title":"VOCAL-denoiser: A novel focal-based Unet for a robust speech denoising","authors":"Mohammed M. Nasef ,&nbsp;Mohammed M. Nabil ,&nbsp;Amr M. Sauber","doi":"10.1016/j.csl.2025.101904","DOIUrl":"10.1016/j.csl.2025.101904","url":null,"abstract":"<div><div>Speech, a powerful information source for insights like language, emotion, and health, is often marred by noise, hindering analysis, and limiting its potential. By removing unwanted sounds and boosting intelligibility, speech denoising paves the way for enhanced human-computer interaction and language processing. To overcome the challenges facing speech denoising, VOCAL*-Denoiser is proposed, which is a causal focal-based speech denoising model simulating the humans’ hearing system in distinguishing speech from noise. The proposed model consists of four components: encoder, bottleneck, decoder, and refinement. Magnitude spectrograms were employed as the proposed model input features. To enhance the proposed model generalization power and overcome dataset shortage problem, a new dataset has been synthesized incorporating multilinguals along with various noise types including extreme ones. Additionally, to mimic real-world noises, the dataset blends up to five overlapping noises at different Signal-to-Noise Ratios (SNRs). Experimental results proves that the proposed model generalizes well to diverse unseen noises with extreme SNR values and across multiple languages. Furthermore, the proposed model outputs very high-quality speeches, demonstrating superior speech quality and intelligibility. Performance was validated using objective metrics as well as composite metrics to approximate Mean Opinion Score. These evaluations confirm the model’s ability to outperform other models in delivering robust speech denoising under challenging noise conditions.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"98 ","pages":"Article 101904"},"PeriodicalIF":3.4,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145625010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computer Speech and Language
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1