Pub Date : 2025-12-14DOI: 10.1016/j.csl.2025.101924
Cheng Yu , Vahid Ahmadi Kalkhorani , Buye Xu , DeLiang Wang
We present an audiovisual speech enhancement (AVSE) system to address two related tasks: speech enhancement (SE) and voice activity detection (VAD). The system is based on a complex spectral mapping model and performs two-stage audiovisual fusion. The first stage is a signal-level fusion module, where a generative lip-to-speech conversion method produces time-frequency (T-F) features from lip movements. This allows the system to leverage noise-free T-F representations, which are crucial for improving speech intelligibility, particularly in challenging acoustic environments. The second stage is an embedding-level fusion module, where high-dimensional embedding features from a jointly trained visual encoder are integrated. Additionally, we propose a multitask learning framework that optimizes both SE and VAD tasks. The inclusion of a VAD decoder enables the system to distinguish speech from non-speech segments. We evaluate the system on multiple benchmark datasets, including COG-MHEAR, LRS3-AudioSet, and LRS3-CHiME3, and achieve state-of-the-art SE and speech recognition results, and significant robustness in VAD compared to the audio-only baseline. These results highlight the effectiveness of our system in realistic environments.
{"title":"Audiovisual speech enhancement and voice activity detection using generative and regressive visual features","authors":"Cheng Yu , Vahid Ahmadi Kalkhorani , Buye Xu , DeLiang Wang","doi":"10.1016/j.csl.2025.101924","DOIUrl":"10.1016/j.csl.2025.101924","url":null,"abstract":"<div><div>We present an audiovisual speech enhancement (AVSE) system to address two related tasks: speech enhancement (SE) and voice activity detection (VAD). The system is based on a complex spectral mapping model and performs two-stage audiovisual fusion. The first stage is a signal-level fusion module, where a generative lip-to-speech conversion method produces time-frequency (T-F) features from lip movements. This allows the system to leverage noise-free T-F representations, which are crucial for improving speech intelligibility, particularly in challenging acoustic environments. The second stage is an embedding-level fusion module, where high-dimensional embedding features from a jointly trained visual encoder are integrated. Additionally, we propose a multitask learning framework that optimizes both SE and VAD tasks. The inclusion of a VAD decoder enables the system to distinguish speech from non-speech segments. We evaluate the system on multiple benchmark datasets, including COG-MHEAR, LRS3-AudioSet, and LRS3-CHiME3, and achieve state-of-the-art SE and speech recognition results, and significant robustness in VAD compared to the audio-only baseline. These results highlight the effectiveness of our system in realistic environments.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101924"},"PeriodicalIF":3.4,"publicationDate":"2025-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145791094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-11DOI: 10.1016/j.csl.2025.101925
Xinlu He, Jacob Whitehill
Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances have driven the shift from cascade systems to end-to-end (E2E) architectures, which reduce error propagation and better exploit the synergy between speech content and speaker identity. Despite rapid progress in E2E multi-speaker ASR, the field lacks a comprehensive review of recent developments. This survey provides a systematic taxonomy of E2E neural approaches for multi-speaker ASR, highlighting recent advances and comparative analysis. Specifically, we analyze: (1) architectural paradigms (single-input-multiple-output (SIMO) vs. single-input-single-output (SISO)) for pre-segmented audio, analyzing their distinct characteristics and trade-offs; (2) recent architectural and algorithmic improvements based on these two paradigms, including multi-modal inputs; (3) extensions to long-form speech, including segmentation strategy and speaker-consistent hypothesis stitching. Further, we (4) evaluate and compare methods across standard benchmarks. We conclude with a discussion of open challenges and future research directions towards building robust and scalable multi-speaker ASR.
{"title":"Survey of end-to-end multi-speaker automatic speech recognition for monaural audio","authors":"Xinlu He, Jacob Whitehill","doi":"10.1016/j.csl.2025.101925","DOIUrl":"10.1016/j.csl.2025.101925","url":null,"abstract":"<div><div>Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances have driven the shift from cascade systems to end-to-end (E2E) architectures, which reduce error propagation and better exploit the synergy between speech content and speaker identity. Despite rapid progress in E2E multi-speaker ASR, the field lacks a comprehensive review of recent developments. This survey provides a systematic taxonomy of E2E neural approaches for multi-speaker ASR, highlighting recent advances and comparative analysis. Specifically, we analyze: (1) architectural paradigms (single-input-multiple-output (SIMO) vs. single-input-single-output (SISO)) for pre-segmented audio, analyzing their distinct characteristics and trade-offs; (2) recent architectural and algorithmic improvements based on these two paradigms, including multi-modal inputs; (3) extensions to long-form speech, including segmentation strategy and speaker-consistent hypothesis stitching. Further, we (4) evaluate and compare methods across standard benchmarks. We conclude with a discussion of open challenges and future research directions towards building robust and scalable multi-speaker ASR.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101925"},"PeriodicalIF":3.4,"publicationDate":"2025-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145791095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-06DOI: 10.1016/j.csl.2025.101923
Z. Foroushi, R.M. Dansereau
Recovering intelligible speech in noise is essential for robust communication. This work presents an audio-visual speech enhancement framework based on a Recurrent Variational Autoencoder (AV-RVAE), where posterior inference is extended using sampling-based methods including the Metropolis-Adjusted Langevin Algorithm (MALA), Langevin Dynamics EM (LDEM), Hamiltonian Monte Carlo (HMC), Barker sampling, and a hybrid MALA+Barker variant. To isolate the contribution of visual cues, an audio-only baseline (A-RVAE) is trained and evaluated under identical data and inference conditions.
Performance is assessed using Scale-Invariant Signal-to-Distortion Ratio (SI-SDR), Perceptual Evaluation of Speech Quality (PESQ), and Short-Time Objective Intelligibility (STOI), along with anytime convergence curves (metric versus wall-clock time) and the Real-Time Factor (RTF; ratio of runtime to audio duration) to measure computational efficiency.
Experimental results show that the hybrid MALA+Barker sampler achieves the best overall performance, while LDEM and step-size-optimized MALA exhibit the lowest RTFs, the MALA+Barker sampler offers the most favorable balance between efficiency and enhancement quality. Across all sampling strategies, the AV-RVAE consistently surpasses the audio-only baseline, particularly at low SNRs, confirming the benefit of visual fusion combined with advanced posterior sampling for robust speech enhancement in challenging acoustic environments.
{"title":"Enhanced audio-visual speech enhancement with posterior sampling methods in recurrent variational autoencoders","authors":"Z. Foroushi, R.M. Dansereau","doi":"10.1016/j.csl.2025.101923","DOIUrl":"10.1016/j.csl.2025.101923","url":null,"abstract":"<div><div>Recovering intelligible speech in noise is essential for robust communication. This work presents an audio-visual speech enhancement framework based on a Recurrent Variational Autoencoder (AV-RVAE), where posterior inference is extended using sampling-based methods including the Metropolis-Adjusted Langevin Algorithm (MALA), Langevin Dynamics EM (LDEM), Hamiltonian Monte Carlo (HMC), Barker sampling, and a hybrid MALA+Barker variant. To isolate the contribution of visual cues, an audio-only baseline (A-RVAE) is trained and evaluated under identical data and inference conditions.</div><div>Performance is assessed using Scale-Invariant Signal-to-Distortion Ratio (SI-SDR), Perceptual Evaluation of Speech Quality (PESQ), and Short-Time Objective Intelligibility (STOI), along with anytime convergence curves (metric versus wall-clock time) and the Real-Time Factor (RTF; ratio of runtime to audio duration) to measure computational efficiency.</div><div>Experimental results show that the hybrid MALA+Barker sampler achieves the best overall performance, while LDEM and step-size-optimized MALA exhibit the lowest RTFs, the MALA+Barker sampler offers the most favorable balance between efficiency and enhancement quality. Across all sampling strategies, the AV-RVAE consistently surpasses the audio-only baseline, particularly at low SNRs, confirming the benefit of visual fusion combined with advanced posterior sampling for robust speech enhancement in challenging acoustic environments.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101923"},"PeriodicalIF":3.4,"publicationDate":"2025-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145737567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-27DOI: 10.1016/j.csl.2025.101905
Ander González-Docasal , Juan Camilo Vásquez-Correa , Haritz Arzelus , Aitor Álvarez , Santiago A. Moreno-Acevedo
This paper presents an extended evaluation of Vicomtech’s automatic speech recognition (ASR) systems developed for the Albayzín 2024 Bilingual Basque-Spanish Speech-to-Text (BBS-S2T) Challenge, a task focused on transcribing bilingual parliamentary recordings featuring frequent intra- and inter-sentential code-switching between Basque and Spanish. These recordings, drawn from Basque Parliament plenary sessions, pose significant challenges due to the abrupt language alternations, the limited availability of digital resources for Basque, and the absence of contextual and speaker information. The study incorporates additional analysis of state-of-the-art ASR architectures, namely Phi4-multimodal and CrisperWhisper, fine-tuned on the challenge dataset. Furthermore, the systems were evaluated on a complementary benchmark to assess model robustness. A detailed comparison of automatic hypothesis selection techniques, including both traditional -gram and large language model (LLM)-based approaches, is also provided. Results demonstrate that optimal word error rate (WER) does not always correlate with the most accurate transcriptions, highlighting the complexity of evaluating ASR performance in code-switching scenarios.
{"title":"Do modern speech LLMs and re-scoring techniques improve bilingual ASR performance for Basque and Spanish in domain-specific contexts?","authors":"Ander González-Docasal , Juan Camilo Vásquez-Correa , Haritz Arzelus , Aitor Álvarez , Santiago A. Moreno-Acevedo","doi":"10.1016/j.csl.2025.101905","DOIUrl":"10.1016/j.csl.2025.101905","url":null,"abstract":"<div><div>This paper presents an extended evaluation of Vicomtech’s automatic speech recognition (ASR) systems developed for the Albayzín 2024 Bilingual Basque-Spanish Speech-to-Text (BBS-S2T) Challenge, a task focused on transcribing bilingual parliamentary recordings featuring frequent intra- and inter-sentential code-switching between Basque and Spanish. These recordings, drawn from Basque Parliament plenary sessions, pose significant challenges due to the abrupt language alternations, the limited availability of digital resources for Basque, and the absence of contextual and speaker information. The study incorporates additional analysis of state-of-the-art ASR architectures, namely Phi4-multimodal and CrisperWhisper, fine-tuned on the challenge dataset. Furthermore, the systems were evaluated on a complementary benchmark to assess model robustness. A detailed comparison of automatic hypothesis selection techniques, including both traditional <span><math><mi>n</mi></math></span>-gram and large language model (LLM)-based approaches, is also provided. Results demonstrate that optimal word error rate (WER) does not always correlate with the most accurate transcriptions, highlighting the complexity of evaluating ASR performance in code-switching scenarios.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101905"},"PeriodicalIF":3.4,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145645755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-27DOI: 10.1016/j.csl.2025.101909
Hanyu Ding , Wenlong Dong , Qirong Mao
Keyword spotting (KWS) is an essential task in speech processing. It is widely used in voice assistants and smart devices. Deep learning models like CNNs, RNNs, and Transformers have performed well in KWS. However, they often struggle to handle long-term patterns and stay efficient at the same time. In this work, we present Keyword Mamba, a new architecture for KWS. It uses a neural state space model (SSM) called Mamba. We apply Mamba along the time axis and also explore how it can replace the self-attention part in Transformer models. We test our model on the Google Speech Commands datasets. The results show that Keyword Mamba reaches strong accuracy with fewer parameters and lower computational cost. To our knowledge, this is the first time a state space model has been used for KWS. These results suggest that Mamba has strong potential in speech-related tasks.
{"title":"Keyword Mamba: Spoken keyword spotting with state space models","authors":"Hanyu Ding , Wenlong Dong , Qirong Mao","doi":"10.1016/j.csl.2025.101909","DOIUrl":"10.1016/j.csl.2025.101909","url":null,"abstract":"<div><div>Keyword spotting (KWS) is an essential task in speech processing. It is widely used in voice assistants and smart devices. Deep learning models like CNNs, RNNs, and Transformers have performed well in KWS. However, they often struggle to handle long-term patterns and stay efficient at the same time. In this work, we present Keyword Mamba, a new architecture for KWS. It uses a neural state space model (SSM) called Mamba. We apply Mamba along the time axis and also explore how it can replace the self-attention part in Transformer models. We test our model on the Google Speech Commands datasets. The results show that Keyword Mamba reaches strong accuracy with fewer parameters and lower computational cost. To our knowledge, this is the first time a state space model has been used for KWS. These results suggest that Mamba has strong potential in speech-related tasks.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101909"},"PeriodicalIF":3.4,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-26DOI: 10.1016/j.csl.2025.101908
Defu Lan, Hai Cheng
Speech is a vital medium for communication and emotional expression, often embedding rich affective information in human interactions. Effectively uncovering and leveraging such emotional cues holds significant potential across domains such as mental health, education, and automotive safety. However, existing methods often suffer from incomplete audio feature extraction and imbalanced feature utilization. To address these challenges, this paper proposes a novel Speech Emotion Recognition (SER) framework based on Multi-Scale Spatial Fusion using Swin-Transformer (MS-Swinformer) and Dynamic Multi-Task Learning (DMTL). Specifically, we first design a multi-scale feature extraction module that captures localized patterns in both frequency and temporal dimensions via convolutional kernels of varying sizes. Next, we enhance the Swin-Transformer architecture by incorporating an adaptive window attention mechanism, which effectively models the hierarchical feature dependencies in long-duration speech signals, thereby improving the perception of both local and global contextual information. In addition, we introduce a dynamic multi-task learning strategy that jointly optimizes high-level semantic features extracted via Wav2Vec2 and low-level acoustic features derived from MFCCs. By dynamically adjusting task weights during training, our approach enables optimal fusion of multi-source information and mitigates the problem of feature utilization imbalance. Extensive experiments on the IEMOCAP and CASIA datasets demonstrate that our model achieves highly competitive performance compared to existing state-of-the-art methods.
{"title":"MS-Swinformer and DMTL: Multi-scale spatial fusion and dynamic multi-task learning for speech emotion recognition","authors":"Defu Lan, Hai Cheng","doi":"10.1016/j.csl.2025.101908","DOIUrl":"10.1016/j.csl.2025.101908","url":null,"abstract":"<div><div>Speech is a vital medium for communication and emotional expression, often embedding rich affective information in human interactions. Effectively uncovering and leveraging such emotional cues holds significant potential across domains such as mental health, education, and automotive safety. However, existing methods often suffer from incomplete audio feature extraction and imbalanced feature utilization. To address these challenges, this paper proposes a novel Speech Emotion Recognition (SER) framework based on Multi-Scale Spatial Fusion using Swin-Transformer (MS-Swinformer) and Dynamic Multi-Task Learning (DMTL). Specifically, we first design a multi-scale feature extraction module that captures localized patterns in both frequency and temporal dimensions via convolutional kernels of varying sizes. Next, we enhance the Swin-Transformer architecture by incorporating an adaptive window attention mechanism, which effectively models the hierarchical feature dependencies in long-duration speech signals, thereby improving the perception of both local and global contextual information. In addition, we introduce a dynamic multi-task learning strategy that jointly optimizes high-level semantic features extracted via Wav2Vec2 and low-level acoustic features derived from MFCCs. By dynamically adjusting task weights during training, our approach enables optimal fusion of multi-source information and mitigates the problem of feature utilization imbalance. Extensive experiments on the IEMOCAP and CASIA datasets demonstrate that our model achieves highly competitive performance compared to existing state-of-the-art methods.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101908"},"PeriodicalIF":3.4,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145737568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-26DOI: 10.1016/j.csl.2025.101906
Şule Bekiryazıcı , Cemal Hanilçi , Neyir Ozcan
Automatic Speaker Verification (ASV) systems are increasingly adopted for biometric authentication but remain highly vulnerable to spoofing, particularly replay attacks. Existing countermeasures (CMs) for replay attack detection rely predominantly on discrete Fourier transform (DFT)-based spectral features, which are sensitive to noise and channel distortions common in physical access (PA) scenarios. This work presents the first comprehensive study of Channel Magnitude Response (CMR) representations for replay detection, explicitly analyzing the impact of spectrum estimation and feature design. The contribution of this work are fourfold: (i) CMR estimation is generalized beyond MFCCs to LFCC and CQCC features, with LFCC-based CMRs offering superior discrimination; (ii) alternative spectrum estimators — linear prediction (LP) and multitaper (MT) — are integrated into the CMR pipeline, yielding substantial gains over conventional DFT (iii) robustness is investigated under silence-free (voiced-only) conditions, mitigating known biases in ASVspoof datasets and (iv) a systematic evaluation of CMR is provided on the recently released ReplayDF corpus, a challenging benchmark combining replay and synthetic speech variability. Experiments on ASVspoof 2017, 2019, 2021, and ReplayDF using both baseline classifiers (ResNet18 and LCNN) and stronger models (Res2Net50 and SE-Res2Net50) show that the proposed approach consistently outperforms conventional features. Particularly, LFCC–CMR features with LP spectra achieve an Equal Error Rate (EER) as low as 1.34% on ASVspoof 2019 (PA), representing considerable relative improvements over traditional methods. Moreover, CMR-based systems retain high performance even when silent segments are removed, unlike conventional approaches. These results establish CMR with principled spectral modeling as a robust and generalizable framework for replay attack detection, opening new directions for resilient spoofing countermeasures.
{"title":"Toward robust replay attack detection in Automatic Speaker Verification: A study of spectrum estimation and channel magnitude response modeling","authors":"Şule Bekiryazıcı , Cemal Hanilçi , Neyir Ozcan","doi":"10.1016/j.csl.2025.101906","DOIUrl":"10.1016/j.csl.2025.101906","url":null,"abstract":"<div><div>Automatic Speaker Verification (ASV) systems are increasingly adopted for biometric authentication but remain highly vulnerable to spoofing, particularly replay attacks. Existing countermeasures (CMs) for replay attack detection rely predominantly on discrete Fourier transform (DFT)-based spectral features, which are sensitive to noise and channel distortions common in physical access (PA) scenarios. This work presents the first comprehensive study of Channel Magnitude Response (CMR) representations for replay detection, explicitly analyzing the impact of spectrum estimation and feature design. The contribution of this work are fourfold: (i) CMR estimation is generalized beyond MFCCs to LFCC and CQCC features, with LFCC-based CMRs offering superior discrimination; (ii) alternative spectrum estimators — linear prediction (LP) and multitaper (MT) — are integrated into the CMR pipeline, yielding substantial gains over conventional DFT (iii) robustness is investigated under silence-free (voiced-only) conditions, mitigating known biases in ASVspoof datasets and (iv) a systematic evaluation of CMR is provided on the recently released ReplayDF corpus, a challenging benchmark combining replay and synthetic speech variability. Experiments on ASVspoof 2017, 2019, 2021, and ReplayDF using both baseline classifiers (ResNet18 and LCNN) and stronger models (Res2Net50 and SE-Res2Net50) show that the proposed approach consistently outperforms conventional features. Particularly, LFCC–CMR features with LP spectra achieve an Equal Error Rate (EER) as low as 1.34% on ASVspoof 2019 (PA), representing considerable relative improvements over traditional methods. Moreover, CMR-based systems retain high performance even when silent segments are removed, unlike conventional approaches. These results establish CMR with principled spectral modeling as a robust and generalizable framework for replay attack detection, opening new directions for resilient spoofing countermeasures.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"98 ","pages":"Article 101906"},"PeriodicalIF":3.4,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145624974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-24DOI: 10.1016/j.csl.2025.101907
Noussaiba Djeffal , Djamel Addou , Hamza Kheddar , Sid Ahmed Selouani
Conventional automatic speech recognition (ASR) systems often struggle to generalize across diverse and noisy environments, where background interference significantly degrades recognition accuracy. This work presents a novel approach to noisy speech recognition by combining convolutional neural networks (CNN) and Swin Transformer with frequency-guided multi-head self-Attention (FG-MSA) architectures. The proposed method addresses the challenge of recognizing speech in noisy environments, focusing on character-level transcription from noisy audio. The CNN efficiently extracts localized features, while the Swin Transformer, with its hierarchical structure and shifted window mechanism, captures both local and long-range dependencies. The FG-MSA mechanism is introduced to guide the attention mechanism toward frequency components that are most relevant for speech recognition, improving robustness in noisy conditions. The proposed method improves performance and efficiency for ASR, especially in noisy environments. Evaluated on the Aurora-2 dataset, and the noisy speech commands (NSC) dataset. The proposed CNN-FG-Swin Transformer achieved an average accuracy of 87.19% on the isolated Aurora-2 dataset, outperforming the baseline Swin Transformer by 2.49%. For all datasets, the proposed model achieved an average accuracy of 87.01%, outperforming all the compared state-of-the-art methods. On the NSC dataset at -9 dB, it achieved a word error rate (WER) of 36.20%, outperforming the end-to-end capsule network models by 8%, both DNN 38.63% and LSTM 69.09% baselines, confirming its robustness in real-world conditions.
{"title":"A robust framework for noisy speech recognition using Frequency-Guided-Swin Transformer","authors":"Noussaiba Djeffal , Djamel Addou , Hamza Kheddar , Sid Ahmed Selouani","doi":"10.1016/j.csl.2025.101907","DOIUrl":"10.1016/j.csl.2025.101907","url":null,"abstract":"<div><div>Conventional automatic speech recognition (ASR) systems often struggle to generalize across diverse and noisy environments, where background interference significantly degrades recognition accuracy. This work presents a novel approach to noisy speech recognition by combining convolutional neural networks (CNN) and Swin Transformer with frequency-guided multi-head self-Attention (FG-MSA) architectures. The proposed method addresses the challenge of recognizing speech in noisy environments, focusing on character-level transcription from noisy audio. The CNN efficiently extracts localized features, while the Swin Transformer, with its hierarchical structure and shifted window mechanism, captures both local and long-range dependencies. The FG-MSA mechanism is introduced to guide the attention mechanism toward frequency components that are most relevant for speech recognition, improving robustness in noisy conditions. The proposed method improves performance and efficiency for ASR, especially in noisy environments. Evaluated on the Aurora-2 dataset, and the noisy speech commands (NSC) dataset. The proposed CNN-FG-Swin Transformer achieved an average accuracy of 87.19% on the isolated Aurora-2 dataset, outperforming the baseline Swin Transformer by 2.49%. For all datasets, the proposed model achieved an average accuracy of 87.01%, outperforming all the compared state-of-the-art methods. On the NSC dataset at -9 dB, it achieved a word error rate (WER) of 36.20%, outperforming the end-to-end capsule network models by 8%, both DNN 38.63% and LSTM 69.09% baselines, confirming its robustness in real-world conditions.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"98 ","pages":"Article 101907"},"PeriodicalIF":3.4,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145625011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-19DOI: 10.1016/j.csl.2025.101903
Sisi Peng , Wenlin Zhang , Hao Zhang , Shunhang Li , Dan Qu
Large language models (LLMs) encode vast knowledge through pre-training, yet struggle with knowledge misalignment in knowledge-intensive dialogue tasks, manifested as knowledge scarcity and misuse. While existing solutions often rely on external knowledge bases or labor-intensive prompt engineering, they face limitations in scalability, generalization, and computational efficiency. To address these challenges, this paper introduces two Knowledge Induction (KI) strategies: Explicit Knowledge Induction (EKI) and Implicit Knowledge Induction (IKI), designed to systematically mine and leverage the internal knowledge of LLMs without external retrieval. EKI employs a structured two-phase prompting mechanism to elicit and apply explicit knowledge, while IKI integrates a knowledge-grounded Chain-of-Thought (K-CoT) to guide response generation through an implicit reasoning pathway. Both strategies enhance the model’s self-awareness of its knowledge reservoir and improve factual grounding through constrained generation. We evaluate our methods across multiple LLMs including GPT-4, LLaMA3 and ChatGLM3 on four dialogue benchmarks. Results show that KI strategies significantly outperform strong prompting baselines and closely approximate the performance of retrieval-augmented generation (RAG) systems, while reducing inference latency by up to 50%. Notably, a fine-tuned ChatGLM3 with KI achieves performance comparable to LLaMA3-70B. Additional analyses confirm that our approach also reduces hallucination rate and improves general truthfulness, demonstrating its potential for building efficient and reliable knowledge-driven dialogue systems.
{"title":"Using Knowledge Induction strategies: LLMs can do better in knowledge-driven dialogue tasks","authors":"Sisi Peng , Wenlin Zhang , Hao Zhang , Shunhang Li , Dan Qu","doi":"10.1016/j.csl.2025.101903","DOIUrl":"10.1016/j.csl.2025.101903","url":null,"abstract":"<div><div>Large language models (LLMs) encode vast knowledge through pre-training, yet struggle with knowledge misalignment in knowledge-intensive dialogue tasks, manifested as knowledge scarcity and misuse. While existing solutions often rely on external knowledge bases or labor-intensive prompt engineering, they face limitations in scalability, generalization, and computational efficiency. To address these challenges, this paper introduces two Knowledge Induction (KI) strategies: Explicit Knowledge Induction (EKI) and Implicit Knowledge Induction (IKI), designed to systematically mine and leverage the internal knowledge of LLMs without external retrieval. EKI employs a structured two-phase prompting mechanism to elicit and apply explicit knowledge, while IKI integrates a knowledge-grounded Chain-of-Thought (K-CoT) to guide response generation through an implicit reasoning pathway. Both strategies enhance the model’s self-awareness of its knowledge reservoir and improve factual grounding through constrained generation. We evaluate our methods across multiple LLMs including GPT-4, LLaMA3 and ChatGLM3 on four dialogue benchmarks. Results show that KI strategies significantly outperform strong prompting baselines and closely approximate the performance of retrieval-augmented generation (RAG) systems, while reducing inference latency by up to 50%. Notably, a fine-tuned ChatGLM3 with KI achieves performance comparable to LLaMA3-70B. Additional analyses confirm that our approach also reduces hallucination rate and improves general truthfulness, demonstrating its potential for building efficient and reliable knowledge-driven dialogue systems.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"98 ","pages":"Article 101903"},"PeriodicalIF":3.4,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145555164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-19DOI: 10.1016/j.csl.2025.101904
Mohammed M. Nasef , Mohammed M. Nabil , Amr M. Sauber
Speech, a powerful information source for insights like language, emotion, and health, is often marred by noise, hindering analysis, and limiting its potential. By removing unwanted sounds and boosting intelligibility, speech denoising paves the way for enhanced human-computer interaction and language processing. To overcome the challenges facing speech denoising, VOCAL*-Denoiser is proposed, which is a causal focal-based speech denoising model simulating the humans’ hearing system in distinguishing speech from noise. The proposed model consists of four components: encoder, bottleneck, decoder, and refinement. Magnitude spectrograms were employed as the proposed model input features. To enhance the proposed model generalization power and overcome dataset shortage problem, a new dataset has been synthesized incorporating multilinguals along with various noise types including extreme ones. Additionally, to mimic real-world noises, the dataset blends up to five overlapping noises at different Signal-to-Noise Ratios (SNRs). Experimental results proves that the proposed model generalizes well to diverse unseen noises with extreme SNR values and across multiple languages. Furthermore, the proposed model outputs very high-quality speeches, demonstrating superior speech quality and intelligibility. Performance was validated using objective metrics as well as composite metrics to approximate Mean Opinion Score. These evaluations confirm the model’s ability to outperform other models in delivering robust speech denoising under challenging noise conditions.
{"title":"VOCAL-denoiser: A novel focal-based Unet for a robust speech denoising","authors":"Mohammed M. Nasef , Mohammed M. Nabil , Amr M. Sauber","doi":"10.1016/j.csl.2025.101904","DOIUrl":"10.1016/j.csl.2025.101904","url":null,"abstract":"<div><div>Speech, a powerful information source for insights like language, emotion, and health, is often marred by noise, hindering analysis, and limiting its potential. By removing unwanted sounds and boosting intelligibility, speech denoising paves the way for enhanced human-computer interaction and language processing. To overcome the challenges facing speech denoising, VOCAL*-Denoiser is proposed, which is a causal focal-based speech denoising model simulating the humans’ hearing system in distinguishing speech from noise. The proposed model consists of four components: encoder, bottleneck, decoder, and refinement. Magnitude spectrograms were employed as the proposed model input features. To enhance the proposed model generalization power and overcome dataset shortage problem, a new dataset has been synthesized incorporating multilinguals along with various noise types including extreme ones. Additionally, to mimic real-world noises, the dataset blends up to five overlapping noises at different Signal-to-Noise Ratios (SNRs). Experimental results proves that the proposed model generalizes well to diverse unseen noises with extreme SNR values and across multiple languages. Furthermore, the proposed model outputs very high-quality speeches, demonstrating superior speech quality and intelligibility. Performance was validated using objective metrics as well as composite metrics to approximate Mean Opinion Score. These evaluations confirm the model’s ability to outperform other models in delivering robust speech denoising under challenging noise conditions.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"98 ","pages":"Article 101904"},"PeriodicalIF":3.4,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145625010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}