首页 > 最新文献

Computer Speech and Language最新文献

英文 中文
Entrainment detection using DNN 使用深度神经网络进行夹带检测
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-26 DOI: 10.1016/j.csl.2025.101930
Jay Kejriwal , Štefan Beňuš , Lina M. Rojas-Barahona
During conversation, speakers adjust their linguistic characteristics to become more similar to their partners. This complex phenomenon is known as entrainment, and speakers dynamically entrain as well as disentrain on different linguistic features. Researchers have utilized a range of computational methods to explore entrainment. Recent technological advancements have facilitated the use of deep learning, which offers a systematic quantification of acoustic entrainment dynamics. In this study, we investigate the capability of deep learning architectures to extract and leverage textual features for the efficient representation and learning of entrainment. By adjusting the architecture of an acoustic-based DNN entrainment model, we present an unsupervised deep learning framework that derives representations from textual features containing relevant information for identifying entrainment at three linguistic levels: lexical, syntactic, and semantic. To investigate the performance of each model within the proposed framework, various text-based and speech features were extracted. Entrainment was quantified using different distance measures in the representation space. The performance of the trained models was evaluated by distinguishing real and sham conversations using the proposed distances. Our results suggest that acoustic-based DNN models outperform text-based DNN models and that distance measures affect the models’ performance.
在对话过程中,说话者会调整自己的语言特征,使之更接近对方。这种复杂的现象被称为夹带,说话者在不同的语言特征上动态地夹带和剥离。研究人员利用了一系列的计算方法来探索夹带。最近的技术进步促进了深度学习的使用,它提供了声学夹带动力学的系统量化。在本研究中,我们研究了深度学习架构提取和利用文本特征来有效表示和学习夹带的能力。通过调整基于声学的DNN诱导模型的架构,我们提出了一个无监督深度学习框架,该框架从包含相关信息的文本特征中提取表征,用于在三个语言层面(词汇、句法和语义)识别诱导。为了研究每个模型在该框架内的性能,提取了各种基于文本和语音的特征。在表征空间中使用不同的距离度量来量化夹带。通过使用建议的距离区分真实和虚假对话来评估训练模型的性能。我们的研究结果表明,基于声学的深度神经网络模型优于基于文本的深度神经网络模型,并且距离度量会影响模型的性能。
{"title":"Entrainment detection using DNN","authors":"Jay Kejriwal ,&nbsp;Štefan Beňuš ,&nbsp;Lina M. Rojas-Barahona","doi":"10.1016/j.csl.2025.101930","DOIUrl":"10.1016/j.csl.2025.101930","url":null,"abstract":"<div><div>During conversation, speakers adjust their linguistic characteristics to become more similar to their partners. This complex phenomenon is known as entrainment, and speakers dynamically entrain as well as disentrain on different linguistic features. Researchers have utilized a range of computational methods to explore entrainment. Recent technological advancements have facilitated the use of deep learning, which offers a systematic quantification of acoustic entrainment dynamics. In this study, we investigate the capability of deep learning architectures to extract and leverage textual features for the efficient representation and learning of entrainment. By adjusting the architecture of an acoustic-based DNN entrainment model, we present an unsupervised deep learning framework that derives representations from textual features containing relevant information for identifying entrainment at three linguistic levels: lexical, syntactic, and semantic. To investigate the performance of each model within the proposed framework, various text-based and speech features were extracted. Entrainment was quantified using different distance measures in the representation space. The performance of the trained models was evaluated by distinguishing real and sham conversations using the proposed distances. Our results suggest that acoustic-based DNN models outperform text-based DNN models and that distance measures affect the models’ performance.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101930"},"PeriodicalIF":3.4,"publicationDate":"2025-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145925550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Speech emotion recognition using multimodal LLMs and quality-controlled TTS-based data augmentation for Iberian languages 使用多模态llm和质量控制的基于tts的伊比利亚语数据增强的语音情感识别
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-23 DOI: 10.1016/j.csl.2025.101927
Jaime Bellver-Soler, Anmol Guragain, Samuel Ramos-Varela, Ricardo Córdoba, Luis Fernando D’Haro
This work proposes the usage of multimodal large language models for speech emotion recognition (SER) on low-resource settings for Iberian languages. Given the existing low amount of annotated SER data for other languages, we also propose a pipeline for generating high-quality synthetic data using existing emotional text-to-speech (TTS) and their cloning capabilities.
Specifically, we design a selective, quality-controlled TTS pipeline combining LLM-ensemble translation with self-verification and expressive voice cloning, followed by automatic ASR-WER, speaker-similarity, and emotion filters. This approach introduces a novel filtering strategy that ensures synthetic data reliability. The resulting data include MSP-MEA,1 a synthetic Spanish extension of MSP-Podcast.
Building on our previous multimodal SER framework, we compare the usage of frozen LLM as classifiers with an MLP baseline and evaluate classical versus TTS-based augmentation across five corpora (IEMOCAP, MEACorpus, EMS, VERBO, AhoEmo3). The best configuration (W2v-BERT-2 attentive pooling frozen Bloomz-7b1) improves mean F1 by +4.9 point increase over a MLP head baseline. Among augmentation techniques, Mix-up remains the most robust overall, while TTS achieves competitive performance, surpassing traditional data augmentation techniques on EMS and VERBO. These results indicate that carefully filtered TTS data can complement classical perturbations, providing a viable, dataset-dependent strategy for multilingual SER. Code, models, and datasets are publicly released.2
这项工作提出了在伊比利亚语言的低资源设置上使用多模态大语言模型进行语音情感识别(SER)。鉴于其他语言的注释SER数据数量较少,我们还提出了一个利用现有的情感文本到语音(TTS)及其克隆能力生成高质量合成数据的管道。具体来说,我们设计了一个选择性的、质量控制的TTS管道,将llm集成翻译与自我验证和表达性语音克隆结合起来,然后是自动ASR-WER、说话者相似度和情感过滤器。该方法引入了一种新的过滤策略,以确保合成数据的可靠性。结果数据包括MSP-MEA,1是MSP-Podcast的合成西班牙语扩展。在我们之前的多模态SER框架的基础上,我们比较了冻结LLM作为分类器与MLP基线的使用情况,并在五个语料库(IEMOCAP, MEACorpus, EMS, VERBO, AhoEmo3)中评估了经典与基于tts的增强。最佳配置(W2v-BERT-2→细心池化→冷冻Bloomz-7b1)比MLP头部基线提高平均F1 +4.9点。在增强技术中,mixed -up总体上仍然是最强大的,而TTS在EMS和VERBO上取得了具有竞争力的性能,超过了传统的数据增强技术。这些结果表明,仔细过滤的TTS数据可以补充经典扰动,为多语言SER提供一种可行的、依赖于数据集的策略。代码、模型和数据集是公开发布的
{"title":"Speech emotion recognition using multimodal LLMs and quality-controlled TTS-based data augmentation for Iberian languages","authors":"Jaime Bellver-Soler,&nbsp;Anmol Guragain,&nbsp;Samuel Ramos-Varela,&nbsp;Ricardo Córdoba,&nbsp;Luis Fernando D’Haro","doi":"10.1016/j.csl.2025.101927","DOIUrl":"10.1016/j.csl.2025.101927","url":null,"abstract":"<div><div>This work proposes the usage of multimodal large language models for speech emotion recognition (SER) on low-resource settings for Iberian languages. Given the existing low amount of annotated SER data for other languages, we also propose a pipeline for generating high-quality synthetic data using existing emotional text-to-speech (TTS) and their cloning capabilities.</div><div>Specifically, we design a selective, quality-controlled TTS pipeline combining LLM-ensemble translation with self-verification and expressive voice cloning, followed by automatic ASR-WER, speaker-similarity, and emotion filters. This approach introduces a novel filtering strategy that ensures synthetic data reliability. The resulting data include MSP-MEA,<span><span><sup>1</sup></span></span> a synthetic Spanish extension of MSP-Podcast.</div><div>Building on our previous multimodal SER framework, we compare the usage of frozen LLM as classifiers with an MLP baseline and evaluate classical versus TTS-based augmentation across five corpora (IEMOCAP, MEACorpus, EMS, VERBO, AhoEmo3). The best configuration (W2v-BERT-2 <span><math><mo>→</mo></math></span> attentive pooling <span><math><mo>→</mo></math></span> frozen Bloomz-7b1) improves mean F1 by +4.9 point increase over a MLP head baseline. Among augmentation techniques, Mix-up remains the most robust overall, while TTS achieves competitive performance, surpassing traditional data augmentation techniques on EMS and VERBO. These results indicate that carefully filtered TTS data can complement classical perturbations, providing a viable, dataset-dependent strategy for multilingual SER. Code, models, and datasets are publicly released.<span><span><sup>2</sup></span></span></div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101927"},"PeriodicalIF":3.4,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Artificial protozoa lotus effect algorithm enabled cognitive brain optimal model for sentiment analysis utilizing multimodal data 人工原生动物荷花效应算法实现认知脑优化模型,利用多模态数据进行情感分析
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-22 DOI: 10.1016/j.csl.2025.101929
Sanjeevkumar Angadi , Saili Hemant Sable , Tejaswini Zope , Rajani Amol Hemade , Vaibhavi Umesh Avachat
Understanding public sentiment derived from online data is a challenging research problem with numerous applications, including contextual analysis and opinion assessment on specific events. Traditionally, sentiment analysis has concentrated on a single modality, such as text or images. However, utilizing multimodal information such as images, text, and audio can enhance model accuracy. Despite this advantage, combining visual and textual features often leads to decreased performance. This is mainly because of the model's inability to efficiently capture the intricate relationships amongst diverse modalities. To confront these challenges, a new technique named Artificial Protozoa Lotus Effect Algorithm _ Cognitive Brain Optimal Model (APLEA_CBO) model has been developed for sentiment analysis using multimodal data. Initially, feature extraction is performed on audio data to obtain the feature vector outcome-1. Similarly, feature extraction is conducted on the input text to extract suitable features and is considered outcome-2. Both feature sets are then processed for sentiment analysis using the Cognitive Brain Optimal Model (CBOM), which is developed by employing Recurrent Denoising Long Short-Term Memory (RD-LSTM). The CBOM is trained using the Artificial Protozoa Lotus Effect Algorithm (APLEA), which is the integration of Artificial Protozoa Optimization (APO) and Lotus Effect Algorithm (LEA). It is noted that the APLEA_CBO model has gained an FPR of 7.17 %, a recall of 92.76 %, a precision of 90.62 %, and an accuracy of 90.60 %.
从在线数据中理解公众情绪是一个具有挑战性的研究问题,有许多应用,包括上下文分析和对特定事件的意见评估。传统上,情感分析集中于单一的情态,如文本或图像。然而,利用图像、文本和音频等多模态信息可以提高模型的准确性。尽管有这个优点,但是将视觉和文本特性结合起来经常会导致性能下降。这主要是因为该模型无法有效地捕捉各种模态之间的复杂关系。为了应对这些挑战,本文提出了一种基于多模态数据的情感分析新技术——人工原生动物莲花效应算法-认知脑优化模型(APLEA_CBO)。首先,对音频数据进行特征提取,得到特征向量result -1。同样,对输入文本进行特征提取,提取合适的特征,认为是结果2。然后使用认知大脑最优模型(CBOM)对两个特征集进行处理以进行情感分析,该模型是通过使用循环降噪长短期记忆(RD-LSTM)开发的。CBOM采用人工原生动物优化算法(APO)和莲花效应算法(LEA)相结合的人工原生动物莲花效应算法(apade)进行训练。结果表明,APLEA_CBO模型的FPR为7.17%,召回率为92.76%,精密度为90.62%,准确度为90.60%。
{"title":"Artificial protozoa lotus effect algorithm enabled cognitive brain optimal model for sentiment analysis utilizing multimodal data","authors":"Sanjeevkumar Angadi ,&nbsp;Saili Hemant Sable ,&nbsp;Tejaswini Zope ,&nbsp;Rajani Amol Hemade ,&nbsp;Vaibhavi Umesh Avachat","doi":"10.1016/j.csl.2025.101929","DOIUrl":"10.1016/j.csl.2025.101929","url":null,"abstract":"<div><div>Understanding public sentiment derived from online data is a challenging research problem with numerous applications, including contextual analysis and opinion assessment on specific events. Traditionally, sentiment analysis has concentrated on a single modality, such as text or images. However, utilizing multimodal information such as images, text, and audio can enhance model accuracy. Despite this advantage, combining visual and textual features often leads to decreased performance. This is mainly because of the model's inability to efficiently capture the intricate relationships amongst diverse modalities. To confront these challenges, a new technique named Artificial Protozoa Lotus Effect Algorithm _ Cognitive Brain Optimal Model (APLEA_CBO) model has been developed for sentiment analysis using multimodal data. Initially, feature extraction is performed on audio data to obtain the feature vector outcome-1. Similarly, feature extraction is conducted on the input text to extract suitable features and is considered outcome-2. Both feature sets are then processed for sentiment analysis using the Cognitive Brain Optimal Model (CBOM), which is developed by employing Recurrent Denoising Long Short-Term Memory (RD-LSTM). The CBOM is trained using the Artificial Protozoa Lotus Effect Algorithm (APLEA), which is the integration of Artificial Protozoa Optimization (APO) and Lotus Effect Algorithm (LEA). It is noted that the APLEA_CBO model has gained an FPR of 7.17 %, a recall of 92.76 %, a precision of 90.62 %, and an accuracy of 90.60 %.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101929"},"PeriodicalIF":3.4,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145884237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Leveraging saliency-based pre-trained foundation model representations to uncover breathing patterns in speech 利用基于显著性的预训练基础模型表示来揭示语音中的呼吸模式
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-20 DOI: 10.1016/j.csl.2025.101926
Vikramjit Mitra, Anirban Chatterjee, Ke Zhai, Helen Weng, Ayuko Hill, Nicole Hay, Christopher Webb, Jamie Cheng, Erdrin Azemi
The process of human speech production involves coordinated respiratory action to elicit acoustic speech signals. Typically, speech is produced when air is forced from the lungs and is modulated by the vocal tract, where such actions are interspersed by moments of breathing in air (inhalation) to refill the lungs again. Respiratory rate (RR) is a vital metric that is used to assess the overall health, fitness, and general well-being of an individual. Existing approaches to measure RR (number of breaths one takes in a minute) are performed using specialized equipment or training. Studies have demonstrated that machine learning algorithms can be used to estimate RR using bio-sensor signals as input. Speech-based estimation of RR can offer an effective approach to measure the vital metric without requiring any specialized equipment or sensors. This work investigates a machine learning based approach to estimate RR from speech segments obtained from subjects speaking to a close-talking microphone device. Data were collected from N=26 individuals, where the groundtruth RR was obtained through commercial grade chest-belts and then manually corrected for any errors. A convolutional long-short term memory network (Conv-LSTM) is proposed to estimate respiration time-series data from the speech signal. We demonstrate that the use of pre-trained representations obtained from a foundation model, such as Wav2Vec2, can be used to estimate respiration-time-series with low root-mean-squared error and high correlation coefficient, when compared with the baseline. The model-driven time series can be used to estimate RR with a low mean absolute error (MAE) 1.6 breaths/min.
人类语音产生的过程涉及到协调的呼吸动作来引出声学语音信号。通常情况下,当空气从肺部排出并被声道调节时,就会产生语音,在声道中,这些动作被吸入空气(吸入)的时刻所穿插,以再次充满肺部。呼吸频率(RR)是一个重要的指标,用于评估个人的整体健康,健身和一般福祉。现有的测量RR(一分钟内呼吸次数)的方法是使用专门的设备或训练来执行的。研究表明,机器学习算法可以使用生物传感器信号作为输入来估计RR。基于语音的RR估计可以提供一种有效的方法来测量重要指标,而不需要任何专门的设备或传感器。这项工作研究了一种基于机器学习的方法,从对着近距离说话的麦克风设备说话的受试者获得的语音片段中估计RR。从N=26个人中收集数据,其中通过商业级胸带获得基础真实RR,然后手动纠正任何错误。提出了一种卷积长短期记忆网络(convolutional long-short term memory network,简称convl - lstm)来估计语音信号中的呼吸时间序列数据。我们证明,与基线相比,使用从基础模型(如Wav2Vec2)获得的预训练表示可以用于估计具有低均方根误差和高相关系数的呼吸时间序列。模型驱动的时间序列可用于估计RR,平均绝对误差(MAE)≈1.6次/分钟。
{"title":"Leveraging saliency-based pre-trained foundation model representations to uncover breathing patterns in speech","authors":"Vikramjit Mitra,&nbsp;Anirban Chatterjee,&nbsp;Ke Zhai,&nbsp;Helen Weng,&nbsp;Ayuko Hill,&nbsp;Nicole Hay,&nbsp;Christopher Webb,&nbsp;Jamie Cheng,&nbsp;Erdrin Azemi","doi":"10.1016/j.csl.2025.101926","DOIUrl":"10.1016/j.csl.2025.101926","url":null,"abstract":"<div><div>The process of human speech production involves coordinated respiratory action to elicit acoustic speech signals. Typically, speech is produced when air is forced from the lungs and is modulated by the vocal tract, where such actions are interspersed by moments of breathing in air (inhalation) to refill the lungs again. Respiratory rate (<em>RR</em>) is a vital metric that is used to assess the overall health, fitness, and general well-being of an individual. Existing approaches to measure <em>RR</em> (number of breaths one takes in a minute) are performed using specialized equipment or training. Studies have demonstrated that machine learning algorithms can be used to estimate <em>RR</em> using bio-sensor signals as input. Speech-based estimation of <em>RR</em> can offer an effective approach to measure the vital metric without requiring any specialized equipment or sensors. This work investigates a machine learning based approach to estimate <em>RR</em> from speech segments obtained from subjects speaking to a close-talking microphone device. Data were collected from N=26 individuals, where the groundtruth <em>RR</em> was obtained through commercial grade chest-belts and then manually corrected for any errors. A convolutional long-short term memory network (<em>Conv-LSTM</em>) is proposed to estimate respiration time-series data from the speech signal. We demonstrate that the use of pre-trained representations obtained from a foundation model, such as <em>Wav2Vec2</em>, can be used to estimate respiration-time-series with low root-mean-squared error and high correlation coefficient, when compared with the baseline. The model-driven time series can be used to estimate <em>RR</em> with a low mean absolute error (<em>MAE</em>) <span><math><mrow><mo>≈</mo><mn>1</mn><mo>.</mo><mn>6</mn></mrow></math></span> breaths/min.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101926"},"PeriodicalIF":3.4,"publicationDate":"2025-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Is self-supervised learning enough to fill in the gap? A study on speech inpainting 自我监督学习是否足以填补这一空白?绘画中的言语研究
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-16 DOI: 10.1016/j.csl.2025.101922
Ihab Asaad , Maxime Jacquelin , Olivier Perrotin , Laurent Girin , Thomas Hueber
Speech inpainting consists in reconstructing corrupted or missing speech segments using surrounding context, a process that closely resembles the pretext tasks in Self-Supervised Learning (SSL) for speech encoders. This study investigates using SSL-trained speech encoders for inpainting without any additional training beyond the initial pretext task, and simply adding a decoder to generate a waveform. We compare this approach to supervised fine-tuning of speech encoders for a downstream task—here, inpainting. Practically, we integrate HuBERT as the SSL encoder and HiFi-GAN as the decoder in two configurations: (1) fine-tuning the decoder to align with the frozen pre-trained encoder’s output and (2) fine-tuning the encoder for an inpainting task based on a frozen decoder’s input. Evaluations are conducted under single- and multi-speaker conditions using in-domain datasets and out-of-domain datasets (including unseen speakers, diverse speaking styles, and noise). Both informed and blind inpainting scenarios are considered, where the position of the corrupted segment is either known or unknown. The proposed SSL-based methods are benchmarked against several baselines, including a text-informed method combining automatic speech recognition with zero-shot text-to-speech synthesis. Performance is assessed using objective metrics and perceptual evaluations. The results demonstrate that both approaches outperform baselines, successfully reconstructing speech segments up to 200 ms, and sometimes up to 400 ms. Notably, fine-tuning the SSL encoder achieves more accurate speech reconstruction in single-speaker settings, while a pre-trained encoder proves more effective for multi-speaker scenarios. This demonstrates that an SSL pretext task can transfer to speech inpainting, enabling successful speech reconstruction with a pre-trained encoder.
语音绘制包括使用周围的上下文重建损坏或缺失的语音片段,这个过程非常类似于语音编码器的自我监督学习(SSL)中的借口任务。本研究调查了使用ssl训练的语音编码器进行绘图,除了初始的借口任务之外,没有任何额外的训练,只是添加了一个解码器来生成波形。我们将这种方法与语音编码器的监督微调进行比较,以完成下游任务-这里是绘画。实际上,我们将HuBERT作为SSL编码器和HiFi-GAN作为解码器集成在两种配置中:(1)微调解码器以与冻结的预训练编码器的输出对齐;(2)根据冻结的解码器的输入微调编码器以完成油漆任务。使用域内数据集和域外数据集(包括看不见的说话者、不同的说话风格和噪音)在单说话者和多说话者条件下进行评估。在已知或未知损坏段的位置的情况下,考虑了知情和盲目的绘制场景。提出的基于ssl的方法针对几个基线进行了基准测试,包括将自动语音识别与零射击文本到语音合成相结合的文本通知方法。性能评估使用客观指标和感性评价。结果表明,这两种方法都优于基线,成功地重建了200毫秒的语音片段,有时甚至高达400毫秒。值得注意的是,微调SSL编码器在单扬声器设置下实现更准确的语音重建,而预训练编码器在多扬声器场景下更有效。这表明SSL借口任务可以转移到语音绘制,从而使用预训练的编码器实现成功的语音重建。
{"title":"Is self-supervised learning enough to fill in the gap? A study on speech inpainting","authors":"Ihab Asaad ,&nbsp;Maxime Jacquelin ,&nbsp;Olivier Perrotin ,&nbsp;Laurent Girin ,&nbsp;Thomas Hueber","doi":"10.1016/j.csl.2025.101922","DOIUrl":"10.1016/j.csl.2025.101922","url":null,"abstract":"<div><div>Speech inpainting consists in reconstructing corrupted or missing speech segments using surrounding context, a process that closely resembles the pretext tasks in Self-Supervised Learning (SSL) for speech encoders. This study investigates using SSL-trained speech encoders for inpainting without any additional training beyond the initial pretext task, and simply adding a decoder to generate a waveform. We compare this approach to supervised fine-tuning of speech encoders for a downstream task—here, inpainting. Practically, we integrate HuBERT as the SSL encoder and HiFi-GAN as the decoder in two configurations: (1) fine-tuning the decoder to align with the frozen pre-trained encoder’s output and (2) fine-tuning the encoder for an inpainting task based on a frozen decoder’s input. Evaluations are conducted under single- and multi-speaker conditions using in-domain datasets and out-of-domain datasets (including unseen speakers, diverse speaking styles, and noise). Both informed and blind inpainting scenarios are considered, where the position of the corrupted segment is either known or unknown. The proposed SSL-based methods are benchmarked against several baselines, including a text-informed method combining automatic speech recognition with zero-shot text-to-speech synthesis. Performance is assessed using objective metrics and perceptual evaluations. The results demonstrate that both approaches outperform baselines, successfully reconstructing speech segments up to 200 ms, and sometimes up to 400 ms. Notably, fine-tuning the SSL encoder achieves more accurate speech reconstruction in single-speaker settings, while a pre-trained encoder proves more effective for multi-speaker scenarios. This demonstrates that an SSL pretext task can transfer to speech inpainting, enabling successful speech reconstruction with a pre-trained encoder.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101922"},"PeriodicalIF":3.4,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145791096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Audiovisual speech enhancement and voice activity detection using generative and regressive visual features 基于生成和回归视觉特征的视听语音增强和语音活动检测
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-14 DOI: 10.1016/j.csl.2025.101924
Cheng Yu , Vahid Ahmadi Kalkhorani , Buye Xu , DeLiang Wang
We present an audiovisual speech enhancement (AVSE) system to address two related tasks: speech enhancement (SE) and voice activity detection (VAD). The system is based on a complex spectral mapping model and performs two-stage audiovisual fusion. The first stage is a signal-level fusion module, where a generative lip-to-speech conversion method produces time-frequency (T-F) features from lip movements. This allows the system to leverage noise-free T-F representations, which are crucial for improving speech intelligibility, particularly in challenging acoustic environments. The second stage is an embedding-level fusion module, where high-dimensional embedding features from a jointly trained visual encoder are integrated. Additionally, we propose a multitask learning framework that optimizes both SE and VAD tasks. The inclusion of a VAD decoder enables the system to distinguish speech from non-speech segments. We evaluate the system on multiple benchmark datasets, including COG-MHEAR, LRS3-AudioSet, and LRS3-CHiME3, and achieve state-of-the-art SE and speech recognition results, and significant robustness in VAD compared to the audio-only baseline. These results highlight the effectiveness of our system in realistic environments.
我们提出了一个视听语音增强(AVSE)系统来解决两个相关的任务:语音增强(SE)和语音活动检测(VAD)。该系统基于复杂的光谱映射模型,并进行两阶段的视听融合。第一阶段是信号级融合模块,其中生成唇到语音转换方法产生唇运动的时频(T-F)特征。这使得系统能够利用无噪声的T-F表示,这对于提高语音清晰度至关重要,特别是在具有挑战性的声学环境中。第二阶段是嵌入级融合模块,其中集成了来自联合训练的视觉编码器的高维嵌入特征。此外,我们提出了一个多任务学习框架,优化SE和VAD任务。VAD解码器的包含使系统能够区分语音和非语音段。我们在多个基准数据集上对系统进行了评估,包括COG-MHEAR、LRS3-AudioSet和LRS3-CHiME3,并获得了最先进的SE和语音识别结果,与纯音频基线相比,VAD具有显著的鲁棒性。这些结果突出了我们的系统在现实环境中的有效性。
{"title":"Audiovisual speech enhancement and voice activity detection using generative and regressive visual features","authors":"Cheng Yu ,&nbsp;Vahid Ahmadi Kalkhorani ,&nbsp;Buye Xu ,&nbsp;DeLiang Wang","doi":"10.1016/j.csl.2025.101924","DOIUrl":"10.1016/j.csl.2025.101924","url":null,"abstract":"<div><div>We present an audiovisual speech enhancement (AVSE) system to address two related tasks: speech enhancement (SE) and voice activity detection (VAD). The system is based on a complex spectral mapping model and performs two-stage audiovisual fusion. The first stage is a signal-level fusion module, where a generative lip-to-speech conversion method produces time-frequency (T-F) features from lip movements. This allows the system to leverage noise-free T-F representations, which are crucial for improving speech intelligibility, particularly in challenging acoustic environments. The second stage is an embedding-level fusion module, where high-dimensional embedding features from a jointly trained visual encoder are integrated. Additionally, we propose a multitask learning framework that optimizes both SE and VAD tasks. The inclusion of a VAD decoder enables the system to distinguish speech from non-speech segments. We evaluate the system on multiple benchmark datasets, including COG-MHEAR, LRS3-AudioSet, and LRS3-CHiME3, and achieve state-of-the-art SE and speech recognition results, and significant robustness in VAD compared to the audio-only baseline. These results highlight the effectiveness of our system in realistic environments.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101924"},"PeriodicalIF":3.4,"publicationDate":"2025-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145791094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Survey of end-to-end multi-speaker automatic speech recognition for monaural audio 端到端多扬声器单声道自动语音识别研究进展
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-11 DOI: 10.1016/j.csl.2025.101925
Xinlu He, Jacob Whitehill
Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances have driven the shift from cascade systems to end-to-end (E2E) architectures, which reduce error propagation and better exploit the synergy between speech content and speaker identity. Despite rapid progress in E2E multi-speaker ASR, the field lacks a comprehensive review of recent developments. This survey provides a systematic taxonomy of E2E neural approaches for multi-speaker ASR, highlighting recent advances and comparative analysis. Specifically, we analyze: (1) architectural paradigms (single-input-multiple-output (SIMO) vs. single-input-single-output (SISO)) for pre-segmented audio, analyzing their distinct characteristics and trade-offs; (2) recent architectural and algorithmic improvements based on these two paradigms, including multi-modal inputs; (3) extensions to long-form speech, including segmentation strategy and speaker-consistent hypothesis stitching. Further, we (4) evaluate and compare methods across standard benchmarks. We conclude with a discussion of open challenges and future research directions towards building robust and scalable multi-speaker ASR.
单耳多说话人自动语音识别(ASR)仍然具有挑战性,因为数据稀缺和识别和归因单个说话人的固有困难,特别是在重叠语音中。最近的进展推动了从级联系统到端到端(E2E)架构的转变,这减少了错误传播,并更好地利用了语音内容和说话者身份之间的协同作用。尽管端到端多扬声器ASR进展迅速,但该领域缺乏对最近发展的全面审查。本文对多语ASR的端到端神经方法进行了系统的分类,重点介绍了最新进展并进行了比较分析。具体来说,我们分析了:(1)预分段音频的架构范例(单输入多输出(SIMO) vs单输入单输出(SISO)),分析了它们的独特特征和权衡;(2)最近基于这两种范式的架构和算法改进,包括多模态输入;(3)对长形言语的扩展,包括分词策略和说话人一致假设拼接。此外,我们(4)评估和比较跨标准基准的方法。最后,我们讨论了构建健壮且可扩展的多扬声器ASR的开放挑战和未来研究方向。
{"title":"Survey of end-to-end multi-speaker automatic speech recognition for monaural audio","authors":"Xinlu He,&nbsp;Jacob Whitehill","doi":"10.1016/j.csl.2025.101925","DOIUrl":"10.1016/j.csl.2025.101925","url":null,"abstract":"<div><div>Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances have driven the shift from cascade systems to end-to-end (E2E) architectures, which reduce error propagation and better exploit the synergy between speech content and speaker identity. Despite rapid progress in E2E multi-speaker ASR, the field lacks a comprehensive review of recent developments. This survey provides a systematic taxonomy of E2E neural approaches for multi-speaker ASR, highlighting recent advances and comparative analysis. Specifically, we analyze: (1) architectural paradigms (single-input-multiple-output (SIMO) vs. single-input-single-output (SISO)) for pre-segmented audio, analyzing their distinct characteristics and trade-offs; (2) recent architectural and algorithmic improvements based on these two paradigms, including multi-modal inputs; (3) extensions to long-form speech, including segmentation strategy and speaker-consistent hypothesis stitching. Further, we (4) evaluate and compare methods across standard benchmarks. We conclude with a discussion of open challenges and future research directions towards building robust and scalable multi-speaker ASR.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101925"},"PeriodicalIF":3.4,"publicationDate":"2025-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145791095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhanced audio-visual speech enhancement with posterior sampling methods in recurrent variational autoencoders 用后验抽样方法在循环变分自编码器中增强视听语音
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-06 DOI: 10.1016/j.csl.2025.101923
Z. Foroushi, R.M. Dansereau
Recovering intelligible speech in noise is essential for robust communication. This work presents an audio-visual speech enhancement framework based on a Recurrent Variational Autoencoder (AV-RVAE), where posterior inference is extended using sampling-based methods including the Metropolis-Adjusted Langevin Algorithm (MALA), Langevin Dynamics EM (LDEM), Hamiltonian Monte Carlo (HMC), Barker sampling, and a hybrid MALA+Barker variant. To isolate the contribution of visual cues, an audio-only baseline (A-RVAE) is trained and evaluated under identical data and inference conditions.
Performance is assessed using Scale-Invariant Signal-to-Distortion Ratio (SI-SDR), Perceptual Evaluation of Speech Quality (PESQ), and Short-Time Objective Intelligibility (STOI), along with anytime convergence curves (metric versus wall-clock time) and the Real-Time Factor (RTF; ratio of runtime to audio duration) to measure computational efficiency.
Experimental results show that the hybrid MALA+Barker sampler achieves the best overall performance, while LDEM and step-size-optimized MALA exhibit the lowest RTFs, the MALA+Barker sampler offers the most favorable balance between efficiency and enhancement quality. Across all sampling strategies, the AV-RVAE consistently surpasses the audio-only baseline, particularly at low SNRs, confirming the benefit of visual fusion combined with advanced posterior sampling for robust speech enhancement in challenging acoustic environments.
在噪声环境中恢复可理解的言语对于强健的交流是必不可少的。这项工作提出了一个基于循环变分自编码器(AV-RVAE)的视听语音增强框架,其中后验推理使用基于采样的方法进行扩展,包括大都市调整朗格万算法(MALA)、朗格万动力学EM (LDEM)、哈密顿蒙特卡罗(HMC)、巴克采样和混合MALA+巴克变体。为了隔离视觉线索的贡献,在相同的数据和推理条件下训练和评估了纯音频基线(A-RVAE)。性能评估使用尺度不变信号失真比(SI-SDR),语音质量感知评估(PESQ)和短时客观可理解性(STOI),以及随时收敛曲线(度量与时钟时间)和实时因子(RTF;运行时间与音频持续时间的比率)来衡量计算效率。实验结果表明,混合MALA+Barker采样器的综合性能最好,而LDEM和步长优化的MALA采样器的RTFs最低,MALA+Barker采样器在效率和增强质量之间取得了最有利的平衡。在所有采样策略中,AV-RVAE始终优于纯音频基线,特别是在低信噪比的情况下,这证实了视觉融合与先进后验采样相结合的优势,可以在具有挑战性的声学环境中增强语音。
{"title":"Enhanced audio-visual speech enhancement with posterior sampling methods in recurrent variational autoencoders","authors":"Z. Foroushi,&nbsp;R.M. Dansereau","doi":"10.1016/j.csl.2025.101923","DOIUrl":"10.1016/j.csl.2025.101923","url":null,"abstract":"<div><div>Recovering intelligible speech in noise is essential for robust communication. This work presents an audio-visual speech enhancement framework based on a Recurrent Variational Autoencoder (AV-RVAE), where posterior inference is extended using sampling-based methods including the Metropolis-Adjusted Langevin Algorithm (MALA), Langevin Dynamics EM (LDEM), Hamiltonian Monte Carlo (HMC), Barker sampling, and a hybrid MALA+Barker variant. To isolate the contribution of visual cues, an audio-only baseline (A-RVAE) is trained and evaluated under identical data and inference conditions.</div><div>Performance is assessed using Scale-Invariant Signal-to-Distortion Ratio (SI-SDR), Perceptual Evaluation of Speech Quality (PESQ), and Short-Time Objective Intelligibility (STOI), along with anytime convergence curves (metric versus wall-clock time) and the Real-Time Factor (RTF; ratio of runtime to audio duration) to measure computational efficiency.</div><div>Experimental results show that the hybrid MALA+Barker sampler achieves the best overall performance, while LDEM and step-size-optimized MALA exhibit the lowest RTFs, the MALA+Barker sampler offers the most favorable balance between efficiency and enhancement quality. Across all sampling strategies, the AV-RVAE consistently surpasses the audio-only baseline, particularly at low SNRs, confirming the benefit of visual fusion combined with advanced posterior sampling for robust speech enhancement in challenging acoustic environments.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101923"},"PeriodicalIF":3.4,"publicationDate":"2025-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145737567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Do modern speech LLMs and re-scoring techniques improve bilingual ASR performance for Basque and Spanish in domain-specific contexts? 现代语音法学硕士和重新评分技术是否提高了巴斯克语和西班牙语在特定领域上下文中的双语ASR表现?
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-27 DOI: 10.1016/j.csl.2025.101905
Ander González-Docasal , Juan Camilo Vásquez-Correa , Haritz Arzelus , Aitor Álvarez , Santiago A. Moreno-Acevedo
This paper presents an extended evaluation of Vicomtech’s automatic speech recognition (ASR) systems developed for the Albayzín 2024 Bilingual Basque-Spanish Speech-to-Text (BBS-S2T) Challenge, a task focused on transcribing bilingual parliamentary recordings featuring frequent intra- and inter-sentential code-switching between Basque and Spanish. These recordings, drawn from Basque Parliament plenary sessions, pose significant challenges due to the abrupt language alternations, the limited availability of digital resources for Basque, and the absence of contextual and speaker information. The study incorporates additional analysis of state-of-the-art ASR architectures, namely Phi4-multimodal and CrisperWhisper, fine-tuned on the challenge dataset. Furthermore, the systems were evaluated on a complementary benchmark to assess model robustness. A detailed comparison of automatic hypothesis selection techniques, including both traditional n-gram and large language model (LLM)-based approaches, is also provided. Results demonstrate that optimal word error rate (WER) does not always correlate with the most accurate transcriptions, highlighting the complexity of evaluating ASR performance in code-switching scenarios.
本文对Vicomtech的自动语音识别(ASR)系统进行了扩展评估,该系统是为Albayzín 2024双语巴斯克语-西班牙语语音到文本(BBS-S2T)挑战赛开发的,该挑战赛的重点是转录双语议会录音,其中包括巴斯克语和西班牙语之间频繁的句子内和句子间代码转换。这些录音来自巴斯克议会全体会议,由于语言的突然变化,巴斯克语数字资源的可用性有限,以及缺乏上下文和演讲者信息,这些录音构成了重大挑战。该研究结合了对最先进的ASR架构(即Phi4-multimodal和CrisperWhisper)的额外分析,并对挑战数据集进行了微调。此外,对系统进行了互补基准评估,以评估模型的鲁棒性。本文还详细比较了自动假设选择技术,包括传统的n-gram和基于大语言模型(LLM)的方法。结果表明,最佳单词错误率(WER)并不总是与最准确的转录相关,这突出了在代码切换场景下评估ASR性能的复杂性。
{"title":"Do modern speech LLMs and re-scoring techniques improve bilingual ASR performance for Basque and Spanish in domain-specific contexts?","authors":"Ander González-Docasal ,&nbsp;Juan Camilo Vásquez-Correa ,&nbsp;Haritz Arzelus ,&nbsp;Aitor Álvarez ,&nbsp;Santiago A. Moreno-Acevedo","doi":"10.1016/j.csl.2025.101905","DOIUrl":"10.1016/j.csl.2025.101905","url":null,"abstract":"<div><div>This paper presents an extended evaluation of Vicomtech’s automatic speech recognition (ASR) systems developed for the Albayzín 2024 Bilingual Basque-Spanish Speech-to-Text (BBS-S2T) Challenge, a task focused on transcribing bilingual parliamentary recordings featuring frequent intra- and inter-sentential code-switching between Basque and Spanish. These recordings, drawn from Basque Parliament plenary sessions, pose significant challenges due to the abrupt language alternations, the limited availability of digital resources for Basque, and the absence of contextual and speaker information. The study incorporates additional analysis of state-of-the-art ASR architectures, namely Phi4-multimodal and CrisperWhisper, fine-tuned on the challenge dataset. Furthermore, the systems were evaluated on a complementary benchmark to assess model robustness. A detailed comparison of automatic hypothesis selection techniques, including both traditional <span><math><mi>n</mi></math></span>-gram and large language model (LLM)-based approaches, is also provided. Results demonstrate that optimal word error rate (WER) does not always correlate with the most accurate transcriptions, highlighting the complexity of evaluating ASR performance in code-switching scenarios.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101905"},"PeriodicalIF":3.4,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145645755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Keyword Mamba: Spoken keyword spotting with state space models 关键词曼巴:用状态空间模型识别口语关键词
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-27 DOI: 10.1016/j.csl.2025.101909
Hanyu Ding , Wenlong Dong , Qirong Mao
Keyword spotting (KWS) is an essential task in speech processing. It is widely used in voice assistants and smart devices. Deep learning models like CNNs, RNNs, and Transformers have performed well in KWS. However, they often struggle to handle long-term patterns and stay efficient at the same time. In this work, we present Keyword Mamba, a new architecture for KWS. It uses a neural state space model (SSM) called Mamba. We apply Mamba along the time axis and also explore how it can replace the self-attention part in Transformer models. We test our model on the Google Speech Commands datasets. The results show that Keyword Mamba reaches strong accuracy with fewer parameters and lower computational cost. To our knowledge, this is the first time a state space model has been used for KWS. These results suggest that Mamba has strong potential in speech-related tasks.
关键词识别是语音处理中的一项重要任务。它被广泛应用于语音助手和智能设备。cnn、rnn和Transformers等深度学习模型在KWS中表现良好。然而,他们经常在处理长期模式的同时保持效率。在这项工作中,我们提出了关键字曼巴,一个新的架构的KWS。它使用一种叫做曼巴的神经状态空间模型(SSM)。我们沿着时间轴应用Mamba,并探索它如何取代Transformer模型中的自关注部分。我们在谷歌语音命令数据集上测试了我们的模型。结果表明,关键词Mamba以较少的参数和较低的计算成本获得了较好的精度。据我们所知,这是第一次将状态空间模型用于KWS。这些结果表明,曼巴在言语相关任务中具有很强的潜力。
{"title":"Keyword Mamba: Spoken keyword spotting with state space models","authors":"Hanyu Ding ,&nbsp;Wenlong Dong ,&nbsp;Qirong Mao","doi":"10.1016/j.csl.2025.101909","DOIUrl":"10.1016/j.csl.2025.101909","url":null,"abstract":"<div><div>Keyword spotting (KWS) is an essential task in speech processing. It is widely used in voice assistants and smart devices. Deep learning models like CNNs, RNNs, and Transformers have performed well in KWS. However, they often struggle to handle long-term patterns and stay efficient at the same time. In this work, we present Keyword Mamba, a new architecture for KWS. It uses a neural state space model (SSM) called Mamba. We apply Mamba along the time axis and also explore how it can replace the self-attention part in Transformer models. We test our model on the Google Speech Commands datasets. The results show that Keyword Mamba reaches strong accuracy with fewer parameters and lower computational cost. To our knowledge, this is the first time a state space model has been used for KWS. These results suggest that Mamba has strong potential in speech-related tasks.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101909"},"PeriodicalIF":3.4,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computer Speech and Language
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1