首页 > 最新文献

Computer Speech and Language最新文献

英文 中文
Entrainment detection using DNN 使用深度神经网络进行夹带检测
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-07-01 Epub Date: 2025-12-26 DOI: 10.1016/j.csl.2025.101930
Jay Kejriwal , Štefan Beňuš , Lina M. Rojas-Barahona
During conversation, speakers adjust their linguistic characteristics to become more similar to their partners. This complex phenomenon is known as entrainment, and speakers dynamically entrain as well as disentrain on different linguistic features. Researchers have utilized a range of computational methods to explore entrainment. Recent technological advancements have facilitated the use of deep learning, which offers a systematic quantification of acoustic entrainment dynamics. In this study, we investigate the capability of deep learning architectures to extract and leverage textual features for the efficient representation and learning of entrainment. By adjusting the architecture of an acoustic-based DNN entrainment model, we present an unsupervised deep learning framework that derives representations from textual features containing relevant information for identifying entrainment at three linguistic levels: lexical, syntactic, and semantic. To investigate the performance of each model within the proposed framework, various text-based and speech features were extracted. Entrainment was quantified using different distance measures in the representation space. The performance of the trained models was evaluated by distinguishing real and sham conversations using the proposed distances. Our results suggest that acoustic-based DNN models outperform text-based DNN models and that distance measures affect the models’ performance.
在对话过程中,说话者会调整自己的语言特征,使之更接近对方。这种复杂的现象被称为夹带,说话者在不同的语言特征上动态地夹带和剥离。研究人员利用了一系列的计算方法来探索夹带。最近的技术进步促进了深度学习的使用,它提供了声学夹带动力学的系统量化。在本研究中,我们研究了深度学习架构提取和利用文本特征来有效表示和学习夹带的能力。通过调整基于声学的DNN诱导模型的架构,我们提出了一个无监督深度学习框架,该框架从包含相关信息的文本特征中提取表征,用于在三个语言层面(词汇、句法和语义)识别诱导。为了研究每个模型在该框架内的性能,提取了各种基于文本和语音的特征。在表征空间中使用不同的距离度量来量化夹带。通过使用建议的距离区分真实和虚假对话来评估训练模型的性能。我们的研究结果表明,基于声学的深度神经网络模型优于基于文本的深度神经网络模型,并且距离度量会影响模型的性能。
{"title":"Entrainment detection using DNN","authors":"Jay Kejriwal ,&nbsp;Štefan Beňuš ,&nbsp;Lina M. Rojas-Barahona","doi":"10.1016/j.csl.2025.101930","DOIUrl":"10.1016/j.csl.2025.101930","url":null,"abstract":"<div><div>During conversation, speakers adjust their linguistic characteristics to become more similar to their partners. This complex phenomenon is known as entrainment, and speakers dynamically entrain as well as disentrain on different linguistic features. Researchers have utilized a range of computational methods to explore entrainment. Recent technological advancements have facilitated the use of deep learning, which offers a systematic quantification of acoustic entrainment dynamics. In this study, we investigate the capability of deep learning architectures to extract and leverage textual features for the efficient representation and learning of entrainment. By adjusting the architecture of an acoustic-based DNN entrainment model, we present an unsupervised deep learning framework that derives representations from textual features containing relevant information for identifying entrainment at three linguistic levels: lexical, syntactic, and semantic. To investigate the performance of each model within the proposed framework, various text-based and speech features were extracted. Entrainment was quantified using different distance measures in the representation space. The performance of the trained models was evaluated by distinguishing real and sham conversations using the proposed distances. Our results suggest that acoustic-based DNN models outperform text-based DNN models and that distance measures affect the models’ performance.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101930"},"PeriodicalIF":3.4,"publicationDate":"2026-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145925550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pitch-Aware multi-feature fusion for classifying statements, questions, and exclamations in low-resource languages 基于音高感知的多特征融合的低资源语言陈述句、疑问句和感叹词分类
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-07-01 Epub Date: 2026-01-07 DOI: 10.1016/j.csl.2026.101941
Ayub Othman Abdulrahman
Automatic classification of statements, questions, and exclamations is important for dialogue systems, speech analytics, language documentation, and other human-computer interaction tasks. Speech pitch and prosody are central cues for these categories, but pitch-based classification remains challenging due to speaker variability, recording conditions, and overlapping prosodic patterns across classes, especially in low-resource settings. We present an innovative multi-feature fusion architecture that combines pretrained wav2Vec 2.0 raw-waveform embeddings (transfer learning), 40-dimensional Mel-Frequency Cepstral Coefficients (MFCCs) features, and Mel-spectrogram representations into an integrated framework. Our work explicitly depends on pitch-related cues (captured primarily by the waveform embeddings and spectrogram branch) together with complementary MFCCs spectral features, which together improve robustness. The model concatenates 128-dimensional representations from each branch and refines the fused vector with fully connected layers. This study leverages SQEBSP, a recently published pitch-annotated Kurdish speech dataset collected by the authors, comprising 12,660 utterances from 431 speakers, to evaluate statement, question, and exclamation classification. The proposed method achieves approximately 97% accuracy on the training/validation data, and about 88% accuracy on a separate held-out test set comprising 20% of the dataset, substantially outperforming single-feature baselines (58.8–79.3%) and prior three-class systems (68.0%). Ablation experiments confirm that the pitch-related inputs contribute substantially to classification accuracy, while MFCC features provide complementary spectral/timbre information. Our research indicates that the combination of pretrained wav2Vec 2.0 representations with multi-feature fusion and supervised fine-tuning provides an efficient method for pitch-informed speech classification in low-resource scenarios.
语句、问题和感叹号的自动分类对于对话系统、语音分析、语言文档和其他人机交互任务非常重要。语音音高和韵律是这些分类的中心线索,但基于音高的分类仍然具有挑战性,因为说话者的变化、记录条件和跨类重叠的韵律模式,特别是在低资源环境中。我们提出了一种创新的多特征融合架构,该架构将预训练的wav2Vec 2.0原始波形嵌入(迁移学习)、40维mel -频率倒谱系数(MFCCs)特征和mel -频谱图表示结合到一个集成框架中。我们的工作明确地依赖于与音高相关的线索(主要由波形嵌入和频谱图分支捕获)以及互补的mfccc频谱特征,它们共同提高了鲁棒性。该模型将来自每个分支的128维表示连接起来,并用完全连接的层精炼融合向量。这项研究利用SQEBSP,一个由作者收集的最近发表的音调注释的库尔德语语音数据集,包括来自431个说话者的12,660个话语,来评估陈述句、疑问和感叹词的分类。该方法在训练/验证数据上的准确率约为97%,在包含20%数据集的单独测试集上的准确率约为88%,大大优于单特征基线(58.8-79.3%)和先前的三类系统(68.0%)。烧蚀实验证实,与音高相关的输入极大地提高了分类精度,而MFCC特征提供了互补的光谱/音色信息。我们的研究表明,将预训练的wav2Vec 2.0表示与多特征融合和监督微调相结合,为低资源场景下的音高知情语音分类提供了一种有效的方法。
{"title":"Pitch-Aware multi-feature fusion for classifying statements, questions, and exclamations in low-resource languages","authors":"Ayub Othman Abdulrahman","doi":"10.1016/j.csl.2026.101941","DOIUrl":"10.1016/j.csl.2026.101941","url":null,"abstract":"<div><div>Automatic classification of statements, questions, and exclamations is important for dialogue systems, speech analytics, language documentation, and other human-computer interaction tasks. Speech pitch and prosody are central cues for these categories, but pitch-based classification remains challenging due to speaker variability, recording conditions, and overlapping prosodic patterns across classes, especially in low-resource settings. We present an innovative multi-feature fusion architecture that combines pretrained wav2Vec 2.0 raw-waveform embeddings (transfer learning), 40-dimensional Mel-Frequency Cepstral Coefficients (MFCCs) features, and Mel-spectrogram representations into an integrated framework. Our work explicitly depends on pitch-related cues (captured primarily by the waveform embeddings and spectrogram branch) together with complementary MFCCs spectral features, which together improve robustness. The model concatenates 128-dimensional representations from each branch and refines the fused vector with fully connected layers. This study leverages SQEBSP, a recently published pitch-annotated Kurdish speech dataset collected by the authors, comprising 12,660 utterances from 431 speakers, to evaluate statement, question, and exclamation classification. The proposed method achieves approximately 97% accuracy on the training/validation data, and about 88% accuracy on a separate held-out test set comprising 20% of the dataset, substantially outperforming single-feature baselines (58.8–79.3%) and prior three-class systems (68.0%). Ablation experiments confirm that the pitch-related inputs contribute substantially to classification accuracy, while MFCC features provide complementary spectral/timbre information. Our research indicates that the combination of pretrained wav2Vec 2.0 representations with multi-feature fusion and supervised fine-tuning provides an efficient method for pitch-informed speech classification in low-resource scenarios.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101941"},"PeriodicalIF":3.4,"publicationDate":"2026-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145977419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
QuAVA: A privacy-aware architecture for conversational desktop Content Retrieval systems 会话式桌面内容检索系统的隐私感知架构
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-07-01 Epub Date: 2026-02-02 DOI: 10.1016/j.csl.2026.101950
Nikolaos Malamas , Andreas L. Symeonidis , John B. Theocharis
Question Answering (QA) and Content Retrieval (CR) systems have experienced a boost in performance in recent years leveraging state-of-the-art Transformer models to process user expressions and retrieve and extract information requested. Despite the constant language understanding improvements, very little effort has been put into the design of such systems for personal desktop use, where data are kept locally and are not sent to cloud services and decisions and outputs are transparent and explainable to the user. To that end, we present QuAVA, a conversational desktop content retrieval assistant, designed on four pillars: privacy and security, explainability, low-resource requirements, and multi-source data fusion. QuAVA is a data and privacy-preserving assistant that enables users to access their private data such as files, emails, and message exchanges, conversationally and transparently. The proposed architecture automatically extracts and preprocesses content from various sources and organizes it in a 3-layered hierarchical structure, namely a topic, a subtopic, and a content layer by employing ML algorithms for clustering and labeling. This way, users can navigate and access information via a set of conversation rules embedded in the assistant. We conduct a qualitative comparison analysis of the QuAVA architecture with other well-established QA and CR architectures against the four pillars defined, as well as privacy tests, and conclude that QuAVA is the only – to our knowledge – virtual assistant that successfully satisfies them.
近年来,问答(QA)和内容检索(CR)系统利用最先进的Transformer模型处理用户表达式并检索和提取所请求的信息,在性能上有了很大的提高。尽管对语言的理解不断得到改进,但在设计这种供个人桌面使用的系统方面投入的努力很少,这些系统的数据保存在本地,而不是发送到云服务,决策和输出是透明的,对用户是可解释的。为此,我们提出了QuAVA,这是一个会话桌面内容检索助手,其设计基于四个支柱:隐私和安全性、可解释性、低资源需求和多源数据融合。QuAVA是一个数据和隐私保护助手,使用户能够以对话和透明的方式访问他们的私人数据,如文件、电子邮件和消息交换。该架构采用ML算法进行聚类和标记,自动从各种来源提取和预处理内容,并将其组织成3层层次结构,即主题、子主题和内容层。这样,用户就可以通过助手中嵌入的一组对话规则来导航和访问信息。我们针对定义的四个支柱,以及隐私测试,对QuAVA架构与其他完善的QA和CR架构进行了定性比较分析,并得出结论:据我们所知,QuAVA是唯一成功满足他们的虚拟助手。
{"title":"QuAVA: A privacy-aware architecture for conversational desktop Content Retrieval systems","authors":"Nikolaos Malamas ,&nbsp;Andreas L. Symeonidis ,&nbsp;John B. Theocharis","doi":"10.1016/j.csl.2026.101950","DOIUrl":"10.1016/j.csl.2026.101950","url":null,"abstract":"<div><div>Question Answering (QA) and Content Retrieval (CR) systems have experienced a boost in performance in recent years leveraging state-of-the-art Transformer models to process user expressions and retrieve and extract information requested. Despite the constant language understanding improvements, very little effort has been put into the design of such systems for personal desktop use, where data are kept locally and are not sent to cloud services and decisions and outputs are transparent and explainable to the user. To that end, we present QuAVA, a conversational desktop content retrieval assistant, designed on four pillars: privacy and security, explainability, low-resource requirements, and multi-source data fusion. QuAVA is a data and privacy-preserving assistant that enables users to access their private data such as files, emails, and message exchanges, conversationally and transparently. The proposed architecture automatically extracts and preprocesses content from various sources and organizes it in a 3-layered hierarchical structure, namely a topic, a subtopic, and a content layer by employing ML algorithms for clustering and labeling. This way, users can navigate and access information via a set of conversation rules embedded in the assistant. We conduct a qualitative comparison analysis of the QuAVA architecture with other well-established QA and CR architectures against the four pillars defined, as well as privacy tests, and conclude that QuAVA is the only – to our knowledge – virtual assistant that successfully satisfies them.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101950"},"PeriodicalIF":3.4,"publicationDate":"2026-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146173544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhanced audio-visual speech enhancement with posterior sampling methods in recurrent variational autoencoders 用后验抽样方法在循环变分自编码器中增强视听语音
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-07-01 Epub Date: 2025-12-06 DOI: 10.1016/j.csl.2025.101923
Z. Foroushi, R.M. Dansereau
Recovering intelligible speech in noise is essential for robust communication. This work presents an audio-visual speech enhancement framework based on a Recurrent Variational Autoencoder (AV-RVAE), where posterior inference is extended using sampling-based methods including the Metropolis-Adjusted Langevin Algorithm (MALA), Langevin Dynamics EM (LDEM), Hamiltonian Monte Carlo (HMC), Barker sampling, and a hybrid MALA+Barker variant. To isolate the contribution of visual cues, an audio-only baseline (A-RVAE) is trained and evaluated under identical data and inference conditions.
Performance is assessed using Scale-Invariant Signal-to-Distortion Ratio (SI-SDR), Perceptual Evaluation of Speech Quality (PESQ), and Short-Time Objective Intelligibility (STOI), along with anytime convergence curves (metric versus wall-clock time) and the Real-Time Factor (RTF; ratio of runtime to audio duration) to measure computational efficiency.
Experimental results show that the hybrid MALA+Barker sampler achieves the best overall performance, while LDEM and step-size-optimized MALA exhibit the lowest RTFs, the MALA+Barker sampler offers the most favorable balance between efficiency and enhancement quality. Across all sampling strategies, the AV-RVAE consistently surpasses the audio-only baseline, particularly at low SNRs, confirming the benefit of visual fusion combined with advanced posterior sampling for robust speech enhancement in challenging acoustic environments.
在噪声环境中恢复可理解的言语对于强健的交流是必不可少的。这项工作提出了一个基于循环变分自编码器(AV-RVAE)的视听语音增强框架,其中后验推理使用基于采样的方法进行扩展,包括大都市调整朗格万算法(MALA)、朗格万动力学EM (LDEM)、哈密顿蒙特卡罗(HMC)、巴克采样和混合MALA+巴克变体。为了隔离视觉线索的贡献,在相同的数据和推理条件下训练和评估了纯音频基线(A-RVAE)。性能评估使用尺度不变信号失真比(SI-SDR),语音质量感知评估(PESQ)和短时客观可理解性(STOI),以及随时收敛曲线(度量与时钟时间)和实时因子(RTF;运行时间与音频持续时间的比率)来衡量计算效率。实验结果表明,混合MALA+Barker采样器的综合性能最好,而LDEM和步长优化的MALA采样器的RTFs最低,MALA+Barker采样器在效率和增强质量之间取得了最有利的平衡。在所有采样策略中,AV-RVAE始终优于纯音频基线,特别是在低信噪比的情况下,这证实了视觉融合与先进后验采样相结合的优势,可以在具有挑战性的声学环境中增强语音。
{"title":"Enhanced audio-visual speech enhancement with posterior sampling methods in recurrent variational autoencoders","authors":"Z. Foroushi,&nbsp;R.M. Dansereau","doi":"10.1016/j.csl.2025.101923","DOIUrl":"10.1016/j.csl.2025.101923","url":null,"abstract":"<div><div>Recovering intelligible speech in noise is essential for robust communication. This work presents an audio-visual speech enhancement framework based on a Recurrent Variational Autoencoder (AV-RVAE), where posterior inference is extended using sampling-based methods including the Metropolis-Adjusted Langevin Algorithm (MALA), Langevin Dynamics EM (LDEM), Hamiltonian Monte Carlo (HMC), Barker sampling, and a hybrid MALA+Barker variant. To isolate the contribution of visual cues, an audio-only baseline (A-RVAE) is trained and evaluated under identical data and inference conditions.</div><div>Performance is assessed using Scale-Invariant Signal-to-Distortion Ratio (SI-SDR), Perceptual Evaluation of Speech Quality (PESQ), and Short-Time Objective Intelligibility (STOI), along with anytime convergence curves (metric versus wall-clock time) and the Real-Time Factor (RTF; ratio of runtime to audio duration) to measure computational efficiency.</div><div>Experimental results show that the hybrid MALA+Barker sampler achieves the best overall performance, while LDEM and step-size-optimized MALA exhibit the lowest RTFs, the MALA+Barker sampler offers the most favorable balance between efficiency and enhancement quality. Across all sampling strategies, the AV-RVAE consistently surpasses the audio-only baseline, particularly at low SNRs, confirming the benefit of visual fusion combined with advanced posterior sampling for robust speech enhancement in challenging acoustic environments.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101923"},"PeriodicalIF":3.4,"publicationDate":"2026-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145737567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Leveraging saliency-based pre-trained foundation model representations to uncover breathing patterns in speech 利用基于显著性的预训练基础模型表示来揭示语音中的呼吸模式
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-07-01 Epub Date: 2025-12-20 DOI: 10.1016/j.csl.2025.101926
Vikramjit Mitra, Anirban Chatterjee, Ke Zhai, Helen Weng, Ayuko Hill, Nicole Hay, Christopher Webb, Jamie Cheng, Erdrin Azemi
The process of human speech production involves coordinated respiratory action to elicit acoustic speech signals. Typically, speech is produced when air is forced from the lungs and is modulated by the vocal tract, where such actions are interspersed by moments of breathing in air (inhalation) to refill the lungs again. Respiratory rate (RR) is a vital metric that is used to assess the overall health, fitness, and general well-being of an individual. Existing approaches to measure RR (number of breaths one takes in a minute) are performed using specialized equipment or training. Studies have demonstrated that machine learning algorithms can be used to estimate RR using bio-sensor signals as input. Speech-based estimation of RR can offer an effective approach to measure the vital metric without requiring any specialized equipment or sensors. This work investigates a machine learning based approach to estimate RR from speech segments obtained from subjects speaking to a close-talking microphone device. Data were collected from N=26 individuals, where the groundtruth RR was obtained through commercial grade chest-belts and then manually corrected for any errors. A convolutional long-short term memory network (Conv-LSTM) is proposed to estimate respiration time-series data from the speech signal. We demonstrate that the use of pre-trained representations obtained from a foundation model, such as Wav2Vec2, can be used to estimate respiration-time-series with low root-mean-squared error and high correlation coefficient, when compared with the baseline. The model-driven time series can be used to estimate RR with a low mean absolute error (MAE) 1.6 breaths/min.
人类语音产生的过程涉及到协调的呼吸动作来引出声学语音信号。通常情况下,当空气从肺部排出并被声道调节时,就会产生语音,在声道中,这些动作被吸入空气(吸入)的时刻所穿插,以再次充满肺部。呼吸频率(RR)是一个重要的指标,用于评估个人的整体健康,健身和一般福祉。现有的测量RR(一分钟内呼吸次数)的方法是使用专门的设备或训练来执行的。研究表明,机器学习算法可以使用生物传感器信号作为输入来估计RR。基于语音的RR估计可以提供一种有效的方法来测量重要指标,而不需要任何专门的设备或传感器。这项工作研究了一种基于机器学习的方法,从对着近距离说话的麦克风设备说话的受试者获得的语音片段中估计RR。从N=26个人中收集数据,其中通过商业级胸带获得基础真实RR,然后手动纠正任何错误。提出了一种卷积长短期记忆网络(convolutional long-short term memory network,简称convl - lstm)来估计语音信号中的呼吸时间序列数据。我们证明,与基线相比,使用从基础模型(如Wav2Vec2)获得的预训练表示可以用于估计具有低均方根误差和高相关系数的呼吸时间序列。模型驱动的时间序列可用于估计RR,平均绝对误差(MAE)≈1.6次/分钟。
{"title":"Leveraging saliency-based pre-trained foundation model representations to uncover breathing patterns in speech","authors":"Vikramjit Mitra,&nbsp;Anirban Chatterjee,&nbsp;Ke Zhai,&nbsp;Helen Weng,&nbsp;Ayuko Hill,&nbsp;Nicole Hay,&nbsp;Christopher Webb,&nbsp;Jamie Cheng,&nbsp;Erdrin Azemi","doi":"10.1016/j.csl.2025.101926","DOIUrl":"10.1016/j.csl.2025.101926","url":null,"abstract":"<div><div>The process of human speech production involves coordinated respiratory action to elicit acoustic speech signals. Typically, speech is produced when air is forced from the lungs and is modulated by the vocal tract, where such actions are interspersed by moments of breathing in air (inhalation) to refill the lungs again. Respiratory rate (<em>RR</em>) is a vital metric that is used to assess the overall health, fitness, and general well-being of an individual. Existing approaches to measure <em>RR</em> (number of breaths one takes in a minute) are performed using specialized equipment or training. Studies have demonstrated that machine learning algorithms can be used to estimate <em>RR</em> using bio-sensor signals as input. Speech-based estimation of <em>RR</em> can offer an effective approach to measure the vital metric without requiring any specialized equipment or sensors. This work investigates a machine learning based approach to estimate <em>RR</em> from speech segments obtained from subjects speaking to a close-talking microphone device. Data were collected from N=26 individuals, where the groundtruth <em>RR</em> was obtained through commercial grade chest-belts and then manually corrected for any errors. A convolutional long-short term memory network (<em>Conv-LSTM</em>) is proposed to estimate respiration time-series data from the speech signal. We demonstrate that the use of pre-trained representations obtained from a foundation model, such as <em>Wav2Vec2</em>, can be used to estimate respiration-time-series with low root-mean-squared error and high correlation coefficient, when compared with the baseline. The model-driven time series can be used to estimate <em>RR</em> with a low mean absolute error (<em>MAE</em>) <span><math><mrow><mo>≈</mo><mn>1</mn><mo>.</mo><mn>6</mn></mrow></math></span> breaths/min.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101926"},"PeriodicalIF":3.4,"publicationDate":"2026-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Decoding phone pairs from MEG signals across speech modalities 解码电话对从MEG信号跨语音模式
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-07-01 Epub Date: 2026-01-05 DOI: 10.1016/j.csl.2026.101939
Xabier de Zuazo , Eva Navas , Ibon Saratxaga , Mathieu Bourguignon , Nicola Molinaro
Understanding the neural mechanisms underlying speech production is essential for both advancing cognitive neuroscience theory and developing practical communication technologies. In this study, we investigated magnetoencephalography (MEG) signals to perform binary phone-pair classification from brain activity during speech production and perception (passive listening and voice playback) tasks. Using a dataset comprising 18 participants, we performed pairwise phone classification, extending our analysis to 20 phonetic pairs. Multiple machine learning approaches, including regularized linear models and neural network architectures, were compared to determine their effectiveness in decoding phonetic information. Our results demonstrate significantly higher decoding accuracy during speech production (73.4%) compared to passive listening and playback modalities (approximately 51%), emphasizing the richer neural information available during overt speech. Among the models, the Elastic Net classifier consistently outperformed more complex neural networks, highlighting the effectiveness of traditional regularization techniques when applied to limited and high-dimensional MEG datasets. Besides, analysis of specific brain frequency bands revealed that low-frequency oscillations, particularly Delta (0.2 Hz to 3 Hz) and Theta (4 Hz to 7 Hz), contributed the most substantially to decoding accuracy, suggesting that these bands encode critical speech production-related neural processes. Despite using advanced denoising methods, it remains unclear whether decoding solely reflects neural activity or if residual muscular or movement artifacts also contribute, indicating the need for further methodological refinement. Overall, our findings underline the critical importance of examining overt speech production paradigms, which, despite their complexity, offer opportunities to improve brain-computer interfaces to help individuals with severe speech impairments.
理解语音产生背后的神经机制对于推进认知神经科学理论和发展实用的通信技术至关重要。在这项研究中,我们研究了脑磁图(MEG)信号在语音产生和感知(被动倾听和语音回放)任务中执行二进制电话对分类的大脑活动。使用包含18个参与者的数据集,我们进行了配对电话分类,将我们的分析扩展到20个语音对。我们比较了多种机器学习方法,包括正则化线性模型和神经网络架构,以确定它们在解码语音信息方面的有效性。我们的研究结果表明,与被动聆听和播放模式(约51%)相比,语音产生过程中的解码准确率(73.4%)显著提高,强调了显性语音过程中更丰富的神经信息。在这些模型中,Elastic Net分类器的表现始终优于更复杂的神经网络,突显了传统正则化技术在有限和高维MEG数据集上的有效性。此外,对特定脑频段的分析显示,低频振荡,特别是Delta (0.2 Hz至3 Hz)和Theta (4 Hz至7 Hz),对解码精度的贡献最大,这表明这些频段编码了与语音产生相关的关键神经过程。尽管使用了先进的去噪方法,但仍不清楚解码是否仅反映神经活动,或者是否残留的肌肉或运动伪影也有贡献,这表明需要进一步改进方法。总的来说,我们的发现强调了研究显性语言产生范式的重要性,尽管它们很复杂,但它为改善脑机接口提供了机会,以帮助患有严重语言障碍的个体。
{"title":"Decoding phone pairs from MEG signals across speech modalities","authors":"Xabier de Zuazo ,&nbsp;Eva Navas ,&nbsp;Ibon Saratxaga ,&nbsp;Mathieu Bourguignon ,&nbsp;Nicola Molinaro","doi":"10.1016/j.csl.2026.101939","DOIUrl":"10.1016/j.csl.2026.101939","url":null,"abstract":"<div><div>Understanding the neural mechanisms underlying speech production is essential for both advancing cognitive neuroscience theory and developing practical communication technologies. In this study, we investigated magnetoencephalography (MEG) signals to perform binary phone-pair classification from brain activity during speech production and perception (passive listening and voice playback) tasks. Using a dataset comprising 18 participants, we performed pairwise phone classification, extending our analysis to 20 phonetic pairs. Multiple machine learning approaches, including regularized linear models and neural network architectures, were compared to determine their effectiveness in decoding phonetic information. Our results demonstrate significantly higher decoding accuracy during speech production (73.4%) compared to passive listening and playback modalities (approximately 51%), emphasizing the richer neural information available during overt speech. Among the models, the Elastic Net classifier consistently outperformed more complex neural networks, highlighting the effectiveness of traditional regularization techniques when applied to limited and high-dimensional MEG datasets. Besides, analysis of specific brain frequency bands revealed that low-frequency oscillations, particularly Delta (0.2<!--> <!-->Hz to 3<!--> <!-->Hz) and Theta (4<!--> <!-->Hz to 7<!--> <!-->Hz), contributed the most substantially to decoding accuracy, suggesting that these bands encode critical speech production-related neural processes. Despite using advanced denoising methods, it remains unclear whether decoding solely reflects neural activity or if residual muscular or movement artifacts also contribute, indicating the need for further methodological refinement. Overall, our findings underline the critical importance of examining overt speech production paradigms, which, despite their complexity, offer opportunities to improve brain-computer interfaces to help individuals with severe speech impairments.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101939"},"PeriodicalIF":3.4,"publicationDate":"2026-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145925552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Artificial protozoa lotus effect algorithm enabled cognitive brain optimal model for sentiment analysis utilizing multimodal data 人工原生动物荷花效应算法实现认知脑优化模型,利用多模态数据进行情感分析
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-07-01 Epub Date: 2025-12-22 DOI: 10.1016/j.csl.2025.101929
Sanjeevkumar Angadi , Saili Hemant Sable , Tejaswini Zope , Rajani Amol Hemade , Vaibhavi Umesh Avachat
Understanding public sentiment derived from online data is a challenging research problem with numerous applications, including contextual analysis and opinion assessment on specific events. Traditionally, sentiment analysis has concentrated on a single modality, such as text or images. However, utilizing multimodal information such as images, text, and audio can enhance model accuracy. Despite this advantage, combining visual and textual features often leads to decreased performance. This is mainly because of the model's inability to efficiently capture the intricate relationships amongst diverse modalities. To confront these challenges, a new technique named Artificial Protozoa Lotus Effect Algorithm _ Cognitive Brain Optimal Model (APLEA_CBO) model has been developed for sentiment analysis using multimodal data. Initially, feature extraction is performed on audio data to obtain the feature vector outcome-1. Similarly, feature extraction is conducted on the input text to extract suitable features and is considered outcome-2. Both feature sets are then processed for sentiment analysis using the Cognitive Brain Optimal Model (CBOM), which is developed by employing Recurrent Denoising Long Short-Term Memory (RD-LSTM). The CBOM is trained using the Artificial Protozoa Lotus Effect Algorithm (APLEA), which is the integration of Artificial Protozoa Optimization (APO) and Lotus Effect Algorithm (LEA). It is noted that the APLEA_CBO model has gained an FPR of 7.17 %, a recall of 92.76 %, a precision of 90.62 %, and an accuracy of 90.60 %.
从在线数据中理解公众情绪是一个具有挑战性的研究问题,有许多应用,包括上下文分析和对特定事件的意见评估。传统上,情感分析集中于单一的情态,如文本或图像。然而,利用图像、文本和音频等多模态信息可以提高模型的准确性。尽管有这个优点,但是将视觉和文本特性结合起来经常会导致性能下降。这主要是因为该模型无法有效地捕捉各种模态之间的复杂关系。为了应对这些挑战,本文提出了一种基于多模态数据的情感分析新技术——人工原生动物莲花效应算法-认知脑优化模型(APLEA_CBO)。首先,对音频数据进行特征提取,得到特征向量result -1。同样,对输入文本进行特征提取,提取合适的特征,认为是结果2。然后使用认知大脑最优模型(CBOM)对两个特征集进行处理以进行情感分析,该模型是通过使用循环降噪长短期记忆(RD-LSTM)开发的。CBOM采用人工原生动物优化算法(APO)和莲花效应算法(LEA)相结合的人工原生动物莲花效应算法(apade)进行训练。结果表明,APLEA_CBO模型的FPR为7.17%,召回率为92.76%,精密度为90.62%,准确度为90.60%。
{"title":"Artificial protozoa lotus effect algorithm enabled cognitive brain optimal model for sentiment analysis utilizing multimodal data","authors":"Sanjeevkumar Angadi ,&nbsp;Saili Hemant Sable ,&nbsp;Tejaswini Zope ,&nbsp;Rajani Amol Hemade ,&nbsp;Vaibhavi Umesh Avachat","doi":"10.1016/j.csl.2025.101929","DOIUrl":"10.1016/j.csl.2025.101929","url":null,"abstract":"<div><div>Understanding public sentiment derived from online data is a challenging research problem with numerous applications, including contextual analysis and opinion assessment on specific events. Traditionally, sentiment analysis has concentrated on a single modality, such as text or images. However, utilizing multimodal information such as images, text, and audio can enhance model accuracy. Despite this advantage, combining visual and textual features often leads to decreased performance. This is mainly because of the model's inability to efficiently capture the intricate relationships amongst diverse modalities. To confront these challenges, a new technique named Artificial Protozoa Lotus Effect Algorithm _ Cognitive Brain Optimal Model (APLEA_CBO) model has been developed for sentiment analysis using multimodal data. Initially, feature extraction is performed on audio data to obtain the feature vector outcome-1. Similarly, feature extraction is conducted on the input text to extract suitable features and is considered outcome-2. Both feature sets are then processed for sentiment analysis using the Cognitive Brain Optimal Model (CBOM), which is developed by employing Recurrent Denoising Long Short-Term Memory (RD-LSTM). The CBOM is trained using the Artificial Protozoa Lotus Effect Algorithm (APLEA), which is the integration of Artificial Protozoa Optimization (APO) and Lotus Effect Algorithm (LEA). It is noted that the APLEA_CBO model has gained an FPR of 7.17 %, a recall of 92.76 %, a precision of 90.62 %, and an accuracy of 90.60 %.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101929"},"PeriodicalIF":3.4,"publicationDate":"2026-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145884237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Emotion-guided cross-modal alignment for multimodal depression detection 情绪引导的多模态抑郁检测的跨模态对齐
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-07-01 Epub Date: 2026-01-29 DOI: 10.1016/j.csl.2026.101951
Wenzhe Jia , Yuhang Wang , Yahui Kang
Depression detection from multimodal data is crucial for early intervention and mental health monitoring. Existing systems, however, face three challenges: (i) capturing subtle affective cues that distinguish depressive states from normal emotional variations, (ii) establishing reliable correspondence between heterogeneous speech and text modalities, and (iii) handling severe class imbalance in real-world corpora. To address these challenges, we propose a framework that integrates explicit emotion supervision, cross-modal alignment, and metric-oriented optimization for robust multimodal depression detection. Acoustic and lexical features are augmented with emotion-category embeddings derived from supervision signals to provide affective context, while semantic correspondence is reinforced through a contrastive alignment objective. To mitigate imbalance, we directly optimize macro-F1 with the Lovász loss. On the Emotional Audio-Textual Depression Corpus (EATD-Corpus), our framework achieves 87.40% ± 0.46% macro-F1 with dataset-provided emotions and 83.15% with predicted emotions, compared to 71.82% without emotion information. Cross-dataset evaluation on the Distress Analysis Interview Corpus – Wizard of Oz (DAIC-WOZ) shows consistent gains, including a 12.34% F1 improvement with emotion augmentation. This integrated approach—combining emotion supervision, cross-modal alignment, and metric-oriented optimization—represents a novel contribution to depression detection. Our framework provides a practical and robust solution for real-world multimodal depression detection.
从多模态数据中发现抑郁症对于早期干预和心理健康监测至关重要。然而,现有的系统面临着三个挑战:(i)捕捉微妙的情感线索,以区分抑郁状态和正常的情绪变化;(ii)在异质语音和文本模式之间建立可靠的对应关系;(iii)处理现实世界语料库中严重的阶级不平衡。为了应对这些挑战,我们提出了一个框架,该框架集成了明确的情绪监督、跨模态对齐和面向度量的优化,用于鲁棒的多模态抑郁检测。语音和词汇特征通过来自监督信号的情感类别嵌入得到增强,以提供情感语境,而语义对应通过对比对齐目标得到加强。为了减轻不平衡,我们直接用Lovász损失优化宏观f1。在情绪音频-文本抑郁语料库(EATD-Corpus)上,我们的框架在数据集提供情绪的情况下实现了87.40%±0.46%的宏观f1,在预测情绪的情况下实现了83.15%的宏观f1,而在没有情绪信息的情况下实现了71.82%的宏观f1。对《绿野仙踪》(DAIC-WOZ)的痛苦分析访谈语料(Distress Analysis Interview Corpus - Wizard of Oz)的跨数据集评估显示出持续的收益,包括情绪增强的12.34% F1改进。这种综合方法-结合情绪监督,跨模态对齐和面向度量的优化-代表了对抑郁症检测的新贡献。我们的框架为现实世界的多模态抑郁检测提供了一个实用而强大的解决方案。
{"title":"Emotion-guided cross-modal alignment for multimodal depression detection","authors":"Wenzhe Jia ,&nbsp;Yuhang Wang ,&nbsp;Yahui Kang","doi":"10.1016/j.csl.2026.101951","DOIUrl":"10.1016/j.csl.2026.101951","url":null,"abstract":"<div><div>Depression detection from multimodal data is crucial for early intervention and mental health monitoring. Existing systems, however, face three challenges: (i) capturing subtle affective cues that distinguish depressive states from normal emotional variations, (ii) establishing reliable correspondence between heterogeneous speech and text modalities, and (iii) handling severe class imbalance in real-world corpora. To address these challenges, we propose a framework that integrates explicit emotion supervision, cross-modal alignment, and metric-oriented optimization for robust multimodal depression detection. Acoustic and lexical features are augmented with emotion-category embeddings derived from supervision signals to provide affective context, while semantic correspondence is reinforced through a contrastive alignment objective. To mitigate imbalance, we directly optimize macro-F1 with the Lovász loss. On the Emotional Audio-Textual Depression Corpus (EATD-Corpus), our framework achieves 87.40% <span><math><mo>±</mo></math></span> 0.46% macro-F1 with dataset-provided emotions and 83.15% with predicted emotions, compared to 71.82% without emotion information. Cross-dataset evaluation on the Distress Analysis Interview Corpus – Wizard of Oz (DAIC-WOZ) shows consistent gains, including a 12.34% F1 improvement with emotion augmentation. This integrated approach—combining emotion supervision, cross-modal alignment, and metric-oriented optimization—represents a novel contribution to depression detection. Our framework provides a practical and robust solution for real-world multimodal depression detection.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101951"},"PeriodicalIF":3.4,"publicationDate":"2026-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146077583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On the use of DiaPer models and matching algorithm for RTVE speaker diarization 2024 dataset 基于纸尿裤模型和匹配算法的RTVE说话人拨号2024数据集
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-07-01 Epub Date: 2026-02-06 DOI: 10.1016/j.csl.2026.101948
Juan Ignacio Alvarez-Trejos , Sara Barahona , Laura Herrera-Alarcon , Jérémie Touati , Alicia Lozano-Diez
Speaker diarization in broadcast media presents significant challenges due to long-duration recordings, numerous speakers, and complex acoustic conditions. End-to-end neural diarization models like DiaPer (Diarization with Perceiver), which directly predict speaker activity from audio features without intermediate clustering steps, have shown promising results. However, their application to extended recordings remains computationally prohibitive due to quadratic complexity with respect to input length. This paper addresses these limitations by proposing a framework that applies DiaPer to short audio chunks and subsequently reconciles speaker identities across segments using a matching algorithm. We systematically analyze optimal chunk durations for DiaPer processing and introduce an enhanced chunk-matching algorithm leveraging state-of-the-art speaker embeddings, comparing Emphasized Channel Attention, Propagation and Aggregation in Time Delay Neural Network (ECAPA-TDNN), Residual Networks (ResNet), and Reshape Dimensions Network (ReDimNet) architectures. Our experimental evaluation on the challenging Radio Televisión Española (RTVE) datasets shows that ReDimNet embeddings consistently outperform alternatives, achieving substantial improvements in speaker identity consistency across segments. The proposed approach yields a Diarization Error Rate (DER) of 17.34% on the RTVE 2024 test set, which is competitive with state-of-the-art systems while achieving a 63.6% relative improvement over the baseline DiaPer model applied directly to complete audio recordings. This demonstrates that end-to-end neural approaches can be successfully extended to hour-long recordings while maintaining computational efficiency.
由于长时间录音、众多扬声器和复杂的声学条件,广播媒体中的扬声器拨号提出了重大挑战。像尿布(diarization with percepver)这样的端到端神经diarization模型,直接从音频特征中预测说话者的活动,而不需要中间聚类步骤,已经显示出有希望的结果。然而,由于输入长度的二次复杂度,它们在扩展记录中的应用在计算上仍然是禁止的。本文通过提出一个框架来解决这些限制,该框架将尿布应用于短音频块,随后使用匹配算法协调分段之间的说话人身份。我们系统地分析了尿布处理的最佳块持续时间,并引入了一种增强的块匹配算法,利用最先进的扬声器嵌入,比较了延迟神经网络(ECAPA-TDNN)、残差网络(ResNet)和重塑维度网络(ReDimNet)架构中的强调频道注意、传播和聚合。我们在具有挑战性的Radio Televisión Española (RTVE)数据集上的实验评估表明,ReDimNet嵌入始终优于替代方案,在跨段的说话人身份一致性方面取得了实质性的改进。所提出的方法在RTVE 2024测试集上产生的Diarization错误率(DER)为17.34%,与最先进的系统竞争,同时比直接应用于完整录音的基线尿布模型实现了63.6%的相对改进。这表明端到端神经方法可以成功地扩展到长达一小时的记录,同时保持计算效率。
{"title":"On the use of DiaPer models and matching algorithm for RTVE speaker diarization 2024 dataset","authors":"Juan Ignacio Alvarez-Trejos ,&nbsp;Sara Barahona ,&nbsp;Laura Herrera-Alarcon ,&nbsp;Jérémie Touati ,&nbsp;Alicia Lozano-Diez","doi":"10.1016/j.csl.2026.101948","DOIUrl":"10.1016/j.csl.2026.101948","url":null,"abstract":"<div><div>Speaker diarization in broadcast media presents significant challenges due to long-duration recordings, numerous speakers, and complex acoustic conditions. End-to-end neural diarization models like DiaPer (Diarization with Perceiver), which directly predict speaker activity from audio features without intermediate clustering steps, have shown promising results. However, their application to extended recordings remains computationally prohibitive due to quadratic complexity with respect to input length. This paper addresses these limitations by proposing a framework that applies DiaPer to short audio chunks and subsequently reconciles speaker identities across segments using a matching algorithm. We systematically analyze optimal chunk durations for DiaPer processing and introduce an enhanced chunk-matching algorithm leveraging state-of-the-art speaker embeddings, comparing Emphasized Channel Attention, Propagation and Aggregation in Time Delay Neural Network (ECAPA-TDNN), Residual Networks (ResNet), and Reshape Dimensions Network (ReDimNet) architectures. Our experimental evaluation on the challenging <em>Radio Televisión Española</em> (RTVE) datasets shows that ReDimNet embeddings consistently outperform alternatives, achieving substantial improvements in speaker identity consistency across segments. The proposed approach yields a Diarization Error Rate (DER) of 17.34% on the RTVE 2024 test set, which is competitive with state-of-the-art systems while achieving a 63.6% relative improvement over the baseline DiaPer model applied directly to complete audio recordings. This demonstrates that end-to-end neural approaches can be successfully extended to hour-long recordings while maintaining computational efficiency.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101948"},"PeriodicalIF":3.4,"publicationDate":"2026-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146173545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Keyword Mamba: Spoken keyword spotting with state space models 关键词曼巴:用状态空间模型识别口语关键词
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-07-01 Epub Date: 2025-11-27 DOI: 10.1016/j.csl.2025.101909
Hanyu Ding , Wenlong Dong , Qirong Mao
Keyword spotting (KWS) is an essential task in speech processing. It is widely used in voice assistants and smart devices. Deep learning models like CNNs, RNNs, and Transformers have performed well in KWS. However, they often struggle to handle long-term patterns and stay efficient at the same time. In this work, we present Keyword Mamba, a new architecture for KWS. It uses a neural state space model (SSM) called Mamba. We apply Mamba along the time axis and also explore how it can replace the self-attention part in Transformer models. We test our model on the Google Speech Commands datasets. The results show that Keyword Mamba reaches strong accuracy with fewer parameters and lower computational cost. To our knowledge, this is the first time a state space model has been used for KWS. These results suggest that Mamba has strong potential in speech-related tasks.
关键词识别是语音处理中的一项重要任务。它被广泛应用于语音助手和智能设备。cnn、rnn和Transformers等深度学习模型在KWS中表现良好。然而,他们经常在处理长期模式的同时保持效率。在这项工作中,我们提出了关键字曼巴,一个新的架构的KWS。它使用一种叫做曼巴的神经状态空间模型(SSM)。我们沿着时间轴应用Mamba,并探索它如何取代Transformer模型中的自关注部分。我们在谷歌语音命令数据集上测试了我们的模型。结果表明,关键词Mamba以较少的参数和较低的计算成本获得了较好的精度。据我们所知,这是第一次将状态空间模型用于KWS。这些结果表明,曼巴在言语相关任务中具有很强的潜力。
{"title":"Keyword Mamba: Spoken keyword spotting with state space models","authors":"Hanyu Ding ,&nbsp;Wenlong Dong ,&nbsp;Qirong Mao","doi":"10.1016/j.csl.2025.101909","DOIUrl":"10.1016/j.csl.2025.101909","url":null,"abstract":"<div><div>Keyword spotting (KWS) is an essential task in speech processing. It is widely used in voice assistants and smart devices. Deep learning models like CNNs, RNNs, and Transformers have performed well in KWS. However, they often struggle to handle long-term patterns and stay efficient at the same time. In this work, we present Keyword Mamba, a new architecture for KWS. It uses a neural state space model (SSM) called Mamba. We apply Mamba along the time axis and also explore how it can replace the self-attention part in Transformer models. We test our model on the Google Speech Commands datasets. The results show that Keyword Mamba reaches strong accuracy with fewer parameters and lower computational cost. To our knowledge, this is the first time a state space model has been used for KWS. These results suggest that Mamba has strong potential in speech-related tasks.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101909"},"PeriodicalIF":3.4,"publicationDate":"2026-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computer Speech and Language
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1