首页 > 最新文献

Computer Speech and Language最新文献

英文 中文
Comparative study on noise-augmented training and its effect on adversarial robustness in ASR systems ASR系统中噪声增强训练及其对对抗鲁棒性影响的比较研究
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-08-26 DOI: 10.1016/j.csl.2025.101869
Karla Pizzi , Matías Pizarro , Asja Fischer
In this study, we investigate whether noise-augmented training can concurrently improve adversarial robustness in automatic speech recognition (ASR) systems. We conduct a comparative analysis of the adversarial robustness of four different ASR architectures, each trained under three different augmentation conditions: (1) background noise, speed variations, and reverberations; (2) speed variations only; (3) no data augmentation. We then evaluate the robustness of all resulting models against attacks with white-box or black-box adversarial examples. Our results demonstrate that noise augmentation not only enhances model performance on noisy speech but also improves the model’s robustness to adversarial attacks.
在这项研究中,我们探讨了噪声增强训练是否可以同时提高自动语音识别(ASR)系统的对抗鲁棒性。我们对四种不同ASR架构的对抗鲁棒性进行了比较分析,每种架构都在三种不同的增强条件下进行了训练:(1)背景噪声、速度变化和混响;(2)只能变速;(3)无数据增强。然后,我们用白盒或黑盒对抗示例评估所有结果模型对攻击的鲁棒性。我们的研究结果表明,噪声增强不仅提高了模型对噪声语音的性能,而且提高了模型对对抗性攻击的鲁棒性。
{"title":"Comparative study on noise-augmented training and its effect on adversarial robustness in ASR systems","authors":"Karla Pizzi ,&nbsp;Matías Pizarro ,&nbsp;Asja Fischer","doi":"10.1016/j.csl.2025.101869","DOIUrl":"10.1016/j.csl.2025.101869","url":null,"abstract":"<div><div>In this study, we investigate whether noise-augmented training can concurrently improve adversarial robustness in automatic speech recognition (ASR) systems. We conduct a comparative analysis of the adversarial robustness of four different ASR architectures, each trained under three different augmentation conditions: (1) background noise, speed variations, and reverberations; (2) speed variations only; (3) no data augmentation. We then evaluate the robustness of all resulting models against attacks with white-box or black-box adversarial examples. Our results demonstrate that noise augmentation not only enhances model performance on noisy speech but also improves the model’s robustness to adversarial attacks.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"96 ","pages":"Article 101869"},"PeriodicalIF":3.4,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145010591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A novel approach to cross-linguistic transfer learning for hope speech detection in Tamil and Malayalam 泰米尔语和马拉雅拉姆语希望语音检测的跨语言迁移学习新方法
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-08-18 DOI: 10.1016/j.csl.2025.101870
Jothi Prakash V., Arul Antran Vijay S.
In the field of Natural Language Processing (NLP), accurately identifying hope speech in low-resource languages such as Tamil and Malayalam poses significant challenges. This research introduces the Sentimix Transformer (SentT), a novel transformer-based model designed for detecting hope speech in YouTube comments composed in Tamil and Malayalam, two linguistically rich but computationally low-resource languages. The SentT model innovatively combines multilingual BERT (mBERT) embeddings with specialized cultural and code-mixing adaptations to effectively process the linguistic diversity and complexities inherent in code-mixed data. This approach allows SentT to capture nuanced expressions of hope by integrating domain-specific knowledge into the transformer framework. Our methodology extends traditional transformer architectures by incorporating a unique ensemble of embeddings that encapsulate linguistic, cultural, and code-mixing attributes, significantly enhancing the model’s sensitivity to context and cultural idioms. We validate our approach using the Hope Speech dataset for Equality, Diversity, and Inclusion (HopeEDI), which includes diverse comments from social media. The SentT model achieves an impressive accuracy of 93.4%, a precision of 92.7%, and a recall of 94.1% outperforming existing models and demonstrating its efficacy in handling the subtleties of hope speech in multilingual contexts. The model’s architecture and the results of extensive evaluations not only underscore its effectiveness but also its potential as a scalable solution for similar tasks in other low-resource languages. Through this research, we contribute to the broader field of sentiment analysis by demonstrating the potential of tailored, context-aware models in enhancing digital communication’s positivity and inclusiveness.
在自然语言处理(NLP)领域,准确识别泰米尔语和马拉雅拉姆语等低资源语言中的希望语是一个重大挑战。本研究介绍了Sentimix Transformer (SentT),这是一种新颖的基于变压器的模型,旨在检测YouTube评论中的泰米尔语和马拉雅拉姆语的希望言论,这两种语言丰富但计算资源少。SentT模型创新地将多语言BERT (mBERT)嵌入与专门的文化和代码混合适应相结合,以有效地处理代码混合数据中固有的语言多样性和复杂性。这种方法允许SentT通过将特定领域的知识集成到转换器框架中来捕获希望的细微表达。我们的方法扩展了传统的转换器体系结构,通过结合独特的嵌入集成来封装语言、文化和代码混合属性,显著提高了模型对上下文和文化习惯用法的敏感性。我们使用平等、多样性和包容性的希望演讲数据集(HopeEDI)验证了我们的方法,该数据集包括来自社交媒体的各种评论。SentT模型取得了令人印象深刻的93.4%的准确率,92.7%的精度和94.1%的召回率,超过了现有的模型,并证明了它在处理多语言环境中希望语音的微妙之处的有效性。该模型的体系结构和广泛评估的结果不仅强调了它的有效性,而且还强调了它作为其他低资源语言中类似任务的可扩展解决方案的潜力。通过这项研究,我们通过展示量身定制的情境感知模型在增强数字通信的积极性和包容性方面的潜力,为更广泛的情感分析领域做出了贡献。
{"title":"A novel approach to cross-linguistic transfer learning for hope speech detection in Tamil and Malayalam","authors":"Jothi Prakash V.,&nbsp;Arul Antran Vijay S.","doi":"10.1016/j.csl.2025.101870","DOIUrl":"10.1016/j.csl.2025.101870","url":null,"abstract":"<div><div>In the field of Natural Language Processing (NLP), accurately identifying hope speech in low-resource languages such as Tamil and Malayalam poses significant challenges. This research introduces the Sentimix Transformer (SentT), a novel transformer-based model designed for detecting hope speech in YouTube comments composed in Tamil and Malayalam, two linguistically rich but computationally low-resource languages. The SentT model innovatively combines multilingual BERT (mBERT) embeddings with specialized cultural and code-mixing adaptations to effectively process the linguistic diversity and complexities inherent in code-mixed data. This approach allows SentT to capture nuanced expressions of hope by integrating domain-specific knowledge into the transformer framework. Our methodology extends traditional transformer architectures by incorporating a unique ensemble of embeddings that encapsulate linguistic, cultural, and code-mixing attributes, significantly enhancing the model’s sensitivity to context and cultural idioms. We validate our approach using the Hope Speech dataset for Equality, Diversity, and Inclusion (HopeEDI), which includes diverse comments from social media. The SentT model achieves an impressive accuracy of 93.4%, a precision of 92.7%, and a recall of 94.1% outperforming existing models and demonstrating its efficacy in handling the subtleties of hope speech in multilingual contexts. The model’s architecture and the results of extensive evaluations not only underscore its effectiveness but also its potential as a scalable solution for similar tasks in other low-resource languages. Through this research, we contribute to the broader field of sentiment analysis by demonstrating the potential of tailored, context-aware models in enhancing digital communication’s positivity and inclusiveness.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"96 ","pages":"Article 101870"},"PeriodicalIF":3.4,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144863282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Real-time audio enhancement framework for vocal performances based on LSTM and time-frequency masking algorithm 基于LSTM和时频掩蔽算法的声乐表演实时音频增强框架
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-08-11 DOI: 10.1016/j.csl.2025.101871
Zan Huang
This study proposes a new framework for real-time enhancement of vocal performances based on a long short-term memory (LSTM) network and a time-frequency masking algorithm. The framework primarily addresses the contradiction between non-stationary noise suppression and audio fidelity in complex acoustic scenes. The key innovations of this study are: 1. A real-time enhancement model combining LSTM and ideal ratio masking. The study uses an LSTM to model long-term dependencies in time-frequency features, combining it with an IRM algorithm that dynamically adjusts noise weights. This fusion significantly improves the clarity and intelligibility of audio signals in complex backgrounds. Experiments show that, within a signal-to-noise ratio range of -10 to 5 dB, the model's PESQ and STOI indicators improve to 3.75 and 0.893, respectively. 2. Adaptive Time-Frequency Masking Algorithm The study proposes an adaptive masking mechanism based on the dynamic weight of the signal-to-noise ratio, solving the trade-off between independent binary masking and IRM, as well as between distortion and noise suppression. 3. Masking coefficient optimization driven by a deep neural network. The study presents a bidirectional long short-term memory (LSTM) time-frequency processing module (TFPM) that hierarchically models intra-frame and inter-frame features. At the same time, a composite LSTM ratio masking (LSTM-RM) objective function is introduced to enhance both the amplitude and phase spectra simultaneously. Through end-to-end training, the proposed framework solves the real-time problem and demonstrates stable enhancement effects on ten types of noise test sets. The study provides a scalable algorithmic paradigm for real-time audio enhancement.
本研究提出了一种基于长短期记忆(LSTM)网络和时频掩蔽算法的实时增强声乐表演的新框架。该框架主要解决复杂声学场景中非平稳噪声抑制与音频保真度之间的矛盾。本研究的主要创新点有:1.数据分析。结合LSTM和理想比例掩蔽的实时增强模型。该研究使用LSTM来模拟时频特征的长期依赖关系,并将其与动态调整噪声权重的IRM算法相结合。这种融合显著提高了复杂背景下音频信号的清晰度和可理解性。实验表明,在信噪比为-10 ~ 5 dB的范围内,模型的PESQ和STOI指标分别提高到3.75和0.893。2. 本研究提出了一种基于信噪比动态权重的自适应掩蔽机制,解决了独立二进制掩蔽与IRM、失真与噪声抑制之间的权衡。3. 基于深度神经网络的掩蔽系数优化。本研究提出了一种双向长短期记忆时频处理模块(TFPM),该模块对帧内和帧间特征进行分层建模。同时,引入复合LSTM比掩蔽(LSTM- rm)目标函数,同时增强幅相谱。通过端到端训练,该框架解决了实时性问题,并对10类噪声测试集表现出稳定的增强效果。该研究为实时音频增强提供了一个可扩展的算法范例。
{"title":"Real-time audio enhancement framework for vocal performances based on LSTM and time-frequency masking algorithm","authors":"Zan Huang","doi":"10.1016/j.csl.2025.101871","DOIUrl":"10.1016/j.csl.2025.101871","url":null,"abstract":"<div><div>This study proposes a new framework for real-time enhancement of vocal performances based on a long short-term memory (LSTM) network and a time-frequency masking algorithm. The framework primarily addresses the contradiction between non-stationary noise suppression and audio fidelity in complex acoustic scenes. The key innovations of this study are: 1. A real-time enhancement model combining LSTM and ideal ratio masking. The study uses an LSTM to model long-term dependencies in time-frequency features, combining it with an IRM algorithm that dynamically adjusts noise weights. This fusion significantly improves the clarity and intelligibility of audio signals in complex backgrounds. Experiments show that, within a signal-to-noise ratio range of -10 to 5 dB, the model's PESQ and STOI indicators improve to 3.75 and 0.893, respectively. 2. Adaptive Time-Frequency Masking Algorithm The study proposes an adaptive masking mechanism based on the dynamic weight of the signal-to-noise ratio, solving the trade-off between independent binary masking and IRM, as well as between distortion and noise suppression. 3. Masking coefficient optimization driven by a deep neural network. The study presents a bidirectional long short-term memory (LSTM) time-frequency processing module (TFPM) that hierarchically models intra-frame and inter-frame features. At the same time, a composite LSTM ratio masking (LSTM-RM) objective function is introduced to enhance both the amplitude and phase spectra simultaneously. Through end-to-end training, the proposed framework solves the real-time problem and demonstrates stable enhancement effects on ten types of noise test sets. The study provides a scalable algorithmic paradigm for real-time audio enhancement.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"96 ","pages":"Article 101871"},"PeriodicalIF":3.4,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144842508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic speech-based alcohol intoxication detection for automotive safety applications 基于语音的自动酒精中毒检测,用于汽车安全应用
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-08-09 DOI: 10.1016/j.csl.2025.101872
Brian Stasak , Julien Epps
There is a responsibility to advance automatic alcohol intoxication screening capabilities in modern automobiles to reduce the high rate of alcohol-related accidents and fatalities worldwide. Automatic speech-based alcohol intoxication screening offers a tremendous safety opportunity in the automotive industry due to its non-invasive convenience, comparatively inexpensive cost, and rapid result processing. Using the Alcohol Language Corpus (ALC), this study examines automatic alcohol intoxication classification based on participants’ non-intoxicated/intoxicated omni-microphone speech recordings. Experimentation of many different speech features (e.g., glottal, landmarks, linguistic, prosodic, spectral, syllabic, vocal tract coordination) across different blood alcohol concentration (BAC) ranges and specific verbal tasks show significant changes as participants' BAC increases. Intoxicated participants produce lower average fundamental frequency (F0) with an increase in F0 frequency modulation, breathiness and creakiness voice qualities in intoxicated recordings when compared to their non-intoxicated recordings. For the picture description and tongue twister tasks, manual irregularity disfluency and pause linguistic features significantly increase in intoxicated recordings. Further, for all verbal tasks, automatically extracted syllabic pause features show a significant increase in intoxicated recordings. Implementation of task-dependent support vector machine classifier model with a ≥0.001 BAC 'intoxication' sensitivity threshold increases alcohol classification by up to 8% absolute gain over a task-agnostic approach. Moreover, intoxication classification results demonstrate that task-dependent modeling with majority vote decision improves classification accuracy with up to 20% absolute gain depending on task when compared to file-by-file task-agnostic method results reported previously in ALC baseline studies that used higher quality headset microphone recordings.
有责任提高现代汽车的酒精中毒自动筛查能力,以降低世界范围内与酒精有关的事故和死亡率。基于语音的酒精中毒自动筛查由于其非侵入性的便利性、相对便宜的成本和快速的结果处理,为汽车行业提供了巨大的安全机会。使用酒精语言语料库(ALC),本研究基于参与者的非醉酒/醉酒全麦克风语音记录,检验了酒精中毒的自动分类。在不同的血液酒精浓度(BAC)范围和特定的言语任务中,许多不同的言语特征(如声门、标志、语言、韵律、谱、音节、声道协调)的实验表明,随着参与者BAC的增加,这些特征发生了显著的变化。与未醉酒的录音相比,醉酒的参与者产生较低的平均基频(F0), F0调频、呼吸和吱吱声的音质在醉酒录音中有所增加。对于图片描述和绕口令任务,人工不规则性、不流畅性和停顿语言特征显著增加了醉酒录音。此外,在所有口头任务中,自动提取的音节停顿特征显示醉酒录音显著增加。与任务无关的方法相比,具有≥0.001 BAC“中毒”敏感性阈值的任务相关支持向量机分类器模型的实现使酒精分类的绝对增益增加了8%。此外,中毒分类结果表明,与先前在ALC基线研究中报告的使用更高质量耳机麦克风记录的逐文件任务不可知方法结果相比,具有多数投票决策的任务依赖模型可以根据任务提高分类精度,绝对增益高达20%。
{"title":"Automatic speech-based alcohol intoxication detection for automotive safety applications","authors":"Brian Stasak ,&nbsp;Julien Epps","doi":"10.1016/j.csl.2025.101872","DOIUrl":"10.1016/j.csl.2025.101872","url":null,"abstract":"<div><div>There is a responsibility to advance automatic alcohol intoxication screening capabilities in modern automobiles to reduce the high rate of alcohol-related accidents and fatalities worldwide. Automatic speech-based alcohol intoxication screening offers a tremendous safety opportunity in the automotive industry due to its non-invasive convenience, comparatively inexpensive cost, and rapid result processing. Using the Alcohol Language Corpus (ALC), this study examines automatic alcohol intoxication classification based on participants’ non-intoxicated/intoxicated omni-microphone speech recordings. Experimentation of many different speech features (e.g., glottal, landmarks, linguistic, prosodic, spectral, syllabic, vocal tract coordination) across different blood alcohol concentration (BAC) ranges and specific verbal tasks show significant changes as participants' BAC increases. Intoxicated participants produce lower average fundamental frequency (F0) with an increase in F0 frequency modulation, breathiness and creakiness voice qualities in intoxicated recordings when compared to their non-intoxicated recordings. For the picture description and tongue twister tasks, manual irregularity disfluency and pause linguistic features significantly increase in intoxicated recordings. Further, for all verbal tasks, automatically extracted syllabic pause features show a significant increase in intoxicated recordings. Implementation of task-dependent support vector machine classifier model with a ≥0.001 BAC 'intoxication' sensitivity threshold increases alcohol classification by up to 8% absolute gain over a task-agnostic approach. Moreover, intoxication classification results demonstrate that task-dependent modeling with majority vote decision improves classification accuracy with up to 20% absolute gain depending on task when compared to file-by-file task-agnostic method results reported previously in ALC baseline studies that used higher quality headset microphone recordings.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"96 ","pages":"Article 101872"},"PeriodicalIF":3.4,"publicationDate":"2025-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144829660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SCoT2S: Self-correcting Text-to-SQL parsing by leveraging LLMs SCoT2S:通过利用llm自动纠正文本到sql的解析
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-07-31 DOI: 10.1016/j.csl.2025.101865
Chunlin Zhu , Yuming Lin , Yaojun Cai , You Li
Text-to-SQL parsing, which converts natural language questions into executable SQL queries, has emerged as a critical technology for enabling non-technical users to interact with databases effectively. Although recent advances in this field have shown promise, existing models still struggle with complex semantic understanding and accurate SQL generation, particularly in handling schema relationships and join operations. To address these challenges, we propose SCoT2S (Self-Correcting Text-to-SQL), a novel framework that leverages large language models to automatically identify and rectify errors in SQL query generation. Through systematic error analysis of existing Text-to-SQL models, we identify that schema linking and join operations account for more than 70% of parsing errors. Our SCoT2S framework addresses these issues through a three-stage approach: initial SQL generation, comprehensive error detection, and targeted correction using large language models. This approach enables real-time error identification and correction during the parsing process. Extensive experiments demonstrate the effectiveness of the proposed SCoT2S in the Spider benchmark data set. Specifically, SCoT2S shows significant improvements, with a 2.8% increase in EM scores and a 4.0% increase in EX scores compared to current state-of-the-art methods.
文本到SQL解析将自然语言问题转换为可执行的SQL查询,它已成为使非技术用户能够有效地与数据库交互的一项关键技术。尽管该领域的最新进展显示出了希望,但现有模型仍然难以理解复杂的语义和精确的SQL生成,特别是在处理模式关系和连接操作方面。为了应对这些挑战,我们提出了SCoT2S(文本到SQL的自我纠正),这是一个利用大型语言模型来自动识别和纠正SQL查询生成中的错误的新框架。通过对现有Text-to-SQL模型的系统错误分析,我们发现模式链接和连接操作占解析错误的70%以上。我们的SCoT2S框架通过三个阶段的方法来解决这些问题:初始SQL生成、全面的错误检测和使用大型语言模型的有针对性的纠正。这种方法支持在解析过程中实时识别和纠正错误。大量的实验证明了所提出的SCoT2S在Spider基准数据集中的有效性。具体来说,与目前最先进的方法相比,SCoT2S显示出显着的改进,EM分数提高2.8%,EX分数提高4.0%。
{"title":"SCoT2S: Self-correcting Text-to-SQL parsing by leveraging LLMs","authors":"Chunlin Zhu ,&nbsp;Yuming Lin ,&nbsp;Yaojun Cai ,&nbsp;You Li","doi":"10.1016/j.csl.2025.101865","DOIUrl":"10.1016/j.csl.2025.101865","url":null,"abstract":"<div><div>Text-to-SQL parsing, which converts natural language questions into executable SQL queries, has emerged as a critical technology for enabling non-technical users to interact with databases effectively. Although recent advances in this field have shown promise, existing models still struggle with complex semantic understanding and accurate SQL generation, particularly in handling schema relationships and join operations. To address these challenges, we propose SCoT2S (Self-Correcting Text-to-SQL), a novel framework that leverages large language models to automatically identify and rectify errors in SQL query generation. Through systematic error analysis of existing Text-to-SQL models, we identify that schema linking and join operations account for more than 70% of parsing errors. Our SCoT2S framework addresses these issues through a three-stage approach: initial SQL generation, comprehensive error detection, and targeted correction using large language models. This approach enables real-time error identification and correction during the parsing process. Extensive experiments demonstrate the effectiveness of the proposed SCoT2S in the Spider benchmark data set. Specifically, SCoT2S shows significant improvements, with a 2.8% increase in EM scores and a 4.0% increase in EX scores compared to current state-of-the-art methods.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101865"},"PeriodicalIF":3.4,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144750492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Three-stage modular speaker diarization collaborating with front-end techniques in the CHiME-8 NOTSOFAR-1 challenge 在CHiME-8 NOTSOFAR-1挑战赛中,三级模块化扬声器拨号与前端技术合作
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-07-28 DOI: 10.1016/j.csl.2025.101863
Ruo-Yu Wang , Jun Du , Shu-Tong Niu , Gao-Bin Yang , Tian Gao , Jia Pan , Qing-Feng Liu
We propose a modular speaker diarization framework that collaborates with front-end techniques in a three-stage process, designed for the challenging CHiME-8 NOTSOFAR-1 acoustic environment. The framework leverages the strengths of deep learning based speech separation systems and traditional speech signal processing techniques to provide more accurate initializations for the Neural Speaker Diarization (NSD) system at each stage, thereby enhancing the performance of a single-channel NSD system. Firstly, speaker overlap detection and Continuous Speech Separation (CSS) are applied to the multichannel speech to obtain clearer single-speaker speech segments for the Clustering-based Speaker Diarization (CSD), followed by the first NSD decoding. Next, the binary speaker masks from the first decoding are used to initialize a complex Angular Center Gaussian Mixture Model (cACGMM) to estimate speaker masks on the multi-channel speech. Using Mask-to-VAD post-processing techniques, we achieve per-speaker speech activity with reduced speaker error (SpkErr), followed by a second NSD decoding. Finally, the second decoding results are used to Guide Source Separation (GSS) to produce per-speaker speech segments. Short utterances containing one word or fewer are filtered, and the remaining speech segments are re-clustered for the final NSD decoding. We present evaluation results progressively explored from the CHiME-8 NOTSOFAR-1 challenge, demonstrating the effectiveness of our modular diarization system and its contribution to improving speech recognition performance. The code will be open-sourced at https://github.com/rywang99/USTC-NERCSLIP_CHiME-8.
我们提出了一种模块化扬声器拨号框架,该框架与前端技术在三阶段过程中协同工作,专为具有挑战性的CHiME-8 NOTSOFAR-1声学环境而设计。该框架利用基于深度学习的语音分离系统和传统语音信号处理技术的优势,在每个阶段为神经说话人Diarization (NSD)系统提供更准确的初始化,从而提高单通道NSD系统的性能。首先,对多通道语音进行说话人重叠检测和连续语音分离(CSS),获得更清晰的单说话人语音片段,用于基于聚类的说话人Diarization (CSD),然后进行第一次NSD解码。接下来,使用第一次解码的二进制扬声器掩码初始化复杂的角中心高斯混合模型(cACGMM)来估计多通道语音上的扬声器掩码。使用Mask-to-VAD后处理技术,我们实现了每个说话人的语音活动,减少了说话人的错误(SpkErr),然后进行了第二次NSD解码。最后,二次解码结果用于指导源分离(GSS)产生每个说话人的语音片段。包含一个或更少单词的短话语被过滤,剩余的语音片段被重新聚类,用于最终的NSD解码。我们提出了从CHiME-8 NOTSOFAR-1挑战中逐步探索的评估结果,证明了我们的模块化拨号系统的有效性及其对提高语音识别性能的贡献。代码将在https://github.com/rywang99/USTC-NERCSLIP_CHiME-8上开源。
{"title":"Three-stage modular speaker diarization collaborating with front-end techniques in the CHiME-8 NOTSOFAR-1 challenge","authors":"Ruo-Yu Wang ,&nbsp;Jun Du ,&nbsp;Shu-Tong Niu ,&nbsp;Gao-Bin Yang ,&nbsp;Tian Gao ,&nbsp;Jia Pan ,&nbsp;Qing-Feng Liu","doi":"10.1016/j.csl.2025.101863","DOIUrl":"10.1016/j.csl.2025.101863","url":null,"abstract":"<div><div>We propose a modular speaker diarization framework that collaborates with front-end techniques in a three-stage process, designed for the challenging CHiME-8 NOTSOFAR-1 acoustic environment. The framework leverages the strengths of deep learning based speech separation systems and traditional speech signal processing techniques to provide more accurate initializations for the Neural Speaker Diarization (NSD) system at each stage, thereby enhancing the performance of a single-channel NSD system. Firstly, speaker overlap detection and Continuous Speech Separation (CSS) are applied to the multichannel speech to obtain clearer single-speaker speech segments for the Clustering-based Speaker Diarization (CSD), followed by the first NSD decoding. Next, the binary speaker masks from the first decoding are used to initialize a complex Angular Center Gaussian Mixture Model (cACGMM) to estimate speaker masks on the multi-channel speech. Using Mask-to-VAD post-processing techniques, we achieve per-speaker speech activity with reduced speaker error (SpkErr), followed by a second NSD decoding. Finally, the second decoding results are used to Guide Source Separation (GSS) to produce per-speaker speech segments. Short utterances containing one word or fewer are filtered, and the remaining speech segments are re-clustered for the final NSD decoding. We present evaluation results progressively explored from the CHiME-8 NOTSOFAR-1 challenge, demonstrating the effectiveness of our modular diarization system and its contribution to improving speech recognition performance. The code will be open-sourced at <span><span>https://github.com/rywang99/USTC-NERCSLIP_CHiME-8</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101863"},"PeriodicalIF":3.4,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144772518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Time–Frequency Causal Hidden Markov Model for speech-based Alzheimer’s disease longitudinal detection 基于语音的阿尔茨海默病纵向检测的时频因果隐马尔可夫模型
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-07-19 DOI: 10.1016/j.csl.2025.101862
Yilin Pan , Jiabing Li , Yating Zhang , Zhuoran Tian , Yijia Zhang , Mingyu Lu
Speech deterioration is an early indicator in individuals with Alzheimer’s disease (AD), with progression influenced by various factors, leading to unique trajectories for each individual. To facilitate automated longitudinal detection of AD using speech, we propose an enhanced Hidden Markov Model (HMM), termed the Time-Frequency Causal HMM (TF-CHMM), which models disease-causative acoustic features over time under the Markov property. The TF-CHMM integrates a parallel convolutional neural network as an encoder for spectrograms, extracting both time-domain and frequency-domain features from audio recordings linked to AD. Additionally, it incorporates personal attributes (e.g., age) and clinical diagnosis data (e.g., MMSE scores) as supplementary inputs, disentangling disease-related features from unrelated components through a sequential variational auto-encoder with causal inference. The TF-CHMM is evaluated using the Pitt Corpus, which includes annual visits for each subject with a variable number of longitudinal samples, comprising audio recordings, manual transcriptions, MMSE scores, and age information. Experimental results demonstrated the effectiveness of our designed system, achieving a competitive accuracy of 90.24% and an F1 score of 90.00%. An ablation study further highlighted the efficiency of the parallel convolutional kernels in extracting time–frequency information and emphasized the effectiveness of our longitudinal experimental setup in the AD detection system.
言语退化是阿尔茨海默病(AD)患者的早期指标,其进展受到各种因素的影响,导致每个人的独特轨迹。为了便于使用语音进行AD的自动纵向检测,我们提出了一种增强的隐马尔可夫模型(HMM),称为时频因果HMM (TF-CHMM),该模型在马尔可夫特性下对随时间变化的致病声学特征进行建模。TF-CHMM集成了一个并行卷积神经网络作为频谱图编码器,从与AD相关的音频记录中提取时域和频域特征。此外,它结合了个人属性(例如,年龄)和临床诊断数据(例如,MMSE分数)作为补充输入,通过具有因果推理的顺序变分自编码器从不相关的组件中分离出与疾病相关的特征。使用Pitt语料库对TF-CHMM进行评估,其中包括每个受试者的年度访问和可变数量的纵向样本,包括录音,手动转录,MMSE分数和年龄信息。实验结果证明了系统的有效性,达到了90.24%的竞争准确率和90.00%的F1分数。一项消融研究进一步强调了并行卷积核在提取时频信息方面的效率,并强调了我们在AD检测系统中纵向实验设置的有效性。
{"title":"Time–Frequency Causal Hidden Markov Model for speech-based Alzheimer’s disease longitudinal detection","authors":"Yilin Pan ,&nbsp;Jiabing Li ,&nbsp;Yating Zhang ,&nbsp;Zhuoran Tian ,&nbsp;Yijia Zhang ,&nbsp;Mingyu Lu","doi":"10.1016/j.csl.2025.101862","DOIUrl":"10.1016/j.csl.2025.101862","url":null,"abstract":"<div><div>Speech deterioration is an early indicator in individuals with Alzheimer’s disease (AD), with progression influenced by various factors, leading to unique trajectories for each individual. To facilitate automated longitudinal detection of AD using speech, we propose an enhanced Hidden Markov Model (HMM), termed the Time-Frequency Causal HMM (TF-CHMM), which models disease-causative acoustic features over time under the Markov property. The TF-CHMM integrates a parallel convolutional neural network as an encoder for spectrograms, extracting both time-domain and frequency-domain features from audio recordings linked to AD. Additionally, it incorporates personal attributes (e.g., age) and clinical diagnosis data (e.g., MMSE scores) as supplementary inputs, disentangling disease-related features from unrelated components through a sequential variational auto-encoder with causal inference. The TF-CHMM is evaluated using the Pitt Corpus, which includes annual visits for each subject with a variable number of longitudinal samples, comprising audio recordings, manual transcriptions, MMSE scores, and age information. Experimental results demonstrated the effectiveness of our designed system, achieving a competitive accuracy of 90.24% and an F1 score of 90.00%. An ablation study further highlighted the efficiency of the parallel convolutional kernels in extracting time–frequency information and emphasized the effectiveness of our longitudinal experimental setup in the AD detection system.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101862"},"PeriodicalIF":3.1,"publicationDate":"2025-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144687179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Linguistically informed automatic speech recognition in Sanskrit 基于语言学的梵语自动语音识别
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-07-12 DOI: 10.1016/j.csl.2025.101861
Rishabh Kumar , Devaraja Adiga , Rishav Ranjan , Amrith Krishna , Ganesh Ramakrishnan , Pawan Goyal , Preethi Jyothi
The field of Automatic Speech Recognition (ASR) for Sanskrit is marked by distinctive challenges, primarily due to the language’s intricate linguistic and morphological characteristics. Recognizing the burgeoning interest in this domain, we present the ‘Vāksañcayah’ speech corpus, a comprehensive collection that captures the linguistic depth and complexities of Sanskrit. Building upon our prior work, which focused on various acoustic model (AM) and language model (LM) units, we present an enhanced ASR system. This system integrates innovative subword tokenization methods and enriches the search space with linguistic insights. Addressing the issue of high out-of-vocabulary (OOV) rates and the prevalence of infrequently used words in Sanskrit, we employed a subword-based language model. Our approach mitigates these challenges and facilitates the generation of a subword-based search space. While effective in numerous scenarios, this model encounters limitations regarding long-range dependencies and semantic context comprehension. To counter these limitations, we leveraged Sanskrit’s rich morphological framework, thus achieving a more holistic understanding. The subword-based search space is subsequently transformed into a word-based format and augmented with morphological and lexical data, derived from a lexically driven shallow parser. Enhancing this further, we rescore transitions within this enriched space using a supervised morphological parser specifically designed for Sanskrit. Our proposed methodology is currently acclaimed as the most advanced in the realm of Sanskrit ASR, achieving a Word Error Rate (WER) of 12.54 and an improvement of 3.77 absolute points over the previous best. Additionally, we annotated 500 utterances with detailed morphological data and their corresponding lemmas, providing a basis for extensive linguistic analysis.
梵语的自动语音识别(ASR)领域面临着独特的挑战,主要是由于该语言复杂的语言和形态特征。认识到这一领域的新兴兴趣,我们提出了“Vāksañcayah”语音语料库,一个全面的集合,捕捉了梵语的语言深度和复杂性。基于我们之前的工作,重点是各种声学模型(AM)和语言模型(LM)单元,我们提出了一个增强的ASR系统。该系统集成了创新的子词标记方法,丰富了具有语言学洞察力的搜索空间。为了解决梵语词汇外(OOV)率高和不常用词普遍存在的问题,我们采用了基于子词的语言模型。我们的方法减轻了这些挑战,并促进了基于子词的搜索空间的生成。虽然该模型在许多场景中都是有效的,但它在远程依赖关系和语义上下文理解方面遇到了限制。为了克服这些限制,我们利用梵语丰富的形态学框架,从而获得更全面的理解。随后将基于子词的搜索空间转换为基于词的格式,并使用词法和词法数据进行扩充,这些数据来自词法驱动的浅层解析器。进一步增强了这一点,我们使用专门为梵语设计的监督形态学解析器在这个丰富的空间内重新记录转换。我们提出的方法目前被认为是梵语ASR领域最先进的方法,实现了12.54的单词错误率(WER),比之前最好的方法提高了3.77个绝对分数。此外,我们用详细的形态学数据和相应的引理注释了500个话语,为广泛的语言分析提供了基础。
{"title":"Linguistically informed automatic speech recognition in Sanskrit","authors":"Rishabh Kumar ,&nbsp;Devaraja Adiga ,&nbsp;Rishav Ranjan ,&nbsp;Amrith Krishna ,&nbsp;Ganesh Ramakrishnan ,&nbsp;Pawan Goyal ,&nbsp;Preethi Jyothi","doi":"10.1016/j.csl.2025.101861","DOIUrl":"10.1016/j.csl.2025.101861","url":null,"abstract":"<div><div>The field of Automatic Speech Recognition (ASR) for Sanskrit is marked by distinctive challenges, primarily due to the language’s intricate linguistic and morphological characteristics. Recognizing the burgeoning interest in this domain, we present the ‘Vāksañcayah’ speech corpus, a comprehensive collection that captures the linguistic depth and complexities of Sanskrit. Building upon our prior work, which focused on various acoustic model (AM) and language model (LM) units, we present an enhanced ASR system. This system integrates innovative subword tokenization methods and enriches the search space with linguistic insights. Addressing the issue of high out-of-vocabulary (OOV) rates and the prevalence of infrequently used words in Sanskrit, we employed a subword-based language model. Our approach mitigates these challenges and facilitates the generation of a subword-based search space. While effective in numerous scenarios, this model encounters limitations regarding long-range dependencies and semantic context comprehension. To counter these limitations, we leveraged Sanskrit’s rich morphological framework, thus achieving a more holistic understanding. The subword-based search space is subsequently transformed into a word-based format and augmented with morphological and lexical data, derived from a lexically driven shallow parser. Enhancing this further, we rescore transitions within this enriched space using a supervised morphological parser specifically designed for Sanskrit. Our proposed methodology is currently acclaimed as the most advanced in the realm of Sanskrit ASR, achieving a Word Error Rate (WER) of 12.54 and an improvement of 3.77 absolute points over the previous best. Additionally, we annotated 500 utterances with detailed morphological data and their corresponding lemmas, providing a basis for extensive linguistic analysis.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101861"},"PeriodicalIF":3.1,"publicationDate":"2025-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144702514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Toward fast meeting transcription: NAIST system for CHiME-8 NOTSOFAR-1 task and its analysis 面向快速会议转录:CHiME-8 NOTSOFAR-1任务的NAIST系统及其分析
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-07-08 DOI: 10.1016/j.csl.2025.101836
Yuta Hirano , Mau Nguyen , Kakeru Azuma , Jan Meyer Saragih , Sakriani Sakti
This paper reports on the NAIST system submitted to the CHIME-8 challenge’s NOTSOFAR-1 (Natural Office Talkers in Settings of Far-field Audio Recordings) task, including results and analyses from several additional experiments. While fast processing is crucial for real-world applications, the CHIME-7 challenge focused solely on reducing error rate, neglecting the practical aspects of system performance such as inference speed. Therefore, this research aims to develop a practical system by improving recognition accuracy while simultaneously reducing inference speed. To address this challenge, we propose enhancing the baseline module architecture by modifying both the CSS and ASR modules. Specifically, the ASR module was built based on a WavLM large feature extractor and a Zipformer transducer. Furthermore, we employed reverberation removal using block-wise weighted prediction error (WPE) as preprocessing for the speech separation module. The proposed system achieved a relative reduction in tcpWER of 11.6% for single-channel tracks and 18.7% for multi-channel tracks compared to the baseline system. Moreover, the proposed system operates up to six times faster than the baseline system while achieving superior tcpWER results. We also report on the observed changes in system performance due to variations in the amount of training data for the ASR model, as well the impact of the maximum word-length setting in the transducer-based ASR module on the subsequent diarization system, based on findings from our system development.
本文报告了提交给CHIME-8挑战的NOTSOFAR-1(远场录音设置中的自然办公室通话者)任务的NAIST系统,包括几个额外实验的结果和分析。虽然快速处理对于现实世界的应用至关重要,但CHIME-7挑战只关注降低错误率,而忽略了系统性能的实际方面,如推理速度。因此,本研究旨在开发一种实用的系统,在提高识别精度的同时降低推理速度。为了应对这一挑战,我们建议通过修改CSS和ASR模块来增强基线模块架构。具体来说,ASR模块是基于WavLM大型特征提取器和Zipformer换能器构建的。此外,我们使用块加权预测误差(WPE)作为语音分离模块的预处理来去除混响。与基线系统相比,该系统实现了单通道轨道的tcpWER相对降低11.6%,多通道轨道的tcpWER相对降低18.7%。此外,该系统的运行速度比基准系统快6倍,同时实现了更好的tcpWER结果。我们还报告了由于ASR模型的训练数据量的变化而观察到的系统性能变化,以及基于传感器的ASR模块中最大字长设置对后续diarization系统的影响,这些都是基于我们系统开发的结果。
{"title":"Toward fast meeting transcription: NAIST system for CHiME-8 NOTSOFAR-1 task and its analysis","authors":"Yuta Hirano ,&nbsp;Mau Nguyen ,&nbsp;Kakeru Azuma ,&nbsp;Jan Meyer Saragih ,&nbsp;Sakriani Sakti","doi":"10.1016/j.csl.2025.101836","DOIUrl":"10.1016/j.csl.2025.101836","url":null,"abstract":"<div><div>This paper reports on the NAIST system submitted to the CHIME-8 challenge’s NOTSOFAR-1 (Natural Office Talkers in Settings of Far-field Audio Recordings) task, including results and analyses from several additional experiments. While fast processing is crucial for real-world applications, the CHIME-7 challenge focused solely on reducing error rate, neglecting the practical aspects of system performance such as inference speed. Therefore, this research aims to develop a practical system by improving recognition accuracy while simultaneously reducing inference speed. To address this challenge, we propose enhancing the baseline module architecture by modifying both the CSS and ASR modules. Specifically, the ASR module was built based on a WavLM large feature extractor and a Zipformer transducer. Furthermore, we employed reverberation removal using block-wise weighted prediction error (WPE) as preprocessing for the speech separation module. The proposed system achieved a relative reduction in tcpWER of 11.6% for single-channel tracks and 18.7% for multi-channel tracks compared to the baseline system. Moreover, the proposed system operates up to six times faster than the baseline system while achieving superior tcpWER results. We also report on the observed changes in system performance due to variations in the amount of training data for the ASR model, as well the impact of the maximum word-length setting in the transducer-based ASR module on the subsequent diarization system, based on findings from our system development.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101836"},"PeriodicalIF":3.1,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144633205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Gnowsis: Multimodal multitask learning for oral proficiency assessments 多模态多任务学习用于口语水平评估
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-07-05 DOI: 10.1016/j.csl.2025.101860
Hiroaki Takatsu , Shungo Suzuki , Masaki Eguchi , Ryuki Matsuura , Mao Saeki , Yoichi Matsuyama
Although oral proficiency assessments are crucial to understand second language (L2) learners’ progress, they are resource-intensive. Herein we propose a multimodal multitask learning model to assess L2 proficiency levels from multiple aspects on the basis of multimodal dialogue data. To construct the model, we first created a dataset of speech samples collected through oral proficiency interviews between Japanese learners of English and a conversational virtual agent. Expert human raters subsequently categorized the samples into the six levels based on the rating scales defined in the Common European Framework of Reference for Languages with respect to proficiency in one holistic and five analytic assessment criteria (vocabulary richness, grammatical accuracy, fluency, goodness of pronunciation, and coherence). The model was trained using this dataset via the multitask learning approach to simultaneously predict the proficiency levels of these language competences from various linguistic features. These features were extracted via multiple encoder modules, which were composed of feature extractors pretrained through various natural language processing tasks such as grammatical error correction, coreference resolution, discourse marker prediction, and pronunciation scoring. In experiments comparing the proposed model to baseline models with a feature extractor pretrained with single modality (textual or acoustic) features, the proposed model outperformed the baseline models. In particular, the proposed model was robust even with limited training data or short dialogues with a smaller number of topics because it considered rich features.
虽然口头能力评估对于了解第二语言学习者的进步至关重要,但它们是资源密集型的。在此,我们提出了一个多模态多任务学习模型,以多模态对话数据为基础,从多个方面评估二语熟练程度。为了构建模型,我们首先创建了一个语音样本数据集,该数据集是通过日本英语学习者与会话虚拟代理之间的口语熟练程度访谈收集的。专家评分员随后根据《欧洲共同语言参考框架》中定义的评分量表将样本分为六个级别,涉及一个整体和五个分析性评估标准(词汇丰富度、语法准确性、流利度、发音良好和连贯性)的熟练程度。该模型使用该数据集通过多任务学习方法进行训练,同时从各种语言特征中预测这些语言能力的熟练程度。这些特征通过多个编码器模块提取,这些编码器模块由经过各种自然语言处理任务(如语法错误纠正、共指解析、话语标记预测和发音评分)预训练的特征提取器组成。在实验中,将所提出的模型与使用单模态(文本或声学)特征预训练的特征提取器的基线模型进行比较,所提出的模型优于基线模型。特别是,由于考虑了丰富的特征,即使在有限的训练数据或具有较少主题的简短对话中,所提出的模型也具有鲁棒性。
{"title":"Gnowsis: Multimodal multitask learning for oral proficiency assessments","authors":"Hiroaki Takatsu ,&nbsp;Shungo Suzuki ,&nbsp;Masaki Eguchi ,&nbsp;Ryuki Matsuura ,&nbsp;Mao Saeki ,&nbsp;Yoichi Matsuyama","doi":"10.1016/j.csl.2025.101860","DOIUrl":"10.1016/j.csl.2025.101860","url":null,"abstract":"<div><div>Although oral proficiency assessments are crucial to understand second language (L2) learners’ progress, they are resource-intensive. Herein we propose a multimodal multitask learning model to assess L2 proficiency levels from multiple aspects on the basis of multimodal dialogue data. To construct the model, we first created a dataset of speech samples collected through oral proficiency interviews between Japanese learners of English and a conversational virtual agent. Expert human raters subsequently categorized the samples into the six levels based on the rating scales defined in the Common European Framework of Reference for Languages with respect to proficiency in one holistic and five analytic assessment criteria (vocabulary richness, grammatical accuracy, fluency, goodness of pronunciation, and coherence). The model was trained using this dataset via the multitask learning approach to simultaneously predict the proficiency levels of these language competences from various linguistic features. These features were extracted via multiple encoder modules, which were composed of feature extractors pretrained through various natural language processing tasks such as grammatical error correction, coreference resolution, discourse marker prediction, and pronunciation scoring. In experiments comparing the proposed model to baseline models with a feature extractor pretrained with single modality (textual or acoustic) features, the proposed model outperformed the baseline models. In particular, the proposed model was robust even with limited training data or short dialogues with a smaller number of topics because it considered rich features.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101860"},"PeriodicalIF":3.1,"publicationDate":"2025-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144588195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computer Speech and Language
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1