Children’s KWS (keyword spotting) systems often experience a significant decline in performance when acoustic mismatches occur between training and testing conditions. Though multiple factors are liable for creating such mismatches, pitch and speaking rate are the two predominant sources of acoustic mismatch. This work proposes a pitch-robust acoustic feature by computing the temporal envelope of sub-band signals to develop a children’s KWS system in the zero-resource scenario. To accomplish this, the speech signal is first passed through non-overlapping band-pass filters arranged in a linear scale to break it down into sub-bands. Then, the temporal envelope of each sub-band signal is estimated with the application of the Hilbert transform. The mean values of the estimated envelopes are computed over an analysis frame and logarithmically compressed to yield an -dimensional feature vector per analysis frame, here termed the logarithmically compressed averaged temporal envelope of sub-band signals (LC-ATESS). The efficacy of the proposed LC-ATESS feature is tested on the deep neural network-hidden Markov model-based acoustic model. The observed KWS results are superior to conventional Mel-frequency cepstral coefficients (MFCC), MFCC computed after spectral smoothing, and features calculated from single-frequency spectra, both with and without data augmentation, across clean and noisy test scenarios.
{"title":"Modeling the temporal envelope of sub-band signals for improving the performance of children’s speech recognition system in zero-resource scenario","authors":"Kaustav Das, Biswaranjan Pattanayak, Gayadhar Pradhan","doi":"10.1016/j.csl.2026.101954","DOIUrl":"10.1016/j.csl.2026.101954","url":null,"abstract":"<div><div>Children’s KWS (keyword spotting) systems often experience a significant decline in performance when acoustic mismatches occur between training and testing conditions. Though multiple factors are liable for creating such mismatches, pitch and speaking rate are the two predominant sources of acoustic mismatch. This work proposes a pitch-robust acoustic feature by computing the temporal envelope of sub-band signals to develop a children’s KWS system in the zero-resource scenario. To accomplish this, the speech signal is first passed through <span><math><mi>M</mi></math></span> non-overlapping band-pass filters arranged in a linear scale to break it down into sub-bands. Then, the temporal envelope of each sub-band signal is estimated with the application of the Hilbert transform. The mean values of the estimated envelopes are computed over an analysis frame and logarithmically compressed to yield an <span><math><mi>M</mi></math></span>-dimensional feature vector per analysis frame, here termed the logarithmically compressed averaged temporal envelope of sub-band signals (LC-ATESS). The efficacy of the proposed LC-ATESS feature is tested on the deep neural network-hidden Markov model-based acoustic model. The observed KWS results are superior to conventional Mel-frequency cepstral coefficients (MFCC), MFCC computed after spectral smoothing, and features calculated from single-frequency spectra, both with and without data augmentation, across clean and noisy test scenarios.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101954"},"PeriodicalIF":3.4,"publicationDate":"2026-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-06DOI: 10.1016/j.csl.2026.101948
Juan Ignacio Alvarez-Trejos , Sara Barahona , Laura Herrera-Alarcon , Jérémie Touati , Alicia Lozano-Diez
Speaker diarization in broadcast media presents significant challenges due to long-duration recordings, numerous speakers, and complex acoustic conditions. End-to-end neural diarization models like DiaPer (Diarization with Perceiver), which directly predict speaker activity from audio features without intermediate clustering steps, have shown promising results. However, their application to extended recordings remains computationally prohibitive due to quadratic complexity with respect to input length. This paper addresses these limitations by proposing a framework that applies DiaPer to short audio chunks and subsequently reconciles speaker identities across segments using a matching algorithm. We systematically analyze optimal chunk durations for DiaPer processing and introduce an enhanced chunk-matching algorithm leveraging state-of-the-art speaker embeddings, comparing Emphasized Channel Attention, Propagation and Aggregation in Time Delay Neural Network (ECAPA-TDNN), Residual Networks (ResNet), and Reshape Dimensions Network (ReDimNet) architectures. Our experimental evaluation on the challenging Radio Televisión Española (RTVE) datasets shows that ReDimNet embeddings consistently outperform alternatives, achieving substantial improvements in speaker identity consistency across segments. The proposed approach yields a Diarization Error Rate (DER) of 17.34% on the RTVE 2024 test set, which is competitive with state-of-the-art systems while achieving a 63.6% relative improvement over the baseline DiaPer model applied directly to complete audio recordings. This demonstrates that end-to-end neural approaches can be successfully extended to hour-long recordings while maintaining computational efficiency.
由于长时间录音、众多扬声器和复杂的声学条件,广播媒体中的扬声器拨号提出了重大挑战。像尿布(diarization with percepver)这样的端到端神经diarization模型,直接从音频特征中预测说话者的活动,而不需要中间聚类步骤,已经显示出有希望的结果。然而,由于输入长度的二次复杂度,它们在扩展记录中的应用在计算上仍然是禁止的。本文通过提出一个框架来解决这些限制,该框架将尿布应用于短音频块,随后使用匹配算法协调分段之间的说话人身份。我们系统地分析了尿布处理的最佳块持续时间,并引入了一种增强的块匹配算法,利用最先进的扬声器嵌入,比较了延迟神经网络(ECAPA-TDNN)、残差网络(ResNet)和重塑维度网络(ReDimNet)架构中的强调频道注意、传播和聚合。我们在具有挑战性的Radio Televisión Española (RTVE)数据集上的实验评估表明,ReDimNet嵌入始终优于替代方案,在跨段的说话人身份一致性方面取得了实质性的改进。所提出的方法在RTVE 2024测试集上产生的Diarization错误率(DER)为17.34%,与最先进的系统竞争,同时比直接应用于完整录音的基线尿布模型实现了63.6%的相对改进。这表明端到端神经方法可以成功地扩展到长达一小时的记录,同时保持计算效率。
{"title":"On the use of DiaPer models and matching algorithm for RTVE speaker diarization 2024 dataset","authors":"Juan Ignacio Alvarez-Trejos , Sara Barahona , Laura Herrera-Alarcon , Jérémie Touati , Alicia Lozano-Diez","doi":"10.1016/j.csl.2026.101948","DOIUrl":"10.1016/j.csl.2026.101948","url":null,"abstract":"<div><div>Speaker diarization in broadcast media presents significant challenges due to long-duration recordings, numerous speakers, and complex acoustic conditions. End-to-end neural diarization models like DiaPer (Diarization with Perceiver), which directly predict speaker activity from audio features without intermediate clustering steps, have shown promising results. However, their application to extended recordings remains computationally prohibitive due to quadratic complexity with respect to input length. This paper addresses these limitations by proposing a framework that applies DiaPer to short audio chunks and subsequently reconciles speaker identities across segments using a matching algorithm. We systematically analyze optimal chunk durations for DiaPer processing and introduce an enhanced chunk-matching algorithm leveraging state-of-the-art speaker embeddings, comparing Emphasized Channel Attention, Propagation and Aggregation in Time Delay Neural Network (ECAPA-TDNN), Residual Networks (ResNet), and Reshape Dimensions Network (ReDimNet) architectures. Our experimental evaluation on the challenging <em>Radio Televisión Española</em> (RTVE) datasets shows that ReDimNet embeddings consistently outperform alternatives, achieving substantial improvements in speaker identity consistency across segments. The proposed approach yields a Diarization Error Rate (DER) of 17.34% on the RTVE 2024 test set, which is competitive with state-of-the-art systems while achieving a 63.6% relative improvement over the baseline DiaPer model applied directly to complete audio recordings. This demonstrates that end-to-end neural approaches can be successfully extended to hour-long recordings while maintaining computational efficiency.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101948"},"PeriodicalIF":3.4,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146173545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In common categorical speech emotion recognition (SER) tasks, the used emotion corpora often provide ground truth labels at utterance-level rather than segment-level. However, such coarse-grained labeling approaches rely on an assumption that emotion expression in an utterance is uniformly distributed, which is inappropriate to characterize human emotional ambiguity in real scenarios. To alleviate this issue, this work proposes two-stage multiple instance learning (MIL) networks equipped with attention-based hybrid aggregation for SER. From the viewpoint of MIL, an utterance is considered as a bag, and divided into certain segments, each of which is taken as an instance. Each instance is then processed with two stages: segment-level acoustic feature encoder in stage-1, and MIL-based hybrid aggregator in stage-2. In particular, in stage-1 multiple-level acoustic features are encoded for each divided segment, and then a cross-attention mechanism is employed to perform feature enhancement and fusion. In stage-2, a MIL-based hybrid aggregator, consisting of adaptive aggregation, instance selection and attention-based aggregation, is designed to obtain final utterance-level results. The proposed method is evaluated on the public IEMOCAP and MELD datasets, and experimental results demonstrate the effectiveness of the proposed method.
{"title":"Two-stage multiple instance learning networks with attention-based hybrid aggregation for speech emotion recognition","authors":"Shiqing Zhang, Chen Chen, Dandan Wang, Xin Tao, Xiaoming Zhao","doi":"10.1016/j.csl.2026.101946","DOIUrl":"10.1016/j.csl.2026.101946","url":null,"abstract":"<div><div>In common categorical speech emotion recognition (SER) tasks, the used emotion corpora often provide ground truth labels at utterance-level rather than segment-level. However, such coarse-grained labeling approaches rely on an assumption that emotion expression in an utterance is uniformly distributed, which is inappropriate to characterize human emotional ambiguity in real scenarios. To alleviate this issue, this work proposes two-stage multiple instance learning (MIL) networks equipped with attention-based hybrid aggregation for SER. From the viewpoint of MIL, an utterance is considered as a bag, and divided into certain segments, each of which is taken as an instance. Each instance is then processed with two stages: segment-level acoustic feature encoder in stage-1, and MIL-based hybrid aggregator in stage-2. In particular, in stage-1 multiple-level acoustic features are encoded for each divided segment, and then a cross-attention mechanism is employed to perform feature enhancement and fusion. In stage-2, a MIL-based hybrid aggregator, consisting of adaptive aggregation, instance selection and attention-based aggregation, is designed to obtain final utterance-level results. The proposed method is evaluated on the public IEMOCAP and MELD datasets, and experimental results demonstrate the effectiveness of the proposed method.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101946"},"PeriodicalIF":3.4,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146172980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-02DOI: 10.1016/j.csl.2026.101950
Nikolaos Malamas , Andreas L. Symeonidis , John B. Theocharis
Question Answering (QA) and Content Retrieval (CR) systems have experienced a boost in performance in recent years leveraging state-of-the-art Transformer models to process user expressions and retrieve and extract information requested. Despite the constant language understanding improvements, very little effort has been put into the design of such systems for personal desktop use, where data are kept locally and are not sent to cloud services and decisions and outputs are transparent and explainable to the user. To that end, we present QuAVA, a conversational desktop content retrieval assistant, designed on four pillars: privacy and security, explainability, low-resource requirements, and multi-source data fusion. QuAVA is a data and privacy-preserving assistant that enables users to access their private data such as files, emails, and message exchanges, conversationally and transparently. The proposed architecture automatically extracts and preprocesses content from various sources and organizes it in a 3-layered hierarchical structure, namely a topic, a subtopic, and a content layer by employing ML algorithms for clustering and labeling. This way, users can navigate and access information via a set of conversation rules embedded in the assistant. We conduct a qualitative comparison analysis of the QuAVA architecture with other well-established QA and CR architectures against the four pillars defined, as well as privacy tests, and conclude that QuAVA is the only – to our knowledge – virtual assistant that successfully satisfies them.
{"title":"QuAVA: A privacy-aware architecture for conversational desktop Content Retrieval systems","authors":"Nikolaos Malamas , Andreas L. Symeonidis , John B. Theocharis","doi":"10.1016/j.csl.2026.101950","DOIUrl":"10.1016/j.csl.2026.101950","url":null,"abstract":"<div><div>Question Answering (QA) and Content Retrieval (CR) systems have experienced a boost in performance in recent years leveraging state-of-the-art Transformer models to process user expressions and retrieve and extract information requested. Despite the constant language understanding improvements, very little effort has been put into the design of such systems for personal desktop use, where data are kept locally and are not sent to cloud services and decisions and outputs are transparent and explainable to the user. To that end, we present QuAVA, a conversational desktop content retrieval assistant, designed on four pillars: privacy and security, explainability, low-resource requirements, and multi-source data fusion. QuAVA is a data and privacy-preserving assistant that enables users to access their private data such as files, emails, and message exchanges, conversationally and transparently. The proposed architecture automatically extracts and preprocesses content from various sources and organizes it in a 3-layered hierarchical structure, namely a topic, a subtopic, and a content layer by employing ML algorithms for clustering and labeling. This way, users can navigate and access information via a set of conversation rules embedded in the assistant. We conduct a qualitative comparison analysis of the QuAVA architecture with other well-established QA and CR architectures against the four pillars defined, as well as privacy tests, and conclude that QuAVA is the only – to our knowledge – virtual assistant that successfully satisfies them.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101950"},"PeriodicalIF":3.4,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146173544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-29DOI: 10.1016/j.csl.2026.101951
Wenzhe Jia , Yuhang Wang , Yahui Kang
Depression detection from multimodal data is crucial for early intervention and mental health monitoring. Existing systems, however, face three challenges: (i) capturing subtle affective cues that distinguish depressive states from normal emotional variations, (ii) establishing reliable correspondence between heterogeneous speech and text modalities, and (iii) handling severe class imbalance in real-world corpora. To address these challenges, we propose a framework that integrates explicit emotion supervision, cross-modal alignment, and metric-oriented optimization for robust multimodal depression detection. Acoustic and lexical features are augmented with emotion-category embeddings derived from supervision signals to provide affective context, while semantic correspondence is reinforced through a contrastive alignment objective. To mitigate imbalance, we directly optimize macro-F1 with the Lovász loss. On the Emotional Audio-Textual Depression Corpus (EATD-Corpus), our framework achieves 87.40% 0.46% macro-F1 with dataset-provided emotions and 83.15% with predicted emotions, compared to 71.82% without emotion information. Cross-dataset evaluation on the Distress Analysis Interview Corpus – Wizard of Oz (DAIC-WOZ) shows consistent gains, including a 12.34% F1 improvement with emotion augmentation. This integrated approach—combining emotion supervision, cross-modal alignment, and metric-oriented optimization—represents a novel contribution to depression detection. Our framework provides a practical and robust solution for real-world multimodal depression detection.
从多模态数据中发现抑郁症对于早期干预和心理健康监测至关重要。然而,现有的系统面临着三个挑战:(i)捕捉微妙的情感线索,以区分抑郁状态和正常的情绪变化;(ii)在异质语音和文本模式之间建立可靠的对应关系;(iii)处理现实世界语料库中严重的阶级不平衡。为了应对这些挑战,我们提出了一个框架,该框架集成了明确的情绪监督、跨模态对齐和面向度量的优化,用于鲁棒的多模态抑郁检测。语音和词汇特征通过来自监督信号的情感类别嵌入得到增强,以提供情感语境,而语义对应通过对比对齐目标得到加强。为了减轻不平衡,我们直接用Lovász损失优化宏观f1。在情绪音频-文本抑郁语料库(EATD-Corpus)上,我们的框架在数据集提供情绪的情况下实现了87.40%±0.46%的宏观f1,在预测情绪的情况下实现了83.15%的宏观f1,而在没有情绪信息的情况下实现了71.82%的宏观f1。对《绿野仙踪》(DAIC-WOZ)的痛苦分析访谈语料(Distress Analysis Interview Corpus - Wizard of Oz)的跨数据集评估显示出持续的收益,包括情绪增强的12.34% F1改进。这种综合方法-结合情绪监督,跨模态对齐和面向度量的优化-代表了对抑郁症检测的新贡献。我们的框架为现实世界的多模态抑郁检测提供了一个实用而强大的解决方案。
{"title":"Emotion-guided cross-modal alignment for multimodal depression detection","authors":"Wenzhe Jia , Yuhang Wang , Yahui Kang","doi":"10.1016/j.csl.2026.101951","DOIUrl":"10.1016/j.csl.2026.101951","url":null,"abstract":"<div><div>Depression detection from multimodal data is crucial for early intervention and mental health monitoring. Existing systems, however, face three challenges: (i) capturing subtle affective cues that distinguish depressive states from normal emotional variations, (ii) establishing reliable correspondence between heterogeneous speech and text modalities, and (iii) handling severe class imbalance in real-world corpora. To address these challenges, we propose a framework that integrates explicit emotion supervision, cross-modal alignment, and metric-oriented optimization for robust multimodal depression detection. Acoustic and lexical features are augmented with emotion-category embeddings derived from supervision signals to provide affective context, while semantic correspondence is reinforced through a contrastive alignment objective. To mitigate imbalance, we directly optimize macro-F1 with the Lovász loss. On the Emotional Audio-Textual Depression Corpus (EATD-Corpus), our framework achieves 87.40% <span><math><mo>±</mo></math></span> 0.46% macro-F1 with dataset-provided emotions and 83.15% with predicted emotions, compared to 71.82% without emotion information. Cross-dataset evaluation on the Distress Analysis Interview Corpus – Wizard of Oz (DAIC-WOZ) shows consistent gains, including a 12.34% F1 improvement with emotion augmentation. This integrated approach—combining emotion supervision, cross-modal alignment, and metric-oriented optimization—represents a novel contribution to depression detection. Our framework provides a practical and robust solution for real-world multimodal depression detection.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101951"},"PeriodicalIF":3.4,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146077583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-29DOI: 10.1016/j.csl.2026.101949
Yujiang Liu , Lijun Fu , Xiaojun Xia
With the development of AI technology, especially after the emergence of large language models, the response of the medical chatbot is more accurate and reasonable than before. However, due to the high cost of data annotation and hardware for training or fine-tuning specific data, it is difficult for researchers or physicians to train appropriate models for medical consultation. In this paper, we propose a new framework to solve this problem for medical dialogue generation. It is a vector level optimization scheme that we use different strategies during the training and testing stages. In the training stage, the original response and medical related words are supervised by two LLMs, which are considered as a twin network. While in the testing stage, we combine the hidden states of them to get the fusion output of the response. A large number of experiments show that our framework is effective and achieves performance improvement on five medical chat datasets. Thus, we provide new research ideas for medical chatbots.
{"title":"Medical related word enhancement framework: A new method for large language model in medical dialogue generation","authors":"Yujiang Liu , Lijun Fu , Xiaojun Xia","doi":"10.1016/j.csl.2026.101949","DOIUrl":"10.1016/j.csl.2026.101949","url":null,"abstract":"<div><div>With the development of AI technology, especially after the emergence of large language models, the response of the medical chatbot is more accurate and reasonable than before. However, due to the high cost of data annotation and hardware for training or fine-tuning specific data, it is difficult for researchers or physicians to train appropriate models for medical consultation. In this paper, we propose a new framework to solve this problem for medical dialogue generation. It is a vector level optimization scheme that we use different strategies during the training and testing stages. In the training stage, the original response and medical related words are supervised by two LLMs, which are considered as a twin network. While in the testing stage, we combine the hidden states of them to get the fusion output of the response. A large number of experiments show that our framework is effective and achieves performance improvement on five medical chat datasets. Thus, we provide new research ideas for medical chatbots.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101949"},"PeriodicalIF":3.4,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146173546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Current text-to-speech (TTS) systems are capable of learning the phonetics of a language accurately given that the speech data used to train such models covers all phonetic phenomena. For languages with different varieties, this includes all their richness and accents. This is the case of Catalan, a mid-resourced language with several dialects or accents. Although there are various publicly available corpora, there is a lack of high-quality open-access data for speech technologies covering its variety of accents. Common Voice includes recordings of Catalan speakers from different regions; however, accent labeling has been shown to be inaccurate, and artificially enhanced samples may be unsuitable for TTS. To address these limitations, we present LaFresCat, the first studio-quality Catalan multi-accent dataset. LaFresCat comprises 3.5 h of professionally recording speech covering four of the most prominent Catalan accents: Balearic, Central, North-Western, and Valencian. In this work, we provide a detailed description of the dataset design: utterances were selected to be phonetically balanced, detailed speaker instructions were provided, native speakers from the regions corresponding to the Catalan accents were hired, and the recordings were formatted and post-processed. The resulting dataset, LaFresCat, is publicly available. To preliminarily evaluate the dataset, we trained and assessed a lightweight flow-based TTS system, which is also provided as a by-product. We also analyzed LaFresCat samples and the corresponding TTS-generated samples at the phonetic level, employing expert annotations and Pillai scores to quantify acoustic vowel overlap. Preliminary results suggest a significant improvement in predicted mean opinion score (UTMOS), with an increase of 0.42 points when the TTS system is fine-tuned on LaFresCat rather than trained from scratch, starting from a pre-trained version based on Central Catalan data from Common Voice. Subsequent human expert annotations achieved nearly 90% accuracy in accent classification for LaFresCat recordings. However, although the TTS tends to homogenize pronunciation, it still learns distinct dialectal patterns. This assessment offers key insights for establishing a baseline to guide future evaluations of Catalan multi-accent TTS systems and further studies of LaFresCat.
{"title":"LaFresCat: A studio-quality Catalan multi-accent speech dataset for text-to-speech synthesis","authors":"Alex Peiró-Lilja , Carme Armentano-Oller , José Giraldo , Wendy Elvira-García , Ignasi Esquerra , Rodolfo Zevallos , Cristina España-Bonet , Martí Llopart-Font , Baybars Külebi , Mireia Farrús","doi":"10.1016/j.csl.2026.101945","DOIUrl":"10.1016/j.csl.2026.101945","url":null,"abstract":"<div><div>Current text-to-speech (TTS) systems are capable of learning the phonetics of a language accurately given that the speech data used to train such models covers all phonetic phenomena. For languages with different varieties, this includes all their richness and accents. This is the case of Catalan, a mid-resourced language with several dialects or accents. Although there are various publicly available corpora, there is a lack of high-quality open-access data for speech technologies covering its variety of accents. Common Voice includes recordings of Catalan speakers from different regions; however, accent labeling has been shown to be inaccurate, and artificially enhanced samples may be unsuitable for TTS. To address these limitations, we present LaFresCat, the first studio-quality Catalan multi-accent dataset. LaFresCat comprises 3.5 h of professionally recording speech covering four of the most prominent Catalan accents: Balearic, Central, North-Western, and Valencian. In this work, we provide a detailed description of the dataset design: utterances were selected to be phonetically balanced, detailed speaker instructions were provided, native speakers from the regions corresponding to the Catalan accents were hired, and the recordings were formatted and post-processed. The resulting dataset, LaFresCat, is publicly available. To preliminarily evaluate the dataset, we trained and assessed a lightweight flow-based TTS system, which is also provided as a by-product. We also analyzed LaFresCat samples and the corresponding TTS-generated samples at the phonetic level, employing expert annotations and Pillai scores to quantify acoustic vowel overlap. Preliminary results suggest a significant improvement in predicted mean opinion score (UTMOS), with an increase of 0.42 points when the TTS system is fine-tuned on LaFresCat rather than trained from scratch, starting from a pre-trained version based on Central Catalan data from Common Voice. Subsequent human expert annotations achieved nearly 90% accuracy in accent classification for LaFresCat recordings. However, although the TTS tends to homogenize pronunciation, it still learns distinct dialectal patterns. This assessment offers key insights for establishing a baseline to guide future evaluations of Catalan multi-accent TTS systems and further studies of LaFresCat.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101945"},"PeriodicalIF":3.4,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146147489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-16DOI: 10.1016/j.csl.2026.101944
Ladislav Mošner , Oldřich Plchot , Lukáš Burget , Chunlei Zhang , Jan Černocký , Meng Yu
Multi-channel speaker verification (SV), employing numerous microphones for capturing enrollment and/or test recordings, gained attention for its benefits in far-field scenarios. While some studies approach the problem by designing multi-channel embedding extractors, we focus on building and thoroughly analyzing a framework integrating beamforming pre-processing paired with single-channel embedding extraction. This strategy benefits from accommodating both multi-channel and single-channel inputs. Furthermore, it provides human-interpretable intermediate output – enhanced speech – that can be independently evaluated and related to SV performance. We first focus on the front-end, taking advantage of deep-learning source separation for direct or indirect mask estimation required by the beamformer. We alternate single-channel network architectures, subsequently extended to multi-channel ones by reference channel attention (RCA). We also analyze the impact of beamformer and network output fusion. Finally, we show improvements brought by end-to-end fine-tuning the entire architecture facilitated by our newly designed multi-channel corpus, MultiSV2, extending our previous MultiSV dataset.
{"title":"Trainable multi-channel front-ends for joint beamforming and speaker embedding extraction","authors":"Ladislav Mošner , Oldřich Plchot , Lukáš Burget , Chunlei Zhang , Jan Černocký , Meng Yu","doi":"10.1016/j.csl.2026.101944","DOIUrl":"10.1016/j.csl.2026.101944","url":null,"abstract":"<div><div>Multi-channel speaker verification (SV), employing numerous microphones for capturing enrollment and/or test recordings, gained attention for its benefits in far-field scenarios. While some studies approach the problem by designing multi-channel embedding extractors, we focus on building and thoroughly analyzing a framework integrating beamforming pre-processing paired with single-channel embedding extraction. This strategy benefits from accommodating both multi-channel and single-channel inputs. Furthermore, it provides human-interpretable intermediate output – enhanced speech – that can be independently evaluated and related to SV performance. We first focus on the front-end, taking advantage of deep-learning source separation for direct or indirect mask estimation required by the beamformer. We alternate single-channel network architectures, subsequently extended to multi-channel ones by reference channel attention (RCA). We also analyze the impact of beamformer and network output fusion. Finally, we show improvements brought by end-to-end fine-tuning the entire architecture facilitated by our newly designed multi-channel corpus, MultiSV2, extending our previous MultiSV dataset.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101944"},"PeriodicalIF":3.4,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146022541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-13DOI: 10.1016/j.csl.2026.101938
Myeong-Ha Hwang , Jikang Shin , Junseong Bang
While voice interaction facilitates hands-free control for Robotic Process Automation (RPA), real-world deployment faces significant challenges regarding robustness to ASR errors, reliable context tracking, and safeguards against unsafe execution. To address these, we propose V-APA, a voice-driven agentic spoken-dialogue system that automates administrative workflows through policy-driven orchestration, selecting dialogue actions online rather than following fixed, hand-crafted rule flows. The system incorporates three primary robustness and safety mechanisms: N-best ASR hypothesis fusion to mitigate recognition noise, Dialogue State Tracking (DST) for persistent context preservation across turns, and risk-aware confirmation gates to prevent high-impact mis-executions. V-APA is implemented using a practical, reproducible stack featuring Whisper-family ASR, a transformer-based intent ensemble (BERT, RoBERTa, T5), rule-based slot extraction, and LangGraph for dynamic multi-step orchestration. Out-of-scope requests are handled by an optional open-weight LLM fallback based on the Llama-3-8B architecture. Evaluated on 400 spoken task scenarios using a calibrated per-module latency model, results demonstrate that the proposed system significantly improves reliability and safety while maintaining an interactive turn-level latency of approximately 0.5 s.
{"title":"V-APA: A Voice-driven Agentic Process Automation System","authors":"Myeong-Ha Hwang , Jikang Shin , Junseong Bang","doi":"10.1016/j.csl.2026.101938","DOIUrl":"10.1016/j.csl.2026.101938","url":null,"abstract":"<div><div>While voice interaction facilitates hands-free control for Robotic Process Automation (RPA), real-world deployment faces significant challenges regarding robustness to ASR errors, reliable context tracking, and safeguards against unsafe execution. To address these, we propose V-APA, a voice-driven agentic spoken-dialogue system that automates administrative workflows through policy-driven orchestration, selecting dialogue actions online rather than following fixed, hand-crafted rule flows. The system incorporates three primary robustness and safety mechanisms: N-best ASR hypothesis fusion to mitigate recognition noise, Dialogue State Tracking (DST) for persistent context preservation across turns, and risk-aware confirmation gates to prevent high-impact mis-executions. V-APA is implemented using a practical, reproducible stack featuring Whisper-family ASR, a transformer-based intent ensemble (BERT, RoBERTa, T5), rule-based slot extraction, and LangGraph for dynamic multi-step orchestration. Out-of-scope requests are handled by an optional open-weight LLM fallback based on the Llama-3-8B architecture. Evaluated on 400 spoken task scenarios using a calibrated per-module latency model, results demonstrate that the proposed system significantly improves reliability and safety while maintaining an interactive turn-level latency of approximately 0.5 s.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101938"},"PeriodicalIF":3.4,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146022542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-11DOI: 10.1016/j.csl.2026.101940
Da-Hee Yang, Joon-Hyuk Chang
This study presents a hybrid speech restoration framework that integrates predictive-guided conditioning into a diffusion-based generative model to address complex distortions, including noise, reverberation, and bandwidth reduction. The proposed method employs the outputs of a predictive model to guide the diffusion process, enabling more accurate reconstruction under challenging acoustic conditions. Furthermore, during the final sampling stage, the outputs of the predictive and generative models are fused with a tunable ratio, balancing signal fidelity and perceptual naturalness. Experimental results demonstrate that the proposed approach significantly improves objective restoration metrics compared to conventional diffusion baselines. However, the perceptual quality varies with the fusion ratio, revealing a trade-off between objective gains and subjective preference. These findings highlight the potential of predictive-guided conditioning for robust speech restoration and provide insights into optimizing the balance between predictive and generative contributions.
{"title":"An experimental study of diffusion-based general speech restoration with predictive-guided conditioning","authors":"Da-Hee Yang, Joon-Hyuk Chang","doi":"10.1016/j.csl.2026.101940","DOIUrl":"10.1016/j.csl.2026.101940","url":null,"abstract":"<div><div>This study presents a hybrid speech restoration framework that integrates predictive-guided conditioning into a diffusion-based generative model to address complex distortions, including noise, reverberation, and bandwidth reduction. The proposed method employs the outputs of a predictive model to guide the diffusion process, enabling more accurate reconstruction under challenging acoustic conditions. Furthermore, during the final sampling stage, the outputs of the predictive and generative models are fused with a tunable ratio, balancing signal fidelity and perceptual naturalness. Experimental results demonstrate that the proposed approach significantly improves objective restoration metrics compared to conventional diffusion baselines. However, the perceptual quality varies with the fusion ratio, revealing a trade-off between objective gains and subjective preference. These findings highlight the potential of predictive-guided conditioning for robust speech restoration and provide insights into optimizing the balance between predictive and generative contributions.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"99 ","pages":"Article 101940"},"PeriodicalIF":3.4,"publicationDate":"2026-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146022543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}