首页 > 最新文献

Speech Communication最新文献

英文 中文
Do all features matter? Layer-wise feature probing of self-supervised speech models for dysarthria severity classification 所有功能都重要吗?构音障碍严重程度分类的自监督语音模型分层特征探索
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-11-01 DOI: 10.1016/j.specom.2025.103326
Paban Sapkota , Harsh Srivastava , Hemant Kumar Kathania , Shrikanth Narayanan , Sudarsana Reddy Kadiri
Estimating the severity of dysarthria, a speech disorder from neurological conditions, is important in medicine. It helps with diagnosis, early detection, and personalized treatment. Significant progress has been made in leveraging SSL models as feature extractors for various classification tasks, demonstrating their effectiveness. Building on this, this paper examines whether using all features extracted from SSL models is necessary for optimal dysarthria severity classification from speech. We focused on layer-wise feature analysis of one base model, Wav2Vec2-base, and four large models, Wav2Vec2-large, HuBERT-large, Data2Vec-large, and WavLM-large, using a Convolutional Neural Network (CNN) as classifier with mel-frequency cepstral coefficients (MFCC) features as baseline. Experiments showed that the later transformer layers of the SSL models were more effective in the dysarthria severity classification, compared to the earlier layers. This is because the later transformer layers better capture articulation, and complex temporal patterns refined from the mid layers. More specifically, analysis revealed that embeddings from transformer encoder layer 23 of HuBERT-large yielded the best performance among all three models, possibly due to HuBERT’s hierarchical learning from unsupervised clustering. To further assess whether all dimensions are important, we examined the impact of varying feature dimensions. Our findings indicated that reducing the dimensions to 32 (from 1024 dimension) led to further improvements in accuracy. This indicates that not all features are necessary for effective severity classification. Additionally, feature fusion was conducted using the optimal reduced dimensions from the best-performing layer combined with varying dimensions of the MFCC features, resulting in further improvement in performance. The highest accuracy of 70.44% was achieved by combining 32 selected dimensions from the HuBERT-large model with 21 MFCC feature dimensions. The feature fusion of HuBERT-large (32) and MFCC (21) outperformed the HuBERT-large baseline by 6.36% and MFCC baseline by 15.28% in absolute. Furthermore, combining the fused features with handcrafted features from articulatory, prosodic, phonatory, and respiratory domains increased the classification accuracy to 73.53%, resulting in a more robust representation for dysarthria severity classification. Probing analyses of articulatory and prosodic features supported the choice of the best-performing HuBERT layer, while the low correlation with handcrafted features highlighted their complementary contribution. Finally, comparative t-SNE visualizations further validated the effectiveness of the proposed feature fusion, demonstrating clearer class separability.
构音障碍是一种由神经系统疾病引起的语言障碍,估计其严重程度在医学上很重要。它有助于诊断、早期发现和个性化治疗。在利用SSL模型作为各种分类任务的特征提取器方面已经取得了重大进展,证明了它们的有效性。在此基础上,本文研究了是否需要使用从SSL模型中提取的所有特征来从语音中进行最佳构音障碍严重程度分类。我们专注于一个基本模型,Wav2Vec2-base,以及四个大型模型,Wav2Vec2-large, HuBERT-large, data2vec2 -large和WavLM-large的分层特征分析,使用卷积神经网络(CNN)作为分类器,以melf -frequency cepstral系数(MFCC)特征为基线。实验表明,与较早的层相比,SSL模型的后变压器层在构音障碍严重程度分类中更有效。这是因为后面的转换层更好地捕获了衔接,以及从中间层提炼出来的复杂时间模式。更具体地说,分析显示HuBERT-large的变压器编码器第23层的嵌入在所有三个模型中产生了最好的性能,这可能是由于HuBERT从无监督聚类中分层学习。为了进一步评估是否所有维度都很重要,我们检查了不同特征维度的影响。我们的研究结果表明,将维度减少到32(从1024个维度)可以进一步提高准确性。这表明并不是所有的特征都是有效的严重性分类所必需的。此外,将性能最佳层的最优降维与MFCC特征的不同维度相结合,进行特征融合,进一步提高了性能。将HuBERT-large模型中选择的32个维度与21个MFCC特征维度相结合,达到了70.44%的最高准确率。HuBERT-large(32)和MFCC(21)的特征融合绝对优于HuBERT-large基线的6.36%和MFCC基线的15.28%。此外,将融合特征与来自发音、韵律、语音和呼吸域的手工特征相结合,将分类准确率提高到73.53%,从而更稳健地表示构音障碍严重程度分类。对发音和韵律特征的探索性分析支持选择表现最好的休伯特层,而与手工特征的低相关性突出了它们的互补贡献。最后,对比t-SNE可视化进一步验证了所提出的特征融合的有效性,展示了更清晰的类可分性。
{"title":"Do all features matter? Layer-wise feature probing of self-supervised speech models for dysarthria severity classification","authors":"Paban Sapkota ,&nbsp;Harsh Srivastava ,&nbsp;Hemant Kumar Kathania ,&nbsp;Shrikanth Narayanan ,&nbsp;Sudarsana Reddy Kadiri","doi":"10.1016/j.specom.2025.103326","DOIUrl":"10.1016/j.specom.2025.103326","url":null,"abstract":"<div><div>Estimating the severity of dysarthria, a speech disorder from neurological conditions, is important in medicine. It helps with diagnosis, early detection, and personalized treatment. Significant progress has been made in leveraging SSL models as feature extractors for various classification tasks, demonstrating their effectiveness. Building on this, this paper examines whether using all features extracted from SSL models is necessary for optimal dysarthria severity classification from speech. We focused on layer-wise feature analysis of one base model, Wav2Vec2-base, and four large models, Wav2Vec2-large, HuBERT-large, Data2Vec-large, and WavLM-large, using a Convolutional Neural Network (CNN) as classifier with mel-frequency cepstral coefficients (MFCC) features as baseline. Experiments showed that the later transformer layers of the SSL models were more effective in the dysarthria severity classification, compared to the earlier layers. This is because the later transformer layers better capture articulation, and complex temporal patterns refined from the mid layers. More specifically, analysis revealed that embeddings from transformer encoder layer 23 of HuBERT-large yielded the best performance among all three models, possibly due to HuBERT’s hierarchical learning from unsupervised clustering. To further assess whether all dimensions are important, we examined the impact of varying feature dimensions. Our findings indicated that reducing the dimensions to 32 (from 1024 dimension) led to further improvements in accuracy. This indicates that not all features are necessary for effective severity classification. Additionally, feature fusion was conducted using the optimal reduced dimensions from the best-performing layer combined with varying dimensions of the MFCC features, resulting in further improvement in performance. The highest accuracy of 70.44% was achieved by combining 32 selected dimensions from the HuBERT-large model with 21 MFCC feature dimensions. The feature fusion of HuBERT-large (32) and MFCC (21) outperformed the HuBERT-large baseline by 6.36% and MFCC baseline by 15.28% in absolute. Furthermore, combining the fused features with handcrafted features from articulatory, prosodic, phonatory, and respiratory domains increased the classification accuracy to 73.53%, resulting in a more robust representation for dysarthria severity classification. Probing analyses of articulatory and prosodic features supported the choice of the best-performing HuBERT layer, while the low correlation with handcrafted features highlighted their complementary contribution. Finally, comparative t-SNE visualizations further validated the effectiveness of the proposed feature fusion, demonstrating clearer class separability.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"175 ","pages":"Article 103326"},"PeriodicalIF":3.0,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145474244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A survey of deep learning for complex speech spectrograms 复杂语音谱图的深度学习研究综述
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-11-01 DOI: 10.1016/j.specom.2025.103319
Yuying Xie, Zheng-Hua Tan
Recent advancements in deep learning have significantly impacted the field of speech signal processing, particularly in the analysis and manipulation of complex spectrograms. This survey provides a comprehensive overview of the state-of-the-art techniques leveraging deep neural networks for processing complex spectrograms, which encapsulate both magnitude and phase information. We begin by introducing complex spectrograms and their associated features for various speech processing tasks. Next, we examine the key components and architectures of complex-valued neural networks, which are specifically designed to handle complex-valued data and have been applied to complex spectrogram processing. As recent studies have primarily focused on applying real-valued neural networks to complex spectrograms, we revisit these approaches and their architectural designs. We then discuss various training strategies and loss functions tailored for training neural networks to process and model complex spectrograms. The survey further examines key applications, including phase retrieval, speech enhancement, and speaker separation, where deep learning has achieved significant progress by leveraging complex spectrograms or their derived feature representations. Additionally, we examine the intersection of complex spectrograms with generative models. This survey aims to serve as a valuable resource for researchers and practitioners in the field of speech signal processing, deep learning and related fields.
深度学习的最新进展对语音信号处理领域产生了重大影响,特别是在复杂频谱图的分析和处理方面。本调查提供了利用深度神经网络处理复杂频谱图的最先进技术的全面概述,这些频谱图包含幅度和相位信息。我们首先介绍复杂频谱图及其相关特征,用于各种语音处理任务。接下来,我们研究了复杂值神经网络的关键组件和架构,它们专门用于处理复杂值数据,并已应用于复杂频谱图处理。由于最近的研究主要集中在将实值神经网络应用于复杂的频谱图上,我们重新审视这些方法及其架构设计。然后我们讨论了各种训练策略和损失函数,用于训练神经网络来处理和模拟复杂的谱图。该调查进一步研究了关键应用,包括相位检索、语音增强和说话人分离,在这些应用中,深度学习通过利用复杂的频谱图或其衍生的特征表示取得了重大进展。此外,我们研究了复杂谱图与生成模型的交集。本调查旨在为语音信号处理、深度学习及相关领域的研究人员和从业人员提供有价值的资源。
{"title":"A survey of deep learning for complex speech spectrograms","authors":"Yuying Xie,&nbsp;Zheng-Hua Tan","doi":"10.1016/j.specom.2025.103319","DOIUrl":"10.1016/j.specom.2025.103319","url":null,"abstract":"<div><div>Recent advancements in deep learning have significantly impacted the field of speech signal processing, particularly in the analysis and manipulation of complex spectrograms. This survey provides a comprehensive overview of the state-of-the-art techniques leveraging deep neural networks for processing complex spectrograms, which encapsulate both magnitude and phase information. We begin by introducing complex spectrograms and their associated features for various speech processing tasks. Next, we examine the key components and architectures of complex-valued neural networks, which are specifically designed to handle complex-valued data and have been applied to complex spectrogram processing. As recent studies have primarily focused on applying real-valued neural networks to complex spectrograms, we revisit these approaches and their architectural designs. We then discuss various training strategies and loss functions tailored for training neural networks to process and model complex spectrograms. The survey further examines key applications, including phase retrieval, speech enhancement, and speaker separation, where deep learning has achieved significant progress by leveraging complex spectrograms or their derived feature representations. Additionally, we examine the intersection of complex spectrograms with generative models. This survey aims to serve as a valuable resource for researchers and practitioners in the field of speech signal processing, deep learning and related fields.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"175 ","pages":"Article 103319"},"PeriodicalIF":3.0,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145474242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Categorization of patients affected with neurogenerative dysarthria among Hindi-speaking population and analyzing factors causing reduced speech intelligibility at the human-machine interface 印地语人群神经再生构音障碍患者分类及人机界面言语清晰度降低因素分析
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-11-01 DOI: 10.1016/j.specom.2025.103328
Raj Kumar , Manoj Tripathy , Niraj Kumar , R.S. Anand
Dysarthria, a vocal movement disorder resulting from neurological disease, is recognized by various indicators, such as diminished intensity, uncontrolled pitch variations, varying speech rate, and hypo/hypernasality, among other symptoms. It presents significant hurdles for dysarthric individuals when interfacing with machines operated by automatic speech recognition (ASR) systems tailored for the speech of neurologically healthy people. This research delves into the voice characteristics contributing to decreased intelligibility within human-machine interaction by investigating the behaviour of ASR systems with varying degrees of dysarthria. The work presents a pilot study of dysarthria in Hindi-speaking population by compiling a Hindi corpus. The corpus scrutinizes the distinct voice attributes present in dysarthric speech, focusing on parameters like pitch perturbation, amplitude perturbation, articulation rate, pause and phoneme rate derived from sustained phonation and continuous speech data captured using a conventional close-talk and throat microphone. The speech dataset includes recordings from sixty participants with neurological conditions, each providing thirty sentences. Participants are categorized into four intelligibility groups for analysis using the Google Cloud Speech to Text conversion system. The phonation analysis reveals greater disturbances in pitch and intensity variation as intelligibility decreases. Additionally, a sentence-level analysis was conducted to explore the influence of inter-word pauses and word complexity across different intelligibility groups. The results show that individuals with severe dysarthria tend to speak more slowly and misarticulate longer words. The study provides numerical ranges for pitch, amplitude, and time perturbation, which will be helpful for researchers working in the field of DSR system development, which utilizes data augmentation to generate synthetic dysarthric data to mitigate data scarcity. The work establishes a relationship between word complexity and intelligibility, which will support speech pathologists in designing customized speech training programs to improve intelligibility for individuals with dysarthria.
构音障碍是一种由神经系统疾病引起的发声运动障碍,可通过多种指标来识别,如强度减弱、音调变化失控、语速变化、鼻音减退/鼻音亢进等症状。当与为神经健康人群量身定制的自动语音识别(ASR)系统操作的机器进行交互时,这给运动障碍患者带来了巨大的障碍。本研究通过调查具有不同程度构音障碍的ASR系统的行为,深入研究了在人机交互中导致可理解性下降的语音特征。该工作提出了一个试点研究的构音障碍在印地语人口通过汇编印地语语料库。该语料库仔细研究了困难语音中存在的独特语音属性,重点关注音高扰动、幅度扰动、发音率、停顿和音素率等参数,这些参数来自持续发声和使用传统近距离谈话和喉部麦克风捕获的连续语音数据。语音数据集包括来自60个神经系统疾病参与者的录音,每个人提供30个句子。参与者被分为四个可理解性组,使用谷歌云语音到文本转换系统进行分析。发声分析表明,随着可理解性的降低,音高和强度变化的干扰更大。此外,本文还通过句子层面的分析,探讨了不同可理解度群体中词间停顿和词复杂度的影响。结果表明,患有严重构音障碍的人往往说话更慢,而且长单词发音错误。该研究提供了基音、振幅和时间扰动的数值范围,这将有助于DSR系统开发领域的研究人员利用数据增强来生成合成的失稳数据,以减轻数据的稀缺性。这项工作建立了单词复杂性和可理解性之间的关系,这将支持语言病理学家设计定制的语言训练计划,以提高构音障碍患者的可理解性。
{"title":"Categorization of patients affected with neurogenerative dysarthria among Hindi-speaking population and analyzing factors causing reduced speech intelligibility at the human-machine interface","authors":"Raj Kumar ,&nbsp;Manoj Tripathy ,&nbsp;Niraj Kumar ,&nbsp;R.S. Anand","doi":"10.1016/j.specom.2025.103328","DOIUrl":"10.1016/j.specom.2025.103328","url":null,"abstract":"<div><div>Dysarthria, a vocal movement disorder resulting from neurological disease, is recognized by various indicators, such as diminished intensity, uncontrolled pitch variations, varying speech rate, and hypo/hypernasality, among other symptoms. It presents significant hurdles for dysarthric individuals when interfacing with machines operated by automatic speech recognition (ASR) systems tailored for the speech of neurologically healthy people. This research delves into the voice characteristics contributing to decreased intelligibility within human-machine interaction by investigating the behaviour of ASR systems with varying degrees of dysarthria. The work presents a pilot study of dysarthria in Hindi-speaking population by compiling a Hindi corpus. The corpus scrutinizes the distinct voice attributes present in dysarthric speech, focusing on parameters like pitch perturbation, amplitude perturbation, articulation rate, pause and phoneme rate derived from sustained phonation and continuous speech data captured using a conventional close-talk and throat microphone. The speech dataset includes recordings from sixty participants with neurological conditions, each providing thirty sentences. Participants are categorized into four intelligibility groups for analysis using the Google Cloud Speech to Text conversion system. The phonation analysis reveals greater disturbances in pitch and intensity variation as intelligibility decreases. Additionally, a sentence-level analysis was conducted to explore the influence of inter-word pauses and word complexity across different intelligibility groups. The results show that individuals with severe dysarthria tend to speak more slowly and misarticulate longer words. The study provides numerical ranges for pitch, amplitude, and time perturbation, which will be helpful for researchers working in the field of DSR system development, which utilizes data augmentation to generate synthetic dysarthric data to mitigate data scarcity. The work establishes a relationship between word complexity and intelligibility, which will support speech pathologists in designing customized speech training programs to improve intelligibility for individuals with dysarthria.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"175 ","pages":"Article 103328"},"PeriodicalIF":3.0,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145528715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Noise-robust feature extraction for keyword spotting based on supervised adversarial domain adaptation training strategies 基于监督对抗域自适应训练策略的关键词识别噪声鲁棒特征提取
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-11-01 DOI: 10.1016/j.specom.2025.103323
Yongqiang Chen , Qianhua He , Zunxian Liu , Mingru Yang , Wenwu Wang
Keyword spotting (KWS) suffers from the domain shift between training and testing in practical complex situations. To improve the robustness of KWS models in noisy environments, this paper proposes a novel domain-invariant feature extraction strategy called supervised probabilistic multi-domain adversarial training (SPMDAT). Based on supervised adversarial domain adaptation (SADA), SPMDAT makes better use of differently distributed data (multi-condition data) by using a class-wise domain discriminator to estimate the domain index probability distribution. Experimental results on three different deep networks showed that the SPMDAT could improve KWS performances for three noisy situations: seen noise, unseen noise, and seen noise with ultra-low signal-to-noise ratio (SNR) levels, compared to the multi-condition training (MCT) strategy. Especially, for KWT-1, the average relative improvements are 9.63%, 10.83%, and 28.16%, respectively. SPMDAT also achieves better results in the three test situations than the other two SADA strategies adapted from unsupervised domain adaptation (UDA) methods. Since the three strategies are only used in the training process, all the improvements are achieved without increasing the computational complexity of the inference models. In addition, to better understand the practicability of the SADA-based strategies, experiments are conducted to assess the impact of model parameters on the performance. The results show that models with approximately 69 K parameters already achieve performance improvements over MCT, suggesting the effectiveness of the strategies for small-footprint KWS models.
在复杂的实际情况下,关键字识别(KWS)受到训练和测试领域转移的影响。为了提高KWS模型在噪声环境下的鲁棒性,本文提出了一种新的域不变特征提取策略——监督概率多域对抗训练(SPMDAT)。SPMDAT基于监督式对抗域自适应(SADA),利用分类域鉴别器估计域索引概率分布,更好地利用了不同分布的数据(多条件数据)。在三个不同深度网络上的实验结果表明,与多条件训练(MCT)策略相比,SPMDAT可以提高KWS在三种噪声情况下的性能:可见噪声、不可见噪声和超低信噪比的可见噪声。特别是对于KWT-1,平均相对改善率分别为9.63%、10.83%和28.16%。SPMDAT在三种测试情境下的效果也优于其他两种由无监督域自适应(UDA)方法改编的SADA策略。由于这三种策略只在训练过程中使用,所以所有的改进都是在不增加推理模型计算复杂度的情况下实现的。此外,为了更好地了解基于sada的策略的实用性,我们进行了实验来评估模型参数对性能的影响。结果表明,与MCT相比,具有大约69 K参数的模型已经取得了性能改进,这表明该策略对于小占用空间的KWS模型是有效的。
{"title":"Noise-robust feature extraction for keyword spotting based on supervised adversarial domain adaptation training strategies","authors":"Yongqiang Chen ,&nbsp;Qianhua He ,&nbsp;Zunxian Liu ,&nbsp;Mingru Yang ,&nbsp;Wenwu Wang","doi":"10.1016/j.specom.2025.103323","DOIUrl":"10.1016/j.specom.2025.103323","url":null,"abstract":"<div><div>Keyword spotting (KWS) suffers from the domain shift between training and testing in practical complex situations. To improve the robustness of KWS models in noisy environments, this paper proposes a novel domain-invariant feature extraction strategy called supervised probabilistic multi-domain adversarial training (SPMDAT). Based on supervised adversarial domain adaptation (SADA), SPMDAT makes better use of differently distributed data (multi-condition data) by using a class-wise domain discriminator to estimate the domain index probability distribution. Experimental results on three different deep networks showed that the SPMDAT could improve KWS performances for three noisy situations: seen noise, unseen noise, and seen noise with ultra-low signal-to-noise ratio (SNR) levels, compared to the multi-condition training (MCT) strategy. Especially, for KWT-1, the average relative improvements are 9.63%, 10.83%, and 28.16%, respectively. SPMDAT also achieves better results in the three test situations than the other two SADA strategies adapted from unsupervised domain adaptation (UDA) methods. Since the three strategies are only used in the training process, all the improvements are achieved without increasing the computational complexity of the inference models. In addition, to better understand the practicability of the SADA-based strategies, experiments are conducted to assess the impact of model parameters on the performance. The results show that models with approximately 69 K parameters already achieve performance improvements over MCT, suggesting the effectiveness of the strategies for small-footprint KWS models.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"175 ","pages":"Article 103323"},"PeriodicalIF":3.0,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using spatial sound reproduction for studying speech perception of listeners with different language immersion experiences 利用空间声音再现研究不同语言沉浸体验下听者的言语感知
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-11-01 DOI: 10.1016/j.specom.2025.103320
Yusuke Hioka , C.T. Justine Hui , Hinako Masuda , Yunqi C. Zhang , Eri Osawa , Takayuki Arai
This study evaluates a research method for studying speech perception of listeners with different language background under practical acoustic environments. The proposed research method utilises spatial sound reproduction, an emerging technology that enables reproducing arbitrary acoustic environments in controlled laboratory settings, for testing participants recruited at multiple locations that are geographically distant from each other. To validate the research method, the current study conducted a listening test in a real seminar room and chapel as well as under a spherical harmonics-based spatial sound reproduction that reproduced the acoustics of the two venues up to the third order and investigates differences in the results collected from the two test types. Three groups of participants who had different immersion level to New Zealand English were recruited in Auckland, New Zealand and Tokyo, Japan. The experimental results show that spatial sound reproduction is able to capture the advantage of first language (L1) listeners in terms of understanding speech in noise and reverberation correctly but is not sensitive enough to describe the subtle difference among second language (L2) listeners with different level of language immersion experiences. The research method is also partially able to describe how well listeners can benefit from spatial release from masking regardless of their language immersion experiences under room acoustics with higher speech clarity (C50), and may represent the effect of room acoustics in the real room within a certain range of room acoustics characterised by speech clarity.
本研究评估了一种在实际声学环境下研究不同语言背景听者言语感知的研究方法。提出的研究方法利用空间声音再现,这是一种新兴的技术,可以在受控的实验室环境中再现任意的声音环境,用于在地理上彼此距离较远的多个地点招募的测试参与者。为了验证研究方法,本研究在真实的研讨室和礼拜堂以及基于球面谐波的空间声音再现下进行了听力测试,再现了两个场所的三阶声学,并调查了两种测试类型收集的结果的差异。在新西兰奥克兰和日本东京招募了三组对新西兰英语有不同沉浸程度的参与者。实验结果表明,空间声音再现能够捕捉到第一语言(L1)听者在正确理解噪音和混响中的语音方面的优势,但不够敏感,无法描述不同语言沉浸体验水平的第二语言(L2)听者之间的细微差异。该研究方法还能够部分描述听者在较高语音清晰度(C50)的房间声学条件下,无论他们的语言沉浸体验如何,都能从掩蔽的空间释放中受益,并可能代表在以语音清晰度为特征的一定房间声学范围内,房间声学在真实房间中的效果。
{"title":"Using spatial sound reproduction for studying speech perception of listeners with different language immersion experiences","authors":"Yusuke Hioka ,&nbsp;C.T. Justine Hui ,&nbsp;Hinako Masuda ,&nbsp;Yunqi C. Zhang ,&nbsp;Eri Osawa ,&nbsp;Takayuki Arai","doi":"10.1016/j.specom.2025.103320","DOIUrl":"10.1016/j.specom.2025.103320","url":null,"abstract":"<div><div>This study evaluates a research method for studying speech perception of listeners with different language background under practical acoustic environments. The proposed research method utilises spatial sound reproduction, an emerging technology that enables reproducing arbitrary acoustic environments in controlled laboratory settings, for testing participants recruited at multiple locations that are geographically distant from each other. To validate the research method, the current study conducted a listening test in a real seminar room and chapel as well as under a spherical harmonics-based spatial sound reproduction that reproduced the acoustics of the two venues up to the third order and investigates differences in the results collected from the two test types. Three groups of participants who had different immersion level to New Zealand English were recruited in Auckland, New Zealand and Tokyo, Japan. The experimental results show that spatial sound reproduction is able to capture the advantage of first language (L1) listeners in terms of understanding speech in noise and reverberation correctly but is not sensitive enough to describe the subtle difference among second language (L2) listeners with different level of language immersion experiences. The research method is also partially able to describe how well listeners can benefit from spatial release from masking regardless of their language immersion experiences under room acoustics with higher speech clarity (C50), and may represent the effect of room acoustics in the real room within a certain range of room acoustics characterised by speech clarity.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"175 ","pages":"Article 103320"},"PeriodicalIF":3.0,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Trading accuracy for fluency? An investigation of word retrieval difficulties in connected speech 用准确换取流利?连接语音词检索困难的研究
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-11-01 DOI: 10.1016/j.specom.2025.103325
Amber Römkens , Aurélie Pistono
Many authors view disfluencies as by-products of speech encoding difficulties, but it remains unclear which connected-speech phenomena genuinely reflect word-form retrieval problems. This study examined the relationship between retrieval difficulty, disfluency production, and individual differences in typically aging older adults, in light of both the Transmission Deficit Hypothesis (TDH) and the Inhibition Deficit Hypothesis (IDH). Twenty-five native Dutch-speaking adults aged 60 to 73 completed a connected-speech network task in which lexical frequency (high vs. low) was manipulated. Disfluencies and related responses were annotated and analyzed using generalized linear mixed-effects models. Lexical frequency did not affect the overall likelihood of disfluencies, arguing against IDH and against extending TDH to disfluency production. However, low-frequency words did increase semantically related answers, which could be consistent with the TDH. Vocabulary knowledge provided additional protection, with higher scores predicting fewer semantic alternatives. These findings suggest that disfluencies are not simply symptoms of retrieval failure. Implications are discussed in relation to “good-enough language production” and methodological challenges of capturing language production difficulties in cognitively demanding but ecologically valid tasks and contexts.
许多作者认为不流利是语音编码困难的副产品,但尚不清楚哪些连接语音现象真正反映了词形检索问题。本研究在传递缺陷假说(TDH)和抑制缺陷假说(IDH)的基础上,探讨了典型老年人检索困难、不流利产生和个体差异之间的关系。25名60 - 73岁的母语为荷兰语的成年人完成了一个连接语音网络任务,其中词汇频率(高与低)被操纵。使用广义线性混合效应模型对不流畅和相关反应进行注释和分析。词汇频率不影响不流利的总体可能性,反对IDH和反对将TDH扩展到不流利的产生。然而,低频词确实增加了语义相关的答案,这可能与TDH一致。词汇知识提供了额外的保护,更高的分数预示着更少的语义选择。这些发现表明,不流畅不仅仅是检索失败的症状。本文讨论了“足够好的语言生产”的含义,以及在认知要求但生态有效的任务和环境中捕捉语言生产困难的方法论挑战。
{"title":"Trading accuracy for fluency? An investigation of word retrieval difficulties in connected speech","authors":"Amber Römkens ,&nbsp;Aurélie Pistono","doi":"10.1016/j.specom.2025.103325","DOIUrl":"10.1016/j.specom.2025.103325","url":null,"abstract":"<div><div>Many authors view disfluencies as by-products of speech encoding difficulties, but it remains unclear which connected-speech phenomena genuinely reflect word-form retrieval problems. This study examined the relationship between retrieval difficulty, disfluency production, and individual differences in typically aging older adults, in light of both the Transmission Deficit Hypothesis (TDH) and the Inhibition Deficit Hypothesis (IDH). Twenty-five native Dutch-speaking adults aged 60 to 73 completed a connected-speech network task in which lexical frequency (high vs. low) was manipulated. Disfluencies and related responses were annotated and analyzed using generalized linear mixed-effects models. Lexical frequency did not affect the overall likelihood of disfluencies, arguing against IDH and against extending TDH to disfluency production. However, low-frequency words did increase semantically related answers, which could be consistent with the TDH. Vocabulary knowledge provided additional protection, with higher scores predicting fewer semantic alternatives. These findings suggest that disfluencies are not simply symptoms of retrieval failure. Implications are discussed in relation to “good-enough language production” and methodological challenges of capturing language production difficulties in cognitively demanding but ecologically valid tasks and contexts.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"175 ","pages":"Article 103325"},"PeriodicalIF":3.0,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Arabic dialects speech corpora: A systematic review 阿拉伯语方言语料库:系统回顾
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-11-01 DOI: 10.1016/j.specom.2025.103322
Ammar Mohammed Ali Alqadasi , Akram M. Zeki , Mohd Shahrizal Sunar , Siti Zaiton Mohd Hashim , Md Sah hj Salam , Rawad Abdulghafor
Speech processing applications are crucial in various domains, necessitating reliable speech recognition systems built upon suitable speech databases. However, the availability of comprehensive resources for the Arabic language remains limited compared to other languages like English. A systematic review was conducted to identify, analyze, and classify existing Arabic dialect speech databases. Initially, online digital databases and search engines were identified to collect a diverse range of manuscripts for thorough examination. The review encompassed 30 publicly accessible databases and an additional 39 self-databases, which were thoroughly studied, classified based on their characteristics, and subjected to a detailed analysis of research trends. This paper offers a comprehensive discussion on the diverse speech databases developed for various speech processing applications, highlighting the purposes and unique characteristics of Arabic speech databases. By providing valuable insights into their availability, characteristics, challenges, and research directions, this review aims to facilitate researchers' access to suitable resources for their specific applications, encourage the creation of new datasets in underrepresented areas, and promote open and easily accessible databases. Furthermore, the findings contribute to bridging the gap in available Arabic speech databases and serve as a valuable resource for researchers in the field.
© 2017 Elsevier Inc. All rights reserved.
语音处理应用在各个领域都是至关重要的,因此需要建立在合适的语音数据库上的可靠的语音识别系统。然而,与英语等其他语言相比,阿拉伯语的综合资源仍然有限。对现有的阿拉伯方言语音数据库进行了系统的回顾,以识别、分析和分类。最初,确定了在线数字数据库和搜索引擎,以收集各种各样的手稿进行彻底检查。审查包括30个可公开访问的数据库和另外39个自数据库,对这些数据库进行了彻底研究,根据其特点进行分类,并对研究趋势进行了详细分析。本文全面讨论了针对各种语音处理应用开发的各种语音数据库,突出了阿拉伯语语音数据库的目的和独特的特点。通过对这些数据的可用性、特征、挑战和研究方向提供有价值的见解,本综述旨在促进研究人员获取适合其特定应用的资源,鼓励在代表性不足的领域创建新的数据集,并促进开放和易于访问的数据库。此外,这些发现有助于弥补现有阿拉伯语语音数据库的空白,并为该领域的研究人员提供宝贵的资源。©2017 Elsevier Inc.版权所有。
{"title":"Arabic dialects speech corpora: A systematic review","authors":"Ammar Mohammed Ali Alqadasi ,&nbsp;Akram M. Zeki ,&nbsp;Mohd Shahrizal Sunar ,&nbsp;Siti Zaiton Mohd Hashim ,&nbsp;Md Sah hj Salam ,&nbsp;Rawad Abdulghafor","doi":"10.1016/j.specom.2025.103322","DOIUrl":"10.1016/j.specom.2025.103322","url":null,"abstract":"<div><div>Speech processing applications are crucial in various domains, necessitating reliable speech recognition systems built upon suitable speech databases. However, the availability of comprehensive resources for the Arabic language remains limited compared to other languages like English. A systematic review was conducted to identify, analyze, and classify existing Arabic dialect speech databases. Initially, online digital databases and search engines were identified to collect a diverse range of manuscripts for thorough examination. The review encompassed 30 publicly accessible databases and an additional 39 self-databases, which were thoroughly studied, classified based on their characteristics, and subjected to a detailed analysis of research trends. This paper offers a comprehensive discussion on the diverse speech databases developed for various speech processing applications, highlighting the purposes and unique characteristics of Arabic speech databases. By providing valuable insights into their availability, characteristics, challenges, and research directions, this review aims to facilitate researchers' access to suitable resources for their specific applications, encourage the creation of new datasets in underrepresented areas, and promote open and easily accessible databases. Furthermore, the findings contribute to bridging the gap in available Arabic speech databases and serve as a valuable resource for researchers in the field.</div><div>© 2017 Elsevier Inc. All rights reserved.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"175 ","pages":"Article 103322"},"PeriodicalIF":3.0,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145528716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predicting speech intelligibility in older adults for speech enhancement using the Gammachirp Envelope Similarity Index, GESI 使用Gammachirp包络相似性指数预测老年人语音可理解性以增强语音
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-11-01 DOI: 10.1016/j.specom.2025.103318
Ayako Yamamoto , Fuki Miyazaki , Toshio Irino
We propose an objective intelligibility measure (OIM), called the Gammachirp Envelope Similarity Index (GESI), that can predict speech intelligibility (SI) in older adults. GESI is a bottom-up model based on psychoacoustic knowledge from the peripheral to the central auditory system. It computes the single SI metric using the gammachirp filterbank (GCFB), the modulation filterbank, and the extended cosine similarity measure. It takes into account not only the hearing level represented in the audiogram, but also the temporal processing characteristics captured by the temporal modulation transfer function (TMTF). To evaluate performance, SI experiments were conducted with older adults of various hearing levels using speech-in-noise with ideal speech enhancement on familiarity-controlled Japanese words. The prediction performance was compared with HASPIw2, which was developed for keyword SI prediction. The results showed that GESI predicted the subjective SI scores more accurately than HASPIw2. GESI was also found to be at least as effective as, if not more effective than, HASPIv2 in predicting English sentence-level SI. The effect of introducing TMTF into the GESI algorithm was insignificant, suggesting that TMTF measurements and models are not yet mature. Therefore, it may be necessary to perform TMTF measurements with bandpass noise and to improve the incorporation of temporal characteristics into the model.
我们提出了一个客观的可理解性测量(OIM),称为Gammachirp包络相似性指数(GESI),可以预测老年人的语音可理解性(SI)。GESI是一种基于从外周到中枢听觉的心理声学知识的自下而上的模型。它使用gammachirp滤波器组(GCFB)、调制滤波器组和扩展余弦相似度度量来计算单个SI度量。它不仅考虑了听力图所表示的听力水平,而且还考虑了由时间调制传递函数(TMTF)捕获的时间处理特征。为了评估其表现,我们对不同听力水平的老年人进行了手语实验,在熟悉度控制的日语单词上使用了具有理想语音增强功能的噪音语音。并与专为关键词SI预测而开发的HASPIw2进行了预测性能比较。结果表明,GESI对主观SI评分的预测比HASPIw2更准确。GESI也被发现在预测英语句子水平的SI方面至少与HASPIv2一样有效,甚至更有效。在GESI算法中引入TMTF的效果不显著,说明TMTF的测量方法和模型还不成熟。因此,可能有必要在带通噪声的情况下进行TMTF测量,并改进将时间特征纳入模型。
{"title":"Predicting speech intelligibility in older adults for speech enhancement using the Gammachirp Envelope Similarity Index, GESI","authors":"Ayako Yamamoto ,&nbsp;Fuki Miyazaki ,&nbsp;Toshio Irino","doi":"10.1016/j.specom.2025.103318","DOIUrl":"10.1016/j.specom.2025.103318","url":null,"abstract":"<div><div>We propose an objective intelligibility measure (OIM), called the Gammachirp Envelope Similarity Index (GESI), that can predict speech intelligibility (SI) in older adults. GESI is a bottom-up model based on psychoacoustic knowledge from the peripheral to the central auditory system. It computes the single SI metric using the gammachirp filterbank (GCFB), the modulation filterbank, and the extended cosine similarity measure. It takes into account not only the hearing level represented in the audiogram, but also the temporal processing characteristics captured by the temporal modulation transfer function (TMTF). To evaluate performance, SI experiments were conducted with older adults of various hearing levels using speech-in-noise with ideal speech enhancement on familiarity-controlled Japanese words. The prediction performance was compared with HASPIw2, which was developed for keyword SI prediction. The results showed that GESI predicted the subjective SI scores more accurately than HASPIw2. GESI was also found to be at least as effective as, if not more effective than, HASPIv2 in predicting English sentence-level SI. The effect of introducing TMTF into the GESI algorithm was insignificant, suggesting that TMTF measurements and models are not yet mature. Therefore, it may be necessary to perform TMTF measurements with bandpass noise and to improve the incorporation of temporal characteristics into the model.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"175 ","pages":"Article 103318"},"PeriodicalIF":3.0,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dynamic graph learning with gated convolutions for single-channel speech separation 基于门控卷积的单通道语音分离动态图学习
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-11-01 DOI: 10.1016/j.specom.2025.103321
Meng Zhang, Xinyu Jia, Yina Guo
Single-channel speech separation remains challenging due to the need for joint modeling of time-varying spectral patterns and spatial interactions between overlapping sources. While deep learning methods excel at temporal sequence processing, their fixed geometric representations inherently limit dynamic spatial relationship modeling. In this paper, we propose a novel dynamically learned Gated Dense Graph Convolutional Network (GDGCN) that overcomes the limitations of spatiotemporal dynamic modeling in speech separation. Specifically, we employ an adaptive hybrid topology integrating complete and K-partite graph structures to explicitly model multi-scale spatial dependencies between sound sources. Furthermore, we design a novel gating mechanism for speech graph data that maps node features to an information selection space through learnable projection matrices, dynamically regulating inter-node information flow. This architecture enables effective modeling of time-varying couplings without being constrained by static parameters. Experimental evaluations on benchmark datasets demonstrate the superior performance of our method for speech separation under noisy conditions as evidenced by objective metrics.
由于需要对时变频谱模式和重叠源之间的空间相互作用进行联合建模,单通道语音分离仍然具有挑战性。虽然深度学习方法擅长时间序列处理,但其固定的几何表示固有地限制了动态空间关系建模。在本文中,我们提出了一种新的动态学习门控密集图卷积网络(GDGCN),克服了时空动态建模在语音分离中的局限性。具体来说,我们采用了一种自适应混合拓扑,集成了完全和k部图结构,以明确地模拟声源之间的多尺度空间依赖关系。此外,我们设计了一种新的语音图数据门控机制,通过可学习的投影矩阵将节点特征映射到信息选择空间,动态调节节点间的信息流。这种体系结构能够有效地对时变耦合进行建模,而不受静态参数的约束。对基准数据集的实验评估表明,我们的方法在噪声条件下的语音分离性能优越,客观指标证明了这一点。
{"title":"Dynamic graph learning with gated convolutions for single-channel speech separation","authors":"Meng Zhang,&nbsp;Xinyu Jia,&nbsp;Yina Guo","doi":"10.1016/j.specom.2025.103321","DOIUrl":"10.1016/j.specom.2025.103321","url":null,"abstract":"<div><div>Single-channel speech separation remains challenging due to the need for joint modeling of time-varying spectral patterns and spatial interactions between overlapping sources. While deep learning methods excel at temporal sequence processing, their fixed geometric representations inherently limit dynamic spatial relationship modeling. In this paper, we propose a novel dynamically learned Gated Dense Graph Convolutional Network (GDGCN) that overcomes the limitations of spatiotemporal dynamic modeling in speech separation. Specifically, we employ an adaptive hybrid topology integrating complete and K-partite graph structures to explicitly model multi-scale spatial dependencies between sound sources. Furthermore, we design a novel gating mechanism for speech graph data that maps node features to an information selection space through learnable projection matrices, dynamically regulating inter-node information flow. This architecture enables effective modeling of time-varying couplings without being constrained by static parameters. Experimental evaluations on benchmark datasets demonstrate the superior performance of our method for speech separation under noisy conditions as evidenced by objective metrics.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"175 ","pages":"Article 103321"},"PeriodicalIF":3.0,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FinnAffect: An affective speech corpus for spontaneous Finnish FinnAffect:自发性芬兰语的情感语料库
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-11-01 DOI: 10.1016/j.specom.2025.103327
Kalle Lahtinen , Liisa Mustanoja , Okko Räsänen
Affective expression plays a major role in everyday spoken and written language. In order to study how affect is expressed by Finnish language users in day-to-day life, data consisting of samples from naturalistic and unscripted contexts is required. The present work describes the first spontaneous speech corpus for Finnish with affect-related annotations, containing 12,000 transcribed samples of unscripted speech paired with continuous-valued scores of valence and arousal marked by five native Finnish speakers. We first describe the creation of the corpus, based on combining speech samples from three large-scale Finnish speech corpora, from which we chose samples for annotation using an active learning-based affect mining approach. We then report characteristics of the resulting corpus and annotation consistency, followed by speech emotion recognition (SER) experiments with several classifiers and regression models to test the feasibility of the corpus for SER system development and evaluation. Annotation analyses reveal mean Pearson correlations between annotator scores and the mean of all annotators to be ρmean=0.856 for valence and ρmean=0.898 for arousal. The SER experiments on discretized labels result in an average unweighted average recall (UAR) of 0.458 for ternary valence classification and 0.719 for binary arousal classification using a fine-tuned ExHuBERT model for valence prediction and a support vector machine (SVM) classifier for arousal prediction, reaching comparable levels to those reported earlier for spontaneous speech. For the regression task, concordance correlation coefficients of 0.270 and 0.689 were obtained for valence and arousal, respectively, when using a WavLM-based model trained on MSP-Podcast corpus and fine-tuned on the target data. Overall, the analyses suggest that the corpus provides a feasible basis for later study on affective expression in spontaneous Finnish.
情感表达在日常口语和书面语中起着重要作用。为了研究芬兰语使用者在日常生活中如何表达情感,需要从自然情境和非脚本情境中收集样本数据。目前的工作描述了芬兰语的第一个带有情感相关注释的自发语音语料库,包含12,000个转录的非脚本语音样本,并与5个母语为芬兰语的人标记的价态和唤醒的连续值分数配对。我们首先描述了语料库的创建,该语料库基于三个大型芬兰语语料库的语音样本的组合,我们使用基于主动学习的影响挖掘方法从中选择样本进行注释。然后,我们报告了结果语料库的特征和注释一致性,随后使用几个分类器和回归模型进行了语音情感识别(SER)实验,以测试语料库用于SER系统开发和评估的可行性。注释分析表明,注释者得分与所有注释者的平均值之间的平均Pearson相关性为效价ρmean=0.856,唤醒ρmean=0.898。在离散标签上进行的SER实验结果表明,使用经过调整的ExHuBERT模型进行价态预测和支持向量机(SVM)分类器进行唤醒预测,三元价态分类的平均未加权平均召回率(UAR)为0.458,二元唤醒分类的平均未加权平均召回率(UAR)为0.719,达到了与之前报道的自发语音相当的水平。对于回归任务,当使用基于wavlm的模型训练MSP-Podcast语料并对目标数据进行调整时,效价和唤醒的一致性相关系数分别为0.270和0.689。综上所述,该语料库为后续对自发性芬兰语情感表达的研究提供了可行的基础。
{"title":"FinnAffect: An affective speech corpus for spontaneous Finnish","authors":"Kalle Lahtinen ,&nbsp;Liisa Mustanoja ,&nbsp;Okko Räsänen","doi":"10.1016/j.specom.2025.103327","DOIUrl":"10.1016/j.specom.2025.103327","url":null,"abstract":"<div><div>Affective expression plays a major role in everyday spoken and written language. In order to study how affect is expressed by Finnish language users in day-to-day life, data consisting of samples from naturalistic and unscripted contexts is required. The present work describes the first spontaneous speech corpus for Finnish with affect-related annotations, containing 12,000 transcribed samples of unscripted speech paired with continuous-valued scores of valence and arousal marked by five native Finnish speakers. We first describe the creation of the corpus, based on combining speech samples from three large-scale Finnish speech corpora, from which we chose samples for annotation using an active learning-based affect mining approach. We then report characteristics of the resulting corpus and annotation consistency, followed by speech emotion recognition (SER) experiments with several classifiers and regression models to test the feasibility of the corpus for SER system development and evaluation. Annotation analyses reveal mean Pearson correlations between annotator scores and the mean of all annotators to be <span><math><mrow><msub><mrow><mi>ρ</mi></mrow><mrow><mi>m</mi><mi>e</mi><mi>a</mi><mi>n</mi></mrow></msub><mo>=</mo><mn>0</mn><mo>.</mo><mn>856</mn></mrow></math></span> for valence and <span><math><mrow><msub><mrow><mi>ρ</mi></mrow><mrow><mi>m</mi><mi>e</mi><mi>a</mi><mi>n</mi></mrow></msub><mo>=</mo><mn>0</mn><mo>.</mo><mn>898</mn></mrow></math></span> for arousal. The SER experiments on discretized labels result in an average unweighted average recall (UAR) of 0.458 for ternary valence classification and 0.719 for binary arousal classification using a fine-tuned ExHuBERT model for valence prediction and a support vector machine (SVM) classifier for arousal prediction, reaching comparable levels to those reported earlier for spontaneous speech. For the regression task, concordance correlation coefficients of 0.270 and 0.689 were obtained for valence and arousal, respectively, when using a WavLM-based model trained on MSP-Podcast corpus and fine-tuned on the target data. Overall, the analyses suggest that the corpus provides a feasible basis for later study on affective expression in spontaneous Finnish.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"175 ","pages":"Article 103327"},"PeriodicalIF":3.0,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145528714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Speech Communication
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1