首页 > 最新文献

Speech Communication最新文献

英文 中文
Emphasis rendering for conversational text-to-speech with multi-modal multi-scale context modeling 强调会话文本到语音的多模态多尺度上下文建模
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2026-01-27 DOI: 10.1016/j.specom.2026.103353
Rui Liu , Jia Zhenqi , Jie Yang , Yifan Hu , Haizhou Li
Conversational Text-to-Speech (CTTS) aims to accurately express an utterance with the appropriate style within a conversational setting, which attracts more attention nowadays. While recognizing the significance of the CTTS task, prior studies have not thoroughly investigated speech emphasis expression, which is essential for conveying the underlying intention and attitude in human-machine interaction scenarios, due to the scarcity of conversational emphasis datasets and the difficulty in context understanding. In this paper, we propose a novel Emphasis Rendering scheme for the CTTS model, termed ER-CTTS, that includes two main components: (1) we simultaneously take into textual and acoustic contexts, with both global and local semantic modeling to understand the conversation context comprehensively; (2) we deeply integrate multi-modal and multi-scale context to learn the influence of context on the emphasis expression of the current utterance. Finally, the inferred emphasis feature is fed into the neural speech synthesizer to generate conversational speech. To address data scarcity, we create emphasis intensity annotations on the existing conversational dataset (DailyTalk). Both objective and subjective evaluations suggest that our model outperforms the baseline models in emphasis rendering within a conversational setting. The code and audio samples are available at https://github.com/AI-S2-Lab/ER-CTTS.
会话式文本到语音(CTTS)旨在在会话环境中以适当的风格准确地表达话语,是目前越来越受到关注的问题。虽然认识到CTTS任务的重要性,但由于会话强调数据集的缺乏和上下文理解的困难,先前的研究并没有深入研究语音强调表达,而语音强调表达是人机交互场景中传达潜在意图和态度的关键。在本文中,我们提出了一种新的CTTS模型的强调渲染方案,称为ER-CTTS,该方案包括两个主要组成部分:(1)我们同时考虑文本和声学上下文,使用全局和局部语义建模来全面理解会话上下文;(2)深入整合多模态、多尺度语境,学习语境对当前话语强调表达的影响。最后,将推断出的强调特征输入到神经语音合成器中生成会话语音。为了解决数据稀缺性问题,我们在现有的会话数据集(DailyTalk)上创建了强调强度注释。客观和主观的评估都表明,我们的模型在会话设置中的重点渲染方面优于基线模型。代码和音频示例可从https://github.com/AI-S2-Lab/ER-CTTS获得。
{"title":"Emphasis rendering for conversational text-to-speech with multi-modal multi-scale context modeling","authors":"Rui Liu ,&nbsp;Jia Zhenqi ,&nbsp;Jie Yang ,&nbsp;Yifan Hu ,&nbsp;Haizhou Li","doi":"10.1016/j.specom.2026.103353","DOIUrl":"10.1016/j.specom.2026.103353","url":null,"abstract":"<div><div>Conversational Text-to-Speech (CTTS) aims to accurately express an utterance with the appropriate style within a conversational setting, which attracts more attention nowadays. While recognizing the significance of the CTTS task, prior studies have not thoroughly investigated speech emphasis expression, which is essential for conveying the underlying intention and attitude in human-machine interaction scenarios, due to the scarcity of conversational emphasis datasets and the difficulty in context understanding. In this paper, we propose a novel Emphasis Rendering scheme for the CTTS model, termed ER-CTTS, that includes two main components: (1) we simultaneously take into textual and acoustic contexts, with both global and local semantic modeling to understand the conversation context comprehensively; (2) we deeply integrate multi-modal and multi-scale context to learn the influence of context on the emphasis expression of the current utterance. Finally, the inferred emphasis feature is fed into the neural speech synthesizer to generate conversational speech. To address data scarcity, we create emphasis intensity annotations on the existing conversational dataset (DailyTalk). Both objective and subjective evaluations suggest that our model outperforms the baseline models in emphasis rendering within a conversational setting. The code and audio samples are available at <span><span>https://github.com/AI-S2-Lab/ER-CTTS</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"178 ","pages":"Article 103353"},"PeriodicalIF":3.0,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146081290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Classification of phonation types in singing and speaking voice using self-supervised learning models 使用自我监督学习模型对唱歌和说话声音的发声类型进行分类
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2026-01-23 DOI: 10.1016/j.specom.2026.103355
Prathamesh Parasharam Patil , Kiran Reddy Mittapalle , Paavo Alku
Phonation, the process of producing audible sound, involves various laryngeal adjustments that result in distinct phonation types such as breathy, neutral, and pressed. These types are crucial in conveying expression and emotion in both singing and speaking. This study proposes an approach for automatic phonation type classification by leveraging features extracted from three pre-trained speech foundation self-supervised learning (SSL) models (Wav2vec2-Base, Wav2vec2-Large, and HuBERT) and one pre-trained foundation SSL model for non-verbal vocalization (voc2vec). Unlike traditional methods that depend on intricate signal processing and manual feature engineering, our approach automatically derives robust, high-level representations from multiple layers of these SSL models. These learned features are subsequently classified using Support Vector Machines (SVMs) with a Radial Basis Function (RBF) kernel and Feed-Forward Neural Networks (FFNNs). Experiments conducted on established singing and speaking voice datasets, utilizing a 10-fold cross-validation strategy, demonstrate superior performance. The proposed SSL-based method achieved classification accuracy of 97% for singing voice and 88% for speaking voice, significantly outperforming conventional feature extraction techniques. These findings underscore the efficacy of SSL-derived features for classification of phonation types, offering a scalable and powerful methodology with considerable implications for vocal pedagogy, performance analysis, singing synthesis, and clinical voice assessment.
发声是产生可听声音的过程,涉及到各种喉部的调整,从而产生不同的发声类型,如呼吸式、中性和压音。无论是唱歌还是说话,这些类型在表达表情和情感方面都至关重要。本研究提出了一种自动语音类型分类方法,该方法利用了从三个预训练的语音基础自监督学习(SSL)模型(Wav2vec2-Base、Wav2vec2-Large和HuBERT)和一个预训练的非言语发声基础SSL模型(voc2vec)中提取的特征。与依赖于复杂信号处理和手动特征工程的传统方法不同,我们的方法从这些SSL模型的多层自动派生出鲁棒的高级表示。这些学习到的特征随后使用支持向量机(svm)与径向基函数(RBF)核和前馈神经网络(ffnn)进行分类。在已建立的唱歌和说话语音数据集上进行的实验,利用10倍交叉验证策略,证明了卓越的性能。本文提出的基于ssl的方法对唱歌声音和说话声音的分类准确率分别达到97%和88%,显著优于传统的特征提取技术。这些发现强调了ssl衍生特征对发声类型分类的有效性,为声乐教学、表演分析、歌唱合成和临床声音评估提供了一种可扩展且强大的方法。
{"title":"Classification of phonation types in singing and speaking voice using self-supervised learning models","authors":"Prathamesh Parasharam Patil ,&nbsp;Kiran Reddy Mittapalle ,&nbsp;Paavo Alku","doi":"10.1016/j.specom.2026.103355","DOIUrl":"10.1016/j.specom.2026.103355","url":null,"abstract":"<div><div>Phonation, the process of producing audible sound, involves various laryngeal adjustments that result in distinct phonation types such as breathy, neutral, and pressed. These types are crucial in conveying expression and emotion in both singing and speaking. This study proposes an approach for automatic phonation type classification by leveraging features extracted from three pre-trained speech foundation self-supervised learning (SSL) models (Wav2vec2-Base, Wav2vec2-Large, and HuBERT) and one pre-trained foundation SSL model for non-verbal vocalization (voc2vec). Unlike traditional methods that depend on intricate signal processing and manual feature engineering, our approach automatically derives robust, high-level representations from multiple layers of these SSL models. These learned features are subsequently classified using Support Vector Machines (SVMs) with a Radial Basis Function (RBF) kernel and Feed-Forward Neural Networks (FFNNs). Experiments conducted on established singing and speaking voice datasets, utilizing a 10-fold cross-validation strategy, demonstrate superior performance. The proposed SSL-based method achieved classification accuracy of 97% for singing voice and 88% for speaking voice, significantly outperforming conventional feature extraction techniques. These findings underscore the efficacy of SSL-derived features for classification of phonation types, offering a scalable and powerful methodology with considerable implications for vocal pedagogy, performance analysis, singing synthesis, and clinical voice assessment.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"178 ","pages":"Article 103355"},"PeriodicalIF":3.0,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146080842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-attention fusion for audio-visual emotion recognition with shared transformer 基于共享变压器的跨注意融合视听情感识别
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2026-01-23 DOI: 10.1016/j.specom.2026.103356
Jianjun Lei , Kai Ye , Ying Wang
Accurately recognizing emotional states can facilitate achieving a natural human–computer interaction. However, conventional multimodal emotion recognition approaches usually obtain low recognition accuracy due to their inadequate integration of multiple modal features. This paper proposes an innovative multimodal emotion recognition framework with four pivotal elements: Cross-attention, Auxiliary network, Multi-head attention, and Shared Transformer, thus referred to as CAMS. The shared Transformer leverages single-modal features in the auxiliary network to refine the attention mechanism during feature fusion, which can prevent the model from ignoring important emotional information and improve the feature interaction. In addition, we introduce a cross-attention mechanism to connect multimodal features from the audio and video encoders. Moreover, we incorporate noise injection and feature-dropping techniques during training, simulating potential real-world challenges such as noise interference and missing modalities. Our empirical evaluation on the RAVDESS and CREMA-D datasets demonstrates CAMS’s superior performance, with average accuracies of 89.78% and 80.23%, respectively, surpassing current benchmarks in multimodal emotion recognition.
准确识别情绪状态有助于实现自然的人机交互。然而,传统的多模态情感识别方法由于缺乏对多模态特征的整合,识别精度较低。本文提出了一种具有交叉注意、辅助网络、多头注意和共享转换器四个关键要素的创新型多模态情感识别框架,即CAMS。共享变压器利用辅助网络中的单模态特征,细化特征融合过程中的注意机制,防止模型忽略重要的情感信息,提高特征的交互性。此外,我们还引入了一种交叉注意机制来连接音频和视频编码器的多模态特征。此外,我们在训练过程中结合了噪声注入和特征删除技术,模拟了潜在的现实挑战,如噪声干扰和缺失模态。我们对RAVDESS和CREMA-D数据集的实证评估表明,CAMS的性能优越,平均准确率分别为89.78%和80.23%,超过了当前多模态情绪识别的基准。
{"title":"Cross-attention fusion for audio-visual emotion recognition with shared transformer","authors":"Jianjun Lei ,&nbsp;Kai Ye ,&nbsp;Ying Wang","doi":"10.1016/j.specom.2026.103356","DOIUrl":"10.1016/j.specom.2026.103356","url":null,"abstract":"<div><div>Accurately recognizing emotional states can facilitate achieving a natural human–computer interaction. However, conventional multimodal emotion recognition approaches usually obtain low recognition accuracy due to their inadequate integration of multiple modal features. This paper proposes an innovative multimodal emotion recognition framework with four pivotal elements: Cross-attention, Auxiliary network, Multi-head attention, and Shared Transformer, thus referred to as CAMS. The shared Transformer leverages single-modal features in the auxiliary network to refine the attention mechanism during feature fusion, which can prevent the model from ignoring important emotional information and improve the feature interaction. In addition, we introduce a cross-attention mechanism to connect multimodal features from the audio and video encoders. Moreover, we incorporate noise injection and feature-dropping techniques during training, simulating potential real-world challenges such as noise interference and missing modalities. Our empirical evaluation on the RAVDESS and CREMA-D datasets demonstrates CAMS’s superior performance, with average accuracies of 89.78% and 80.23%, respectively, surpassing current benchmarks in multimodal emotion recognition.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"178 ","pages":"Article 103356"},"PeriodicalIF":3.0,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146081289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Influence of speech-in-noise perception, gender, and age on lipreading ability for monosyllabic words 噪声语音感知、性别和年龄对单音节词唇读能力的影响
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2026-01-16 DOI: 10.1016/j.specom.2026.103354
Brandon T. Paul, Patricia V. Aguiar, Jennifer Hanna Al-Shaikh
Lipreading is an important function that supports speech perception in people with communication difficulties, and there is recent interest in the development of lipreading training programs as a strategy to rehabilitate their speech perception. However, the success of these programs is mixed, potentially owing to individual differences in participant characteristics. Here, we conducted an online cross-sectional study to examine the effects of speech-in-noise perception, gender, and age on lipreading performance. Forty participants aged 41–75 years viewed short, silent video clips of a woman speaking a monosyllabic word and typed the word they perceived into a response box. Additionally, we collected demographic information and speech-in-noise perception scores. Lipreading performance was scored at the word level (lexical level) and at sublexical levels for individual phonemes and visually identical homophemes (visemes) of the target words. Women correctly reported more phonemes and visemes per word than men, but no gender effect was found at the whole-word level. There was an interaction between age and speech-in-noise ability for both words and phonemes: Lipreading performance was best for comparatively younger participants with worse speech-in-noise performance, but this effect was not found in older participants and those with better speech-in-noise perception. This suggests evidence for a compensatory reliance on visual speech when speech-in-noise perception is low, which may decline with age. Overall, our results suggest that gender, age, and speech-in-noise perception shape lipreading ability, although the effects may differ depending on the analysis at lexical and sublexical levels.
唇读是支持有沟通困难的人的语言感知的重要功能,最近有兴趣开发唇读训练计划作为恢复他们的语言感知的策略。然而,这些项目的成功参差不齐,可能是由于参与者特征的个体差异。在这里,我们进行了一项在线横断面研究,以检查语音噪音感知,性别和年龄对唇读表现的影响。40名年龄在41-75岁之间的参与者观看了一段简短的无声视频片段,视频中是一个女人在说一个单音节的单词,然后将这个单词输入到一个回复框中。此外,我们还收集了人口统计信息和语音噪音感知分数。在单词水平(词汇水平)和亚词汇水平上对目标单词的单个音素和视觉上相同的同音音素(视音)进行唇读评分。女性正确报告的每个单词的音素和音素比男性多,但在整个单词水平上没有发现性别影响。在单词和音素方面,年龄和语音噪音能力之间存在相互作用:语音噪音能力较差的年轻参与者的唇读表现最好,但在年龄较大的参与者和语音噪音感知能力较好的参与者中没有发现这种影响。这表明,当语音噪音感知能力较低时,对视觉语言的补偿性依赖可能会随着年龄的增长而下降。总体而言,我们的研究结果表明,性别、年龄和语音噪音感知会影响唇读能力,尽管其影响可能因词汇和亚词汇水平的分析而有所不同。
{"title":"Influence of speech-in-noise perception, gender, and age on lipreading ability for monosyllabic words","authors":"Brandon T. Paul,&nbsp;Patricia V. Aguiar,&nbsp;Jennifer Hanna Al-Shaikh","doi":"10.1016/j.specom.2026.103354","DOIUrl":"10.1016/j.specom.2026.103354","url":null,"abstract":"<div><div>Lipreading is an important function that supports speech perception in people with communication difficulties, and there is recent interest in the development of lipreading training programs as a strategy to rehabilitate their speech perception. However, the success of these programs is mixed, potentially owing to individual differences in participant characteristics. Here, we conducted an online cross-sectional study to examine the effects of speech-in-noise perception, gender, and age on lipreading performance. Forty participants aged 41–75 years viewed short, silent video clips of a woman speaking a monosyllabic word and typed the word they perceived into a response box. Additionally, we collected demographic information and speech-in-noise perception scores. Lipreading performance was scored at the word level (lexical level) and at sublexical levels for individual phonemes and visually identical homophemes (visemes) of the target words. Women correctly reported more phonemes and visemes per word than men, but no gender effect was found at the whole-word level. There was an interaction between age and speech-in-noise ability for both words and phonemes: Lipreading performance was best for comparatively younger participants with worse speech-in-noise performance, but this effect was not found in older participants and those with better speech-in-noise perception. This suggests evidence for a compensatory reliance on visual speech when speech-in-noise perception is low, which may decline with age. Overall, our results suggest that gender, age, and speech-in-noise perception shape lipreading ability, although the effects may differ depending on the analysis at lexical and sublexical levels.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"178 ","pages":"Article 103354"},"PeriodicalIF":3.0,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146043327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MS-VBRVQ: Multi-scale variable bitrate speech residual vector quantization 多尺度可变比特率语音残差矢量量化
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-12-18 DOI: 10.1016/j.specom.2025.103346
Yukun Qian, Shiyun Xu, Xuyi Zhuang, Zehua Zhang, Mingjiang Wang
Recent speech quantization compression models have adopted residual vector quantization (RVQ) methods. However, these models typically use fixed bitrates, allocating the same number of time frames at a constant scale across all speech segments. This approach may lead to bitrate inefficiency, particularly when the audio contains simpler segments. To address this limitation, we introduce a multi-scale variable bitrate approach by incorporating a relative importance map, adaptive threshold masks, and a gradient estimation function into the RVQ-GAN model. This method allows the allocation of time frames at varying time scales, depending on the complexity of the audio. For more complex audio, a greater number of time frames are allocated, while fewer time frames are assigned to simpler segments. Additionally, we propose both symmetric and asymmetric decoding methods. Asymmetric decoding is easier to implement and integrates seamlessly into the system, while symmetric decoding delivers superior audio quality at lower bitrates. Subjective and objective experiments demonstrate that, compared to EnCodec, both of our decoding methods deliver excellent audio quality at lower bitrates across various speech and singing datasets, with only a slight increase in computational cost. In comparison to the VRVQ method, we achieve comparable audio quality at even lower bitrates, while requiring less computational cost.
最近的语音量化压缩模型都采用了残差矢量量化(RVQ)方法。然而,这些模型通常使用固定的比特率,在所有语音段中以恒定的比例分配相同数量的时间帧。这种方法可能导致比特率效率低下,特别是当音频包含更简单的片段时。为了解决这一限制,我们引入了一种多尺度可变比特率方法,将相对重要性图、自适应阈值掩码和梯度估计函数合并到RVQ-GAN模型中。这种方法允许在不同的时间尺度上分配时间框架,这取决于音频的复杂性。对于更复杂的音频,更多的时间框架被分配,而更少的时间框架被分配给更简单的片段。此外,我们提出了对称和非对称解码方法。非对称解码更容易实现并无缝集成到系统中,而对称解码以更低的比特率提供更高的音频质量。主观和客观实验表明,与EnCodec相比,我们的两种解码方法都能在各种语音和歌唱数据集上以更低的比特率提供出色的音频质量,而计算成本仅略有增加。与VRVQ方法相比,我们以更低的比特率获得了相当的音频质量,同时需要更少的计算成本。
{"title":"MS-VBRVQ: Multi-scale variable bitrate speech residual vector quantization","authors":"Yukun Qian,&nbsp;Shiyun Xu,&nbsp;Xuyi Zhuang,&nbsp;Zehua Zhang,&nbsp;Mingjiang Wang","doi":"10.1016/j.specom.2025.103346","DOIUrl":"10.1016/j.specom.2025.103346","url":null,"abstract":"<div><div>Recent speech quantization compression models have adopted residual vector quantization (RVQ) methods. However, these models typically use fixed bitrates, allocating the same number of time frames at a constant scale across all speech segments. This approach may lead to bitrate inefficiency, particularly when the audio contains simpler segments. To address this limitation, we introduce a multi-scale variable bitrate approach by incorporating a relative importance map, adaptive threshold masks, and a gradient estimation function into the RVQ-GAN model. This method allows the allocation of time frames at varying time scales, depending on the complexity of the audio. For more complex audio, a greater number of time frames are allocated, while fewer time frames are assigned to simpler segments. Additionally, we propose both symmetric and asymmetric decoding methods. Asymmetric decoding is easier to implement and integrates seamlessly into the system, while symmetric decoding delivers superior audio quality at lower bitrates. Subjective and objective experiments demonstrate that, compared to EnCodec, both of our decoding methods deliver excellent audio quality at lower bitrates across various speech and singing datasets, with only a slight increase in computational cost. In comparison to the VRVQ method, we achieve comparable audio quality at even lower bitrates, while requiring less computational cost.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"177 ","pages":"Article 103346"},"PeriodicalIF":3.0,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145814282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hand gesture realisation of contrastive focus in real-time whisper-to-speech synthesis: Investigating the transfer from implicit to explicit control of intonation 实时耳语-语音合成中对比焦点的手势实现:研究语调从隐式到显式控制的转移
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-12-15 DOI: 10.1016/j.specom.2025.103344
Delphine Charuau , Nathalie Henrich Bernardoni , Silvain Gerber , Olivier Perrotin
The ability of speakers to externalise the control of their intonation in the context of voice substitution communication is evaluated in terms of the realisation of a contrastive focus in French. A whisper-to-speech synthesiser is used with gestural interfaces for intonation control, enabling two types of gesture: an isometric finger pressure and an isotonic wrist movement. An original experimental paradigm is designed to elicit a contrastive focus on the
syllables of nine-syllable sentences by means of a read-question-answer scenario. For all 16 participants, focus was successfully achieved in speech and in both modality transfer situations by increasing the fundamental frequency and duration of the target syllable. Coordination of the articulation of the whispered syllables and the manual intonational control was acquired quickly and easily. Focus realisation by finger pressure or wrist movement showed very similar dynamics in intonation and duration. Overall, although wrist movement was preferred in terms of ease of control, both interfaces were judged to be equal in terms of learning, performance, emotional experience, and cognitive load.
在语音替代交际的语境中,说话者外化语调控制的能力通过对比焦点在法语中的实现来评估。耳语-语音合成器与手势接口一起用于语调控制,支持两种类型的手势:等距手指压力和等张力手腕运动。设计了一个独创的实验范式,通过阅读-提问-回答的方式,引起对九音节句子音节的对比关注。对于所有16名参与者来说,通过增加目标音节的基本频率和持续时间,在语音和两种情态转移情况下都成功地实现了焦点。低语速音节的发音与人工语调控制的协调是快速而容易地获得的。通过手指按压或手腕运动实现的焦点在语调和持续时间上表现出非常相似的动态。总体而言,尽管手腕运动在易于控制方面更受青睐,但在学习、表现、情绪体验和认知负荷方面,两种界面被认为是平等的。
{"title":"Hand gesture realisation of contrastive focus in real-time whisper-to-speech synthesis: Investigating the transfer from implicit to explicit control of intonation","authors":"Delphine Charuau ,&nbsp;Nathalie Henrich Bernardoni ,&nbsp;Silvain Gerber ,&nbsp;Olivier Perrotin","doi":"10.1016/j.specom.2025.103344","DOIUrl":"10.1016/j.specom.2025.103344","url":null,"abstract":"<div><div>The ability of speakers to externalise the control of their intonation in the context of voice substitution communication is evaluated in terms of the realisation of a contrastive focus in French. A whisper-to-speech synthesiser is used with gestural interfaces for intonation control, enabling two types of gesture: an isometric finger pressure and an isotonic wrist movement. An original experimental paradigm is designed to elicit a contrastive focus on the <figure><img></figure> syllables of nine-syllable sentences by means of a read-question-answer scenario. For all 16 participants, focus was successfully achieved in speech and in both modality transfer situations by increasing the fundamental frequency and duration of the target syllable. Coordination of the articulation of the whispered syllables and the manual intonational control was acquired quickly and easily. Focus realisation by finger pressure or wrist movement showed very similar dynamics in intonation and duration. Overall, although wrist movement was preferred in terms of ease of control, both interfaces were judged to be equal in terms of learning, performance, emotional experience, and cognitive load.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"177 ","pages":"Article 103344"},"PeriodicalIF":3.0,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145927299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lateral channel dynamics and F3 modulation: Quantifying para-sagittal articulation in Australian English /l/ 横向通道动力学和F3调制:量化澳大利亚英语中矢状旁发音/l/
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-12-13 DOI: 10.1016/j.specom.2025.103345
Jia Ying
This study investigates articulatory-acoustic relationships in Australian English /l/ using simultaneous 3D electromagnetic articulography (EMA) and acoustic recordings from six speakers producing /l/ in onset and coda positions with /æ/ and /ɪ/ vowels. Linear mixed-effects models revealed significant relationships between tongue lateralization and all three formants, with F3 emerging as the primary acoustic correlate of lateralization (β = 0.081, p < 0.001). Acoustic properties of /l/ were strongly influenced by vowel context, with significant vowel-lateralization interactions for F1 and F2, indicating that the acoustic consequences of lateralization vary by vowel environment. Temporal analysis revealed position-dependent timing relationships: F3 preceded articulatory peaks in coda position but showed near-synchronous timing in onset position, while F1 and F2 consistently lagged behind articulatory peaks across all conditions. These findings suggest distinct articulatory-acoustic coupling mechanisms for onset versus coda /l/, with F3 serving as an anticipatory cue in coda position. The results highlight the complex, context-dependent nature of /l/'s articulatory-acoustic relationships and underscore the importance of considering both spectral and temporal dimensions in understanding liquid consonant production.
本研究使用同时3D电磁发音仪(EMA)和六位说话者的录音来研究澳大利亚英语/l/中的发音-声学关系,这些录音产生/l/在元音/æ/和/ / /的起音和尾音位置。线性混合效应模型显示,舌侧化与所有三个共振峰之间存在显著关系,其中F3是舌侧化的主要声学相关性(β = 0.081, p < 0.001)。/l/的声学特性受到元音环境的强烈影响,F1和F2的元音-侧化相互作用显著,表明侧化的声学后果因元音环境而异。时间分析揭示了位置依赖的时间关系:F3先于尾位发音峰值,但在起始位发音峰值几乎同步,而F1和F2始终滞后于所有条件下的发音峰值。这些发现表明不同的发音-声学耦合机制的开始与尾/l/, F3作为尾位置的预期提示。研究结果强调了/l/发音-声学关系的复杂性和上下文依赖性,并强调了在理解液态辅音产生时考虑频谱和时间维度的重要性。
{"title":"Lateral channel dynamics and F3 modulation: Quantifying para-sagittal articulation in Australian English /l/","authors":"Jia Ying","doi":"10.1016/j.specom.2025.103345","DOIUrl":"10.1016/j.specom.2025.103345","url":null,"abstract":"<div><div>This study investigates articulatory-acoustic relationships in Australian English /l/ using simultaneous 3D electromagnetic articulography (EMA) and acoustic recordings from six speakers producing /l/ in onset and coda positions with /æ/ and /ɪ/ vowels. Linear mixed-effects models revealed significant relationships between tongue lateralization and all three formants, with F3 emerging as the primary acoustic correlate of lateralization (β = 0.081, p &lt; 0.001). Acoustic properties of /l/ were strongly influenced by vowel context, with significant vowel-lateralization interactions for F1 and F2, indicating that the acoustic consequences of lateralization vary by vowel environment. Temporal analysis revealed position-dependent timing relationships: F3 preceded articulatory peaks in coda position but showed near-synchronous timing in onset position, while F1 and F2 consistently lagged behind articulatory peaks across all conditions. These findings suggest distinct articulatory-acoustic coupling mechanisms for onset versus coda /l/, with F3 serving as an anticipatory cue in coda position. The results highlight the complex, context-dependent nature of /l/'s articulatory-acoustic relationships and underscore the importance of considering both spectral and temporal dimensions in understanding liquid consonant production.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"176 ","pages":"Article 103345"},"PeriodicalIF":3.0,"publicationDate":"2025-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A review on speech emotion recognition for low-resource and Indigenous languages 低资源语言与本土语言语音情感识别研究进展
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-12-09 DOI: 10.1016/j.specom.2025.103342
Himashi Rathnayake , Jesin James , Gianna Leoni , Ake Nicholas , Catherine Watson , Peter Keegan
Speech emotion recognition (SER) is an emerging field in human–computer interaction. Although numerous studies have focused on SER for well-resourced languages, the literature reveals a significant gap in research on low-resource and Indigenous (LRI) languages. This paper presents a comprehensive review of the existing literature on SER in the context of LRI languages, analysing critical factors to consider at each stage of designing an SER system. The review indicates that most studies on SER for LRI languages adopt emotion categories established for well-resourced languages, often assuming the universality of emotions. However, the literature suggests that this approach may be limited due to emotional disparities influenced by cultural variations. Additionally, the review underscores that current SER systems typically lack community-oriented methodologies in the development of technology for LRI languages. The importance of feature selection is highlighted, with evidence suggesting that a combination of traditional machine learning methods and carefully selected acoustic features may offer viable options for SER in these languages. Furthermore, the review identifies a need for further exploration of semi-supervised and unsupervised approaches to enhance SER capabilities in LRI contexts. Overall, current SER systems for LRI languages lag behind state-of-the-art standards due to the lack of resources, indicating that there is still much work to be done in this area.
语音情感识别(SER)是人机交互领域的一个新兴领域。尽管大量的研究都集中在资源丰富的语言的SER上,但文献显示,在资源匮乏和土著语言(LRI)的研究中存在显著的差距。本文全面回顾了LRI语言背景下关于SER的现有文献,分析了在设计SER系统的每个阶段需要考虑的关键因素。综述表明,大多数针对低语言语言的SER研究都采用了为资源丰富的语言建立的情感类别,往往假设情感具有普遍性。然而,文献表明,由于受文化差异影响的情绪差异,这种方法可能受到限制。此外,回顾强调当前的SER系统在LRI语言的技术开发中通常缺乏面向社区的方法。强调了特征选择的重要性,有证据表明,将传统的机器学习方法和精心选择的声学特征相结合,可能为这些语言中的SER提供可行的选择。此外,该综述确定需要进一步探索半监督和非监督方法来增强LRI背景下的SER能力。总的来说,由于缺乏资源,目前LRI语言的SER系统落后于最先进的标准,这表明在这个领域还有很多工作要做。
{"title":"A review on speech emotion recognition for low-resource and Indigenous languages","authors":"Himashi Rathnayake ,&nbsp;Jesin James ,&nbsp;Gianna Leoni ,&nbsp;Ake Nicholas ,&nbsp;Catherine Watson ,&nbsp;Peter Keegan","doi":"10.1016/j.specom.2025.103342","DOIUrl":"10.1016/j.specom.2025.103342","url":null,"abstract":"<div><div>Speech emotion recognition (SER) is an emerging field in human–computer interaction. Although numerous studies have focused on SER for well-resourced languages, the literature reveals a significant gap in research on low-resource and Indigenous (LRI) languages. This paper presents a comprehensive review of the existing literature on SER in the context of LRI languages, analysing critical factors to consider at each stage of designing an SER system. The review indicates that most studies on SER for LRI languages adopt emotion categories established for well-resourced languages, often assuming the universality of emotions. However, the literature suggests that this approach may be limited due to emotional disparities influenced by cultural variations. Additionally, the review underscores that current SER systems typically lack community-oriented methodologies in the development of technology for LRI languages. The importance of feature selection is highlighted, with evidence suggesting that a combination of traditional machine learning methods and carefully selected acoustic features may offer viable options for SER in these languages. Furthermore, the review identifies a need for further exploration of semi-supervised and unsupervised approaches to enhance SER capabilities in LRI contexts. Overall, current SER systems for LRI languages lag behind state-of-the-art standards due to the lack of resources, indicating that there is still much work to be done in this area.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"176 ","pages":"Article 103342"},"PeriodicalIF":3.0,"publicationDate":"2025-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145737291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bottom-up modeling of phoneme learning: Universal sensitivity and language-specific transformation 自下而上的音素学习模型:普遍敏感性和语言特异性转换
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-12-04 DOI: 10.1016/j.specom.2025.103343
Frank Lihui Tan, Youngah Do
This study investigates the emergence and development of universal phonetic sensitivity during early phonological learning using an unsupervised modeling approach. Autoencoder models were trained on raw acoustic input from English and Mandarin to simulate bottom-up perceptual development, with a focus on phoneme contrast learning. The results show that phoneme-like categories and feature-aligned representational spaces can emerge from context-free acoustic exposure alone. Crucially, the model exhibits universal phonetic sensitivity as a transient developmental stage that varies across contrasts and gradually gives way to language-specific perception—a trajectory that parallels infant perceptual development. Different featural contrasts remain universally discriminable for varying durations over the course of learning. These findings support the view that universal sensitivity is not innately fixed but emerges through learning, and that early phonological development proceeds along a mosaic, feature-dependent trajectory.
本研究采用无监督建模的方法研究了早期语音学习中普遍语音敏感性的出现和发展。自动编码器模型在英语和普通话的原始声音输入上进行训练,以模拟自下而上的感知发展,重点是音素对比学习。结果表明,音素类类别和特征对齐的表征空间可以单独从上下文无关的声音暴露中产生。至关重要的是,该模型显示,普遍的语音敏感性是一个短暂的发展阶段,在不同的对比中有所不同,并逐渐让位给特定语言的感知——一个与婴儿感知发展相似的轨迹。在学习过程中,不同的特征对比在不同的持续时间内仍然是普遍可辨别的。这些发现支持了一种观点,即普遍的敏感性不是天生固定的,而是通过学习形成的,早期的语音发展是沿着一个马赛克的、依赖特征的轨迹进行的。
{"title":"Bottom-up modeling of phoneme learning: Universal sensitivity and language-specific transformation","authors":"Frank Lihui Tan,&nbsp;Youngah Do","doi":"10.1016/j.specom.2025.103343","DOIUrl":"10.1016/j.specom.2025.103343","url":null,"abstract":"<div><div>This study investigates the emergence and development of universal phonetic sensitivity during early phonological learning using an unsupervised modeling approach. Autoencoder models were trained on raw acoustic input from English and Mandarin to simulate bottom-up perceptual development, with a focus on phoneme contrast learning. The results show that phoneme-like categories and feature-aligned representational spaces can emerge from context-free acoustic exposure alone. Crucially, the model exhibits universal phonetic sensitivity as a transient developmental stage that varies across contrasts and gradually gives way to language-specific perception—a trajectory that parallels infant perceptual development. Different featural contrasts remain universally discriminable for varying durations over the course of learning. These findings support the view that universal sensitivity is not innately fixed but emerges through learning, and that early phonological development proceeds along a mosaic, feature-dependent trajectory.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"176 ","pages":"Article 103343"},"PeriodicalIF":3.0,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145737290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Speaker-conditioned phrase break prediction for text-to-speech with phoneme-level pre-trained language model 基于音素级预训练语言模型的文本到语音的说话人条件断句预测
IF 3 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-11-29 DOI: 10.1016/j.specom.2025.103331
Dong Yang , Yuki Saito , Takaaki Saeki , Tomoki Koriyama , Wataru Nakata , Detai Xin , Hiroshi Saruwatari
This paper advances phrase break prediction (also known as phrasing) in multi-speaker text-to-speech (TTS) systems. We integrate speaker-specific features by leveraging speaker embeddings to enhance the performance of the phrasing model. We further demonstrate that these speaker embeddings can capture speaker-related characteristics solely from the phrasing task. Besides, we explore the potential of pre-trained speaker embeddings for unseen speakers through a few-shot adaptation method. Furthermore, we pioneer the application of phoneme-level pre-trained language models to this TTS front-end task, which significantly boosts the accuracy of the phrasing model. Our methods are rigorously assessed through both objective and subjective evaluations, demonstrating their effectiveness.
本文提出了多说话人文本到语音(TTS)系统中的断句预测方法。我们通过利用说话人嵌入来集成特定于说话人的功能,以增强措辞模型的性能。我们进一步证明,这些说话人嵌入可以仅从措辞任务中捕获说话人相关的特征。此外,我们还通过少镜头自适应方法探索了预训练的说话人嵌入对未见说话人的潜力。此外,我们率先将音素级预训练语言模型应用于TTS前端任务,显著提高了短语模型的准确性。我们的方法经过客观和主观的严格评估,证明了它们的有效性。
{"title":"Speaker-conditioned phrase break prediction for text-to-speech with phoneme-level pre-trained language model","authors":"Dong Yang ,&nbsp;Yuki Saito ,&nbsp;Takaaki Saeki ,&nbsp;Tomoki Koriyama ,&nbsp;Wataru Nakata ,&nbsp;Detai Xin ,&nbsp;Hiroshi Saruwatari","doi":"10.1016/j.specom.2025.103331","DOIUrl":"10.1016/j.specom.2025.103331","url":null,"abstract":"<div><div>This paper advances phrase break prediction (also known as phrasing) in multi-speaker text-to-speech (TTS) systems. We integrate speaker-specific features by leveraging speaker embeddings to enhance the performance of the phrasing model. We further demonstrate that these speaker embeddings can capture speaker-related characteristics solely from the phrasing task. Besides, we explore the potential of pre-trained speaker embeddings for unseen speakers through a few-shot adaptation method. Furthermore, we pioneer the application of phoneme-level pre-trained language models to this TTS front-end task, which significantly boosts the accuracy of the phrasing model. Our methods are rigorously assessed through both objective and subjective evaluations, demonstrating their effectiveness.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"176 ","pages":"Article 103331"},"PeriodicalIF":3.0,"publicationDate":"2025-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Speech Communication
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1