首页 > 最新文献

Speech Communication最新文献

英文 中文
Deep learning based stage-wise two-dimensional speaker localization with large ad-hoc microphone arrays 基于深度学习的舞台二维扬声器定位与大型特设麦克风阵列
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-04-29 DOI: 10.1016/j.specom.2025.103247
Shupei Liu , Linfeng Feng , Yijun Gong , Chengdong Liang , Chen Zhang , Xiao-Lei Zhang , Xuelong Li
While deep-learning-based speaker localization has shown advantages in challenging acoustic environments, it often yields only direction-of-arrival (DOA) cues rather than precise two-dimensional (2D) coordinates. To address this, we propose a novel deep-learning-based 2D speaker localization method leveraging ad-hoc microphone arrays. Specifically, each ad-hoc array comprises randomly distributed microphone nodes, each of which is equipped with a traditional array. Our approach first employs convolutional neural networks at each node to estimate speaker directions.Then, we integrate these DOA estimates using triangulation and clustering techniques to get 2D speaker locations. To further boost the estimation accuracy, we introduce a node selection algorithm that strategically filters the most reliable nodes. Extensive experiments on both simulated and real-world data demonstrate that our approach significantly outperforms conventional methods. The proposed node selection further refines performance. The real-world dataset in the experiment, named Libri-adhoc-node10 which is a newly recorded data described for the first time in this paper, is online available at https://github.com/Liu-sp/Libri-adhoc-nodes10.
虽然基于深度学习的说话者定位在具有挑战性的声学环境中显示出优势,但它通常只产生到达方向(DOA)线索,而不是精确的二维(2D)坐标。为了解决这个问题,我们提出了一种新的基于深度学习的2D扬声器定位方法,利用特设麦克风阵列。具体来说,每个自组织阵列由随机分布的麦克风节点组成,每个节点配备一个传统阵列。我们的方法首先在每个节点上使用卷积神经网络来估计说话人的方向。然后,我们使用三角测量和聚类技术对这些DOA估计进行整合,以获得2D扬声器位置。为了进一步提高估计精度,我们引入了一种节点选择算法,该算法战略性地过滤最可靠的节点。在模拟和现实世界数据上进行的大量实验表明,我们的方法明显优于传统方法。提出的节点选择进一步改进了性能。实验中的真实数据集名为lib -adhoc-node10,这是本文首次描述的新记录数据,可在https://github.com/Liu-sp/Libri-adhoc-nodes10上在线获取。
{"title":"Deep learning based stage-wise two-dimensional speaker localization with large ad-hoc microphone arrays","authors":"Shupei Liu ,&nbsp;Linfeng Feng ,&nbsp;Yijun Gong ,&nbsp;Chengdong Liang ,&nbsp;Chen Zhang ,&nbsp;Xiao-Lei Zhang ,&nbsp;Xuelong Li","doi":"10.1016/j.specom.2025.103247","DOIUrl":"10.1016/j.specom.2025.103247","url":null,"abstract":"<div><div>While deep-learning-based speaker localization has shown advantages in challenging acoustic environments, it often yields only direction-of-arrival (DOA) cues rather than precise two-dimensional (2D) coordinates. To address this, we propose a novel deep-learning-based 2D speaker localization method leveraging ad-hoc microphone arrays. Specifically, each ad-hoc array comprises randomly distributed microphone nodes, each of which is equipped with a traditional array. Our approach first employs convolutional neural networks at each node to estimate speaker directions.Then, we integrate these DOA estimates using triangulation and clustering techniques to get 2D speaker locations. To further boost the estimation accuracy, we introduce a node selection algorithm that strategically filters the most reliable nodes. Extensive experiments on both simulated and real-world data demonstrate that our approach significantly outperforms conventional methods. The proposed node selection further refines performance. The real-world dataset in the experiment, named Libri-adhoc-node10 which is a newly recorded data described for the first time in this paper, is online available at <span><span>https://github.com/Liu-sp/Libri-adhoc-nodes10</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103247"},"PeriodicalIF":2.4,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143892370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Speech Emotion Recognition via CNN-Transformer and multidimensional attention mechanism 基于CNN-Transformer和多维注意机制的语音情感识别
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-04-23 DOI: 10.1016/j.specom.2025.103242
Xiaoyu Tang , Jiazheng Huang , Yixin Lin , Ting Dang , Jintao Cheng
Speech Emotion Recognition (SER) is crucial in human–machine interactions. Previous approaches have predominantly focused on local spatial or channel information and neglected the temporal information in speech. In this paper, to model local and global information at different levels of granularity in speech and capture temporal, spatial and channel dependencies in speech signals, we propose a Speech Emotion Recognition network based on CNN-Transformer and multi-dimensional attention mechanisms. Specifically, a stack of CNN blocks is dedicated to capturing local information in speech from a time–frequency perspective. In addition, a time-channel-space attention mechanism is used to enhance features across three dimensions. Moreover, we model local and global dependencies of feature sequences using large convolutional kernels with depthwise separable convolutions and lightweight Transformer modules. We evaluate the proposed method on IEMOCAP and Emo-DB datasets and show our approach significantly improves the performance over the state-of-the-art methods. https://github.com/SCNU-RISLAB/CNN-Transforemr-and-Multidimensional-Attention-Mechanism.
语音情感识别在人机交互中起着至关重要的作用。以往的方法主要关注局部空间信息或通道信息,而忽略了语音中的时间信息。为了对语音中不同粒度层次的局部和全局信息进行建模,并捕获语音信号中的时间、空间和通道依赖关系,本文提出了一种基于CNN-Transformer和多维注意机制的语音情感识别网络。具体来说,一堆CNN块致力于从时频角度捕获语音中的局部信息。此外,还采用了一种时间-通道-空间注意机制来增强三维特征。此外,我们使用具有深度可分离卷积的大卷积核和轻量级Transformer模块来建模特征序列的局部和全局依赖关系。我们在IEMOCAP和Emo-DB数据集上评估了所提出的方法,并表明我们的方法比最先进的方法显着提高了性能。https://github.com/SCNU-RISLAB/CNN-Transforemr-and-Multidimensional-Attention-Mechanism。
{"title":"Speech Emotion Recognition via CNN-Transformer and multidimensional attention mechanism","authors":"Xiaoyu Tang ,&nbsp;Jiazheng Huang ,&nbsp;Yixin Lin ,&nbsp;Ting Dang ,&nbsp;Jintao Cheng","doi":"10.1016/j.specom.2025.103242","DOIUrl":"10.1016/j.specom.2025.103242","url":null,"abstract":"<div><div>Speech Emotion Recognition (SER) is crucial in human–machine interactions. Previous approaches have predominantly focused on local spatial or channel information and neglected the temporal information in speech. In this paper, to model local and global information at different levels of granularity in speech and capture temporal, spatial and channel dependencies in speech signals, we propose a Speech Emotion Recognition network based on CNN-Transformer and multi-dimensional attention mechanisms. Specifically, a stack of CNN blocks is dedicated to capturing local information in speech from a time–frequency perspective. In addition, a time-channel-space attention mechanism is used to enhance features across three dimensions. Moreover, we model local and global dependencies of feature sequences using large convolutional kernels with depthwise separable convolutions and lightweight Transformer modules. We evaluate the proposed method on IEMOCAP and Emo-DB datasets and show our approach significantly improves the performance over the state-of-the-art methods. <span><span>https://github.com/SCNU-RISLAB/CNN-Transforemr-and-Multidimensional-Attention-Mechanism</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103242"},"PeriodicalIF":2.4,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143883032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Vibravox: A dataset of french speech captured with body-conduction audio sensors Vibravox:用身体传导音频传感器捕获的法语语音数据集
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-04-19 DOI: 10.1016/j.specom.2025.103238
Julien Hauret , Malo Olivier , Thomas Joubaud , Christophe Langrenne , Sarah Poirée , Véronique Zimpfer , Éric Bavu
Vibravox is a dataset compliant with the General Data Protection Regulation (GDPR) containing audio recordings using five different body-conduction audio sensors: two in-ear microphones, two bone conduction vibration pickups, and a laryngophone. The dataset also includes audio data from an airborne microphone used as a reference. The Vibravox corpus contains 45 h per sensor of speech samples and physiological sounds recorded by 188 participants under different acoustic conditions imposed by a high order ambisonics 3D spatializer. Annotations about the recording conditions and linguistic transcriptions are also included in the corpus. We conducted a series of experiments on various speech-related tasks, including speech recognition, speech enhancement, and speaker verification. These experiments were carried out using state-of-the-art models to evaluate and compare their performances on signals captured by the different audio sensors offered by the Vibravox dataset, with the aim of gaining a better grasp of their individual characteristics.
Vibravox是一个符合通用数据保护条例(GDPR)的数据集,包含使用五种不同的身体传导音频传感器的音频记录:两个入耳式麦克风,两个骨传导振动拾音器和一个喉听筒。该数据集还包括用作参考的机载麦克风的音频数据。Vibravox语料库包含每个传感器45小时的语音样本和生理声音,由188名参与者在高阶立体声3D空间化器施加的不同声学条件下记录。关于记录条件和语言转录的注释也包括在语料库中。我们对各种语音相关任务进行了一系列实验,包括语音识别、语音增强和说话人验证。这些实验使用最先进的模型来评估和比较它们在由Vibravox数据集提供的不同音频传感器捕获的信号上的表现,目的是更好地掌握它们的个体特征。
{"title":"Vibravox: A dataset of french speech captured with body-conduction audio sensors","authors":"Julien Hauret ,&nbsp;Malo Olivier ,&nbsp;Thomas Joubaud ,&nbsp;Christophe Langrenne ,&nbsp;Sarah Poirée ,&nbsp;Véronique Zimpfer ,&nbsp;Éric Bavu","doi":"10.1016/j.specom.2025.103238","DOIUrl":"10.1016/j.specom.2025.103238","url":null,"abstract":"<div><div>Vibravox is a dataset compliant with the General Data Protection Regulation (GDPR) containing audio recordings using five different body-conduction audio sensors: two in-ear microphones, two bone conduction vibration pickups, and a laryngophone. The dataset also includes audio data from an airborne microphone used as a reference. The Vibravox corpus contains 45 h per sensor of speech samples and physiological sounds recorded by 188 participants under different acoustic conditions imposed by a high order ambisonics 3D spatializer. Annotations about the recording conditions and linguistic transcriptions are also included in the corpus. We conducted a series of experiments on various speech-related tasks, including speech recognition, speech enhancement, and speaker verification. These experiments were carried out using state-of-the-art models to evaluate and compare their performances on signals captured by the different audio sensors offered by the Vibravox dataset, with the aim of gaining a better grasp of their individual characteristics.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103238"},"PeriodicalIF":2.4,"publicationDate":"2025-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143892371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lexical, syntactic, semantic and acoustic entrainment in Slovak, Spanish, English, and Hungarian: A cross-linguistic comparison 斯洛伐克语、西班牙语、英语和匈牙利语的词汇、句法、语义和声学夹带:跨语言比较
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-04-19 DOI: 10.1016/j.specom.2025.103240
Jay Kejriwal , Štefan Beňuš
Entrainment is the tendency of speakers to reuse each other’s linguistic material, including lexical, syntactic, semantic, or acoustic–prosodic, during a conversation. While entrainment has been studied in English and other Germanic languages, it is less researched in other language groups. In this study, we evaluated lexical, syntactic, semantic, and acoustic–prosodic entrainment in four comparable spoken corpora of four typologically different languages (English, Slovak, Spanish, and Hungarian) using comparable tools and methodologies based on DNN embeddings. Our cross-linguistic comparison revealed that Hungarian speakers are closer to their interlocutors and more consistent with their own linguistic features when compared to English, Slovak, and Spanish speakers. Further, comparison across different linguistic levels within each language revealed that speakers are closest to their partners and most consistent with their own linguistic features at the acoustic level, followed by semantic, lexical, and syntactic levels. Examining the four languages separately, we found that people’s tendency to be close to each other at each turn (proximity) varies at different linguistic levels in different languages. Additionally, we found that entrainment in lexical, syntactic, semantic, and acoustic–prosodic features are positively correlated in all four datasets. Our results are relevant for the predictions of Interactive Alignment theory (Pickering and Garrod, 2004) and may facilitate implementing entrainment functionality in human–machine interactions (HMI).
娱乐是说话者在谈话中重复使用彼此语言材料的趋势,包括词汇、句法、语义或声学韵律。虽然在英语和其他日耳曼语言中对夹带进行了研究,但在其他语言群体中研究较少。在这项研究中,我们使用基于深度神经网络嵌入的可比工具和方法,评估了四种不同类型语言(英语、斯洛伐克语、西班牙语和匈牙利语)的四种可比较口语语料库的词汇、句法、语义和声学韵律的影响。我们的跨语言比较显示,与说英语、斯洛伐克语和西班牙语的人相比,说匈牙利语的人更接近他们的对话者,更符合他们自己的语言特征。此外,在每种语言的不同语言水平之间的比较表明,说话者在声学水平上与他们的伴侣最接近,并且与他们自己的语言特征最一致,其次是语义、词汇和句法水平。我们对这四种语言分别进行了研究,发现在不同语言的不同语言层次上,人们在每个转折点(接近度)上彼此接近的倾向是不同的。此外,我们发现词汇、句法、语义和声学韵律特征在所有四个数据集中都是正相关的。我们的结果与交互对齐理论(Pickering and Garrod, 2004)的预测相关,并可能促进在人机交互(HMI)中实现夹带功能。
{"title":"Lexical, syntactic, semantic and acoustic entrainment in Slovak, Spanish, English, and Hungarian: A cross-linguistic comparison","authors":"Jay Kejriwal ,&nbsp;Štefan Beňuš","doi":"10.1016/j.specom.2025.103240","DOIUrl":"10.1016/j.specom.2025.103240","url":null,"abstract":"<div><div>Entrainment is the tendency of speakers to reuse each other’s linguistic material, including lexical, syntactic, semantic, or acoustic–prosodic, during a conversation. While entrainment has been studied in English and other Germanic languages, it is less researched in other language groups. In this study, we evaluated lexical, syntactic, semantic, and acoustic–prosodic entrainment in four comparable spoken corpora of four typologically different languages (English, Slovak, Spanish, and Hungarian) using comparable tools and methodologies based on DNN embeddings. Our cross-linguistic comparison revealed that Hungarian speakers are closer to their interlocutors and more consistent with their own linguistic features when compared to English, Slovak, and Spanish speakers. Further, comparison across different linguistic levels within each language revealed that speakers are closest to their partners and most consistent with their own linguistic features at the acoustic level, followed by semantic, lexical, and syntactic levels. Examining the four languages separately, we found that people’s tendency to be close to each other at each turn (proximity) varies at different linguistic levels in different languages. Additionally, we found that entrainment in lexical, syntactic, semantic, and acoustic–prosodic features are positively correlated in all four datasets. Our results are relevant for the predictions of Interactive Alignment theory (Pickering and Garrod, 2004) and may facilitate implementing entrainment functionality in human–machine interactions (HMI).</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103240"},"PeriodicalIF":2.4,"publicationDate":"2025-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143876675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Expectation of speech style improves audio-visual perception of English vowels 语音风格的预期提高了英语元音的视听感知
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-04-17 DOI: 10.1016/j.specom.2025.103243
Joan A. Sereno , Allard Jongman , Yue Wang , Paul Tupper , Dawn M. Behne , Jetic Gu , Haoyao Ruan
Speech perception is influenced by both signal-internal properties and signal-independent knowledge, including communicative expectations. This study investigates how these two factors interact, focusing on the role of speech style expectations. Specifically, we examine how prior knowledge about speech style (clear versus plain speech) affects word identification and speech style judgment. Native English perceivers were presented with English words containing tense versus lax vowels in either clear or plain speech, with trial conditions manipulating whether style prompts (presented immediately prior to the target word) were congruent or incongruent with the actual speech style. The stimuli were also presented in three input modalities: auditory (speaker voice), visual (speaker face), and audio-visual. Results show that prior knowledge of speech style improved accuracy in identifying style after the session when style information in the prompt and target word was consistent, particularly in auditory and audio-visual modalities. Additionally, as expected, clear speech enhanced word intelligibility compared to plain speech, with benefits more evident for tense vowels and in auditory and audio-visual contexts. These results demonstrate that congruent style prompts improve style identification accuracy by aligning with high-level expectations, while clear speech enhances word identification accuracy due to signal-internal modifications. Overall, the current findings suggest an interplay of processing sources of information which are both signal-driven and signal-independent, and that high-level signal-complementary information such as speech style is not separate from, but is embodied in, the signal.
语音感知受信号内部属性和信号独立知识(包括交际期望)的影响。本研究探讨了这两个因素是如何相互作用的,重点是演讲风格期望的作用。具体来说,我们研究了关于语言风格的先验知识(清晰的语言与平淡的语言)如何影响单词识别和语言风格判断。母语为英语的感知者以清晰或平淡的语言呈现含有时态和松散元音的英语单词,并在试验条件下操纵风格提示(在目标单词之前立即呈现)是否与实际语言风格一致或不一致。刺激也以三种输入方式呈现:听觉(说话人的声音)、视觉(说话人的脸)和视听。结果表明,当提示词和目标词的风格信息一致时,对语言风格的先验知识提高了会话后识别风格的准确性,尤其是在听觉和视听模式下。此外,正如预期的那样,清晰的语言比平淡的语言更能提高单词的可理解性,对于时态元音以及听觉和视听环境的好处更为明显。这些结果表明,一致的风格提示通过与高水平期望保持一致来提高风格识别的准确性,而清晰的语音提示由于信号内部修改而提高单词识别的准确性。总的来说,目前的研究结果表明,信号驱动和信号独立的信息处理源之间存在相互作用,并且语音风格等高级信号互补信息不是与信号分离,而是体现在信号中。
{"title":"Expectation of speech style improves audio-visual perception of English vowels","authors":"Joan A. Sereno ,&nbsp;Allard Jongman ,&nbsp;Yue Wang ,&nbsp;Paul Tupper ,&nbsp;Dawn M. Behne ,&nbsp;Jetic Gu ,&nbsp;Haoyao Ruan","doi":"10.1016/j.specom.2025.103243","DOIUrl":"10.1016/j.specom.2025.103243","url":null,"abstract":"<div><div>Speech perception is influenced by both signal-internal properties and signal-independent knowledge, including communicative expectations. This study investigates how these two factors interact, focusing on the role of speech style expectations. Specifically, we examine how prior knowledge about speech style (clear versus plain speech) affects word identification and speech style judgment. Native English perceivers were presented with English words containing tense versus lax vowels in either clear or plain speech, with trial conditions manipulating whether style prompts (presented immediately prior to the target word) were congruent or incongruent with the actual speech style. The stimuli were also presented in three input modalities: auditory (speaker voice), visual (speaker face), and audio-visual. Results show that prior knowledge of speech style improved accuracy in identifying style after the session when style information in the prompt and target word was consistent, particularly in auditory and audio-visual modalities. Additionally, as expected, clear speech enhanced word intelligibility compared to plain speech, with benefits more evident for tense vowels and in auditory and audio-visual contexts. These results demonstrate that congruent style prompts improve style identification accuracy by aligning with high-level expectations, while clear speech enhances word identification accuracy due to signal-internal modifications. Overall, the current findings suggest an interplay of processing sources of information which are both signal-driven and signal-independent, and that high-level signal-complementary information such as speech style is not separate from, but is embodied in, the signal.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103243"},"PeriodicalIF":2.4,"publicationDate":"2025-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143855649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Neural Chinese silent speech recognition with facial electromyography 基于面部肌电图的汉语无声语音神经识别
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-04-15 DOI: 10.1016/j.specom.2025.103230
Liang Xie , Yakun Zhang , Hao Yuan , Meishan Zhang , Xingyu Zhang , Changyan Zheng , Ye Yan , Erwei Yin
The majority work in speech recognition is based on audible speech and has already achieved great success. However, in several special scenarios, the voice might be unavailable. Recently, Gaddy and Klein (2020) presented an initial study of silent speech analysis, aiming to voice the silent speech from facial electromyography (EMG). In this work, we present the first study of neural silent speech recognition in Chinese, which goes one step further to convert the silent facial EMG signals into text directly. We build a benchmark dataset and then introduce a neural end-to-end model to the task. The model is further optimized with two auxiliary tasks for better feature learning. In addition, we suggest a systematic data augmentation strategy to improve model performance. Experimental results show that our final best model can achieve a character error rate of 38.0% on a sentence-level silent speech recognition task. We also provide in-depth analysis to gain a comprehensive understanding of our task and the various models proposed. Although our model achieves initial results, there is still a gap compared to the ideal level, warranting further attention and research.
语音识别领域的大部分工作都是基于可听语音的,并且已经取得了巨大的成功。然而,在一些特殊情况下,语音可能无法使用。最近,Gaddy 和 Klein(2020)提出了无声语音分析的初步研究,旨在通过面部肌电图(EMG)为无声语音配音。在这项研究中,我们首次提出了中文神经无声语音识别,并进一步将无声面部肌电信号直接转换为文本。我们建立了一个基准数据集,然后为该任务引入了一个端到端的神经模型。为了更好地进行特征学习,我们通过两个辅助任务对模型进行了进一步优化。此外,我们还提出了一种系统的数据增强策略,以提高模型性能。实验结果表明,在句子级无声语音识别任务中,我们的最终最佳模型可以实现 38.0% 的字符错误率。我们还进行了深入分析,以全面了解我们的任务和提出的各种模型。虽然我们的模型取得了初步成果,但与理想水平相比仍有差距,值得进一步关注和研究。
{"title":"Neural Chinese silent speech recognition with facial electromyography","authors":"Liang Xie ,&nbsp;Yakun Zhang ,&nbsp;Hao Yuan ,&nbsp;Meishan Zhang ,&nbsp;Xingyu Zhang ,&nbsp;Changyan Zheng ,&nbsp;Ye Yan ,&nbsp;Erwei Yin","doi":"10.1016/j.specom.2025.103230","DOIUrl":"10.1016/j.specom.2025.103230","url":null,"abstract":"<div><div>The majority work in speech recognition is based on audible speech and has already achieved great success. However, in several special scenarios, the voice might be unavailable. Recently, Gaddy and Klein (2020) presented an initial study of silent speech analysis, aiming to voice the silent speech from facial electromyography (EMG). In this work, we present the first study of neural silent speech recognition in Chinese, which goes one step further to convert the silent facial EMG signals into text directly. We build a benchmark dataset and then introduce a neural end-to-end model to the task. The model is further optimized with two auxiliary tasks for better feature learning. In addition, we suggest a systematic data augmentation strategy to improve model performance. Experimental results show that our final best model can achieve a character error rate of 38.0% on a sentence-level silent speech recognition task. We also provide in-depth analysis to gain a comprehensive understanding of our task and the various models proposed. Although our model achieves initial results, there is still a gap compared to the ideal level, warranting further attention and research.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103230"},"PeriodicalIF":2.4,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143850540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Disfluency production in children with attention-deficit/hyperactivity disorder during a narrative task 注意缺陷/多动障碍儿童在叙述任务中的不流畅产生
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-04-11 DOI: 10.1016/j.specom.2025.103244
Annemarie Bijnens , Aurélie Pistono
Limited evidence exists on ADHD-related disfluency and lexical diversity behaviour in connected speech, although a significant number of individuals with ADHD experience language difficulties at different linguistic levels. Using a retrospective cross-sectional design with data from the Asymmetries TalkBank database, this study aims to capture differences in disfluency production and lexical diversity between children with ADHD and Typically Developing (TD) children. These measures include the frequencies of different disfluency subtypes and two lexical diversity measures, and are correlated with performance on a working memory task and a response inhibition task. Results indicate that the ADHD group produced a higher mean frequency of each disfluency type, but no differences were found to be significant. Correlation analysis revealed that filled pauses and revisions were negatively correlated with working memory and response inhibition in the ADHD group, whereas they were positively correlated with working memory performance in the TD group. This suggests that the underlying causes of disfluency differ in each group and that further research is required of speech monitoring ability in children with ADHD.
尽管有相当数量的多动症患者在不同语言水平上存在语言障碍,但有关多动症相关的连贯言语不流畅和词汇多样性行为的证据有限。本研究采用回顾性横断面设计,使用来自不对称语料库(Asymmetries TalkBank)的数据,旨在捕捉多动症儿童与发育正常(TD)儿童在不流利语和词汇多样性方面的差异。这些测量包括不同不流利语亚型的频率和两种词汇多样性测量,并与工作记忆任务和反应抑制任务的表现相关联。结果表明,ADHD 组儿童出现各种不流利语的平均频率较高,但差异并不显著。相关性分析表明,在多动症组中,填充停顿和修正与工作记忆和反应抑制呈负相关,而在多动症组中,填充停顿和修正与工作记忆成绩呈正相关。这表明,造成各组言语不流畅的根本原因不同,因此需要对多动症儿童的言语监控能力进行进一步研究。
{"title":"Disfluency production in children with attention-deficit/hyperactivity disorder during a narrative task","authors":"Annemarie Bijnens ,&nbsp;Aurélie Pistono","doi":"10.1016/j.specom.2025.103244","DOIUrl":"10.1016/j.specom.2025.103244","url":null,"abstract":"<div><div>Limited evidence exists on ADHD-related disfluency and lexical diversity behaviour in connected speech, although a significant number of individuals with ADHD experience language difficulties at different linguistic levels. Using a retrospective cross-sectional design with data from the Asymmetries TalkBank database, this study aims to capture differences in disfluency production and lexical diversity between children with ADHD and Typically Developing (TD) children. These measures include the frequencies of different disfluency subtypes and two lexical diversity measures, and are correlated with performance on a working memory task and a response inhibition task. Results indicate that the ADHD group produced a higher mean frequency of each disfluency type, but no differences were found to be significant. Correlation analysis revealed that filled pauses and revisions were negatively correlated with working memory and response inhibition in the ADHD group, whereas they were positively correlated with working memory performance in the TD group. This suggests that the underlying causes of disfluency differ in each group and that further research is required of speech monitoring ability in children with ADHD.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103244"},"PeriodicalIF":2.4,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143848672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Understanding perception and production in loan adaptation: Cases of English loans in Mandarin 了解贷款适应中的感知和生产:普通话中的英语借词案例
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-04-08 DOI: 10.1016/j.specom.2025.103207
Mingchang Lü
This study investigates the formation of English loans in Mandarin from the lens of both perception and production. Excluding loans that involve semantic or lexical adaptation, I explore how the two aspects of perception and production may separately account for various adaptation patterns of segmental change in phonological loans—those whose formation is governed solely by phonological processes. Specifically, perceptual interpretation is composed of auditory (acoustic) correlates. Building upon my previous work, I argue that production involves the adapter's awareness of articulatory economy and attempt to facilitate the interlocutor's perception, in addition to their prosodic knowledge of the native phonology, as addressed at length in my earlier proposals. Conclusions are drawn primarily upon language universals, cross-linguistic trends, and coarticulatory factors. The emergent patterns provide compelling evidence that orthographic influence is only marginal.
本研究从感知和产生两个角度探讨汉语中英语借词的形成。排除涉及语义或词汇适应的外来语,我探讨了感知和产生的两个方面如何分别解释音系外来语中音段变化的各种适应模式,音系外来语的形成完全由音系过程控制。具体来说,知觉解释是由听觉(声学)相关物组成的。在我之前的工作的基础上,我认为生产涉及到接合者对发音经济的意识,并试图促进对话者的感知,除了他们对本地音韵学的韵律知识之外,正如我之前的建议中详细讨论的那样。结论主要基于语言共性、跨语言趋势和协同发音因素。涌现的模式提供了令人信服的证据,证明正字法的影响只是边际的。
{"title":"Understanding perception and production in loan adaptation: Cases of English loans in Mandarin","authors":"Mingchang Lü","doi":"10.1016/j.specom.2025.103207","DOIUrl":"10.1016/j.specom.2025.103207","url":null,"abstract":"<div><div>This study investigates the formation of English loans in Mandarin from the lens of both perception and production. Excluding loans that involve semantic or lexical adaptation, I explore how the two aspects of perception and production may separately account for various adaptation patterns of segmental change in phonological loans—those whose formation is governed solely by phonological processes. Specifically, perceptual interpretation is composed of auditory (acoustic) correlates. Building upon my previous work, I argue that production involves the adapter's awareness of articulatory economy and attempt to facilitate the interlocutor's perception, in addition to their prosodic knowledge of the native phonology, as addressed at length in my earlier proposals. Conclusions are drawn primarily upon language universals, cross-linguistic trends, and coarticulatory factors. The emergent patterns provide compelling evidence that orthographic influence is only marginal.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103207"},"PeriodicalIF":2.4,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143874568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cutting to the chase: The influence of first and second language use on discourse compression 开门见山:第一语言和第二语言使用对语篇压缩的影响
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-04-07 DOI: 10.1016/j.specom.2025.103241
Evy Woumans , Robert J. Hartsuiker
The present study looked into speakers’ capacity to shorten narratives through retelling, and specifically at differences in such discourse compression between bilinguals’ first (L1) and second (L2) language. A group of unbalanced Dutch-English bilinguals recounted the events from two cartoons in both languages four times. For each narration, word count (both including and excluding hesitation markers), duration, and fluency were recorded as dependent measures, all of which showed significant compression, i.e. economy in the oral production of the narrative, in both languages. Compression thus occurred in L1 as well as L2, indicating it relies on similar psycholinguistic mechanisms in both languages. Remarkably, whereas all L2 measures were less compressed in the initial narration than their L1 counterpart, compression with the first retelling was significantly higher in the L2 condition. Hence, whereas lexical access is expected to be more difficult in L2 initially, ultimately leading to increased disfluency, speaking behaviour did not seem to differ much from that in L1 once vocabulary, grammar, and syntax structures were primed.
本研究考察了说话者通过复述来缩短叙述的能力,特别是在双语者的第一语言(L1)和第二语言(L2)之间这种话语压缩的差异。一组不平衡的荷兰语-英语双语者用两种语言讲述了两幅漫画中的事件四次。对于每一个叙述,字数(包括和不包括犹豫标记)、持续时间和流利度作为依赖测量被记录下来,所有这些都显示出显著的压缩,即在两种语言中叙述的口头生产中的经济性。因此,压缩既发生在第一语言中,也发生在第二语言中,表明它依赖于两种语言中相似的心理语言机制。值得注意的是,虽然所有的第二语言测量在初始叙述中都比第一语言条件下的测量被压缩得更少,但在第二语言条件下,第一次复述的压缩明显更高。因此,虽然最初在第二语言中词汇获取预计会更困难,最终导致不流畅性增加,但一旦词汇、语法和句法结构被启动,说话行为似乎与第一语言没有太大区别。
{"title":"Cutting to the chase: The influence of first and second language use on discourse compression","authors":"Evy Woumans ,&nbsp;Robert J. Hartsuiker","doi":"10.1016/j.specom.2025.103241","DOIUrl":"10.1016/j.specom.2025.103241","url":null,"abstract":"<div><div>The present study looked into speakers’ capacity to shorten narratives through retelling, and specifically at differences in such discourse compression between bilinguals’ first (L1) and second (L2) language. A group of unbalanced Dutch-English bilinguals recounted the events from two cartoons in both languages four times. For each narration, word count (both including and excluding hesitation markers), duration, and fluency were recorded as dependent measures, all of which showed significant compression, i.e. economy in the oral production of the narrative, in both languages. Compression thus occurred in L1 as well as L2, indicating it relies on similar psycholinguistic mechanisms in both languages. Remarkably, whereas all L2 measures were less compressed in the initial narration than their L1 counterpart, compression with the first retelling was significantly higher in the L2 condition. Hence, whereas lexical access is expected to be more difficult in L2 initially, ultimately leading to increased disfluency, speaking behaviour did not seem to differ much from that in L1 once vocabulary, grammar, and syntax structures were primed.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103241"},"PeriodicalIF":2.4,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143833788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The role of informativity and frequency in shaping word durations in English and in Polish 英语和波兰语中信息性和频率在塑造词时长的作用
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-04-05 DOI: 10.1016/j.specom.2025.103239
Kamil Kaźmierski
Overall lexical frequency has long been known to play a role in sound change. Specifically, lexical frequency is negatively correlated with phonetic duration, and as such can be seen as a driver of diachronic reduction processes. However, recent findings suggest that it is the frequency of occurrence in a phonetic environment that favors a particular type of sound change, rather than overall lexical frequency, that shapes phonetic forms. For temporal reduction, Seyfarth (2014) shows that words that have a high frequency of occurrence in predictable contexts (low informativity words) are more temporally reduced than words that have a lower frequency of occurrence in predictable contexts (high informativity words). In this paper, I replicate Seyfarth's (2014) finding using another corpus of unscripted English — the Nationwide Speech Project corpus (Clopper and Pisoni, 2006), as well as using a corpus of another language, Polish — the Greater Poland Spoken Corpus (Kaźmierski et al., 2019; Kul et al., 2019). In both cases, informativity is included as a predictor of theoretical interest in mixed-effects linear regression models of word durations. Informativity, i. e. the frequency of occurrence in low-predictability contexts is shown to have a statistically significant effect on word durations in both English and Polish. Extending the analysis beyond a replication of Seyfarth (2014), a comparison of the effect of informativity and overall lexical frequency shows that the effect of informativity is somewhat weaker in Polish than in English, lending some support to the notion that morphologically rich languages are less sensitive to contextual predictability.
总体而言,词汇频率在语音变化中起着重要作用。具体而言,词汇频率与语音持续时间呈负相关,因此可以被视为历时还原过程的驱动因素。然而,最近的研究结果表明,是语音环境中出现的频率有利于特定类型的声音变化,而不是整体的词汇频率,形成了语音形式。对于时间还原,Seyfarth(2014)表明,在可预测上下文中出现频率高的词(低信息性词)比在可预测上下文中出现频率较低的词(高信息性词)在时间上更容易被还原。在本文中,我复制了Seyfarth(2014)的发现,使用了另一个无脚本英语语料库-全国演讲项目语料库(Clopper和Pisoni, 2006),以及另一种语言的语料库,波兰语-大波兰口语语料库(Kaźmierski等人,2019;Kul等人,2019)。在这两种情况下,信息性都被包括在单词持续时间的混合效应线性回归模型中作为理论兴趣的预测因子。信息性,即在低可预测性语境中出现的频率,在统计上对英语和波兰语的单词持续时间都有显著影响。将分析扩展到对Seyfarth(2014)的复制之外,对信息性和整体词汇频率的影响进行了比较,结果表明,波兰语的信息性影响比英语弱一些,这为形态学丰富的语言对上下文可预测性不太敏感的观点提供了一些支持。
{"title":"The role of informativity and frequency in shaping word durations in English and in Polish","authors":"Kamil Kaźmierski","doi":"10.1016/j.specom.2025.103239","DOIUrl":"10.1016/j.specom.2025.103239","url":null,"abstract":"<div><div>Overall lexical frequency has long been known to play a role in sound change. Specifically, lexical frequency is negatively correlated with phonetic duration, and as such can be seen as a driver of diachronic reduction processes. However, recent findings suggest that it is the frequency of occurrence in a phonetic environment that favors a particular type of sound change, rather than overall lexical frequency, that shapes phonetic forms. For temporal reduction, Seyfarth (2014) shows that words that have a high frequency of occurrence in predictable contexts (<em>low informativity</em> words) are more temporally reduced than words that have a lower frequency of occurrence in predictable contexts (<em>high informativity</em> words). In this paper, I replicate Seyfarth's (2014) finding using another corpus of unscripted English — the Nationwide Speech Project corpus (Clopper and Pisoni, 2006), as well as using a corpus of another language, Polish — the Greater Poland Spoken Corpus (Kaźmierski et al., 2019; Kul et al., 2019). In both cases, informativity is included as a predictor of theoretical interest in mixed-effects linear regression models of word durations. Informativity, i. e. the frequency of occurrence in low-predictability contexts is shown to have a statistically significant effect on word durations in both English and Polish. Extending the analysis beyond a replication of Seyfarth (2014), a comparison of the effect of informativity and overall lexical frequency shows that the effect of informativity is somewhat weaker in Polish than in English, lending some support to the notion that morphologically rich languages are less sensitive to contextual predictability.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103239"},"PeriodicalIF":2.4,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143807974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Speech Communication
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1