Pub Date : 2025-04-29DOI: 10.1016/j.specom.2025.103247
Shupei Liu , Linfeng Feng , Yijun Gong , Chengdong Liang , Chen Zhang , Xiao-Lei Zhang , Xuelong Li
While deep-learning-based speaker localization has shown advantages in challenging acoustic environments, it often yields only direction-of-arrival (DOA) cues rather than precise two-dimensional (2D) coordinates. To address this, we propose a novel deep-learning-based 2D speaker localization method leveraging ad-hoc microphone arrays. Specifically, each ad-hoc array comprises randomly distributed microphone nodes, each of which is equipped with a traditional array. Our approach first employs convolutional neural networks at each node to estimate speaker directions.Then, we integrate these DOA estimates using triangulation and clustering techniques to get 2D speaker locations. To further boost the estimation accuracy, we introduce a node selection algorithm that strategically filters the most reliable nodes. Extensive experiments on both simulated and real-world data demonstrate that our approach significantly outperforms conventional methods. The proposed node selection further refines performance. The real-world dataset in the experiment, named Libri-adhoc-node10 which is a newly recorded data described for the first time in this paper, is online available at https://github.com/Liu-sp/Libri-adhoc-nodes10.
{"title":"Deep learning based stage-wise two-dimensional speaker localization with large ad-hoc microphone arrays","authors":"Shupei Liu , Linfeng Feng , Yijun Gong , Chengdong Liang , Chen Zhang , Xiao-Lei Zhang , Xuelong Li","doi":"10.1016/j.specom.2025.103247","DOIUrl":"10.1016/j.specom.2025.103247","url":null,"abstract":"<div><div>While deep-learning-based speaker localization has shown advantages in challenging acoustic environments, it often yields only direction-of-arrival (DOA) cues rather than precise two-dimensional (2D) coordinates. To address this, we propose a novel deep-learning-based 2D speaker localization method leveraging ad-hoc microphone arrays. Specifically, each ad-hoc array comprises randomly distributed microphone nodes, each of which is equipped with a traditional array. Our approach first employs convolutional neural networks at each node to estimate speaker directions.Then, we integrate these DOA estimates using triangulation and clustering techniques to get 2D speaker locations. To further boost the estimation accuracy, we introduce a node selection algorithm that strategically filters the most reliable nodes. Extensive experiments on both simulated and real-world data demonstrate that our approach significantly outperforms conventional methods. The proposed node selection further refines performance. The real-world dataset in the experiment, named Libri-adhoc-node10 which is a newly recorded data described for the first time in this paper, is online available at <span><span>https://github.com/Liu-sp/Libri-adhoc-nodes10</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103247"},"PeriodicalIF":2.4,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143892370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Speech Emotion Recognition (SER) is crucial in human–machine interactions. Previous approaches have predominantly focused on local spatial or channel information and neglected the temporal information in speech. In this paper, to model local and global information at different levels of granularity in speech and capture temporal, spatial and channel dependencies in speech signals, we propose a Speech Emotion Recognition network based on CNN-Transformer and multi-dimensional attention mechanisms. Specifically, a stack of CNN blocks is dedicated to capturing local information in speech from a time–frequency perspective. In addition, a time-channel-space attention mechanism is used to enhance features across three dimensions. Moreover, we model local and global dependencies of feature sequences using large convolutional kernels with depthwise separable convolutions and lightweight Transformer modules. We evaluate the proposed method on IEMOCAP and Emo-DB datasets and show our approach significantly improves the performance over the state-of-the-art methods. https://github.com/SCNU-RISLAB/CNN-Transforemr-and-Multidimensional-Attention-Mechanism.
{"title":"Speech Emotion Recognition via CNN-Transformer and multidimensional attention mechanism","authors":"Xiaoyu Tang , Jiazheng Huang , Yixin Lin , Ting Dang , Jintao Cheng","doi":"10.1016/j.specom.2025.103242","DOIUrl":"10.1016/j.specom.2025.103242","url":null,"abstract":"<div><div>Speech Emotion Recognition (SER) is crucial in human–machine interactions. Previous approaches have predominantly focused on local spatial or channel information and neglected the temporal information in speech. In this paper, to model local and global information at different levels of granularity in speech and capture temporal, spatial and channel dependencies in speech signals, we propose a Speech Emotion Recognition network based on CNN-Transformer and multi-dimensional attention mechanisms. Specifically, a stack of CNN blocks is dedicated to capturing local information in speech from a time–frequency perspective. In addition, a time-channel-space attention mechanism is used to enhance features across three dimensions. Moreover, we model local and global dependencies of feature sequences using large convolutional kernels with depthwise separable convolutions and lightweight Transformer modules. We evaluate the proposed method on IEMOCAP and Emo-DB datasets and show our approach significantly improves the performance over the state-of-the-art methods. <span><span>https://github.com/SCNU-RISLAB/CNN-Transforemr-and-Multidimensional-Attention-Mechanism</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103242"},"PeriodicalIF":2.4,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143883032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-19DOI: 10.1016/j.specom.2025.103238
Julien Hauret , Malo Olivier , Thomas Joubaud , Christophe Langrenne , Sarah Poirée , Véronique Zimpfer , Éric Bavu
Vibravox is a dataset compliant with the General Data Protection Regulation (GDPR) containing audio recordings using five different body-conduction audio sensors: two in-ear microphones, two bone conduction vibration pickups, and a laryngophone. The dataset also includes audio data from an airborne microphone used as a reference. The Vibravox corpus contains 45 h per sensor of speech samples and physiological sounds recorded by 188 participants under different acoustic conditions imposed by a high order ambisonics 3D spatializer. Annotations about the recording conditions and linguistic transcriptions are also included in the corpus. We conducted a series of experiments on various speech-related tasks, including speech recognition, speech enhancement, and speaker verification. These experiments were carried out using state-of-the-art models to evaluate and compare their performances on signals captured by the different audio sensors offered by the Vibravox dataset, with the aim of gaining a better grasp of their individual characteristics.
{"title":"Vibravox: A dataset of french speech captured with body-conduction audio sensors","authors":"Julien Hauret , Malo Olivier , Thomas Joubaud , Christophe Langrenne , Sarah Poirée , Véronique Zimpfer , Éric Bavu","doi":"10.1016/j.specom.2025.103238","DOIUrl":"10.1016/j.specom.2025.103238","url":null,"abstract":"<div><div>Vibravox is a dataset compliant with the General Data Protection Regulation (GDPR) containing audio recordings using five different body-conduction audio sensors: two in-ear microphones, two bone conduction vibration pickups, and a laryngophone. The dataset also includes audio data from an airborne microphone used as a reference. The Vibravox corpus contains 45 h per sensor of speech samples and physiological sounds recorded by 188 participants under different acoustic conditions imposed by a high order ambisonics 3D spatializer. Annotations about the recording conditions and linguistic transcriptions are also included in the corpus. We conducted a series of experiments on various speech-related tasks, including speech recognition, speech enhancement, and speaker verification. These experiments were carried out using state-of-the-art models to evaluate and compare their performances on signals captured by the different audio sensors offered by the Vibravox dataset, with the aim of gaining a better grasp of their individual characteristics.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103238"},"PeriodicalIF":2.4,"publicationDate":"2025-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143892371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-19DOI: 10.1016/j.specom.2025.103240
Jay Kejriwal , Štefan Beňuš
Entrainment is the tendency of speakers to reuse each other’s linguistic material, including lexical, syntactic, semantic, or acoustic–prosodic, during a conversation. While entrainment has been studied in English and other Germanic languages, it is less researched in other language groups. In this study, we evaluated lexical, syntactic, semantic, and acoustic–prosodic entrainment in four comparable spoken corpora of four typologically different languages (English, Slovak, Spanish, and Hungarian) using comparable tools and methodologies based on DNN embeddings. Our cross-linguistic comparison revealed that Hungarian speakers are closer to their interlocutors and more consistent with their own linguistic features when compared to English, Slovak, and Spanish speakers. Further, comparison across different linguistic levels within each language revealed that speakers are closest to their partners and most consistent with their own linguistic features at the acoustic level, followed by semantic, lexical, and syntactic levels. Examining the four languages separately, we found that people’s tendency to be close to each other at each turn (proximity) varies at different linguistic levels in different languages. Additionally, we found that entrainment in lexical, syntactic, semantic, and acoustic–prosodic features are positively correlated in all four datasets. Our results are relevant for the predictions of Interactive Alignment theory (Pickering and Garrod, 2004) and may facilitate implementing entrainment functionality in human–machine interactions (HMI).
娱乐是说话者在谈话中重复使用彼此语言材料的趋势,包括词汇、句法、语义或声学韵律。虽然在英语和其他日耳曼语言中对夹带进行了研究,但在其他语言群体中研究较少。在这项研究中,我们使用基于深度神经网络嵌入的可比工具和方法,评估了四种不同类型语言(英语、斯洛伐克语、西班牙语和匈牙利语)的四种可比较口语语料库的词汇、句法、语义和声学韵律的影响。我们的跨语言比较显示,与说英语、斯洛伐克语和西班牙语的人相比,说匈牙利语的人更接近他们的对话者,更符合他们自己的语言特征。此外,在每种语言的不同语言水平之间的比较表明,说话者在声学水平上与他们的伴侣最接近,并且与他们自己的语言特征最一致,其次是语义、词汇和句法水平。我们对这四种语言分别进行了研究,发现在不同语言的不同语言层次上,人们在每个转折点(接近度)上彼此接近的倾向是不同的。此外,我们发现词汇、句法、语义和声学韵律特征在所有四个数据集中都是正相关的。我们的结果与交互对齐理论(Pickering and Garrod, 2004)的预测相关,并可能促进在人机交互(HMI)中实现夹带功能。
{"title":"Lexical, syntactic, semantic and acoustic entrainment in Slovak, Spanish, English, and Hungarian: A cross-linguistic comparison","authors":"Jay Kejriwal , Štefan Beňuš","doi":"10.1016/j.specom.2025.103240","DOIUrl":"10.1016/j.specom.2025.103240","url":null,"abstract":"<div><div>Entrainment is the tendency of speakers to reuse each other’s linguistic material, including lexical, syntactic, semantic, or acoustic–prosodic, during a conversation. While entrainment has been studied in English and other Germanic languages, it is less researched in other language groups. In this study, we evaluated lexical, syntactic, semantic, and acoustic–prosodic entrainment in four comparable spoken corpora of four typologically different languages (English, Slovak, Spanish, and Hungarian) using comparable tools and methodologies based on DNN embeddings. Our cross-linguistic comparison revealed that Hungarian speakers are closer to their interlocutors and more consistent with their own linguistic features when compared to English, Slovak, and Spanish speakers. Further, comparison across different linguistic levels within each language revealed that speakers are closest to their partners and most consistent with their own linguistic features at the acoustic level, followed by semantic, lexical, and syntactic levels. Examining the four languages separately, we found that people’s tendency to be close to each other at each turn (proximity) varies at different linguistic levels in different languages. Additionally, we found that entrainment in lexical, syntactic, semantic, and acoustic–prosodic features are positively correlated in all four datasets. Our results are relevant for the predictions of Interactive Alignment theory (Pickering and Garrod, 2004) and may facilitate implementing entrainment functionality in human–machine interactions (HMI).</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103240"},"PeriodicalIF":2.4,"publicationDate":"2025-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143876675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-17DOI: 10.1016/j.specom.2025.103243
Joan A. Sereno , Allard Jongman , Yue Wang , Paul Tupper , Dawn M. Behne , Jetic Gu , Haoyao Ruan
Speech perception is influenced by both signal-internal properties and signal-independent knowledge, including communicative expectations. This study investigates how these two factors interact, focusing on the role of speech style expectations. Specifically, we examine how prior knowledge about speech style (clear versus plain speech) affects word identification and speech style judgment. Native English perceivers were presented with English words containing tense versus lax vowels in either clear or plain speech, with trial conditions manipulating whether style prompts (presented immediately prior to the target word) were congruent or incongruent with the actual speech style. The stimuli were also presented in three input modalities: auditory (speaker voice), visual (speaker face), and audio-visual. Results show that prior knowledge of speech style improved accuracy in identifying style after the session when style information in the prompt and target word was consistent, particularly in auditory and audio-visual modalities. Additionally, as expected, clear speech enhanced word intelligibility compared to plain speech, with benefits more evident for tense vowels and in auditory and audio-visual contexts. These results demonstrate that congruent style prompts improve style identification accuracy by aligning with high-level expectations, while clear speech enhances word identification accuracy due to signal-internal modifications. Overall, the current findings suggest an interplay of processing sources of information which are both signal-driven and signal-independent, and that high-level signal-complementary information such as speech style is not separate from, but is embodied in, the signal.
{"title":"Expectation of speech style improves audio-visual perception of English vowels","authors":"Joan A. Sereno , Allard Jongman , Yue Wang , Paul Tupper , Dawn M. Behne , Jetic Gu , Haoyao Ruan","doi":"10.1016/j.specom.2025.103243","DOIUrl":"10.1016/j.specom.2025.103243","url":null,"abstract":"<div><div>Speech perception is influenced by both signal-internal properties and signal-independent knowledge, including communicative expectations. This study investigates how these two factors interact, focusing on the role of speech style expectations. Specifically, we examine how prior knowledge about speech style (clear versus plain speech) affects word identification and speech style judgment. Native English perceivers were presented with English words containing tense versus lax vowels in either clear or plain speech, with trial conditions manipulating whether style prompts (presented immediately prior to the target word) were congruent or incongruent with the actual speech style. The stimuli were also presented in three input modalities: auditory (speaker voice), visual (speaker face), and audio-visual. Results show that prior knowledge of speech style improved accuracy in identifying style after the session when style information in the prompt and target word was consistent, particularly in auditory and audio-visual modalities. Additionally, as expected, clear speech enhanced word intelligibility compared to plain speech, with benefits more evident for tense vowels and in auditory and audio-visual contexts. These results demonstrate that congruent style prompts improve style identification accuracy by aligning with high-level expectations, while clear speech enhances word identification accuracy due to signal-internal modifications. Overall, the current findings suggest an interplay of processing sources of information which are both signal-driven and signal-independent, and that high-level signal-complementary information such as speech style is not separate from, but is embodied in, the signal.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103243"},"PeriodicalIF":2.4,"publicationDate":"2025-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143855649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The majority work in speech recognition is based on audible speech and has already achieved great success. However, in several special scenarios, the voice might be unavailable. Recently, Gaddy and Klein (2020) presented an initial study of silent speech analysis, aiming to voice the silent speech from facial electromyography (EMG). In this work, we present the first study of neural silent speech recognition in Chinese, which goes one step further to convert the silent facial EMG signals into text directly. We build a benchmark dataset and then introduce a neural end-to-end model to the task. The model is further optimized with two auxiliary tasks for better feature learning. In addition, we suggest a systematic data augmentation strategy to improve model performance. Experimental results show that our final best model can achieve a character error rate of 38.0% on a sentence-level silent speech recognition task. We also provide in-depth analysis to gain a comprehensive understanding of our task and the various models proposed. Although our model achieves initial results, there is still a gap compared to the ideal level, warranting further attention and research.
{"title":"Neural Chinese silent speech recognition with facial electromyography","authors":"Liang Xie , Yakun Zhang , Hao Yuan , Meishan Zhang , Xingyu Zhang , Changyan Zheng , Ye Yan , Erwei Yin","doi":"10.1016/j.specom.2025.103230","DOIUrl":"10.1016/j.specom.2025.103230","url":null,"abstract":"<div><div>The majority work in speech recognition is based on audible speech and has already achieved great success. However, in several special scenarios, the voice might be unavailable. Recently, Gaddy and Klein (2020) presented an initial study of silent speech analysis, aiming to voice the silent speech from facial electromyography (EMG). In this work, we present the first study of neural silent speech recognition in Chinese, which goes one step further to convert the silent facial EMG signals into text directly. We build a benchmark dataset and then introduce a neural end-to-end model to the task. The model is further optimized with two auxiliary tasks for better feature learning. In addition, we suggest a systematic data augmentation strategy to improve model performance. Experimental results show that our final best model can achieve a character error rate of 38.0% on a sentence-level silent speech recognition task. We also provide in-depth analysis to gain a comprehensive understanding of our task and the various models proposed. Although our model achieves initial results, there is still a gap compared to the ideal level, warranting further attention and research.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103230"},"PeriodicalIF":2.4,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143850540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-11DOI: 10.1016/j.specom.2025.103244
Annemarie Bijnens , Aurélie Pistono
Limited evidence exists on ADHD-related disfluency and lexical diversity behaviour in connected speech, although a significant number of individuals with ADHD experience language difficulties at different linguistic levels. Using a retrospective cross-sectional design with data from the Asymmetries TalkBank database, this study aims to capture differences in disfluency production and lexical diversity between children with ADHD and Typically Developing (TD) children. These measures include the frequencies of different disfluency subtypes and two lexical diversity measures, and are correlated with performance on a working memory task and a response inhibition task. Results indicate that the ADHD group produced a higher mean frequency of each disfluency type, but no differences were found to be significant. Correlation analysis revealed that filled pauses and revisions were negatively correlated with working memory and response inhibition in the ADHD group, whereas they were positively correlated with working memory performance in the TD group. This suggests that the underlying causes of disfluency differ in each group and that further research is required of speech monitoring ability in children with ADHD.
{"title":"Disfluency production in children with attention-deficit/hyperactivity disorder during a narrative task","authors":"Annemarie Bijnens , Aurélie Pistono","doi":"10.1016/j.specom.2025.103244","DOIUrl":"10.1016/j.specom.2025.103244","url":null,"abstract":"<div><div>Limited evidence exists on ADHD-related disfluency and lexical diversity behaviour in connected speech, although a significant number of individuals with ADHD experience language difficulties at different linguistic levels. Using a retrospective cross-sectional design with data from the Asymmetries TalkBank database, this study aims to capture differences in disfluency production and lexical diversity between children with ADHD and Typically Developing (TD) children. These measures include the frequencies of different disfluency subtypes and two lexical diversity measures, and are correlated with performance on a working memory task and a response inhibition task. Results indicate that the ADHD group produced a higher mean frequency of each disfluency type, but no differences were found to be significant. Correlation analysis revealed that filled pauses and revisions were negatively correlated with working memory and response inhibition in the ADHD group, whereas they were positively correlated with working memory performance in the TD group. This suggests that the underlying causes of disfluency differ in each group and that further research is required of speech monitoring ability in children with ADHD.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103244"},"PeriodicalIF":2.4,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143848672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-08DOI: 10.1016/j.specom.2025.103207
Mingchang Lü
This study investigates the formation of English loans in Mandarin from the lens of both perception and production. Excluding loans that involve semantic or lexical adaptation, I explore how the two aspects of perception and production may separately account for various adaptation patterns of segmental change in phonological loans—those whose formation is governed solely by phonological processes. Specifically, perceptual interpretation is composed of auditory (acoustic) correlates. Building upon my previous work, I argue that production involves the adapter's awareness of articulatory economy and attempt to facilitate the interlocutor's perception, in addition to their prosodic knowledge of the native phonology, as addressed at length in my earlier proposals. Conclusions are drawn primarily upon language universals, cross-linguistic trends, and coarticulatory factors. The emergent patterns provide compelling evidence that orthographic influence is only marginal.
{"title":"Understanding perception and production in loan adaptation: Cases of English loans in Mandarin","authors":"Mingchang Lü","doi":"10.1016/j.specom.2025.103207","DOIUrl":"10.1016/j.specom.2025.103207","url":null,"abstract":"<div><div>This study investigates the formation of English loans in Mandarin from the lens of both perception and production. Excluding loans that involve semantic or lexical adaptation, I explore how the two aspects of perception and production may separately account for various adaptation patterns of segmental change in phonological loans—those whose formation is governed solely by phonological processes. Specifically, perceptual interpretation is composed of auditory (acoustic) correlates. Building upon my previous work, I argue that production involves the adapter's awareness of articulatory economy and attempt to facilitate the interlocutor's perception, in addition to their prosodic knowledge of the native phonology, as addressed at length in my earlier proposals. Conclusions are drawn primarily upon language universals, cross-linguistic trends, and coarticulatory factors. The emergent patterns provide compelling evidence that orthographic influence is only marginal.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103207"},"PeriodicalIF":2.4,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143874568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-07DOI: 10.1016/j.specom.2025.103241
Evy Woumans , Robert J. Hartsuiker
The present study looked into speakers’ capacity to shorten narratives through retelling, and specifically at differences in such discourse compression between bilinguals’ first (L1) and second (L2) language. A group of unbalanced Dutch-English bilinguals recounted the events from two cartoons in both languages four times. For each narration, word count (both including and excluding hesitation markers), duration, and fluency were recorded as dependent measures, all of which showed significant compression, i.e. economy in the oral production of the narrative, in both languages. Compression thus occurred in L1 as well as L2, indicating it relies on similar psycholinguistic mechanisms in both languages. Remarkably, whereas all L2 measures were less compressed in the initial narration than their L1 counterpart, compression with the first retelling was significantly higher in the L2 condition. Hence, whereas lexical access is expected to be more difficult in L2 initially, ultimately leading to increased disfluency, speaking behaviour did not seem to differ much from that in L1 once vocabulary, grammar, and syntax structures were primed.
{"title":"Cutting to the chase: The influence of first and second language use on discourse compression","authors":"Evy Woumans , Robert J. Hartsuiker","doi":"10.1016/j.specom.2025.103241","DOIUrl":"10.1016/j.specom.2025.103241","url":null,"abstract":"<div><div>The present study looked into speakers’ capacity to shorten narratives through retelling, and specifically at differences in such discourse compression between bilinguals’ first (L1) and second (L2) language. A group of unbalanced Dutch-English bilinguals recounted the events from two cartoons in both languages four times. For each narration, word count (both including and excluding hesitation markers), duration, and fluency were recorded as dependent measures, all of which showed significant compression, i.e. economy in the oral production of the narrative, in both languages. Compression thus occurred in L1 as well as L2, indicating it relies on similar psycholinguistic mechanisms in both languages. Remarkably, whereas all L2 measures were less compressed in the initial narration than their L1 counterpart, compression with the first retelling was significantly higher in the L2 condition. Hence, whereas lexical access is expected to be more difficult in L2 initially, ultimately leading to increased disfluency, speaking behaviour did not seem to differ much from that in L1 once vocabulary, grammar, and syntax structures were primed.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103241"},"PeriodicalIF":2.4,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143833788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-05DOI: 10.1016/j.specom.2025.103239
Kamil Kaźmierski
Overall lexical frequency has long been known to play a role in sound change. Specifically, lexical frequency is negatively correlated with phonetic duration, and as such can be seen as a driver of diachronic reduction processes. However, recent findings suggest that it is the frequency of occurrence in a phonetic environment that favors a particular type of sound change, rather than overall lexical frequency, that shapes phonetic forms. For temporal reduction, Seyfarth (2014) shows that words that have a high frequency of occurrence in predictable contexts (low informativity words) are more temporally reduced than words that have a lower frequency of occurrence in predictable contexts (high informativity words). In this paper, I replicate Seyfarth's (2014) finding using another corpus of unscripted English — the Nationwide Speech Project corpus (Clopper and Pisoni, 2006), as well as using a corpus of another language, Polish — the Greater Poland Spoken Corpus (Kaźmierski et al., 2019; Kul et al., 2019). In both cases, informativity is included as a predictor of theoretical interest in mixed-effects linear regression models of word durations. Informativity, i. e. the frequency of occurrence in low-predictability contexts is shown to have a statistically significant effect on word durations in both English and Polish. Extending the analysis beyond a replication of Seyfarth (2014), a comparison of the effect of informativity and overall lexical frequency shows that the effect of informativity is somewhat weaker in Polish than in English, lending some support to the notion that morphologically rich languages are less sensitive to contextual predictability.
{"title":"The role of informativity and frequency in shaping word durations in English and in Polish","authors":"Kamil Kaźmierski","doi":"10.1016/j.specom.2025.103239","DOIUrl":"10.1016/j.specom.2025.103239","url":null,"abstract":"<div><div>Overall lexical frequency has long been known to play a role in sound change. Specifically, lexical frequency is negatively correlated with phonetic duration, and as such can be seen as a driver of diachronic reduction processes. However, recent findings suggest that it is the frequency of occurrence in a phonetic environment that favors a particular type of sound change, rather than overall lexical frequency, that shapes phonetic forms. For temporal reduction, Seyfarth (2014) shows that words that have a high frequency of occurrence in predictable contexts (<em>low informativity</em> words) are more temporally reduced than words that have a lower frequency of occurrence in predictable contexts (<em>high informativity</em> words). In this paper, I replicate Seyfarth's (2014) finding using another corpus of unscripted English — the Nationwide Speech Project corpus (Clopper and Pisoni, 2006), as well as using a corpus of another language, Polish — the Greater Poland Spoken Corpus (Kaźmierski et al., 2019; Kul et al., 2019). In both cases, informativity is included as a predictor of theoretical interest in mixed-effects linear regression models of word durations. Informativity, i. e. the frequency of occurrence in low-predictability contexts is shown to have a statistically significant effect on word durations in both English and Polish. Extending the analysis beyond a replication of Seyfarth (2014), a comparison of the effect of informativity and overall lexical frequency shows that the effect of informativity is somewhat weaker in Polish than in English, lending some support to the notion that morphologically rich languages are less sensitive to contextual predictability.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103239"},"PeriodicalIF":2.4,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143807974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}