Pub Date : 2025-10-13DOI: 10.1016/j.specom.2025.103316
Fatma Gumus , M. Fatih Amasyali
There is an intricate interplay between third-party AI application programming interfaces and adversarial machine learning. The investigation centers on vulnerabilities inherent in AI models utilizing multiple black-box APIs, with a particular emphasis on their susceptibility to attacks in the domains of speech and text recognition. Our exploration spans a spectrum of attack strategies, encompassing targeted, indiscriminate, and adaptive targeting approaches, each carefully designed to exploit unique facets of multi-modal inputs. The results underscore the intricate balance between attack success, average target class confidence, and the density of swaps and queries. Remarkably, targeted attacks exhibit an average success rate of 76%, while adaptive targeting achieves an even higher rate of 88%. Conversely, indiscriminate attacks attain an intermediate success rate of 73%, highlighting their potency even in the absence of strategic tailoring. Moreover, our strategies’ efficiency is evaluated through a resource utilization lens. Our findings reveal adaptive targeting as the most efficient approach, with an average of 2 word swaps and 140 queries per attack instance. In contrast, indiscriminate targeting requires an average of 2 word swaps and 150 queries per instance.
{"title":"Robustness of emotion recognition in dialogue systems: A study on third-party API integrations and black-box attacks","authors":"Fatma Gumus , M. Fatih Amasyali","doi":"10.1016/j.specom.2025.103316","DOIUrl":"10.1016/j.specom.2025.103316","url":null,"abstract":"<div><div>There is an intricate interplay between third-party AI application programming interfaces and adversarial machine learning. The investigation centers on vulnerabilities inherent in AI models utilizing multiple black-box APIs, with a particular emphasis on their susceptibility to attacks in the domains of speech and text recognition. Our exploration spans a spectrum of attack strategies, encompassing targeted, indiscriminate, and adaptive targeting approaches, each carefully designed to exploit unique facets of multi-modal inputs. The results underscore the intricate balance between attack success, average target class confidence, and the density of swaps and queries. Remarkably, targeted attacks exhibit an average success rate of 76%, while adaptive targeting achieves an even higher rate of 88%. Conversely, indiscriminate attacks attain an intermediate success rate of 73%, highlighting their potency even in the absence of strategic tailoring. Moreover, our strategies’ efficiency is evaluated through a resource utilization lens. Our findings reveal adaptive targeting as the most efficient approach, with an average of 2 word swaps and 140 queries per attack instance. In contrast, indiscriminate targeting requires an average of 2 word swaps and 150 queries per instance.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"175 ","pages":"Article 103316"},"PeriodicalIF":3.0,"publicationDate":"2025-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145325872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-01DOI: 10.1016/j.specom.2025.103315
Ching-Hung Lai , Shu-Wei Tsai , Chenhao Chiu , Yung-An Tsou , Ting-Shou Chang , David Shang-Yu Hung , Miyuki Hsing-Chun Hsieh , I-Pei Lee , Tammy Tsai
The nasal electrolarynx (NEL) is an innovative device that assists patients without vocal folds or under endotracheal intubation in producing speech sounds. The NEL has a different path for acoustic wave transmission to the traditional electrolarynx that starts from the nostril, passes through the nasal cavity, velopharyngeal port, and oral cavity, and exits the lips. There are several advantages to the NEL, including being non-handheld and not requiring a specific “sweet spot.” However, little is known about the acoustic characteristics of the NEL. This study investigated the acoustic characteristics of the NEL compared to normal speech using ten participants involved in two vowel production sessions. Compared to normal speech, NEL speech had low-frequency deficits in the linear predictive coding spectrum, higher first and second formants, decreased amplitude of the first formant, and increased amplitude of the nasal pole. The results identify the general acoustic features of the NEL, which are discussed using a tube model of the vocal tract and perturbation theory. Understanding the acoustic properties of NEL will help refine the acoustic source and speech recognition in future studies.
{"title":"An acoustic analysis of the nasal electrolarynx in healthy participants","authors":"Ching-Hung Lai , Shu-Wei Tsai , Chenhao Chiu , Yung-An Tsou , Ting-Shou Chang , David Shang-Yu Hung , Miyuki Hsing-Chun Hsieh , I-Pei Lee , Tammy Tsai","doi":"10.1016/j.specom.2025.103315","DOIUrl":"10.1016/j.specom.2025.103315","url":null,"abstract":"<div><div>The nasal electrolarynx (NEL) is an innovative device that assists patients without vocal folds or under endotracheal intubation in producing speech sounds. The NEL has a different path for acoustic wave transmission to the traditional electrolarynx that starts from the nostril, passes through the nasal cavity, velopharyngeal port, and oral cavity, and exits the lips. There are several advantages to the NEL, including being non-handheld and not requiring a specific “sweet spot.” However, little is known about the acoustic characteristics of the NEL. This study investigated the acoustic characteristics of the NEL compared to normal speech using ten participants involved in two vowel production sessions. Compared to normal speech, NEL speech had low-frequency deficits in the linear predictive coding spectrum, higher first and second formants, decreased amplitude of the first formant, and increased amplitude of the nasal pole. The results identify the general acoustic features of the NEL, which are discussed using a tube model of the vocal tract and perturbation theory. Understanding the acoustic properties of NEL will help refine the acoustic source and speech recognition in future studies.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"175 ","pages":"Article 103315"},"PeriodicalIF":3.0,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145269869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-30DOI: 10.1016/j.specom.2025.103314
Junyu Wang , Zizhen Lin , Tianrui Wang , Meng Ge , Longbiao Wang , Jianwu Dang
Achieving superior enhancement performance while maintaining a low parameter count and computational complexity remains a challenge in the field of speech enhancement. In this paper, we introduce LORT, a novel architecture that integrates spatial-channel enhanced Taylor Transformer and locally refined convolution for efficient and robust speech enhancement. We propose a Taylor multi-head self-attention (T-MSA) module enhanced with spatial-channel enhancement attention (SCEA), designed to facilitate inter-channel information exchange and alleviate the spatial attention limitations inherent in Taylor-based Transformers. To complement global modeling, we further present a locally refined convolution (LRC) block that integrates convolutional feed-forward layers, time–frequency dense local convolutions, and gated units to capture fine-grained local details. Built upon a U-Net-like encoder–decoder structure with only 16 output channels in the encoder, LORT processes noisy inputs through multi-resolution T-MSA modules using alternating downsampling and upsampling operations. The enhanced magnitude and phase spectra are decoded independently and optimized through a composite loss function that jointly considers magnitude, complex, phase, discriminator, and consistency objectives. Experimental results on the VCTK+DEMAND and DNS Challenge datasets demonstrate that LORT achieves competitive or superior performance to state-of-the-art (SOTA) models with only 0.96M parameters, highlighting its effectiveness for real-world speech enhancement applications with limited computational resources.
{"title":"LORT: Locally refined convolution and Taylor transformer for monaural speech enhancement","authors":"Junyu Wang , Zizhen Lin , Tianrui Wang , Meng Ge , Longbiao Wang , Jianwu Dang","doi":"10.1016/j.specom.2025.103314","DOIUrl":"10.1016/j.specom.2025.103314","url":null,"abstract":"<div><div>Achieving superior enhancement performance while maintaining a low parameter count and computational complexity remains a challenge in the field of speech enhancement. In this paper, we introduce LORT, a novel architecture that integrates spatial-channel enhanced Taylor Transformer and locally refined convolution for efficient and robust speech enhancement. We propose a Taylor multi-head self-attention (T-MSA) module enhanced with spatial-channel enhancement attention (SCEA), designed to facilitate inter-channel information exchange and alleviate the spatial attention limitations inherent in Taylor-based Transformers. To complement global modeling, we further present a locally refined convolution (LRC) block that integrates convolutional feed-forward layers, time–frequency dense local convolutions, and gated units to capture fine-grained local details. Built upon a U-Net-like encoder–decoder structure with only 16 output channels in the encoder, LORT processes noisy inputs through multi-resolution T-MSA modules using alternating downsampling and upsampling operations. The enhanced magnitude and phase spectra are decoded independently and optimized through a composite loss function that jointly considers magnitude, complex, phase, discriminator, and consistency objectives. Experimental results on the VCTK+DEMAND and DNS Challenge datasets demonstrate that LORT achieves competitive or superior performance to state-of-the-art (SOTA) models with only 0.96M parameters, highlighting its effectiveness for real-world speech enhancement applications with limited computational resources.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"175 ","pages":"Article 103314"},"PeriodicalIF":3.0,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145223149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The increasing complexity and diversity of emotional expression pose challenges when identifying fake news conveyed through text and audio formats. Integrating emotional cues derived from data offers a promising approach for balancing the tradeoff between the volume and quality of data. Leveraging recent advancements in speech emotion recognition (SER), our study proposes a Multimodal Recursive Dual-Convolutional Neural Network Model (MDCNN) for fake news detection, with a focus on sentiment analysis based on audio and text. Our proposed model employs convolutional layers to extract features from both audio and text inputs, facilitating an effective feature fusion process for sentiment classification. Through a deep bidirectional recursive encoder, the model can better understand audio and text features for determining the final emotional category. Experiments conducted on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset, which contains 5531 samples across four emotion types—anger, happiness, neutrality, and sadness—demonstrate the superior performance of the MDCNN. Its weighted average precision (WAP) is 78.8 %, which is 2.5 % higher than that of the best baseline. Compared with the existing sentiment analysis models, our approach exhibits notable enhancements in terms of accurately detecting neutral categories, thereby addressing a common challenge faced by the prior models. These findings underscore the efficacy of the MDCNN in multimodal sentiment analysis tasks and its significant achievements in neutral category classification tasks, offering a robust solution for precisely detecting fake news and conducting nuanced emotional analyses in speech recognition scenarios.
{"title":"MDCNN: A multimodal dual-CNN recursive model for fake news detection via audio- and text-based speech emotion recognition","authors":"Hongchen Wu, Hongxuan Li, Xiaochang Fang, Mengqi Tang, Hongzhu Yu, Bing Yu, Meng Li, Zhaorong Jing, Yihong Meng, Wei Chen, Yu Liu, Chenfei Sun, Shuang Gao, Huaxiang Zhang","doi":"10.1016/j.specom.2025.103313","DOIUrl":"10.1016/j.specom.2025.103313","url":null,"abstract":"<div><div>The increasing complexity and diversity of emotional expression pose challenges when identifying fake news conveyed through text and audio formats. Integrating emotional cues derived from data offers a promising approach for balancing the tradeoff between the volume and quality of data. Leveraging recent advancements in speech emotion recognition (SER), our study proposes a Multimodal Recursive Dual-Convolutional Neural Network Model (MDCNN) for fake news detection, with a focus on sentiment analysis based on audio and text. Our proposed model employs convolutional layers to extract features from both audio and text inputs, facilitating an effective feature fusion process for sentiment classification. Through a deep bidirectional recursive encoder, the model can better understand audio and text features for determining the final emotional category. Experiments conducted on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset, which contains 5531 samples across four emotion types—anger, happiness, neutrality, and sadness—demonstrate the superior performance of the MDCNN. Its weighted average precision (WAP) is 78.8 %, which is 2.5 % higher than that of the best baseline. Compared with the existing sentiment analysis models, our approach exhibits notable enhancements in terms of accurately detecting neutral categories, thereby addressing a common challenge faced by the prior models. These findings underscore the efficacy of the MDCNN in multimodal sentiment analysis tasks and its significant achievements in neutral category classification tasks, offering a robust solution for precisely detecting fake news and conducting nuanced emotional analyses in speech recognition scenarios.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"175 ","pages":"Article 103313"},"PeriodicalIF":3.0,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145223148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-17DOI: 10.1016/j.specom.2025.103305
Nigel G. Ward, Raul O. Gomez, Carlos A. Ortega, Georgina Bugarini
A fundamental goal of speech science is to inventory the meaning-conveying elements of human speech. This article provides evidence for including phonetic reduction in this inventory. Based on analysis of dialog data, we find that phonetic reduction is common with several important pragmatic functions, including the expression of positive assessment, in both American English and Mexican Spanish. For American English, we confirm, in a controlled experiment, that people speaking in a positive tone generally do indeed use more reduced forms.
{"title":"Phonetic reduction is associated with positive assessment and other pragmatic functions","authors":"Nigel G. Ward, Raul O. Gomez, Carlos A. Ortega, Georgina Bugarini","doi":"10.1016/j.specom.2025.103305","DOIUrl":"10.1016/j.specom.2025.103305","url":null,"abstract":"<div><div>A fundamental goal of speech science is to inventory the meaning-conveying elements of human speech. This article provides evidence for including phonetic reduction in this inventory. Based on analysis of dialog data, we find that phonetic reduction is common with several important pragmatic functions, including the expression of positive assessment, in both American English and Mexican Spanish. For American English, we confirm, in a controlled experiment, that people speaking in a positive tone generally do indeed use more reduced forms.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"175 ","pages":"Article 103305"},"PeriodicalIF":3.0,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145196002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-16DOI: 10.1016/j.specom.2025.103304
Ke Lv , Yuanjie Deng , Ying Wei
Target speaker extraction technology aims to extract the target speaker’s speech from mixed speech based on related cues. When using visual information as a cue, there exists a heterogeneity problem between audio and visual modalities, as they are different modalities. Therefore, some works have extracted visual features consistent with the target speech to mitigate the heterogeneity problem. However, most methods only consider a single type of consistency, which is insufficient to mitigate the modality gap. Furthermore, time-domain speaker extraction models still face modeling challenges when processing speech with numerous time steps. In this work, we propose MC-Mamba, a cross-modal target speaker extraction model based on multiple consistency. We design a consistent visual feature extractor to extract visual features that are consistent with the target speaker’s identity and content. Content-consistent visual features are used for audio–visual feature fusion, while identity-consistent visual features constrain the identity of separated speech. Notably, when extracting content-consistent visual features, our method does not rely on additional text datasets as labels, as is common in other works, enhancing its practical applicability. The Mamba blocks within the model efficiently process long speech signals by capturing both local and global information. Comparative experimental results show that our proposed speaker extraction model outperforms other state-of-the-art models in terms of speech quality and clarity.
{"title":"MC-Mamba: Cross-modal target speaker extraction model based on multiple consistency","authors":"Ke Lv , Yuanjie Deng , Ying Wei","doi":"10.1016/j.specom.2025.103304","DOIUrl":"10.1016/j.specom.2025.103304","url":null,"abstract":"<div><div>Target speaker extraction technology aims to extract the target speaker’s speech from mixed speech based on related cues. When using visual information as a cue, there exists a heterogeneity problem between audio and visual modalities, as they are different modalities. Therefore, some works have extracted visual features consistent with the target speech to mitigate the heterogeneity problem. However, most methods only consider a single type of consistency, which is insufficient to mitigate the modality gap. Furthermore, time-domain speaker extraction models still face modeling challenges when processing speech with numerous time steps. In this work, we propose MC-Mamba, a cross-modal target speaker extraction model based on multiple consistency. We design a consistent visual feature extractor to extract visual features that are consistent with the target speaker’s identity and content. Content-consistent visual features are used for audio–visual feature fusion, while identity-consistent visual features constrain the identity of separated speech. Notably, when extracting content-consistent visual features, our method does not rely on additional text datasets as labels, as is common in other works, enhancing its practical applicability. The Mamba blocks within the model efficiently process long speech signals by capturing both local and global information. Comparative experimental results show that our proposed speaker extraction model outperforms other state-of-the-art models in terms of speech quality and clarity.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"174 ","pages":"Article 103304"},"PeriodicalIF":3.0,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145105376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The recognition accuracy of conventional automatic speech recognition (ASR) systems depends heavily on the amount of speech and associated transcription data available in the target domain for model training. However, preparing parallel speech and text data each time a model is trained for a new domain is costly and time-consuming. To solve this problem, we propose a method of domain adaptation that does not require the use of a large amount of parallel target domain training data, as most of the data used for model training is not from the target domain. Instead, only target domain speech is used for model training, along with non-target domain speech and its parallel text data, i.e., the domains and contents of the two types of training data do not correspond to one another. Collecting this type of training data is relatively inexpensive. Domain adaptation is performed in two steps: (1) A pre-trained wav2vec 2.0 model is further pre-trained using a large amount of target domain speech data and is then fine-tuned using a large amount of non-target domain speech and its transcriptions. (2) The density ratio approach (DRA) is applied during inference to a language model (LM) trained using target domain text unrelated to, and independently from, the wav2vec 2.0 training. Experimental evaluation illustrated that the proposed domain adaptation obtained character error rate (CER) 10.4 pts lower than baseline with wav2vec 2.0 and 3.9 pts with XLS-R under the situation that the parallel target domain data is unavailable against the target domain test set, achieving 34.4% and 16.2% reductions in relative CER.
{"title":"Domain adaptation using non-parallel target domain corpus for self-supervised learning-based automatic speech recognition","authors":"Takahiro Kinouchi , Atsunori Ogawa , Yukoh Wakabayashi , Kengo Ohta , Norihide Kitaoka","doi":"10.1016/j.specom.2025.103303","DOIUrl":"10.1016/j.specom.2025.103303","url":null,"abstract":"<div><div>The recognition accuracy of conventional automatic speech recognition (ASR) systems depends heavily on the amount of speech and associated transcription data available in the target domain for model training. However, preparing parallel speech and text data each time a model is trained for a new domain is costly and time-consuming. To solve this problem, we propose a method of domain adaptation that does not require the use of a large amount of parallel target domain training data, as most of the data used for model training is not from the target domain. Instead, only target domain speech is used for model training, along with non-target domain speech and its parallel text data, i.e., the domains and contents of the two types of training data do not correspond to one another. Collecting this type of training data is relatively inexpensive. Domain adaptation is performed in two steps: (1) A pre-trained wav2vec<!--> <!-->2.0 model is further pre-trained using a large amount of target domain speech data and is then fine-tuned using a large amount of non-target domain speech and its transcriptions. (2) The density ratio approach (DRA) is applied during inference to a language model (LM) trained using target domain text unrelated to, and independently from, the wav2vec<!--> <!-->2.0 training. Experimental evaluation illustrated that the proposed domain adaptation obtained character error rate (CER) 10.4<!--> <!-->pts lower than baseline with wav2vec<!--> <!-->2.0 and 3.9<!--> <!-->pts with XLS-R under the situation that the parallel target domain data is unavailable against the target domain test set, achieving 34.4% and 16.2% reductions in relative CER.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"174 ","pages":"Article 103303"},"PeriodicalIF":3.0,"publicationDate":"2025-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145105375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-04DOI: 10.1016/j.specom.2025.103300
Minghan Zhang , Jing Song , Fei Xie , Ke Shi , Zhiyuan Guo , Fuliang Weng
Self-attention (SA) originally demonstrated its powerful ability in handling text sequences in machine translation tasks, and some studies have successfully applied it to automatic speech recognition (ASR) models. However, speech sequences exhibit significantly lower information density than text sequences, containing abundant silent, repetitive, and overlapping segments. Recent studies have also pointed out that the full attention mechanism used to extract global dependency relationships is not indispensable for state-of-the-art ASR models. Conventional full attention consumes quadratic computational complexity, and may extract redundant or even negative information. To address this, we propose Diagonal and Vertical Self-Attention (DVSA), a sparse attention mechanism for ASR. To extract more focused dependencies from the speech sequence with higher efficiency, we optimize the traditional SA calculation process by explicitly selecting and calculating only a subset of important dot products. This eliminates the misleading effect of dot products with common query degrees on the model and greatly alleviates the quadratic computational complexity. Experiments on LibriSpeech and Aishell-1 show that DVSA improves the performance of a Conformer-based model (a dominant architecture in ASR) by 6.5 % and 5.7 % respectively over traditional full attention, while significantly reducing computational complexity. Notably, DVSA enables reducing encoder layers by 33 % without performance degradation, yielding additional savings in parameters and computation. As a result, this new approach achieves the improvements in all three major metrics: accuracy, model size, and training and testing time efficiency.
{"title":"DVSA: A focused and efficient sparse attention via explicit selection for speech recognition","authors":"Minghan Zhang , Jing Song , Fei Xie , Ke Shi , Zhiyuan Guo , Fuliang Weng","doi":"10.1016/j.specom.2025.103300","DOIUrl":"10.1016/j.specom.2025.103300","url":null,"abstract":"<div><div>Self-attention (SA) originally demonstrated its powerful ability in handling text sequences in machine translation tasks, and some studies have successfully applied it to automatic speech recognition (ASR) models. However, speech sequences exhibit significantly lower information density than text sequences, containing abundant silent, repetitive, and overlapping segments. Recent studies have also pointed out that the full attention mechanism used to extract global dependency relationships is not indispensable for state-of-the-art ASR models. Conventional full attention consumes quadratic computational complexity, and may extract redundant or even negative information. To address this, we propose Diagonal and Vertical Self-Attention (DVSA), a sparse attention mechanism for ASR. To extract more focused dependencies from the speech sequence with higher efficiency, we optimize the traditional SA calculation process by explicitly selecting and calculating only a subset of important dot products. This eliminates the misleading effect of dot products with common query degrees on the model and greatly alleviates the quadratic computational complexity. Experiments on LibriSpeech and Aishell-1 show that DVSA improves the performance of a Conformer-based model (a dominant architecture in ASR) by 6.5 % and 5.7 % respectively over traditional full attention, while significantly reducing computational complexity. Notably, DVSA enables reducing encoder layers by 33 % without performance degradation, yielding additional savings in parameters and computation. As a result, this new approach achieves the improvements in all three major metrics: accuracy, model size, and training and testing time efficiency.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"174 ","pages":"Article 103300"},"PeriodicalIF":3.0,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145049422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-02DOI: 10.1016/j.specom.2025.103302
Anaïs Tran Ngoc , Julien Meyer , Fanny Meunier
In this study, we investigated the transfer of musical skills to speech perception by analyzing the perception and categorization of consonants produced in whistled speech, a naturally modified speech form. The study had two main objectives: (i) to explore the effects of different levels of musical skill on speech perception, and (ii) to better understand the type of skills transferred by focusing on a group of high-level musicians, playing various instruments. Within this high-level group, we aimed to disentangle general cognitive transfers from sound-specific transfers by considering instrument specialization, contrasting general musical knowledge (shared by all instruments) with instrument-specific ones. We focused on four instruments: voice, violin, piano and flute. Our results confirm a general musical advantage and suggest that only a small amount of musical experience is sufficient for musical skills to benefit whistled speech perception. However, higher-level musicians reached better performances, with differences for specific consonants. Moreover, musical expertise appears to enhance rapid adaptation to the whistled signal throughout the experiment and our results highlight the specificity of instrument expertise. Consistent with previous research showing the impact of the instrument played, the differences observed in whistled speech processing among high-level musicians seem to be primarily due to instrument-specific expertise.
{"title":"Benefits of musical experience on whistled consonant categorization: analyzing the cognitive transfer processes","authors":"Anaïs Tran Ngoc , Julien Meyer , Fanny Meunier","doi":"10.1016/j.specom.2025.103302","DOIUrl":"10.1016/j.specom.2025.103302","url":null,"abstract":"<div><div>In this study, we investigated the transfer of musical skills to speech perception by analyzing the perception and categorization of consonants produced in whistled speech, a naturally modified speech form. The study had two main objectives: (i) to explore the effects of different levels of musical skill on speech perception, and (ii) to better understand the type of skills transferred by focusing on a group of high-level musicians, playing various instruments. Within this high-level group, we aimed to disentangle general cognitive transfers from sound-specific transfers by considering instrument specialization, contrasting general musical knowledge (shared by all instruments) with instrument-specific ones. We focused on four instruments: voice, violin, piano and flute. Our results confirm a general musical advantage and suggest that only a small amount of musical experience is sufficient for musical skills to benefit whistled speech perception. However, higher-level musicians reached better performances, with differences for specific consonants. Moreover, musical expertise appears to enhance rapid adaptation to the whistled signal throughout the experiment and our results highlight the specificity of instrument expertise. Consistent with previous research showing the impact of the instrument played, the differences observed in whistled speech processing among high-level musicians seem to be primarily due to instrument-specific expertise.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"174 ","pages":"Article 103302"},"PeriodicalIF":3.0,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145049420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study investigated the acquisition of lenition in Spanish voiced stops (/b, d, ɡ/) by native English speakers during a study-abroad program, focusing on individual differences and influencing factors. Lenition, characterized by the weakening of stops into fricative-like ([β], [ð], [ɣ]) or approximant-like ([β̞], [ð̞], [ɣ̞]) forms, poses challenges for L2 learners due to its gradient nature and the absence of analogous approximant forms in English. Results indicated that learners aligned with native speakers in recognizing voicing as the primary cue for lenition, yet their productions diverged, favoring fricative-like over approximant-like realizations. This preference reflects the combined influence of articulatory ease, acoustic salience, and cognitive demands.
Individual variability in learners’ trajectories highlights the role of exposure to native input and sociolinguistic engagement. Learners benefitting from richer, informal interactions with native speakers showed greater alignment with native patterns, while others demonstrated more limited progress. However, native input alone was insufficient for learners to internalize subtler distinctions such as place of articulation and stress. These findings emphasize the need for combining immersive experiences with targeted instructional strategies to address articulatory and cognitive challenges. This study contributes to the understanding of L2 phonological acquisition and offers insights for designing more effective language learning programs to support lenition acquisition in Spanish.
{"title":"Individual differences in language acquisition: The impact of study abroad on native English speakers learning Spanish","authors":"Ratree Wayland , Rachel Meyer , Sophia Vellozzi , Kevin Tang","doi":"10.1016/j.specom.2025.103301","DOIUrl":"10.1016/j.specom.2025.103301","url":null,"abstract":"<div><div>This study investigated the acquisition of lenition in Spanish voiced stops (/b, d, ɡ/) by native English speakers during a study-abroad program, focusing on individual differences and influencing factors. Lenition, characterized by the weakening of stops into fricative-like ([β], [ð], [ɣ]) or approximant-like ([β̞], [ð̞], [ɣ̞]) forms, poses challenges for L2 learners due to its gradient nature and the absence of analogous approximant forms in English. Results indicated that learners aligned with native speakers in recognizing voicing as the primary cue for lenition, yet their productions diverged, favoring fricative-like over approximant-like realizations. This preference reflects the combined influence of articulatory ease, acoustic salience, and cognitive demands.</div><div>Individual variability in learners’ trajectories highlights the role of exposure to native input and sociolinguistic engagement. Learners benefitting from richer, informal interactions with native speakers showed greater alignment with native patterns, while others demonstrated more limited progress. However, native input alone was insufficient for learners to internalize subtler distinctions such as place of articulation and stress. These findings emphasize the need for combining immersive experiences with targeted instructional strategies to address articulatory and cognitive challenges. This study contributes to the understanding of L2 phonological acquisition and offers insights for designing more effective language learning programs to support lenition acquisition in Spanish.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"174 ","pages":"Article 103301"},"PeriodicalIF":3.0,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145049421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}