Pub Date : 2025-02-01DOI: 10.1016/j.specom.2024.103169
Khazar Khorrami, Okko Räsänen
Infants gradually learn to parse continuous speech into words and connect names with objects, yet the mechanisms behind development of early word perception skills remain unknown. We studied the extent to which early words can be acquired through statistical learning from regularities in audiovisual sensory input. We simulated word learning in infants up to 12 months of age in a realistic setting, using a model that solely learns from statistical regularities in unannotated raw speech and pixel-level visual input. Crucially, the quantity of object naming events was carefully designed to match that accessible to infants of comparable ages. Results show that the model effectively learns to recognize words and associate them with corresponding visual objects, with a vocabulary growth rate comparable to that observed in infants. The findings support the viability of general statistical learning for early word perception, demonstrating how learning can operate without assuming any prior linguistic capabilities.
{"title":"A model of early word acquisition based on realistic-scale audiovisual naming events","authors":"Khazar Khorrami, Okko Räsänen","doi":"10.1016/j.specom.2024.103169","DOIUrl":"10.1016/j.specom.2024.103169","url":null,"abstract":"<div><div>Infants gradually learn to parse continuous speech into words and connect names with objects, yet the mechanisms behind development of early word perception skills remain unknown. We studied the extent to which early words can be acquired through statistical learning from regularities in audiovisual sensory input. We simulated word learning in infants up to 12 months of age in a realistic setting, using a model that solely learns from statistical regularities in unannotated raw speech and pixel-level visual input. Crucially, the quantity of object naming events was carefully designed to match that accessible to infants of comparable ages. Results show that the model effectively learns to recognize words and associate them with corresponding visual objects, with a vocabulary growth rate comparable to that observed in infants. The findings support the viability of general statistical learning for early word perception, demonstrating how learning can operate without assuming any prior linguistic capabilities.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"167 ","pages":"Article 103169"},"PeriodicalIF":2.4,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143128396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.specom.2024.103161
Nan Li , Meng Ge , Longbiao Wang , Yang-Hao Zhou , Jianwu Dang
Speech enhancement is critical for improving speech quality and intelligibility in a variety of noisy environments. While neural network-based methods have shown promising results in speech enhancement, they often suffer from performance degradation in scenarios with limited computational resources. This paper presents HC-APNet (Harmonic Compensation Auditory Perception Network), a novel lightweight approach tailored to exploit the perceptual capabilities of the human auditory system for efficient and effective speech enhancement, with a focus on harmonic compensation. Inspired by human auditory reception mechanisms, we first segment audio into subbands using an auditory filterbank for speech enhancement. The use of subbands helps to reduce the number of parameters and the computational load, while the use of an auditory filterbank effectively preserves high-quality speech enhancement. In addition, inspired by the perception of human auditory context, we have developed an auditory perception network to capture gain information for different subbands. Furthermore, considering that subband processing only applies gain to the spectral envelope, which may introduce harmonic distortion, we design a learnable multi-subband comb-filter inspired by human pitch frequency perception to mitigate harmonic distortion. Finally, our proposed HC-APNet model achieves competitive performance on the speech quality evaluation metric with significantly less computational and parameter resources compared to existing methods on the VCTK + DEMAND and DNS Challenge datasets.
{"title":"HC-APNet: Harmonic Compensation Auditory Perception Network for low-complexity speech enhancement","authors":"Nan Li , Meng Ge , Longbiao Wang , Yang-Hao Zhou , Jianwu Dang","doi":"10.1016/j.specom.2024.103161","DOIUrl":"10.1016/j.specom.2024.103161","url":null,"abstract":"<div><div>Speech enhancement is critical for improving speech quality and intelligibility in a variety of noisy environments. While neural network-based methods have shown promising results in speech enhancement, they often suffer from performance degradation in scenarios with limited computational resources. This paper presents HC-APNet (Harmonic Compensation Auditory Perception Network), a novel lightweight approach tailored to exploit the perceptual capabilities of the human auditory system for efficient and effective speech enhancement, with a focus on harmonic compensation. Inspired by human auditory reception mechanisms, we first segment audio into subbands using an auditory filterbank for speech enhancement. The use of subbands helps to reduce the number of parameters and the computational load, while the use of an auditory filterbank effectively preserves high-quality speech enhancement. In addition, inspired by the perception of human auditory context, we have developed an auditory perception network to capture gain information for different subbands. Furthermore, considering that subband processing only applies gain to the spectral envelope, which may introduce harmonic distortion, we design a learnable multi-subband comb-filter inspired by human pitch frequency perception to mitigate harmonic distortion. Finally, our proposed HC-APNet model achieves competitive performance on the speech quality evaluation metric with significantly less computational and parameter resources compared to existing methods on the VCTK + DEMAND and DNS Challenge datasets.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"167 ","pages":"Article 103161"},"PeriodicalIF":2.4,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143128498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dichotic listening has been widely used in research investigating the hemispheric specialization of language. A common finding is the Right-ear Advantage (REA), reflecting left hemisphere speech sound perception specialization. However, acoustic/phonetic features of the stimuli, such as voice onset time (VOT) and place of articulation (POA), are known to affect the REA. This study investigates the effects of these features on the REA in the Turkish language, whose language family differs from the languages typically used in previous VOT and POA studies. Data of 95 right-handed participants with REA, which was defined as reporting at least one more correct right than left ear response, were analyzed. Prevoiced consonants were dominant compared with consonants with long VOT and resulted in increased REA. Velar consonants were dominant compared with other consonants. Velar and alveolar consonants resulted in higher REA than bilabial consonants. Lateralization and error rates were lower when POA, but not VOT, of the consonants differed. Error responses were mostly determined by the VOT feature of the consonant presented to the right ear. To conclude, the effects of VOT and PoA on the hemispheric asymmetry in Turkish have been spotted by a behavioral approach. Further neuroimaging or electrophysiologic investigations are needed to validate and shed light into the underlying mechanisms of VOT and PoA effects during the DL test.
{"title":"Effects of voice onset time and place of articulation on perception of dichotic Turkish syllables","authors":"Emre Eskicioglu , Serhat Taslica , Cagdas Guducu , Adile Oniz , Murat Ozgoren","doi":"10.1016/j.specom.2024.103170","DOIUrl":"10.1016/j.specom.2024.103170","url":null,"abstract":"<div><div>Dichotic listening has been widely used in research investigating the hemispheric specialization of language. A common finding is the Right-ear Advantage (REA), reflecting left hemisphere speech sound perception specialization. However, acoustic/phonetic features of the stimuli, such as voice onset time (VOT) and place of articulation (POA), are known to affect the REA. This study investigates the effects of these features on the REA in the Turkish language, whose language family differs from the languages typically used in previous VOT and POA studies. Data of 95 right-handed participants with REA, which was defined as reporting at least one more correct right than left ear response, were analyzed. Prevoiced consonants were dominant compared with consonants with long VOT and resulted in increased REA. Velar consonants were dominant compared with other consonants. Velar and alveolar consonants resulted in higher REA than bilabial consonants. Lateralization and error rates were lower when POA, but not VOT, of the consonants differed. Error responses were mostly determined by the VOT feature of the consonant presented to the right ear. To conclude, the effects of VOT and PoA on the hemispheric asymmetry in Turkish have been spotted by a behavioral approach. Further neuroimaging or electrophysiologic investigations are needed to validate and shed light into the underlying mechanisms of VOT and PoA effects during the DL test.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"167 ","pages":"Article 103170"},"PeriodicalIF":2.4,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143128502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.specom.2024.103167
Douglas O'Shaughnessy
Identification of the language used in spoken utterances is useful for multiple applications, e.g., assist in directing or automating telephone calls, or selecting which language-specific speech recognizer to use. This paper reviews modern methods of automatic language identification. It examines what information in speech helps to distinguish among languages, and extends these ideas to dialect estimation as well. As approaches to recognize languages often share much in common with both automatic speech recognition and speaker verification, these three processes are compared. Many methods are drawn from pattern recognition research in other areas, such as image and text recognition. This paper notes how speech is different from most other signals to recognize, and how language identification differs from other speech applications. While it is mainly addressed to readers who are not experts in speech processing (as detailed algorithms, readily found in the cited literature, are omitted here), the presentation covers a wide discussion useful to experts too.
{"title":"Spoken language identification: An overview of past and present research trends","authors":"Douglas O'Shaughnessy","doi":"10.1016/j.specom.2024.103167","DOIUrl":"10.1016/j.specom.2024.103167","url":null,"abstract":"<div><div>Identification of the language used in spoken utterances is useful for multiple applications, e.g., assist in directing or automating telephone calls, or selecting which language-specific speech recognizer to use. This paper reviews modern methods of automatic language identification. It examines what information in speech helps to distinguish among languages, and extends these ideas to dialect estimation as well. As approaches to recognize languages often share much in common with both automatic speech recognition and speaker verification, these three processes are compared. Many methods are drawn from pattern recognition research in other areas, such as image and text recognition. This paper notes how speech is different from most other signals to recognize, and how language identification differs from other speech applications. While it is mainly addressed to readers who are not experts in speech processing (as detailed algorithms, readily found in the cited literature, are omitted here), the presentation covers a wide discussion useful to experts too.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"167 ","pages":"Article 103167"},"PeriodicalIF":2.4,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143128499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The pronunciation of L2 English by L1 Mandarin speakers is influenced by transfer effects from the phonology of Mandarin. However, there is a research gap in systematically synthesizing and reviewing segmental Mandarin-accented English features (SMAEFs) from the existing literature. An accurate and comprehensive description of SMAEFs is necessary for applied science in relevant fields.
Aim
To identify the segmental features that are most consistently described as characteristic of Mandarin-accented English in previous literature.
Methods
A systematic review was conducted. The studies were identified through searching in nine databases with eight screening criteria.
Results
The systematic review includes nineteen studies with a total of 1,873 Mandarin English speakers. The included studies yield 45 SMAEFs, classified into Vowel and Consonant categories, under which there are multiple sub-categories. The results are supported by evidence of different levels of strength. The four frequently reported findings, which are 1) variations in vowel height and frontness, 2) schwa epenthesis, 3) variations in closure duration in plosives and 4) illegal consonant deletion, were identified and analyzed in terms of their potential intelligibility outcomes.
Conclusion
The number of SMAEFs is large. These features occur in numerous traditional phonetic categories and two categories (i.e. schwa epenthesis and illegal consonant deletion) that are typically used to describe features in connected speech. The study outcomes may provide valuable insights for researchers and practitioners in the fields of English Language Teaching, phonetics, and speech recognition system development in terms of selecting the pronunciation features to focus on in teaching and research or supporting the successful identification of accented features.
{"title":"Systematic review: The identification of segmental Mandarin-accented English features","authors":"Hongzhi Wang, Rachael-Anne Knight, Lucy Dipper, Roy Alderton, Reem S․ W․ Alyahya","doi":"10.1016/j.specom.2024.103168","DOIUrl":"10.1016/j.specom.2024.103168","url":null,"abstract":"<div><h3>Background</h3><div>The pronunciation of L2 English by L1 Mandarin speakers is influenced by transfer effects from the phonology of Mandarin. However, there is a research gap in systematically synthesizing and reviewing segmental Mandarin-accented English features (SMAEFs) from the existing literature. An accurate and comprehensive description of SMAEFs is necessary for applied science in relevant fields.</div></div><div><h3>Aim</h3><div>To identify the segmental features that are most consistently described as characteristic of Mandarin-accented English in previous literature.</div></div><div><h3>Methods</h3><div>A systematic review was conducted. The studies were identified through searching in nine databases with eight screening criteria.</div></div><div><h3>Results</h3><div>The systematic review includes nineteen studies with a total of 1,873 Mandarin English speakers. The included studies yield 45 SMAEFs, classified into Vowel and Consonant categories, under which there are multiple sub-categories. The results are supported by evidence of different levels of strength. The four frequently reported findings, which are 1) variations in vowel height and frontness, 2) schwa epenthesis, 3) variations in closure duration in plosives and 4) illegal consonant deletion, were identified and analyzed in terms of their potential intelligibility outcomes.</div></div><div><h3>Conclusion</h3><div>The number of SMAEFs is large. These features occur in numerous traditional phonetic categories and two categories (i.e. schwa epenthesis and illegal consonant deletion) that are typically used to describe features in connected speech. The study outcomes may provide valuable insights for researchers and practitioners in the fields of English Language Teaching, phonetics, and speech recognition system development in terms of selecting the pronunciation features to focus on in teaching and research or supporting the successful identification of accented features.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"167 ","pages":"Article 103168"},"PeriodicalIF":2.4,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.specom.2024.103153
Xinlei Ma , Ruiteng Zhang , Jianguo Wei , Xugang Lu , Junhai Xu , Lin Zhang , Wenhuan Lu
Advancements in voice conversion (VC) technology have made it easier to generate spoofed speech that closely resembles the identity of a target speaker. Meanwhile, verification systems within the realm of speech processing are widely used to identify speakers. However, the misuse of VC algorithms poses significant privacy and security risks by potentially deceiving these systems. To address this issue, source speaker verification (SSV) has been proposed to verify the source speaker’s identity of the spoofed speech generated by VCs. Nevertheless, SSV often suffers severe performance degradation when confronted with unknown VC algorithms, which is usually neglected by researchers. To deal with this cross-voice-conversion scenario and enhance the model’s performance when facing unknown VC methods, we redefine it as a novel domain adaptation task by treating each VC method as a distinct domain. In this context, we propose an unsupervised domain adaptation (UDA) algorithm termed self-distillation-based domain exploration (SDDE). This algorithm adopts a siamese framework with two branches: one trained on the source (known) domain and the other trained on the target domains (unknown VC methods). The branch trained on the source domain leverages supervised learning to capture the source speaker’s intrinsic features. Meanwhile, the branch trained on the target domain employs self-distillation to explore target domain information from multi-scale segments. Additionally, we have constructed a large-scale data set comprising over 7945 h of spoofed speech to evaluate the proposed SDDE. Experimental results on this data set demonstrate that SDDE outperforms traditional UDAs and substantially enhances the performance of the SSV model under unknown VC scenarios. The code for data generation and the trial lists are available at https://github.com/zrtlemontree/cross-domain-source-speaker-verification.
{"title":"Self-distillation-based domain exploration for source speaker verification under spoofed speech from unknown voice conversion","authors":"Xinlei Ma , Ruiteng Zhang , Jianguo Wei , Xugang Lu , Junhai Xu , Lin Zhang , Wenhuan Lu","doi":"10.1016/j.specom.2024.103153","DOIUrl":"10.1016/j.specom.2024.103153","url":null,"abstract":"<div><div>Advancements in voice conversion (VC) technology have made it easier to generate spoofed speech that closely resembles the identity of a target speaker. Meanwhile, verification systems within the realm of speech processing are widely used to identify speakers. However, the misuse of VC algorithms poses significant privacy and security risks by potentially deceiving these systems. To address this issue, source speaker verification (SSV) has been proposed to verify the source speaker’s identity of the spoofed speech generated by VCs. Nevertheless, SSV often suffers severe performance degradation when confronted with unknown VC algorithms, which is usually neglected by researchers. To deal with this cross-voice-conversion scenario and enhance the model’s performance when facing unknown VC methods, we redefine it as a novel domain adaptation task by treating each VC method as a distinct domain. In this context, we propose an unsupervised domain adaptation (UDA) algorithm termed self-distillation-based domain exploration (SDDE). This algorithm adopts a siamese framework with two branches: one trained on the source (known) domain and the other trained on the target domains (unknown VC methods). The branch trained on the source domain leverages supervised learning to capture the source speaker’s intrinsic features. Meanwhile, the branch trained on the target domain employs self-distillation to explore target domain information from multi-scale segments. Additionally, we have constructed a large-scale data set comprising over 7945 h of spoofed speech to evaluate the proposed SDDE. Experimental results on this data set demonstrate that SDDE outperforms traditional UDAs and substantially enhances the performance of the SSV model under unknown VC scenarios. The code for data generation and the trial lists are available at <span><span>https://github.com/zrtlemontree/cross-domain-source-speaker-verification</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"167 ","pages":"Article 103153"},"PeriodicalIF":2.4,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143128492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
End-to-end speech recognition systems based on the Attention-based Encoder-Decoder (AED) model normally achieve high accuracy because they concurrently consider the previously generated tokens and contextual features of speech signals. However, the spatial, positional, and multiscale information during shallow feature extraction is mostly neglected, and the shallow and deep features are rarely effectively fused. These problems seriously limit the accuracy and speed of speech recognition in real applications. This study proposes a multi-stage feature extraction and fusion method tailored for end-to-end speech recognition systems based on the AED model. Initially, the receptive-field attention convolutional module is introduced into the front-end feature extraction stage of AED. This module employs a receptive field attention mechanism to enhance the model's feature extraction capability by focusing on the positional and spatial information of speech signals. Moreover, a pyramid squeeze attention mechanism is incorporated into the encoder module to effectively merge the shallow and deep features, and feature maps are recalibrated through weight learning to enhance the accuracy of the encoder's output features. Finally, the effectiveness and robustness of our method are validated across various end-to-end speech recognition models. The experimental results prove that our improved AED speech recognition models with multi-stage feature extraction and fusion achieve a lower word error rate without a language model, and their transcriptions are more accurate and grammatically precise.
{"title":"Improved AED with multi-stage feature extraction and fusion based on RFAConv and PSA","authors":"Bingbing Wang, Yangjie Wei, Zhuangzhuang Wang, Zekang Qi","doi":"10.1016/j.specom.2024.103166","DOIUrl":"10.1016/j.specom.2024.103166","url":null,"abstract":"<div><div>End-to-end speech recognition systems based on the Attention-based Encoder-Decoder (AED) model normally achieve high accuracy because they concurrently consider the previously generated tokens and contextual features of speech signals. However, the spatial, positional, and multiscale information during shallow feature extraction is mostly neglected, and the shallow and deep features are rarely effectively fused. These problems seriously limit the accuracy and speed of speech recognition in real applications. This study proposes a multi-stage feature extraction and fusion method tailored for end-to-end speech recognition systems based on the AED model. Initially, the receptive-field attention convolutional module is introduced into the front-end feature extraction stage of AED. This module employs a receptive field attention mechanism to enhance the model's feature extraction capability by focusing on the positional and spatial information of speech signals. Moreover, a pyramid squeeze attention mechanism is incorporated into the encoder module to effectively merge the shallow and deep features, and feature maps are recalibrated through weight learning to enhance the accuracy of the encoder's output features. Finally, the effectiveness and robustness of our method are validated across various end-to-end speech recognition models. The experimental results prove that our improved AED speech recognition models with multi-stage feature extraction and fusion achieve a lower word error rate without a language model, and their transcriptions are more accurate and grammatically precise.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"167 ","pages":"Article 103166"},"PeriodicalIF":2.4,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143128501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-25DOI: 10.1016/j.specom.2025.103200
Jiahong Ye , Diqun Yan , Songyin Fu , Bin Ma , Zhihua Xia
Synthetic speech attacks pose significant threats to Automatic Speaker Verification (ASV) systems. To counter these, various detection systems have been developed. However, these models often struggle with reduced accuracy when encountering novel spoofing attacks during testing. To address this issue, this paper proposes a One-Class Network architecture that leverages features extracted from the log power spectrum of the F0 subband. We have developed an advanced spectro-temporal enhancement module, comprising the Temporal Correlation Integrate Module (TCIM) and the Frequency-Adaptive Dependency Module (FADM), to accurately capture F0 subband details. TCIM captures crucial temporal dynamics and models the long-term dependencies characteristic of the F0 signals within the F0 subband. Meanwhile, FADM employs a frequency-adaptive mechanism to identify critical frequency bands, allowing the detection system to conduct a thorough and detailed analysis. Additionally, we introduce a KLOC-Softmax loss function that incorporates the KoLeo regularizer. This function promotes a uniform distribution of features within batches, effectively addressing intra-class imbalance and aiding balanced optimization. Experimental results on the ASVspoof 2019 LA dataset show that our approach achieves an equal error rate (EER) of 0.38% and a minimum tandem detection cost function (min t-DCF) of 0.0127. Our method outperforms most state-of-the-art speech anti-spoofing techniques and demonstrates robust generalizability to previously unseen types of synthetic speech attacks.
{"title":"One-class network leveraging spectro-temporal features for generalized synthetic speech detection","authors":"Jiahong Ye , Diqun Yan , Songyin Fu , Bin Ma , Zhihua Xia","doi":"10.1016/j.specom.2025.103200","DOIUrl":"10.1016/j.specom.2025.103200","url":null,"abstract":"<div><div>Synthetic speech attacks pose significant threats to Automatic Speaker Verification (ASV) systems. To counter these, various detection systems have been developed. However, these models often struggle with reduced accuracy when encountering novel spoofing attacks during testing. To address this issue, this paper proposes a One-Class Network architecture that leverages features extracted from the log power spectrum of the F0 subband. We have developed an advanced spectro-temporal enhancement module, comprising the Temporal Correlation Integrate Module (TCIM) and the Frequency-Adaptive Dependency Module (FADM), to accurately capture F0 subband details. TCIM captures crucial temporal dynamics and models the long-term dependencies characteristic of the F0 signals within the F0 subband. Meanwhile, FADM employs a frequency-adaptive mechanism to identify critical frequency bands, allowing the detection system to conduct a thorough and detailed analysis. Additionally, we introduce a KLOC-Softmax loss function that incorporates the KoLeo regularizer. This function promotes a uniform distribution of features within batches, effectively addressing intra-class imbalance and aiding balanced optimization. Experimental results on the ASVspoof 2019 LA dataset show that our approach achieves an equal error rate (EER) of 0.38% and a minimum tandem detection cost function (min t-DCF) of 0.0127. Our method outperforms most state-of-the-art speech anti-spoofing techniques and demonstrates robust generalizability to previously unseen types of synthetic speech attacks.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"169 ","pages":"Article 103200"},"PeriodicalIF":2.4,"publicationDate":"2025-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143103333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-24DOI: 10.1016/j.specom.2025.103199
Mingyue Shi , Qinglin Meng , Huali Zhou , Jiawen Li , Yefei Mo , Nengheng Zheng
Previous research has demonstrated the negligible impact of harmonicity on English speech perception for normal hearing (NH) listeners in quiet environments. This study aims to bridge the gap in understanding the role of harmonicity in Mandarin speech perception for cochlear implant (CI) users. Speech perception in quiet was tested in both CI simulation group and actual CI user group using harmonic and inharmonic Mandarin speech. Furthermore, speech-on-speech perception was tested in NH, CI simulation, and actual CI user groups. For speech perception in quiet, results show that, compared to harmonic speech, inharmonic speech decreased the mean recognition rate for both actual CI user and CI simulation groups by about 10 percentage points. For speech-on-speech perception, all groups (i.e., NH, CI simulation, and actual CI user) performed worse with inharmonic stimuli compared to harmonic stimuli. The findings of this study, along with previous studies in NH listeners, indicate that harmonicity aids target speech recognition for NH listeners in speech-on-speech conditions but not speech perception in quiet. In contrast, harmonicity plays an important role in CI users’ Mandarin speech recognition in both quiet and speech-on-speech conditions. However, under speech-on-speech conditions, CI users could only understand target speech at positive SNRs (often 5 dB), suggesting that their performance depends on the intelligibility of the target speech. The contribution of harmonicity to masking release in CI users remains unclear.
{"title":"Effects of harmonicity on Mandarin speech perception in cochlear implant users","authors":"Mingyue Shi , Qinglin Meng , Huali Zhou , Jiawen Li , Yefei Mo , Nengheng Zheng","doi":"10.1016/j.specom.2025.103199","DOIUrl":"10.1016/j.specom.2025.103199","url":null,"abstract":"<div><div>Previous research has demonstrated the negligible impact of harmonicity on English speech perception for normal hearing (NH) listeners in quiet environments. This study aims to bridge the gap in understanding the role of harmonicity in Mandarin speech perception for cochlear implant (CI) users. Speech perception in quiet was tested in both CI simulation group and actual CI user group using harmonic and inharmonic Mandarin speech. Furthermore, speech-on-speech perception was tested in NH, CI simulation, and actual CI user groups. For speech perception in quiet, results show that, compared to harmonic speech, inharmonic speech decreased the mean recognition rate for both actual CI user and CI simulation groups by about 10 percentage points. For speech-on-speech perception, all groups (i.e., NH, CI simulation, and actual CI user) performed worse with inharmonic stimuli compared to harmonic stimuli. The findings of this study, along with previous studies in NH listeners, indicate that harmonicity aids target speech recognition for NH listeners in speech-on-speech conditions but not speech perception in quiet. In contrast, harmonicity plays an important role in CI users’ Mandarin speech recognition in both quiet and speech-on-speech conditions. However, under speech-on-speech conditions, CI users could only understand target speech at positive SNRs (often <span><math><mo>></mo></math></span> 5 dB), suggesting that their performance depends on the intelligibility of the target speech. The contribution of harmonicity to masking release in CI users remains unclear.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"169 ","pages":"Article 103199"},"PeriodicalIF":2.4,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143103334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-23DOI: 10.1016/j.specom.2025.103198
Weiquan Fan , Xiangmin Xu , Guohua Zhou , Xiaofang Deng , Xiaofen Xing
Emotion recognition is crucial to improve the human–computer interaction experience. Attention mechanisms have become a mainstream technique due to their excellent ability to capture emotion representations. Existing algorithms often employ self-attention and cross-attention for multimodal interactions, which artificially set specific attention patterns at specific layers of the model. However, it is uncertain which attention mechanism is more important in different layers of the model. In this paper, we propose a Coordination Attention based Transformers (CAT). Based on the dual attention paradigm, CAT dynamically infers the pass rates of self-attention and cross-attention layer by layer, coordinating the importance of intra-modal and inter-modal factors. Further, we propose a bidirectional contrastive loss to cluster the matching pairs between modalities and push the mismatching pairs farther apart. Experiments demonstrate the effectiveness of our method, and the state-of-the-art performance is achieved under the same experimental conditions.
{"title":"Coordination Attention based Transformers with bidirectional contrastive loss for multimodal speech emotion recognition","authors":"Weiquan Fan , Xiangmin Xu , Guohua Zhou , Xiaofang Deng , Xiaofen Xing","doi":"10.1016/j.specom.2025.103198","DOIUrl":"10.1016/j.specom.2025.103198","url":null,"abstract":"<div><div>Emotion recognition is crucial to improve the human–computer interaction experience. Attention mechanisms have become a mainstream technique due to their excellent ability to capture emotion representations. Existing algorithms often employ self-attention and cross-attention for multimodal interactions, which artificially set specific attention patterns at specific layers of the model. However, it is uncertain which attention mechanism is more important in different layers of the model. In this paper, we propose a Coordination Attention based Transformers (CAT). Based on the dual attention paradigm, CAT dynamically infers the pass rates of self-attention and cross-attention layer by layer, coordinating the importance of intra-modal and inter-modal factors. Further, we propose a bidirectional contrastive loss to cluster the matching pairs between modalities and push the mismatching pairs farther apart. Experiments demonstrate the effectiveness of our method, and the state-of-the-art performance is achieved under the same experimental conditions.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"169 ","pages":"Article 103198"},"PeriodicalIF":2.4,"publicationDate":"2025-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143164900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}