首页 > 最新文献

Speech Communication最新文献

英文 中文
MFFN: Multi-level Feature Fusion Network for monaural speech separation
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-04-01 DOI: 10.1016/j.specom.2025.103229
Jianjun Lei, Yun He, Ying Wang
Monaural speech separation based on Dual-path networks has recently been widely developed due to their outstanding processing ability for long feature sequences. However, these methods often exploit a fixed receptive field during feature learning, which hardly captures feature information at different scales and thus restricts the model’s performance. This paper proposes a novel Multi-level Feature Fusion Network (MFFN) to facilitate dual-path networks for monaural speech separation by capturing multi-scale information. The MFFN integrates information of different scales from long sequences by using a multi-scale sampling strategy and employs Squeeze-and-Excitation blocks in parallel to extract features along the channel and temporal dimensions. Moreover, we introduce a collaborative attention mechanism to fuse feature information across different levels, further improving the model’s representation capability. Finally, we conduct extensive experiments on noise-free datasets, WSJ0-2mix and Libri2mix, and the noisy datasets, WHAM! and WHAMR!. The results demonstrate that our MFFN outperforms some current methods without using data augmentation technologies.
{"title":"MFFN: Multi-level Feature Fusion Network for monaural speech separation","authors":"Jianjun Lei,&nbsp;Yun He,&nbsp;Ying Wang","doi":"10.1016/j.specom.2025.103229","DOIUrl":"10.1016/j.specom.2025.103229","url":null,"abstract":"<div><div>Monaural speech separation based on Dual-path networks has recently been widely developed due to their outstanding processing ability for long feature sequences. However, these methods often exploit a fixed receptive field during feature learning, which hardly captures feature information at different scales and thus restricts the model’s performance. This paper proposes a novel Multi-level Feature Fusion Network (<em>MFFN</em>) to facilitate dual-path networks for monaural speech separation by capturing multi-scale information. The <em>MFFN</em> integrates information of different scales from long sequences by using a multi-scale sampling strategy and employs Squeeze-and-Excitation blocks in parallel to extract features along the channel and temporal dimensions. Moreover, we introduce a collaborative attention mechanism to fuse feature information across different levels, further improving the model’s representation capability. Finally, we conduct extensive experiments on noise-free datasets, WSJ0-2mix and Libri2mix, and the noisy datasets, WHAM! and WHAMR!. The results demonstrate that our <em>MFFN</em> outperforms some current methods without using data augmentation technologies.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103229"},"PeriodicalIF":2.4,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143759477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploiting Locality Sensitive Hashing - Clustering and gloss feature for sign language production
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-03-28 DOI: 10.1016/j.specom.2025.103227
Hu Jin , Shujun Zhang , Zilong Yang , Qi Han , Jianping Cao
The automatic Sign Language Production (SLP), which converts spoken language sentences into continuous sign pose sequences, is crucial for the digital interactive application of sign language. Long text sequence inputs make current deep learning-based SLP models inefficient and unable to fully take advantage of the intricate information conveyed by sign language, resulting in the fact that the generated skeleton pose sequences may not be well comprehensible or acceptable to individuals with hearing impairments. In this paper, we propose a sign language production method that utilizes Locality Sensitive Hashing-Clustering to automatically aggregate the similar and identical embedded word vectors, capture long-distance dependencies, thereby enhance the accuracy of SLP. And a multi-scale feature extraction network is designed to extract local feature of gloss and combine it with embedded text vectors to enhance text in-formation. Extensive experimental results on the challenging RWTH-PHOENIX-Weather 2014T (PHOENIX14T) dataset show that our model outperforms the baseline method.
{"title":"Exploiting Locality Sensitive Hashing - Clustering and gloss feature for sign language production","authors":"Hu Jin ,&nbsp;Shujun Zhang ,&nbsp;Zilong Yang ,&nbsp;Qi Han ,&nbsp;Jianping Cao","doi":"10.1016/j.specom.2025.103227","DOIUrl":"10.1016/j.specom.2025.103227","url":null,"abstract":"<div><div>The automatic Sign Language Production (SLP), which converts spoken language sentences into continuous sign pose sequences, is crucial for the digital interactive application of sign language. Long text sequence inputs make current deep learning-based SLP models inefficient and unable to fully take advantage of the intricate information conveyed by sign language, resulting in the fact that the generated skeleton pose sequences may not be well comprehensible or acceptable to individuals with hearing impairments. In this paper, we propose a sign language production method that utilizes Locality Sensitive Hashing-Clustering to automatically aggregate the similar and identical embedded word vectors, capture long-distance dependencies, thereby enhance the accuracy of SLP. And a multi-scale feature extraction network is designed to extract local feature of gloss and combine it with embedded text vectors to enhance text in-formation. Extensive experimental results on the challenging RWTH-PHOENIX-Weather 2014T (PHOENIX14T) dataset show that our model outperforms the baseline method.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103227"},"PeriodicalIF":2.4,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143748703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Speech emotion recognition using energy based adaptive mode selection
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-03-22 DOI: 10.1016/j.specom.2025.103228
Ravi, Sachin Taran
In this framework, a speech emotion recognition approach is presented, relying on Variational Mode Decomposition (VMD) and adaptive mode selection utilizing energy information. Instead of directly analyzing speech signals this work is focused on the preprocessing of raw speech signals. Initially, a given speech signal is decomposed using VMD and then the energy of each mode is calculated. Based on energy estimation, the dominant modes are selected for signal reconstruction. VMD combined with energy estimation improves the predictability of the reconstructed speech signal. The improvement in predictability is demonstrated using root mean square and spectral entropy measures. The reconstructed signal is divided into frames, and prosodic and spectral features are then calculated. Following feature extraction, ReliefF algorithm is utilized for the feature optimization. The resultant feature set is utilized to train the fine K- nearest neighbor classifier for emotion identification. The proposed framework was tested on publicly available acted and elicited datasets. For the acted datasets, the proposed framework achieved 93.8 %, 95.8 %, and 93.4 % accuracy on different language-based RAVDESS-speech, Emo-DB, and EMOVO datasets. Furthermore, the proposed method has also proven to be robust across three languages: English, German, and Italian, with language sensitivity as low as 2.4 % compared to existing methods. For the elicited dataset IEMOCAP, the proposed framework achieved the highest accuracy of 83.1 % compared to the existing state of the art.
{"title":"Speech emotion recognition using energy based adaptive mode selection","authors":"Ravi,&nbsp;Sachin Taran","doi":"10.1016/j.specom.2025.103228","DOIUrl":"10.1016/j.specom.2025.103228","url":null,"abstract":"<div><div>In this framework, a speech emotion recognition approach is presented, relying on Variational Mode Decomposition (VMD) and adaptive mode selection utilizing energy information. Instead of directly analyzing speech signals this work is focused on the preprocessing of raw speech signals. Initially, a given speech signal is decomposed using VMD and then the energy of each mode is calculated. Based on energy estimation, the dominant modes are selected for signal reconstruction. VMD combined with energy estimation improves the predictability of the reconstructed speech signal. The improvement in predictability is demonstrated using root mean square and spectral entropy measures. The reconstructed signal is divided into frames, and prosodic and spectral features are then calculated. Following feature extraction, ReliefF algorithm is utilized for the feature optimization. The resultant feature set is utilized to train the fine K- nearest neighbor classifier for emotion identification. The proposed framework was tested on publicly available acted and elicited datasets. For the acted datasets, the proposed framework achieved 93.8 %, 95.8 %, and 93.4 % accuracy on different language-based RAVDESS-speech, Emo-DB, and EMOVO datasets. Furthermore, the proposed method has also proven to be robust across three languages: English, German, and Italian, with language sensitivity as low as 2.4 % compared to existing methods. For the elicited dataset IEMOCAP, the proposed framework achieved the highest accuracy of 83.1 % compared to the existing state of the art.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103228"},"PeriodicalIF":2.4,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143734732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Impacts of telecommunications latency on the timing of speaker transitions
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-03-18 DOI: 10.1016/j.specom.2025.103226
David W. Edwards
Transitions from speaker to speaker occur very rapidly in conversations. However, common telecommunication systems, from landline telephones to online video conferencing, introduce latency into the turn-taking process. To measure what impacts latency has on conversation, this study examines 61 audio-only conversations in which latency was introduced partway through the call and removed several minutes later. It finds an increase in overlap proportional to latency. The results also indicate that speakers increase transition times in response to latency and that this increase persists even after latency is removed. Participants made these behavioral changes despite not recognizing the presence of latency during the call.
{"title":"Impacts of telecommunications latency on the timing of speaker transitions","authors":"David W. Edwards","doi":"10.1016/j.specom.2025.103226","DOIUrl":"10.1016/j.specom.2025.103226","url":null,"abstract":"<div><div>Transitions from speaker to speaker occur very rapidly in conversations. However, common telecommunication systems, from landline telephones to online video conferencing, introduce latency into the turn-taking process. To measure what impacts latency has on conversation, this study examines 61 audio-only conversations in which latency was introduced partway through the call and removed several minutes later. It finds an increase in overlap proportional to latency. The results also indicate that speakers increase transition times in response to latency and that this increase persists even after latency is removed. Participants made these behavioral changes despite not recognizing the presence of latency during the call.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103226"},"PeriodicalIF":2.4,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143715788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LLM-based speaker diarization correction: A generalizable approach
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-03-13 DOI: 10.1016/j.specom.2025.103224
Georgios Efstathiadis , Vijay Yadav , Anzar Abbas
Speaker diarization is necessary for interpreting conversations transcribed using automated speech recognition (ASR) tools. Despite significant developments in diarization methods, diarization accuracy remains an issue. Here, we investigate the use of large language models (LLMs) for diarization correction as a post-processing step. LLMs were fine-tuned using the Fisher corpus, a large dataset of transcribed conversations. The ability of the models to improve diarization accuracy in a holdout dataset from the Fisher corpus as well as an independent dataset was measured. We report that fine-tuned LLMs can markedly improve diarization accuracy. However, model performance is constrained to transcripts produced using the same ASR tool as the transcripts used for fine-tuning, limiting generalizability. To address this constraint, an ensemble model was developed by combining weights from three separate models, each fine-tuned using transcripts from a different ASR tool. The ensemble model demonstrated better overall performance than each of the ASR-specific models, suggesting that a generalizable and ASR-agnostic approach may be achievable. We have made the weights of these models publicly available on HuggingFace at https://huggingface.co/bklynhlth.
{"title":"LLM-based speaker diarization correction: A generalizable approach","authors":"Georgios Efstathiadis ,&nbsp;Vijay Yadav ,&nbsp;Anzar Abbas","doi":"10.1016/j.specom.2025.103224","DOIUrl":"10.1016/j.specom.2025.103224","url":null,"abstract":"<div><div>Speaker diarization is necessary for interpreting conversations transcribed using automated speech recognition (ASR) tools. Despite significant developments in diarization methods, diarization accuracy remains an issue. Here, we investigate the use of large language models (LLMs) for diarization correction as a post-processing step. LLMs were fine-tuned using the Fisher corpus, a large dataset of transcribed conversations. The ability of the models to improve diarization accuracy in a holdout dataset from the Fisher corpus as well as an independent dataset was measured. We report that fine-tuned LLMs can markedly improve diarization accuracy. However, model performance is constrained to transcripts produced using the same ASR tool as the transcripts used for fine-tuning, limiting generalizability. To address this constraint, an ensemble model was developed by combining weights from three separate models, each fine-tuned using transcripts from a different ASR tool. The ensemble model demonstrated better overall performance than each of the ASR-specific models, suggesting that a generalizable and ASR-agnostic approach may be achievable. We have made the weights of these models publicly available on HuggingFace at <span><span>https://huggingface.co/bklynhlth</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"170 ","pages":"Article 103224"},"PeriodicalIF":2.4,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143631920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing bone-conducted speech with spectrum similarity metric in adversarial learning
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-03-13 DOI: 10.1016/j.specom.2025.103223
Yan Pan , Jian Zhou , Huabin Wang , Wenming Zheng , Liang Tao , Hon Keung Kwan
Although bone-conducted (BC) speech offers the advantage of being insusceptible to background noise, its transmission path through bone tissue entails not only serious attenuation of high-frequency components but also speech distortion and the loss of unvoiced speech, resulting in a substantial degradation in both speech quality and intelligibility. Existing BC speech enhancement methods focus mainly on approaching high-frequency component restoration but overlook the restoration of missing unvoiced speech and the mitigation of speech distortion, resulting in a noticeable gap in speech quality and intelligibility compared to air-conducted (AC) speech. In this paper, a spectrum-similarity metric based adversarial learning method is proposed for bone-conducted speech enhancement. The acoustic features corresponding to source-excitation and filter-response are disentangled using the WORLD vocoder and mapped to its AC speech counterparts with logarithmic Gaussian normalization and a vocal tract converter, respectively. To reconstruct unvoiced speech from BC speech and decrease the nonlinear speech distortion in BC speech, the vocal tract converter predicts low-dimensional Mel-cepstral coefficients of AC speech using a generator which is supervised by a classification discriminator and a spectrum similarity discriminator. While the classification discriminator is used to distinguish between authentic AC speech and enhanced BC speech, the spectrum similarity discriminator is designed to evaluate the spectrum similarity between enhanced BC speech and its AC counterpart. To evaluate spectrum similarity, the correlation of time–frequency units in spectrum of long duration is captured within the self-attention layer embedded in the spectrum similarity discriminator. Experimental results on various speech datasets show that the proposed method is capable of restoring unvoiced speech segment and diminishing speech distortion, resulting in predicting accurate fine-grained AC spectrum and thus significant improvement in terms of speech quality and speech intelligibility.
{"title":"Enhancing bone-conducted speech with spectrum similarity metric in adversarial learning","authors":"Yan Pan ,&nbsp;Jian Zhou ,&nbsp;Huabin Wang ,&nbsp;Wenming Zheng ,&nbsp;Liang Tao ,&nbsp;Hon Keung Kwan","doi":"10.1016/j.specom.2025.103223","DOIUrl":"10.1016/j.specom.2025.103223","url":null,"abstract":"<div><div>Although bone-conducted (BC) speech offers the advantage of being insusceptible to background noise, its transmission path through bone tissue entails not only serious attenuation of high-frequency components but also speech distortion and the loss of unvoiced speech, resulting in a substantial degradation in both speech quality and intelligibility. Existing BC speech enhancement methods focus mainly on approaching high-frequency component restoration but overlook the restoration of missing unvoiced speech and the mitigation of speech distortion, resulting in a noticeable gap in speech quality and intelligibility compared to air-conducted (AC) speech. In this paper, a spectrum-similarity metric based adversarial learning method is proposed for bone-conducted speech enhancement. The acoustic features corresponding to source-excitation and filter-response are disentangled using the WORLD vocoder and mapped to its AC speech counterparts with logarithmic Gaussian normalization and a vocal tract converter, respectively. To reconstruct unvoiced speech from BC speech and decrease the nonlinear speech distortion in BC speech, the vocal tract converter predicts low-dimensional Mel-cepstral coefficients of AC speech using a generator which is supervised by a classification discriminator and a spectrum similarity discriminator. While the classification discriminator is used to distinguish between authentic AC speech and enhanced BC speech, the spectrum similarity discriminator is designed to evaluate the spectrum similarity between enhanced BC speech and its AC counterpart. To evaluate spectrum similarity, the correlation of time–frequency units in spectrum of long duration is captured within the self-attention layer embedded in the spectrum similarity discriminator. Experimental results on various speech datasets show that the proposed method is capable of restoring unvoiced speech segment and diminishing speech distortion, resulting in predicting accurate fine-grained AC spectrum and thus significant improvement in terms of speech quality and speech intelligibility.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"170 ","pages":"Article 103223"},"PeriodicalIF":2.4,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143679396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Prosody recognition in Persian poetry
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-03-10 DOI: 10.1016/j.specom.2025.103222
Mohammadreza Shahrestani, Mostafa Haghir Chehreghani
Classical Persian poetry, like traditional poetry from other cultures, follows set metrical patterns, known as prosody. Recognizing prosody of a given poetry is very useful in understanding and analyzing Persian language and literature. With the advances in artificial intelligence (AI) techniques, they became popular to recognize prosody. However, the application of advanced AI methodologies to the task of detecting prosody in Persian poetry is not well-explored. Additionally, The lack of an extensive collection of traditional Persian poems, each meticulously annotated with its prosodic pattern, is another challenge. In this paper, first we create a large dataset of prosodic meters including about 1.3 million couplets, which contains detailed prosodic annotations. Then, we introduce five models that harness advanced deep learning methodologies to discern the prosody of Persian poetry. These models include: (i) a transformer-based classifier, (ii) a grapheme-to-phoneme mapping-based method, (iii) a sequence-to-sequence model, (iv) a sequence-to-sequence model with phonemic sequences, and (v) a hybrid approach that leverages the strengths of both the textual information of poetry and its phonemic sequence. Our experimental results reveal that the hybrid model typically outperforms the other models, especially when applied to large samples of the created dataset. Our code is publicly available in https://github.com/m-shahrestani/Prosody-Recognition-in-Persian-Poetry/.
{"title":"Prosody recognition in Persian poetry","authors":"Mohammadreza Shahrestani,&nbsp;Mostafa Haghir Chehreghani","doi":"10.1016/j.specom.2025.103222","DOIUrl":"10.1016/j.specom.2025.103222","url":null,"abstract":"<div><div>Classical Persian poetry, like traditional poetry from other cultures, follows set metrical patterns, known as prosody. Recognizing prosody of a given poetry is very useful in understanding and analyzing Persian language and literature. With the advances in artificial intelligence (AI) techniques, they became popular to recognize prosody. However, the application of advanced AI methodologies to the task of detecting prosody in Persian poetry is not well-explored. Additionally, The lack of an extensive collection of traditional Persian poems, each meticulously annotated with its prosodic pattern, is another challenge. In this paper, first we create a large dataset of prosodic meters including about 1.3 million couplets, which contains detailed prosodic annotations. Then, we introduce five models that harness advanced deep learning methodologies to discern the prosody of Persian poetry. These models include: (i) a transformer-based classifier, (ii) a grapheme-to-phoneme mapping-based method, (iii) a sequence-to-sequence model, (iv) a sequence-to-sequence model with phonemic sequences, and (v) a hybrid approach that leverages the strengths of both the textual information of poetry and its phonemic sequence. Our experimental results reveal that the hybrid model typically outperforms the other models, especially when applied to large samples of the created dataset. Our code is publicly available in <span><span>https://github.com/m-shahrestani/Prosody-Recognition-in-Persian-Poetry/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"170 ","pages":"Article 103222"},"PeriodicalIF":2.4,"publicationDate":"2025-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143621360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Combining multilingual resources to enhance end-to-end speech recognition systems for Scandinavian languages
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-03-08 DOI: 10.1016/j.specom.2025.103221
Lukas Mateju, Jan Nouza, Petr Cerva, Jindrich Zdansky
Languages with limited training resources, such as Danish, Swedish, and Norwegian, pose a challenge to the development of modern end-to-end (E2E) automatic speech recognition (ASR) systems. We tackle this issue by exploring different ways of exploiting existing multilingual resources. Our approaches combine speech data of closely related languages and/or their already trained models. From several proposed options, the most efficient one is based on initializing the E2E encoder parameters by those from other available models, which we call donors. This approach performs well not only for smaller amounts of target language data but also when thousands of hours are available and even when the donor comes from a distant language. We study several aspects of these donor-based models, namely the choice of the donor language, the impact of the data size (both for target and donor models), or the option of using different donor-based models simultaneously. This allows us to implement an efficient data collection process in which multiple donor-based models run in parallel and serve as complementary data checkers. This greatly helps to eliminate annotation errors in training sets and during automated data harvesting. The latter is utilized for efficient processing of diverse public sources (TV, parliament, YouTube, podcasts, or audiobooks) and training models based on thousands of hours. We have also prepared large test sets (link provided) to evaluate all experiments and ultimately compare the performance of our ASR system with that of major ASR service providers for Scandinavian languages.
{"title":"Combining multilingual resources to enhance end-to-end speech recognition systems for Scandinavian languages","authors":"Lukas Mateju,&nbsp;Jan Nouza,&nbsp;Petr Cerva,&nbsp;Jindrich Zdansky","doi":"10.1016/j.specom.2025.103221","DOIUrl":"10.1016/j.specom.2025.103221","url":null,"abstract":"<div><div>Languages with limited training resources, such as Danish, Swedish, and Norwegian, pose a challenge to the development of modern end-to-end (E2E) automatic speech recognition (ASR) systems. We tackle this issue by exploring different ways of exploiting existing multilingual resources. Our approaches combine speech data of closely related languages and/or their already trained models. From several proposed options, the most efficient one is based on initializing the E2E encoder parameters by those from other available models, which we call donors. This approach performs well not only for smaller amounts of target language data but also when thousands of hours are available and even when the donor comes from a distant language. We study several aspects of these donor-based models, namely the choice of the donor language, the impact of the data size (both for target and donor models), or the option of using different donor-based models simultaneously. This allows us to implement an efficient data collection process in which multiple donor-based models run in parallel and serve as complementary data checkers. This greatly helps to eliminate annotation errors in training sets and during automated data harvesting. The latter is utilized for efficient processing of diverse public sources (TV, parliament, YouTube, podcasts, or audiobooks) and training models based on thousands of hours. We have also prepared large test sets (link provided) to evaluate all experiments and ultimately compare the performance of our ASR system with that of major ASR service providers for Scandinavian languages.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"170 ","pages":"Article 103221"},"PeriodicalIF":2.4,"publicationDate":"2025-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143600961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learnability of English diphthongs: One dynamic target vs. two static targets
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-03-05 DOI: 10.1016/j.specom.2025.103225
Anqi Xu , Daniel R. van Niekerk , Branislav Gerazov , Paul Konstantin Krug , Santitham Prom-on , Peter Birkholz , Yi Xu
As vowels with intrinsic movements, diphthongs are among the most elusive sounds of speech. Previous research has characterized diphthongs as a combination of two vowels, a vowel followed by a formant transition, or a constant rate of formant change. These accounts are based on acoustic patterns, perceptual cues, and either acoustic or articulatory synthesis, but no consensus has been reached. In this study, we explore the nature of diphthongs by exploring how they can be acquired through vocal learning. The acquisition is simulated by a three-dimensional (3D) vocal tract model with built-in target approximation dynamics, which can learn articulatory targets of phonetic categories under the guidance of a speech recognizer. The simulation attempts to learn to articulate diphthong-embedded monosyllabic English words with either a single dynamic target or two static targets, and the learned synthetic words were presented to native listeners for identification. The results showed that diphthongs learned with dynamic targets were consistently more intelligible across variable durations than those learned with two static targets, with only the exception of /aɪ/. From the perspective of learnability, therefore, English diphthongs are likely unitary vowels with dynamic targets.
{"title":"Learnability of English diphthongs: One dynamic target vs. two static targets","authors":"Anqi Xu ,&nbsp;Daniel R. van Niekerk ,&nbsp;Branislav Gerazov ,&nbsp;Paul Konstantin Krug ,&nbsp;Santitham Prom-on ,&nbsp;Peter Birkholz ,&nbsp;Yi Xu","doi":"10.1016/j.specom.2025.103225","DOIUrl":"10.1016/j.specom.2025.103225","url":null,"abstract":"<div><div>As vowels with intrinsic movements, diphthongs are among the most elusive sounds of speech. Previous research has characterized diphthongs as a combination of two vowels, a vowel followed by a formant transition, or a constant rate of formant change. These accounts are based on acoustic patterns, perceptual cues, and either acoustic or articulatory synthesis, but no consensus has been reached. In this study, we explore the nature of diphthongs by exploring how they can be acquired through vocal learning. The acquisition is simulated by a three-dimensional (3D) vocal tract model with built-in target approximation dynamics, which can learn articulatory targets of phonetic categories under the guidance of a speech recognizer. The simulation attempts to learn to articulate diphthong-embedded monosyllabic English words with either a single dynamic target or two static targets, and the learned synthetic words were presented to native listeners for identification. The results showed that diphthongs learned with dynamic targets were consistently more intelligible across variable durations than those learned with two static targets, with only the exception of /aɪ/. From the perspective of learnability, therefore, English diphthongs are likely unitary vowels with dynamic targets.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"170 ","pages":"Article 103225"},"PeriodicalIF":2.4,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143611172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Ohio Child Speech Corpus
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-03-04 DOI: 10.1016/j.specom.2025.103206
Laura Wagner , Sharifa Alghowinhem , Abeer Alwan , Kristina Bowdrie , Cynthia Breazeal , Cynthia G. Clopper , Eric Fosler-Lussier , Izabela A. Jamsek , Devan Lander , Rajiv Ramnath , Jory Ross
This paper reports on the creation and composition of a new corpus of children's speech, the Ohio Child Speech Corpus, which is publicly available on the Talkbank-CHILDES website. The audio corpus contains speech samples from 303 children ranging in age from 4 – 9 years old, all of whom participated in a seven-task elicitation protocol conducted in a science museum lab. In addition, an interactive social robot controlled by the researchers joined the sessions for approximately 60% of the children, and the corpus itself was collected in the peri‑pandemic period. Two analyses are reported that highlighted these last two features. One set of analyses found that the children spoke significantly more in the presence of the robot relative to its absence, but no effects of speech complexity (as measured by MLU) were found for the robot's presence. Another set of analyses compared children tested immediately post-pandemic to children tested a year later on two school-readiness tasks, an Alphabet task and a Reading Passages task. This analysis showed no negative impact on these tasks for our highly-educated sample of children just coming off of the pandemic relative to those tested later. These analyses demonstrate just two possible types of questions that this corpus could be used to investigate.
{"title":"The Ohio Child Speech Corpus","authors":"Laura Wagner ,&nbsp;Sharifa Alghowinhem ,&nbsp;Abeer Alwan ,&nbsp;Kristina Bowdrie ,&nbsp;Cynthia Breazeal ,&nbsp;Cynthia G. Clopper ,&nbsp;Eric Fosler-Lussier ,&nbsp;Izabela A. Jamsek ,&nbsp;Devan Lander ,&nbsp;Rajiv Ramnath ,&nbsp;Jory Ross","doi":"10.1016/j.specom.2025.103206","DOIUrl":"10.1016/j.specom.2025.103206","url":null,"abstract":"<div><div>This paper reports on the creation and composition of a new corpus of children's speech, the Ohio Child Speech Corpus, which is publicly available on the Talkbank-CHILDES website. The audio corpus contains speech samples from 303 children ranging in age from 4 – 9 years old, all of whom participated in a seven-task elicitation protocol conducted in a science museum lab. In addition, an interactive social robot controlled by the researchers joined the sessions for approximately 60% of the children, and the corpus itself was collected in the peri‑pandemic period. Two analyses are reported that highlighted these last two features. One set of analyses found that the children spoke significantly more in the presence of the robot relative to its absence, but no effects of speech complexity (as measured by MLU) were found for the robot's presence. Another set of analyses compared children tested immediately post-pandemic to children tested a year later on two school-readiness tasks, an Alphabet task and a Reading Passages task. This analysis showed no negative impact on these tasks for our highly-educated sample of children just coming off of the pandemic relative to those tested later. These analyses demonstrate just two possible types of questions that this corpus could be used to investigate.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"170 ","pages":"Article 103206"},"PeriodicalIF":2.4,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143534141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Speech Communication
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1