首页 > 最新文献

Speech Communication最新文献

英文 中文
MFFN: Multi-level Feature Fusion Network for monaural speech separation MFFN:用于单词语音分离的多级特征融合网络
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-04-01 DOI: 10.1016/j.specom.2025.103229
Jianjun Lei, Yun He, Ying Wang
Monaural speech separation based on Dual-path networks has recently been widely developed due to their outstanding processing ability for long feature sequences. However, these methods often exploit a fixed receptive field during feature learning, which hardly captures feature information at different scales and thus restricts the model’s performance. This paper proposes a novel Multi-level Feature Fusion Network (MFFN) to facilitate dual-path networks for monaural speech separation by capturing multi-scale information. The MFFN integrates information of different scales from long sequences by using a multi-scale sampling strategy and employs Squeeze-and-Excitation blocks in parallel to extract features along the channel and temporal dimensions. Moreover, we introduce a collaborative attention mechanism to fuse feature information across different levels, further improving the model’s representation capability. Finally, we conduct extensive experiments on noise-free datasets, WSJ0-2mix and Libri2mix, and the noisy datasets, WHAM! and WHAMR!. The results demonstrate that our MFFN outperforms some current methods without using data augmentation technologies.
基于双径网络的单耳语音分离由于其对长特征序列的处理能力而得到了广泛的发展。然而,这些方法在特征学习过程中往往利用固定的接受场,难以捕获不同尺度的特征信息,从而限制了模型的性能。本文提出了一种新的多层特征融合网络(MFFN),通过捕获多尺度信息来实现单语音分离的双路径网络。MFFN采用多尺度采样策略整合长序列的不同尺度信息,并采用挤压和激励块并行提取沿通道和时间维度的特征。此外,我们引入了协同关注机制,融合了不同层次的特征信息,进一步提高了模型的表征能力。最后,我们在无噪声数据集WSJ0-2mix和Libri2mix以及有噪声数据集WHAM!和WHAMR !。结果表明,在不使用数据增强技术的情况下,我们的MFFN优于当前的一些方法。
{"title":"MFFN: Multi-level Feature Fusion Network for monaural speech separation","authors":"Jianjun Lei,&nbsp;Yun He,&nbsp;Ying Wang","doi":"10.1016/j.specom.2025.103229","DOIUrl":"10.1016/j.specom.2025.103229","url":null,"abstract":"<div><div>Monaural speech separation based on Dual-path networks has recently been widely developed due to their outstanding processing ability for long feature sequences. However, these methods often exploit a fixed receptive field during feature learning, which hardly captures feature information at different scales and thus restricts the model’s performance. This paper proposes a novel Multi-level Feature Fusion Network (<em>MFFN</em>) to facilitate dual-path networks for monaural speech separation by capturing multi-scale information. The <em>MFFN</em> integrates information of different scales from long sequences by using a multi-scale sampling strategy and employs Squeeze-and-Excitation blocks in parallel to extract features along the channel and temporal dimensions. Moreover, we introduce a collaborative attention mechanism to fuse feature information across different levels, further improving the model’s representation capability. Finally, we conduct extensive experiments on noise-free datasets, WSJ0-2mix and Libri2mix, and the noisy datasets, WHAM! and WHAMR!. The results demonstrate that our <em>MFFN</em> outperforms some current methods without using data augmentation technologies.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103229"},"PeriodicalIF":2.4,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143759477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploiting Locality Sensitive Hashing - Clustering and gloss feature for sign language production 利用局域敏感散列聚类和光泽特征进行手语生成
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-03-28 DOI: 10.1016/j.specom.2025.103227
Hu Jin , Shujun Zhang , Zilong Yang , Qi Han , Jianping Cao
The automatic Sign Language Production (SLP), which converts spoken language sentences into continuous sign pose sequences, is crucial for the digital interactive application of sign language. Long text sequence inputs make current deep learning-based SLP models inefficient and unable to fully take advantage of the intricate information conveyed by sign language, resulting in the fact that the generated skeleton pose sequences may not be well comprehensible or acceptable to individuals with hearing impairments. In this paper, we propose a sign language production method that utilizes Locality Sensitive Hashing-Clustering to automatically aggregate the similar and identical embedded word vectors, capture long-distance dependencies, thereby enhance the accuracy of SLP. And a multi-scale feature extraction network is designed to extract local feature of gloss and combine it with embedded text vectors to enhance text in-formation. Extensive experimental results on the challenging RWTH-PHOENIX-Weather 2014T (PHOENIX14T) dataset show that our model outperforms the baseline method.
手语自动生成(SLP)是将口语句子转换成连续的手势姿势序列,是手语数字交互应用的关键。长文本序列输入使得当前基于深度学习的SLP模型效率低下,无法充分利用手语传达的复杂信息,导致生成的骨骼姿势序列可能无法很好地被听力障碍患者理解或接受。本文提出了一种利用局部敏感哈希聚类自动聚合相似和相同嵌入词向量的手语生成方法,捕获长距离依赖关系,从而提高SLP的准确性。设计了一种多尺度特征提取网络,提取光泽的局部特征,并将其与嵌入的文本向量结合,增强文本信息。在具有挑战性的RWTH-PHOENIX-Weather 2014T (PHOENIX14T)数据集上的大量实验结果表明,我们的模型优于基线方法。
{"title":"Exploiting Locality Sensitive Hashing - Clustering and gloss feature for sign language production","authors":"Hu Jin ,&nbsp;Shujun Zhang ,&nbsp;Zilong Yang ,&nbsp;Qi Han ,&nbsp;Jianping Cao","doi":"10.1016/j.specom.2025.103227","DOIUrl":"10.1016/j.specom.2025.103227","url":null,"abstract":"<div><div>The automatic Sign Language Production (SLP), which converts spoken language sentences into continuous sign pose sequences, is crucial for the digital interactive application of sign language. Long text sequence inputs make current deep learning-based SLP models inefficient and unable to fully take advantage of the intricate information conveyed by sign language, resulting in the fact that the generated skeleton pose sequences may not be well comprehensible or acceptable to individuals with hearing impairments. In this paper, we propose a sign language production method that utilizes Locality Sensitive Hashing-Clustering to automatically aggregate the similar and identical embedded word vectors, capture long-distance dependencies, thereby enhance the accuracy of SLP. And a multi-scale feature extraction network is designed to extract local feature of gloss and combine it with embedded text vectors to enhance text in-formation. Extensive experimental results on the challenging RWTH-PHOENIX-Weather 2014T (PHOENIX14T) dataset show that our model outperforms the baseline method.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103227"},"PeriodicalIF":2.4,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143748703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Speech emotion recognition using energy based adaptive mode selection 基于能量自适应模式选择的语音情感识别
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-03-22 DOI: 10.1016/j.specom.2025.103228
Ravi, Sachin Taran
In this framework, a speech emotion recognition approach is presented, relying on Variational Mode Decomposition (VMD) and adaptive mode selection utilizing energy information. Instead of directly analyzing speech signals this work is focused on the preprocessing of raw speech signals. Initially, a given speech signal is decomposed using VMD and then the energy of each mode is calculated. Based on energy estimation, the dominant modes are selected for signal reconstruction. VMD combined with energy estimation improves the predictability of the reconstructed speech signal. The improvement in predictability is demonstrated using root mean square and spectral entropy measures. The reconstructed signal is divided into frames, and prosodic and spectral features are then calculated. Following feature extraction, ReliefF algorithm is utilized for the feature optimization. The resultant feature set is utilized to train the fine K- nearest neighbor classifier for emotion identification. The proposed framework was tested on publicly available acted and elicited datasets. For the acted datasets, the proposed framework achieved 93.8 %, 95.8 %, and 93.4 % accuracy on different language-based RAVDESS-speech, Emo-DB, and EMOVO datasets. Furthermore, the proposed method has also proven to be robust across three languages: English, German, and Italian, with language sensitivity as low as 2.4 % compared to existing methods. For the elicited dataset IEMOCAP, the proposed framework achieved the highest accuracy of 83.1 % compared to the existing state of the art.
在此框架下,提出了一种基于变分模式分解(VMD)和利用能量信息的自适应模式选择的语音情感识别方法。与直接分析语音信号不同,本研究的重点是对原始语音信号进行预处理。首先对给定的语音信号进行VMD分解,然后计算各模的能量。基于能量估计,选择优势模态进行信号重构。VMD与能量估计相结合,提高了重构语音信号的可预测性。使用均方根和谱熵度量证明了可预测性的提高。重构后的信号被分割成帧,然后计算韵律特征和频谱特征。在特征提取之后,利用ReliefF算法进行特征优化。所得到的特征集被用来训练用于情感识别的精细K近邻分类器。提出的框架在公开可用的行动和引出的数据集上进行了测试。对于行为数据集,所提出的框架在不同的基于语言的RAVDESS-speech, Emo-DB和EMOVO数据集上实现了93.8%,95.8%和93.4%的准确率。此外,所提出的方法也被证明对英语、德语和意大利语这三种语言具有鲁棒性,与现有方法相比,语言敏感性低至2.4%。对于提取的数据集IEMOCAP,与现有技术相比,所提出的框架达到了83.1%的最高准确率。
{"title":"Speech emotion recognition using energy based adaptive mode selection","authors":"Ravi,&nbsp;Sachin Taran","doi":"10.1016/j.specom.2025.103228","DOIUrl":"10.1016/j.specom.2025.103228","url":null,"abstract":"<div><div>In this framework, a speech emotion recognition approach is presented, relying on Variational Mode Decomposition (VMD) and adaptive mode selection utilizing energy information. Instead of directly analyzing speech signals this work is focused on the preprocessing of raw speech signals. Initially, a given speech signal is decomposed using VMD and then the energy of each mode is calculated. Based on energy estimation, the dominant modes are selected for signal reconstruction. VMD combined with energy estimation improves the predictability of the reconstructed speech signal. The improvement in predictability is demonstrated using root mean square and spectral entropy measures. The reconstructed signal is divided into frames, and prosodic and spectral features are then calculated. Following feature extraction, ReliefF algorithm is utilized for the feature optimization. The resultant feature set is utilized to train the fine K- nearest neighbor classifier for emotion identification. The proposed framework was tested on publicly available acted and elicited datasets. For the acted datasets, the proposed framework achieved 93.8 %, 95.8 %, and 93.4 % accuracy on different language-based RAVDESS-speech, Emo-DB, and EMOVO datasets. Furthermore, the proposed method has also proven to be robust across three languages: English, German, and Italian, with language sensitivity as low as 2.4 % compared to existing methods. For the elicited dataset IEMOCAP, the proposed framework achieved the highest accuracy of 83.1 % compared to the existing state of the art.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103228"},"PeriodicalIF":2.4,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143734732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Impacts of telecommunications latency on the timing of speaker transitions 电信延迟对说话人转换时间的影响
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-03-18 DOI: 10.1016/j.specom.2025.103226
David W. Edwards
Transitions from speaker to speaker occur very rapidly in conversations. However, common telecommunication systems, from landline telephones to online video conferencing, introduce latency into the turn-taking process. To measure what impacts latency has on conversation, this study examines 61 audio-only conversations in which latency was introduced partway through the call and removed several minutes later. It finds an increase in overlap proportional to latency. The results also indicate that speakers increase transition times in response to latency and that this increase persists even after latency is removed. Participants made these behavioral changes despite not recognizing the presence of latency during the call.
在对话中,说话人之间的转换非常迅速。然而,普通的电信系统,从固定电话到在线视频会议,在轮转过程中引入了延迟。为了测量延迟对会话的影响,本研究检查了61个纯音频会话,其中延迟在呼叫中途引入并在几分钟后消除。它发现重叠的增加与延迟成正比。结果还表明,说话者增加了响应延迟的过渡时间,即使在消除延迟后,这种增加仍然存在。尽管参与者在通话过程中没有意识到延迟的存在,但他们还是做出了这些行为改变。
{"title":"Impacts of telecommunications latency on the timing of speaker transitions","authors":"David W. Edwards","doi":"10.1016/j.specom.2025.103226","DOIUrl":"10.1016/j.specom.2025.103226","url":null,"abstract":"<div><div>Transitions from speaker to speaker occur very rapidly in conversations. However, common telecommunication systems, from landline telephones to online video conferencing, introduce latency into the turn-taking process. To measure what impacts latency has on conversation, this study examines 61 audio-only conversations in which latency was introduced partway through the call and removed several minutes later. It finds an increase in overlap proportional to latency. The results also indicate that speakers increase transition times in response to latency and that this increase persists even after latency is removed. Participants made these behavioral changes despite not recognizing the presence of latency during the call.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103226"},"PeriodicalIF":2.4,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143715788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LLM-based speaker diarization correction: A generalizable approach 基于法学硕士的说话人刻度校正:一种可推广的方法
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-03-13 DOI: 10.1016/j.specom.2025.103224
Georgios Efstathiadis , Vijay Yadav , Anzar Abbas
Speaker diarization is necessary for interpreting conversations transcribed using automated speech recognition (ASR) tools. Despite significant developments in diarization methods, diarization accuracy remains an issue. Here, we investigate the use of large language models (LLMs) for diarization correction as a post-processing step. LLMs were fine-tuned using the Fisher corpus, a large dataset of transcribed conversations. The ability of the models to improve diarization accuracy in a holdout dataset from the Fisher corpus as well as an independent dataset was measured. We report that fine-tuned LLMs can markedly improve diarization accuracy. However, model performance is constrained to transcripts produced using the same ASR tool as the transcripts used for fine-tuning, limiting generalizability. To address this constraint, an ensemble model was developed by combining weights from three separate models, each fine-tuned using transcripts from a different ASR tool. The ensemble model demonstrated better overall performance than each of the ASR-specific models, suggesting that a generalizable and ASR-agnostic approach may be achievable. We have made the weights of these models publicly available on HuggingFace at https://huggingface.co/bklynhlth.
使用自动语音识别(ASR)工具翻译会话记录时,说话人拨号是必要的。尽管径迹方法取得了重大进展,但径迹精度仍然是一个问题。在这里,我们研究了将大型语言模型(llm)作为后处理步骤用于词法校正的使用。llm使用Fisher语料库(一个转录对话的大型数据集)进行微调。测试了模型在Fisher语料库和独立数据集中提高分割精度的能力。我们报告微调llm可以显着提高刻度精度。然而,模型性能受限于使用与用于微调的转录本相同的ASR工具生成的转录本,从而限制了通用性。为了解决这个约束,通过组合来自三个独立模型的权重来开发一个集成模型,每个模型都使用来自不同ASR工具的转录本进行微调。集成模型比每个asr特定模型表现出更好的整体性能,这表明可以实现一种可推广且与asr无关的方法。我们在HuggingFace网站https://huggingface.co/bklynhlth上公开了这些模型的权重。
{"title":"LLM-based speaker diarization correction: A generalizable approach","authors":"Georgios Efstathiadis ,&nbsp;Vijay Yadav ,&nbsp;Anzar Abbas","doi":"10.1016/j.specom.2025.103224","DOIUrl":"10.1016/j.specom.2025.103224","url":null,"abstract":"<div><div>Speaker diarization is necessary for interpreting conversations transcribed using automated speech recognition (ASR) tools. Despite significant developments in diarization methods, diarization accuracy remains an issue. Here, we investigate the use of large language models (LLMs) for diarization correction as a post-processing step. LLMs were fine-tuned using the Fisher corpus, a large dataset of transcribed conversations. The ability of the models to improve diarization accuracy in a holdout dataset from the Fisher corpus as well as an independent dataset was measured. We report that fine-tuned LLMs can markedly improve diarization accuracy. However, model performance is constrained to transcripts produced using the same ASR tool as the transcripts used for fine-tuning, limiting generalizability. To address this constraint, an ensemble model was developed by combining weights from three separate models, each fine-tuned using transcripts from a different ASR tool. The ensemble model demonstrated better overall performance than each of the ASR-specific models, suggesting that a generalizable and ASR-agnostic approach may be achievable. We have made the weights of these models publicly available on HuggingFace at <span><span>https://huggingface.co/bklynhlth</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"170 ","pages":"Article 103224"},"PeriodicalIF":2.4,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143631920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing bone-conducted speech with spectrum similarity metric in adversarial learning 对抗学习中频谱相似度度量增强骨传导语音
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-03-13 DOI: 10.1016/j.specom.2025.103223
Yan Pan , Jian Zhou , Huabin Wang , Wenming Zheng , Liang Tao , Hon Keung Kwan
Although bone-conducted (BC) speech offers the advantage of being insusceptible to background noise, its transmission path through bone tissue entails not only serious attenuation of high-frequency components but also speech distortion and the loss of unvoiced speech, resulting in a substantial degradation in both speech quality and intelligibility. Existing BC speech enhancement methods focus mainly on approaching high-frequency component restoration but overlook the restoration of missing unvoiced speech and the mitigation of speech distortion, resulting in a noticeable gap in speech quality and intelligibility compared to air-conducted (AC) speech. In this paper, a spectrum-similarity metric based adversarial learning method is proposed for bone-conducted speech enhancement. The acoustic features corresponding to source-excitation and filter-response are disentangled using the WORLD vocoder and mapped to its AC speech counterparts with logarithmic Gaussian normalization and a vocal tract converter, respectively. To reconstruct unvoiced speech from BC speech and decrease the nonlinear speech distortion in BC speech, the vocal tract converter predicts low-dimensional Mel-cepstral coefficients of AC speech using a generator which is supervised by a classification discriminator and a spectrum similarity discriminator. While the classification discriminator is used to distinguish between authentic AC speech and enhanced BC speech, the spectrum similarity discriminator is designed to evaluate the spectrum similarity between enhanced BC speech and its AC counterpart. To evaluate spectrum similarity, the correlation of time–frequency units in spectrum of long duration is captured within the self-attention layer embedded in the spectrum similarity discriminator. Experimental results on various speech datasets show that the proposed method is capable of restoring unvoiced speech segment and diminishing speech distortion, resulting in predicting accurate fine-grained AC spectrum and thus significant improvement in terms of speech quality and speech intelligibility.
尽管骨传导(BC)语音具有不受背景噪声影响的优势,但其通过骨组织的传输路径不仅会导致高频成分的严重衰减,还会导致语音失真和未发音语音的丢失,从而导致语音质量和可理解性的大幅下降。现有的BC语音增强方法主要侧重于接近高频成分的恢复,而忽略了缺失的未发声语音的恢复和语音失真的缓解,导致语音质量和可听性与空气传导语音相比存在明显差距。本文提出了一种基于频谱相似度度量的骨传导语音增强对抗学习方法。使用WORLD声码器对源激励和滤波器响应对应的声学特征进行解纠缠,并分别通过对数高斯归一化和声道转换器将其映射到交流语音对应体。为了从交流语音中重构出不发音语音,降低交流语音中的非线性语音失真,声道转换器使用一个由分类鉴别器和频谱相似鉴别器监督的生成器来预测交流语音的低维梅尔-倒谱系数。分类鉴别器用于区分真实的AC语音和增强的BC语音,而频谱相似判别器用于评估增强的BC语音和AC语音之间的频谱相似度。为了评估频谱相似度,在嵌入在频谱相似鉴别器中的自注意层中捕获长持续时间频谱中时频单元的相关性。在各种语音数据集上的实验结果表明,该方法能够恢复未发声的语音片段,减少语音失真,预测出准确的细粒度交流频谱,从而显著提高语音质量和可理解性。
{"title":"Enhancing bone-conducted speech with spectrum similarity metric in adversarial learning","authors":"Yan Pan ,&nbsp;Jian Zhou ,&nbsp;Huabin Wang ,&nbsp;Wenming Zheng ,&nbsp;Liang Tao ,&nbsp;Hon Keung Kwan","doi":"10.1016/j.specom.2025.103223","DOIUrl":"10.1016/j.specom.2025.103223","url":null,"abstract":"<div><div>Although bone-conducted (BC) speech offers the advantage of being insusceptible to background noise, its transmission path through bone tissue entails not only serious attenuation of high-frequency components but also speech distortion and the loss of unvoiced speech, resulting in a substantial degradation in both speech quality and intelligibility. Existing BC speech enhancement methods focus mainly on approaching high-frequency component restoration but overlook the restoration of missing unvoiced speech and the mitigation of speech distortion, resulting in a noticeable gap in speech quality and intelligibility compared to air-conducted (AC) speech. In this paper, a spectrum-similarity metric based adversarial learning method is proposed for bone-conducted speech enhancement. The acoustic features corresponding to source-excitation and filter-response are disentangled using the WORLD vocoder and mapped to its AC speech counterparts with logarithmic Gaussian normalization and a vocal tract converter, respectively. To reconstruct unvoiced speech from BC speech and decrease the nonlinear speech distortion in BC speech, the vocal tract converter predicts low-dimensional Mel-cepstral coefficients of AC speech using a generator which is supervised by a classification discriminator and a spectrum similarity discriminator. While the classification discriminator is used to distinguish between authentic AC speech and enhanced BC speech, the spectrum similarity discriminator is designed to evaluate the spectrum similarity between enhanced BC speech and its AC counterpart. To evaluate spectrum similarity, the correlation of time–frequency units in spectrum of long duration is captured within the self-attention layer embedded in the spectrum similarity discriminator. Experimental results on various speech datasets show that the proposed method is capable of restoring unvoiced speech segment and diminishing speech distortion, resulting in predicting accurate fine-grained AC spectrum and thus significant improvement in terms of speech quality and speech intelligibility.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"170 ","pages":"Article 103223"},"PeriodicalIF":2.4,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143679396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Prosody recognition in Persian poetry 波斯诗歌中的韵律识别
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-03-10 DOI: 10.1016/j.specom.2025.103222
Mohammadreza Shahrestani, Mostafa Haghir Chehreghani
Classical Persian poetry, like traditional poetry from other cultures, follows set metrical patterns, known as prosody. Recognizing prosody of a given poetry is very useful in understanding and analyzing Persian language and literature. With the advances in artificial intelligence (AI) techniques, they became popular to recognize prosody. However, the application of advanced AI methodologies to the task of detecting prosody in Persian poetry is not well-explored. Additionally, The lack of an extensive collection of traditional Persian poems, each meticulously annotated with its prosodic pattern, is another challenge. In this paper, first we create a large dataset of prosodic meters including about 1.3 million couplets, which contains detailed prosodic annotations. Then, we introduce five models that harness advanced deep learning methodologies to discern the prosody of Persian poetry. These models include: (i) a transformer-based classifier, (ii) a grapheme-to-phoneme mapping-based method, (iii) a sequence-to-sequence model, (iv) a sequence-to-sequence model with phonemic sequences, and (v) a hybrid approach that leverages the strengths of both the textual information of poetry and its phonemic sequence. Our experimental results reveal that the hybrid model typically outperforms the other models, especially when applied to large samples of the created dataset. Our code is publicly available in https://github.com/m-shahrestani/Prosody-Recognition-in-Persian-Poetry/.
古典波斯诗歌,像其他文化的传统诗歌一样,遵循固定的韵律模式,被称为韵律。识别诗歌的韵律对于理解和分析波斯语言文学是非常有用的。随着人工智能(AI)技术的进步,人们开始流行识别韵律。然而,先进的人工智能方法在波斯诗歌韵律检测任务中的应用尚未得到很好的探索。此外,缺乏广泛的传统波斯诗歌集是另一个挑战,每首诗都有精心的韵律模式注释。在本文中,我们首先创建了一个包含约130万个对联的韵律韵律韵律数据集,该数据集包含详细的韵律注释。然后,我们介绍了五个利用先进的深度学习方法来识别波斯诗歌韵律的模型。这些模型包括:(i)基于转换器的分类器,(ii)基于字素到音素映射的方法,(iii)序列到序列模型,(iv)音素序列的序列到序列模型,以及(v)利用诗歌文本信息及其音素序列优势的混合方法。我们的实验结果表明,混合模型通常优于其他模型,特别是当应用于创建的数据集的大样本时。我们的代码可以在https://github.com/m-shahrestani/Prosody-Recognition-in-Persian-Poetry/上公开获得。
{"title":"Prosody recognition in Persian poetry","authors":"Mohammadreza Shahrestani,&nbsp;Mostafa Haghir Chehreghani","doi":"10.1016/j.specom.2025.103222","DOIUrl":"10.1016/j.specom.2025.103222","url":null,"abstract":"<div><div>Classical Persian poetry, like traditional poetry from other cultures, follows set metrical patterns, known as prosody. Recognizing prosody of a given poetry is very useful in understanding and analyzing Persian language and literature. With the advances in artificial intelligence (AI) techniques, they became popular to recognize prosody. However, the application of advanced AI methodologies to the task of detecting prosody in Persian poetry is not well-explored. Additionally, The lack of an extensive collection of traditional Persian poems, each meticulously annotated with its prosodic pattern, is another challenge. In this paper, first we create a large dataset of prosodic meters including about 1.3 million couplets, which contains detailed prosodic annotations. Then, we introduce five models that harness advanced deep learning methodologies to discern the prosody of Persian poetry. These models include: (i) a transformer-based classifier, (ii) a grapheme-to-phoneme mapping-based method, (iii) a sequence-to-sequence model, (iv) a sequence-to-sequence model with phonemic sequences, and (v) a hybrid approach that leverages the strengths of both the textual information of poetry and its phonemic sequence. Our experimental results reveal that the hybrid model typically outperforms the other models, especially when applied to large samples of the created dataset. Our code is publicly available in <span><span>https://github.com/m-shahrestani/Prosody-Recognition-in-Persian-Poetry/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"170 ","pages":"Article 103222"},"PeriodicalIF":2.4,"publicationDate":"2025-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143621360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Combining multilingual resources to enhance end-to-end speech recognition systems for Scandinavian languages 结合多语言资源增强斯堪的纳维亚语言的端到端语音识别系统
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-03-08 DOI: 10.1016/j.specom.2025.103221
Lukas Mateju, Jan Nouza, Petr Cerva, Jindrich Zdansky
Languages with limited training resources, such as Danish, Swedish, and Norwegian, pose a challenge to the development of modern end-to-end (E2E) automatic speech recognition (ASR) systems. We tackle this issue by exploring different ways of exploiting existing multilingual resources. Our approaches combine speech data of closely related languages and/or their already trained models. From several proposed options, the most efficient one is based on initializing the E2E encoder parameters by those from other available models, which we call donors. This approach performs well not only for smaller amounts of target language data but also when thousands of hours are available and even when the donor comes from a distant language. We study several aspects of these donor-based models, namely the choice of the donor language, the impact of the data size (both for target and donor models), or the option of using different donor-based models simultaneously. This allows us to implement an efficient data collection process in which multiple donor-based models run in parallel and serve as complementary data checkers. This greatly helps to eliminate annotation errors in training sets and during automated data harvesting. The latter is utilized for efficient processing of diverse public sources (TV, parliament, YouTube, podcasts, or audiobooks) and training models based on thousands of hours. We have also prepared large test sets (link provided) to evaluate all experiments and ultimately compare the performance of our ASR system with that of major ASR service providers for Scandinavian languages.
培训资源有限的语言,如丹麦语、瑞典语和挪威语,对现代端到端(E2E)自动语音识别(ASR)系统的发展提出了挑战。我们通过探索利用现有多语言资源的不同方式来解决这个问题。我们的方法结合了密切相关语言的语音数据和/或他们已经训练好的模型。在几个建议的选项中,最有效的一个是基于来自其他可用模型的初始化E2E编码器参数,我们称之为供体。这种方法不仅在目标语言数据量较小的情况下表现良好,而且在数千小时可用的情况下,甚至在供体来自遥远的语言时也表现良好。我们研究了这些基于供体的模型的几个方面,即供体语言的选择,数据大小的影响(目标和供体模型),或者同时使用不同的基于供体的模型的选择。这使我们能够实现一个有效的数据收集过程,其中多个基于捐助者的模型并行运行,并作为互补的数据检查器。这极大地有助于消除训练集和自动数据收集过程中的注释错误。后者用于有效处理各种公共资源(电视、议会、YouTube、播客或有声读物)和基于数千小时的训练模型。我们还准备了大型测试集(链接提供)来评估所有实验,并最终将我们的ASR系统与斯堪的纳维亚语言的主要ASR服务提供商的性能进行比较。
{"title":"Combining multilingual resources to enhance end-to-end speech recognition systems for Scandinavian languages","authors":"Lukas Mateju,&nbsp;Jan Nouza,&nbsp;Petr Cerva,&nbsp;Jindrich Zdansky","doi":"10.1016/j.specom.2025.103221","DOIUrl":"10.1016/j.specom.2025.103221","url":null,"abstract":"<div><div>Languages with limited training resources, such as Danish, Swedish, and Norwegian, pose a challenge to the development of modern end-to-end (E2E) automatic speech recognition (ASR) systems. We tackle this issue by exploring different ways of exploiting existing multilingual resources. Our approaches combine speech data of closely related languages and/or their already trained models. From several proposed options, the most efficient one is based on initializing the E2E encoder parameters by those from other available models, which we call donors. This approach performs well not only for smaller amounts of target language data but also when thousands of hours are available and even when the donor comes from a distant language. We study several aspects of these donor-based models, namely the choice of the donor language, the impact of the data size (both for target and donor models), or the option of using different donor-based models simultaneously. This allows us to implement an efficient data collection process in which multiple donor-based models run in parallel and serve as complementary data checkers. This greatly helps to eliminate annotation errors in training sets and during automated data harvesting. The latter is utilized for efficient processing of diverse public sources (TV, parliament, YouTube, podcasts, or audiobooks) and training models based on thousands of hours. We have also prepared large test sets (link provided) to evaluate all experiments and ultimately compare the performance of our ASR system with that of major ASR service providers for Scandinavian languages.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"170 ","pages":"Article 103221"},"PeriodicalIF":2.4,"publicationDate":"2025-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143600961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learnability of English diphthongs: One dynamic target vs. two static targets 英语双元音的易学性:一个动态目标vs.两个静态目标
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-03-05 DOI: 10.1016/j.specom.2025.103225
Anqi Xu , Daniel R. van Niekerk , Branislav Gerazov , Paul Konstantin Krug , Santitham Prom-on , Peter Birkholz , Yi Xu
As vowels with intrinsic movements, diphthongs are among the most elusive sounds of speech. Previous research has characterized diphthongs as a combination of two vowels, a vowel followed by a formant transition, or a constant rate of formant change. These accounts are based on acoustic patterns, perceptual cues, and either acoustic or articulatory synthesis, but no consensus has been reached. In this study, we explore the nature of diphthongs by exploring how they can be acquired through vocal learning. The acquisition is simulated by a three-dimensional (3D) vocal tract model with built-in target approximation dynamics, which can learn articulatory targets of phonetic categories under the guidance of a speech recognizer. The simulation attempts to learn to articulate diphthong-embedded monosyllabic English words with either a single dynamic target or two static targets, and the learned synthetic words were presented to native listeners for identification. The results showed that diphthongs learned with dynamic targets were consistently more intelligible across variable durations than those learned with two static targets, with only the exception of /aɪ/. From the perspective of learnability, therefore, English diphthongs are likely unitary vowels with dynamic targets.
作为具有内在运动的元音,双元音是最难以捉摸的语音之一。以前的研究将双元音描述为两个元音的组合,一个元音之后是一个形成峰的过渡,或者形成峰的恒定速率变化。这些说法是基于声学模式、感知线索以及声学或发音合成,但尚未达成共识。在这项研究中,我们通过探索如何通过声乐学习获得双元音来探索双元音的本质。通过内置目标逼近动力学的三维声道模型进行模拟,在语音识别器的引导下学习语音类别的发音目标。该模拟尝试通过单个动态目标或两个静态目标来学习发音包含双元音的单音节英语单词,并将学习到的合成单词呈现给母语听众进行识别。结果表明,除了/a / /外,在不同的持续时间内,用动态目标学习的双元音比用两个静态目标学习的双元音更容易理解。因此,从易学性的角度来看,英语双元音很可能是带有动态目标的单一元音。
{"title":"Learnability of English diphthongs: One dynamic target vs. two static targets","authors":"Anqi Xu ,&nbsp;Daniel R. van Niekerk ,&nbsp;Branislav Gerazov ,&nbsp;Paul Konstantin Krug ,&nbsp;Santitham Prom-on ,&nbsp;Peter Birkholz ,&nbsp;Yi Xu","doi":"10.1016/j.specom.2025.103225","DOIUrl":"10.1016/j.specom.2025.103225","url":null,"abstract":"<div><div>As vowels with intrinsic movements, diphthongs are among the most elusive sounds of speech. Previous research has characterized diphthongs as a combination of two vowels, a vowel followed by a formant transition, or a constant rate of formant change. These accounts are based on acoustic patterns, perceptual cues, and either acoustic or articulatory synthesis, but no consensus has been reached. In this study, we explore the nature of diphthongs by exploring how they can be acquired through vocal learning. The acquisition is simulated by a three-dimensional (3D) vocal tract model with built-in target approximation dynamics, which can learn articulatory targets of phonetic categories under the guidance of a speech recognizer. The simulation attempts to learn to articulate diphthong-embedded monosyllabic English words with either a single dynamic target or two static targets, and the learned synthetic words were presented to native listeners for identification. The results showed that diphthongs learned with dynamic targets were consistently more intelligible across variable durations than those learned with two static targets, with only the exception of /aɪ/. From the perspective of learnability, therefore, English diphthongs are likely unitary vowels with dynamic targets.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"170 ","pages":"Article 103225"},"PeriodicalIF":2.4,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143611172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Ohio Child Speech Corpus 俄亥俄州儿童语言语料库
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-03-04 DOI: 10.1016/j.specom.2025.103206
Laura Wagner , Sharifa Alghowinhem , Abeer Alwan , Kristina Bowdrie , Cynthia Breazeal , Cynthia G. Clopper , Eric Fosler-Lussier , Izabela A. Jamsek , Devan Lander , Rajiv Ramnath , Jory Ross
This paper reports on the creation and composition of a new corpus of children's speech, the Ohio Child Speech Corpus, which is publicly available on the Talkbank-CHILDES website. The audio corpus contains speech samples from 303 children ranging in age from 4 – 9 years old, all of whom participated in a seven-task elicitation protocol conducted in a science museum lab. In addition, an interactive social robot controlled by the researchers joined the sessions for approximately 60% of the children, and the corpus itself was collected in the peri‑pandemic period. Two analyses are reported that highlighted these last two features. One set of analyses found that the children spoke significantly more in the presence of the robot relative to its absence, but no effects of speech complexity (as measured by MLU) were found for the robot's presence. Another set of analyses compared children tested immediately post-pandemic to children tested a year later on two school-readiness tasks, an Alphabet task and a Reading Passages task. This analysis showed no negative impact on these tasks for our highly-educated sample of children just coming off of the pandemic relative to those tested later. These analyses demonstrate just two possible types of questions that this corpus could be used to investigate.
本文报道了一个新的儿童语言语料库——俄亥俄儿童语言语料库的创建和组成,该语料库可在Talkbank-CHILDES网站上公开获取。音频语料库包含了303名4 - 9岁儿童的语音样本,他们都参加了在科学博物馆实验室进行的七项任务启发协议。此外,由研究人员控制的交互式社交机器人参加了约60%儿童的会议,语料库本身是在大流行期间收集的。有两个分析报告强调了最后两个特征。一组分析发现,孩子们在机器人在场的情况下比没有机器人的情况下说得更多,但没有发现机器人在场对语言复杂性(根据MLU测量)的影响。另一组分析将大流行后立即接受测试的儿童与一年后接受两项入学准备任务测试的儿童进行了比较,这两项任务是字母表任务和阅读段落任务。这一分析显示,与后来接受测试的儿童相比,刚从大流行中恢复过来的高学历儿童样本对这些任务没有负面影响。这些分析仅展示了该语料库可用于调查的两种可能类型的问题。
{"title":"The Ohio Child Speech Corpus","authors":"Laura Wagner ,&nbsp;Sharifa Alghowinhem ,&nbsp;Abeer Alwan ,&nbsp;Kristina Bowdrie ,&nbsp;Cynthia Breazeal ,&nbsp;Cynthia G. Clopper ,&nbsp;Eric Fosler-Lussier ,&nbsp;Izabela A. Jamsek ,&nbsp;Devan Lander ,&nbsp;Rajiv Ramnath ,&nbsp;Jory Ross","doi":"10.1016/j.specom.2025.103206","DOIUrl":"10.1016/j.specom.2025.103206","url":null,"abstract":"<div><div>This paper reports on the creation and composition of a new corpus of children's speech, the Ohio Child Speech Corpus, which is publicly available on the Talkbank-CHILDES website. The audio corpus contains speech samples from 303 children ranging in age from 4 – 9 years old, all of whom participated in a seven-task elicitation protocol conducted in a science museum lab. In addition, an interactive social robot controlled by the researchers joined the sessions for approximately 60% of the children, and the corpus itself was collected in the peri‑pandemic period. Two analyses are reported that highlighted these last two features. One set of analyses found that the children spoke significantly more in the presence of the robot relative to its absence, but no effects of speech complexity (as measured by MLU) were found for the robot's presence. Another set of analyses compared children tested immediately post-pandemic to children tested a year later on two school-readiness tasks, an Alphabet task and a Reading Passages task. This analysis showed no negative impact on these tasks for our highly-educated sample of children just coming off of the pandemic relative to those tested later. These analyses demonstrate just two possible types of questions that this corpus could be used to investigate.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"170 ","pages":"Article 103206"},"PeriodicalIF":2.4,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143534141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Speech Communication
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1