首页 > 最新文献

Speech Communication最新文献

英文 中文
Pronunciation error detection model based on feature fusion 基于特征融合的发音错误检测模型
IF 3.2 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2023-11-14 DOI: 10.1016/j.specom.2023.103009
Cuicui Zhu , Aishan Wumaier , Dongping Wei , Zhixing Fan , Jianlei Yang , Heng Yu , Zaokere Kadeer , Liejun Wang

Mispronunciation detection and diagnosis (MDD) is a specific speech recognition task that aims to recognize the phoneme sequence produced by a user, compare it with the standard phoneme sequence, and identify the type and location of any mispronunciations. However, the lack of large amounts of phoneme-level annotated data limits the performance improvement of the model. In this paper, we propose a joint training approach, Acoustic Error_Type Linguistic (AEL) that utilizes the error type information, acoustic information, and linguistic information from the annotated data, and achieves feature fusion through multiple attention mechanisms. To address the issue of uneven distribution of phonemes in the MDD data, which can cause the model to make overconfident predictions when using the CTC loss, we propose a new loss function, Focal Attention Loss, to improve the performance of the model, such as F1 score accuracy and other metrics. The proposed method in this paper was evaluated on the TIMIT and L2-Arctic public corpora. In ideal conditions, it was compared with the baseline model CNN-RNN-CTC. The F1 score, diagnostic accuracy, and precision were improved by 31.24%, 16.6%, and 17.35% respectively. Compared to the baseline model, our model reduced the phoneme error rate from 29.55% to 8.49% and showed significant improvements in other metrics. Furthermore, experimental results demonstrated that when we have a model capable of accurately obtaining pronunciation error types, our model can achieve results close to the ideal conditions.

发音错误检测与诊断(MDD)是一项特定的语音识别任务,其目的是识别用户产生的音素序列,并将其与标准音素序列进行比较,识别任何发音错误的类型和位置。然而,由于缺乏大量的音素级标注数据,限制了模型性能的提高。本文提出了一种联合训练方法Acoustic Error_Type Linguistic(AEL),该方法利用标注数据中的错误类型信息、声学信息和语言信息,通过多注意机制实现特征融合。为了解决MDD数据中音素分布不均匀的问题,这可能导致模型在使用CTC损失时做出过度自信的预测,我们提出了一个新的损失函数Focal Attention loss,以提高模型的性能,如F1分数准确性和其他指标。本文提出的方法在TIMIT和L2-Arctic公共语料库上进行了评估。在理想条件下,与基线模型CNN-RNN-CTC进行比较。F1评分、诊断正确率和精密度分别提高31.24%、16.6%和17.35%。与基线模型相比,我们的模型将音素错误率从29.55%降低到8.49%,并且在其他指标上有显着改善。此外,实验结果表明,当我们有一个能够准确获取发音错误类型的模型时,我们的模型可以达到接近理想条件的结果。
{"title":"Pronunciation error detection model based on feature fusion","authors":"Cuicui Zhu ,&nbsp;Aishan Wumaier ,&nbsp;Dongping Wei ,&nbsp;Zhixing Fan ,&nbsp;Jianlei Yang ,&nbsp;Heng Yu ,&nbsp;Zaokere Kadeer ,&nbsp;Liejun Wang","doi":"10.1016/j.specom.2023.103009","DOIUrl":"10.1016/j.specom.2023.103009","url":null,"abstract":"<div><p>Mispronunciation detection and diagnosis (MDD) is a specific speech recognition task that aims to recognize the phoneme sequence produced by a user, compare it with the standard phoneme sequence, and identify the type and location of any mispronunciations. However, the lack of large amounts of phoneme-level annotated data limits the performance improvement of the model. In this paper, we propose a joint training approach, Acoustic Error_Type Linguistic (AEL) that utilizes the error type information, acoustic information, and linguistic information from the annotated data, and achieves feature fusion through multiple attention mechanisms. To address the issue of uneven distribution of phonemes in the MDD data, which can cause the model to make overconfident predictions when using the CTC loss, we propose a new loss function, Focal Attention Loss, to improve the performance of the model, such as F1 score accuracy and other metrics. The proposed method in this paper was evaluated on the TIMIT and L2-Arctic public corpora. In ideal conditions, it was compared with the baseline model CNN-RNN-CTC. The F1 score, diagnostic accuracy, and precision were improved by 31.24%, 16.6%, and 17.35% respectively. Compared to the baseline model, our model reduced the phoneme error rate from 29.55% to 8.49% and showed significant improvements in other metrics. Furthermore, experimental results demonstrated that when we have a model capable of accurately obtaining pronunciation error types, our model can achieve results close to the ideal conditions.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"156 ","pages":"Article 103009"},"PeriodicalIF":3.2,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639323001437/pdfft?md5=a9f2df8a1ec5c7e52d687f13a603a861&pid=1-s2.0-S0167639323001437-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135764092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adapted Weighted Linear Prediction with Attenuated Main Excitation for formant frequency estimation in high-pitched singing 基于主激励衰减的自适应加权线性预测用于高频歌唱中形成峰频率估计
IF 3.2 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2023-11-09 DOI: 10.1016/j.specom.2023.103006
Eduardo Barrientos , Edson Cataldo

This paper aims to show how to improve the accuracy of formant frequency estimation in the singing voice of a lyric soprano. Conventional methods of formant frequency estimation may not accurately capture the formant frequencies of the singing voice, particularly in the highest pitch range of a lyric soprano, where the lowest formants are biased by the pitch harmonics. To address this issue, the study proposes adapting the Weighted Linear Prediction with Attenuated Main Excitation (WLP-AME) method for formant frequency estimation. Specific methods for glottal closure instant estimation were required due to differences in glottal closure patterns between speech and singing. The study evaluates the accuracy of the proposed method by comparing its performance with the LPC method through different pitch series arranged in an ascending musical scale. The results indicated that the adapted WLP-AME method consistently outperformed the LPC method in estimating formant frequencies of vowels sung by a lyric soprano. In addition, by estimating the formant frequencies of a synthetic /i/ vowel sung by a soprano singer at the musical note E5, the study showed that the adapted WLP-AME method provided formant frequency values closer to the correct values than those estimated by the LPC method. In general, these results suggest parameter values of AME function that optimize its performance, which can have applications in fields such as singing and medicine.

本文旨在研究如何提高抒情女高音唱腔中峰频估计的准确性。传统的共振峰频率估计方法可能不能准确地捕捉歌唱声音的共振峰频率,特别是在抒情女高音的最高音高范围内,其中最低的共振峰受到音高谐波的影响。针对这一问题,研究提出了采用主激励衰减加权线性预测(WLP-AME)方法进行峰频估计。由于说话和唱歌之间的声门关闭模式的差异,需要特定的声门关闭即时估计方法。本研究通过比较LPC方法在不同音阶排列的不同音高序列上的表现,来评估所提出方法的准确性。结果表明,改进的WLP-AME方法在估计抒情女高音演唱的元音形成频率方面始终优于LPC方法。此外,通过估计女高音歌手在E5音符处演唱的合成/i/元音的形成峰频率,研究表明,改进的WLP-AME方法提供的形成峰频率值比LPC方法估计的更接近正确值。总的来说,这些结果表明AME函数的参数值可以优化其性能,可以在歌唱和医学等领域得到应用。
{"title":"Adapted Weighted Linear Prediction with Attenuated Main Excitation for formant frequency estimation in high-pitched singing","authors":"Eduardo Barrientos ,&nbsp;Edson Cataldo","doi":"10.1016/j.specom.2023.103006","DOIUrl":"10.1016/j.specom.2023.103006","url":null,"abstract":"<div><p>This paper aims to show how to improve the accuracy of formant frequency estimation in the singing voice of a lyric soprano. Conventional methods of formant frequency estimation may not accurately capture the formant frequencies of the singing voice, particularly in the highest pitch range of a lyric soprano, where the lowest formants are biased by the pitch harmonics. To address this issue, the study proposes adapting the Weighted Linear Prediction with Attenuated Main Excitation (WLP-AME) method for formant frequency estimation. Specific methods for glottal closure instant estimation were required due to differences in glottal closure patterns between speech and singing. The study evaluates the accuracy of the proposed method by comparing its performance with the LPC method through different pitch series arranged in an ascending musical scale. The results indicated that the adapted WLP-AME method consistently outperformed the LPC method in estimating formant frequencies of vowels sung by a lyric soprano. In addition, by estimating the formant frequencies of a synthetic /i/ vowel sung by a soprano singer at the musical note E5, the study showed that the adapted WLP-AME method provided formant frequency values closer to the correct values than those estimated by the LPC method. In general, these results suggest parameter values of AME function that optimize its performance, which can have applications in fields such as singing and medicine.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"156 ","pages":"Article 103006"},"PeriodicalIF":3.2,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639323001401/pdfft?md5=ae4d5be07478b88a7d7394a12ce3f36c&pid=1-s2.0-S0167639323001401-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135565162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
JNV corpus: A corpus of Japanese nonverbal vocalizations with diverse phrases and emotions JNV语料库:日语非语言语料库,包含多种短语和情感
IF 3.2 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2023-11-04 DOI: 10.1016/j.specom.2023.103004
Detai Xin, Shinnosuke Takamichi, Hiroshi Saruwatari

We present JNV (Japanese Nonverbal Vocalizations) corpus, a corpus of Japanese nonverbal vocalizations (NVs) with diverse phrases and emotions. Existing Japanese NV corpora either lack phrase diversity or focus on a small number of emotions, which makes it difficult to analyze the characteristics of Japanese NVs and support downstream tasks like emotion recognition. We first propose a corpus-design method that contains two phases: (1) collecting NVs phrases based on crowd-sourcing; (2) recording NVs by stimulating speakers with emotional scenarios. We then collect 420 audio clips from 4 speakers that cover 6 emotions based on the proposed method. Results of comprehensive objective and subjective experiments demonstrate that (1) the emotions of the collected NVs can be recognized with high accuracy by both human evaluators and statistical models; (2) the collected NVs have a high authenticity comparable to previous corpora of English NVs. Additionally, we analyze the distributions of vowel types in Japanese and conduct feature importance analysis to show discriminative acoustic features between emotion categories in Japanese NVs. We publicate JNV to advance further development in this field.

我们提出了JNV(日语非语言发声)语料库,这是一个日语非语言发声语料库,具有不同的短语和情绪。现有的日语NV语料库要么缺乏短语多样性,要么只关注少量的情绪,这给分析日语NV的特征以及支持情绪识别等下游任务带来了困难。我们首先提出了一种包含两个阶段的语料库设计方法:(1)基于众包的NVs短语收集;(2)用情绪情景刺激说话人,记录nv。然后,我们根据提出的方法从4个演讲者那里收集了420个音频片段,涵盖了6种情绪。客观和主观综合实验结果表明:(1)人工评价和统计模型都能较准确地识别出所收集的nv的情绪;(2)与以往的英语nv语料库相比,所收集的nv具有较高的真实性。此外,我们还分析了日语中元音类型的分布,并进行了特征重要性分析,以显示日语nv中情绪类别之间的区别性声学特征。我们出版JNV是为了推动这一领域的进一步发展。
{"title":"JNV corpus: A corpus of Japanese nonverbal vocalizations with diverse phrases and emotions","authors":"Detai Xin,&nbsp;Shinnosuke Takamichi,&nbsp;Hiroshi Saruwatari","doi":"10.1016/j.specom.2023.103004","DOIUrl":"10.1016/j.specom.2023.103004","url":null,"abstract":"<div><p>We present JNV (Japanese Nonverbal Vocalizations) corpus, a corpus of Japanese nonverbal vocalizations (NVs) with diverse phrases and emotions. Existing Japanese NV corpora either lack phrase diversity or focus on a small number of emotions, which makes it difficult to analyze the characteristics of Japanese NVs and support downstream tasks like emotion recognition. We first propose a corpus-design method that contains two phases: (1) collecting NVs phrases based on crowd-sourcing; (2) recording NVs by stimulating speakers with emotional scenarios. We then collect 420 audio clips from 4 speakers that cover 6 emotions based on the proposed method. Results of comprehensive objective and subjective experiments demonstrate that (1) the emotions of the collected NVs can be recognized with high accuracy by both human evaluators and statistical models; (2) the collected NVs have a high authenticity comparable to previous corpora of English NVs. Additionally, we analyze the distributions of vowel types in Japanese and conduct feature importance analysis to show discriminative acoustic features between emotion categories in Japanese NVs. We publicate JNV to advance further development in this field.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"156 ","pages":"Article 103004"},"PeriodicalIF":3.2,"publicationDate":"2023-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639323001383/pdfft?md5=a483e24acbbf292a674e285ddd58df8a&pid=1-s2.0-S0167639323001383-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135455635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Disordered speech recognition considering low resources and abnormal articulation 考虑低资源和异常发音的语音识别障碍
IF 3.2 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2023-11-01 DOI: 10.1016/j.specom.2023.103002
Yuqin Lin , Jianwu Dang , Longbiao Wang , Sheng Li , Chenchen Ding

The success of automatic speech recognition (ASR) benefits a great number of healthy people, but not people with disorders. The speech disordered may truly need support from technology, while they actually gain little. The difficulties of disordered ASR arise from the limited availability of data and the abnormal nature of speech, e.g, unclear, unstable, and incorrect pronunciations. To realize the ASR of disordered speech, this study addresses the problems of disordered speech in two respects, low resources, and articulatory abnormality. In order to solve the problem of low resources, this study proposes staged knowledge distillation (KD), which provides different references to the student models according to their mastery of knowledge, so as to avoid feature overfitting. To tackle the articulatory abnormalities in dysarthria, we propose an intended phonological perception method (IPPM) by applying the motor theory of speech perception to ASR, in which pieces of intended phonological features are estimated and provided to ASR. And further, we solve the challenges of disordered ASR by combining the staged KD and the IPPM. TORGO database and UASEECH corpus are two commonly used datasets of dysarthria which is the main cause of speech disorders. Experiments on the two datasets validated the effectiveness of the proposed methods. Compared with the baseline, the proposed method achieves 35.14%38.12% relative phoneme error rate reductions (PERRs) for speakers with varying degrees of dysarthria on the TORGO database and relative 8.17%13.00% PERRs on the UASPEECH corpus. The experiments demonstrated that addressing disordered speech from both low resources and speech abnormality is an effective way to solve the problems, and the proposed methods significantly improved the performance of ASR for disordered speech.

自动语音识别(ASR)的成功使许多健康的人受益,但对有障碍的人却没有好处。语言障碍患者可能真的需要技术的支持,而他们实际上得到的很少。紊乱ASR的困难来自于数据的有限可用性和言语的异常性质,例如,发音不清、不稳定和不正确。为了实现言语障碍的ASR,本研究从资源不足和发音异常两个方面解决言语障碍的问题。为了解决资源不足的问题,本研究提出了阶段知识蒸馏(KD),根据学生的知识掌握程度,为学生模型提供不同的参考,避免特征过拟合。为了解决构音障碍的发音异常,我们提出了一种意向语音感知方法(IPPM),该方法将语音感知的运动理论应用于ASR,该方法将意向语音特征估计并提供给ASR。此外,我们通过结合分期KD和IPPM来解决无序ASR的挑战。构音障碍是言语障碍的主要原因,TORGO数据库和UASEECH语料库是构音障碍的两个常用数据集。在两个数据集上的实验验证了所提方法的有效性。与基线相比,该方法在TORGO数据库上对不同程度构音障碍的说话者实现了35.14% ~ 38.12%的相对音素错误率降低(perr),在UASPEECH语料库上实现了8.17% ~ 13.00%的相对音素错误率降低。实验表明,从低资源和语音异常两方面对语音进行定位是解决问题的有效途径,所提出的方法显著提高了ASR对语音障碍的处理效果。
{"title":"Disordered speech recognition considering low resources and abnormal articulation","authors":"Yuqin Lin ,&nbsp;Jianwu Dang ,&nbsp;Longbiao Wang ,&nbsp;Sheng Li ,&nbsp;Chenchen Ding","doi":"10.1016/j.specom.2023.103002","DOIUrl":"https://doi.org/10.1016/j.specom.2023.103002","url":null,"abstract":"<div><p><span>The success of automatic speech recognition (ASR) benefits a great number of healthy people, but not people with disorders. The speech disordered may truly need support from technology, while they actually gain little. The difficulties of disordered ASR arise from the limited availability of data and the abnormal nature of speech, </span><em>e.g</em><span><span>, unclear, unstable, and incorrect pronunciations. To realize the ASR of disordered speech, this study addresses the problems of disordered speech in two respects, low resources, and articulatory abnormality. In order to solve the problem of low resources, this study proposes staged knowledge distillation<span> (KD), which provides different references to the student models according to their mastery of knowledge, so as to avoid feature overfitting. To tackle the articulatory abnormalities in dysarthria, we propose an intended phonological perception method (IPPM) by applying the </span></span>motor theory of speech perception to ASR, in which pieces of intended phonological features are estimated and provided to ASR. And further, we solve the challenges of disordered ASR by combining the staged KD and the IPPM. TORGO database and UASEECH corpus are two commonly used datasets of dysarthria which is the main cause of speech disorders. Experiments on the two datasets validated the effectiveness of the proposed methods. Compared with the baseline, the proposed method achieves 35.14%</span><span><math><mo>∼</mo></math></span><span>38.12% relative phoneme error rate reductions (PERRs) for speakers with varying degrees of dysarthria on the TORGO database and relative 8.17%</span><span><math><mo>∼</mo></math></span>13.00% PERRs on the UASPEECH corpus. The experiments demonstrated that addressing disordered speech from both low resources and speech abnormality is an effective way to solve the problems, and the proposed methods significantly improved the performance of ASR for disordered speech.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"155 ","pages":"Article 103002"},"PeriodicalIF":3.2,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"92005883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multimodal Arabic emotion recognition using deep learning 使用深度学习的多模态阿拉伯情绪识别
IF 3.2 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2023-11-01 DOI: 10.1016/j.specom.2023.103005
Noora Al Roken, Gerassimos Barlas

Emotion Recognition has been an active area for decades due to the complexity of the problem and its significance in human–computer interaction. Various methods have been employed to tackle this problem, leveraging different inputs such as speech, 2D and 3D images, audio signals, and text, all of which can convey emotional information. Recently, researchers have started combining multiple modalities to enhance the accuracy of emotion classification, recognizing that different emotions may be better expressed through different input types. This paper introduces a novel Arabic audio-visual natural-emotion dataset, investigates two existing multimodal classifiers, and proposes a new classifier trained on our Arabic dataset. Our evaluation encompasses different aspects, including variations in visual dataset sizes, joint and disjoint training, single and multimodal networks, as well as consecutive and overlapping segmentation. Through 5-fold cross-validation, our proposed classifier achieved exceptional results with an average F1-score of 0.912 and an accuracy of 0.913 for natural emotion recognition.

由于情感识别问题的复杂性及其在人机交互中的重要性,几十年来一直是一个活跃的领域。人们采用了各种方法来解决这个问题,利用不同的输入,如语音、2D和3D图像、音频信号和文本,所有这些都可以传达情感信息。近年来,研究者们开始结合多种模式来提高情绪分类的准确性,认识到不同的情绪可能通过不同的输入类型得到更好的表达。本文介绍了一种新的阿拉伯语视听自然情感数据集,研究了现有的两种多模态分类器,提出了一种基于我们的阿拉伯语数据集训练的新分类器。我们的评估涵盖了不同的方面,包括视觉数据集大小的变化,联合和非联合训练,单模式和多模式网络,以及连续和重叠分割。通过5倍交叉验证,我们提出的分类器在自然情绪识别方面取得了优异的结果,平均f1得分为0.912,准确率为0.913。
{"title":"Multimodal Arabic emotion recognition using deep learning","authors":"Noora Al Roken,&nbsp;Gerassimos Barlas","doi":"10.1016/j.specom.2023.103005","DOIUrl":"https://doi.org/10.1016/j.specom.2023.103005","url":null,"abstract":"<div><p>Emotion Recognition has been an active area for decades due to the complexity of the problem and its significance in human–computer interaction. Various methods have been employed to tackle this problem, leveraging different inputs such as speech, 2D and 3D images, audio signals, and text, all of which can convey emotional information. Recently, researchers have started combining multiple modalities to enhance the accuracy of emotion classification, recognizing that different emotions may be better expressed through different input types. This paper introduces a novel Arabic audio-visual natural-emotion dataset, investigates two existing multimodal classifiers, and proposes a new classifier trained on our Arabic dataset. Our evaluation encompasses different aspects, including variations in visual dataset sizes, joint and disjoint training, single and multimodal networks, as well as consecutive and overlapping segmentation. Through 5-fold cross-validation, our proposed classifier achieved exceptional results with an average F1-score of 0.912 and an accuracy of 0.913 for natural emotion recognition.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"155 ","pages":"Article 103005"},"PeriodicalIF":3.2,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"92006241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dual-model self-regularization and fusion for domain adaptation of robust speaker verification 鲁棒说话人验证领域自适应的双模型自正则化与融合
IF 3.2 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2023-11-01 DOI: 10.1016/j.specom.2023.103001
Yibo Duan , Yanhua Long , Jiaen Liang

Learning robust representations of speaker identity is a key challenge in speaker verification, as it results in good generalization for many real-world speaker verification scenarios with domain or intra-speaker variations. In this study, we aim to improve the well-established ECAPA-TDNN framework to enhance its domain robustness for low-resource cross-domain speaker verification tasks. Specifically, a novel dual-model self-learning approach is first proposed to produce robust speaker identity embeddings, where the ECAPA-TDNN is extended into a dual-model structure and then trained and regularized using self-supervised learning between different intermediate acoustic representations; Then, we enhance the dual-model by combining self-supervised loss and supervised loss in a time-dependent manner, thus enhancing the model’s overall generalization capabilities. Furthermore, to better utilize the complementary information in the dual-model’s outputs, we explore various methods for similarity computation and score fusion. Our experiments, conducted on the publicly available VoxCeleb2 and VoxMovies datasets, have demonstrated that our proposed dual-model regularization and fusion methods outperformed the strong baseline by a relative 9.07%–11.6% EER reduction across various in-domain and cross-domain evaluation sets. Importantly, our approach exhibits effectiveness in both supervised and unsupervised scenarios for low-resource cross-domain speaker verification tasks.

学习说话人身份的鲁棒表示是说话人验证中的一个关键挑战,因为它可以很好地概括许多具有域或说话人内部变化的真实说话人验证场景。在本研究中,我们旨在改进已建立的ECAPA-TDNN框架,以增强其在低资源跨域说话人验证任务中的域鲁棒性。具体而言,首先提出了一种新的双模型自学习方法来产生鲁棒的说话人身份嵌入,其中将ECAPA-TDNN扩展到双模型结构,然后使用不同中间声学表示之间的自监督学习进行训练和正则化;然后,我们将自监督损失和监督损失以时间依赖的方式结合起来,增强了双模型,从而增强了模型的整体泛化能力。此外,为了更好地利用双模型输出中的互补信息,我们探索了各种相似度计算和分数融合的方法。我们在公开的VoxCeleb2和VoxMovies数据集上进行的实验表明,我们提出的双模型正则化和融合方法在各种域内和跨域评估集上的EER降低相对9.07%-11.6%,优于强基线。重要的是,我们的方法在低资源跨域说话人验证任务的监督和无监督场景下都显示出有效性。
{"title":"Dual-model self-regularization and fusion for domain adaptation of robust speaker verification","authors":"Yibo Duan ,&nbsp;Yanhua Long ,&nbsp;Jiaen Liang","doi":"10.1016/j.specom.2023.103001","DOIUrl":"https://doi.org/10.1016/j.specom.2023.103001","url":null,"abstract":"<div><p>Learning robust representations of speaker identity is a key challenge in speaker verification, as it results in good generalization for many real-world speaker verification scenarios with domain or intra-speaker variations. In this study, we aim to improve the well-established ECAPA-TDNN framework to enhance its domain robustness for low-resource cross-domain speaker verification tasks. Specifically, a novel dual-model self-learning approach is first proposed to produce robust speaker identity embeddings, where the ECAPA-TDNN is extended into a dual-model structure and then trained and regularized using self-supervised learning between different intermediate acoustic representations; Then, we enhance the dual-model by combining self-supervised loss and supervised loss in a time-dependent manner, thus enhancing the model’s overall generalization capabilities. Furthermore, to better utilize the complementary information in the dual-model’s outputs, we explore various methods for similarity computation and score fusion. Our experiments, conducted on the publicly available <span>VoxCeleb2</span> and <span>VoxMovies</span><span><span> datasets, have demonstrated that our proposed dual-model regularization and fusion methods outperformed the strong baseline by a relative 9.07%–11.6% </span>EER reduction across various in-domain and cross-domain evaluation sets. Importantly, our approach exhibits effectiveness in both supervised and unsupervised scenarios for low-resource cross-domain speaker verification tasks.</span></p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"155 ","pages":"Article 103001"},"PeriodicalIF":3.2,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"92066325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Coarse-to-fine speech separation method in the time-frequency domain 时频域粗-细语音分离方法
IF 3.2 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2023-11-01 DOI: 10.1016/j.specom.2023.103003
Xue Yang, Changchun Bao, Xianhong Chen

Although time-domain speech separation methods have exhibited the outstanding performance in anechoic scenarios, their effectiveness is considerably reduced in the reverberant scenarios. Compared to the time-domain methods, the speech separation methods in time-frequency (T-F) domain mainly concern the structured T-F representations and have shown a great potential recently. In this paper, we propose a coarse-to-fine speech separation method in the T-F domain, which involves two steps: 1) a rough separation conducted in the coarse phase and 2) a precise extraction accomplished in the refining phase. In the coarse phase, the speech signals of all speakers are initially separated in a rough manner, resulting in some level of distortion in the estimated signals. In the refining phase, the T-F representation of each estimated signal acts as a guide to extract the residual T-F representation for the corresponding speaker, which helps to reduce the distortions caused in the coarse phase. Besides, the specially designed networks used for the coarse and refining phases are jointly trained for superior performance. Furthermore, utilizing the recurrent attention with parallel branches (RAPB) block to fully exploit the contextual information contained in the whole T-F features, the proposed model demonstrates competitive performance on clean datasets with a small number of parameters. Additionally, the proposed method shows more robustness and achieves state-of-the-art results on more realistic datasets.

虽然时域语音分离方法在消声环境中表现出了优异的性能,但在混响环境中其有效性却大大降低。与时域方法相比,时频域的语音分离方法主要关注结构化的T-F表示,近年来显示出很大的发展潜力。本文提出了一种基于T-F域的从粗到精的语音分离方法,该方法包括两个步骤:1)在粗阶段进行粗分离,2)在精炼阶段进行精确提取。在粗阶段,所有说话人的语音信号最初被粗分离,导致估计的信号有一定程度的失真。在细化阶段,每个估计信号的T-F表示作为提取相应说话者的残差T-F表示的指南,这有助于减少在粗阶段造成的失真。此外,专门设计的用于粗阶段和精炼阶段的网络进行了联合训练,以获得更好的性能。此外,利用循环关注并行分支(RAPB)块来充分利用包含在整个T-F特征中的上下文信息,该模型在具有少量参数的干净数据集上显示出具有竞争力的性能。此外,该方法在更真实的数据集上显示出更强的鲁棒性,并获得了最先进的结果。
{"title":"Coarse-to-fine speech separation method in the time-frequency domain","authors":"Xue Yang,&nbsp;Changchun Bao,&nbsp;Xianhong Chen","doi":"10.1016/j.specom.2023.103003","DOIUrl":"https://doi.org/10.1016/j.specom.2023.103003","url":null,"abstract":"<div><p>Although time-domain speech separation methods have exhibited the outstanding performance in anechoic scenarios, their effectiveness is considerably reduced in the reverberant scenarios. Compared to the time-domain methods, the speech separation methods in time-frequency (T-F) domain mainly concern the structured T-F representations and have shown a great potential recently. In this paper, we propose a coarse-to-fine speech separation method in the T-F domain, which involves two steps: 1) a rough separation conducted in the coarse phase and 2) a precise extraction accomplished in the refining phase. In the coarse phase, the speech signals of all speakers are initially separated in a rough manner, resulting in some level of distortion in the estimated signals. In the refining phase, the T-F representation of each estimated signal acts as a guide to extract the residual T-F representation for the corresponding speaker, which helps to reduce the distortions caused in the coarse phase. Besides, the specially designed networks used for the coarse and refining phases are jointly trained for superior performance. Furthermore, utilizing the recurrent attention with parallel branches (RAPB) block to fully exploit the contextual information contained in the whole T-F features, the proposed model demonstrates competitive performance on clean datasets with a small number of parameters. Additionally, the proposed method shows more robustness and achieves state-of-the-art results on more realistic datasets.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"155 ","pages":"Article 103003"},"PeriodicalIF":3.2,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"92140662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Role of Auditory and Visual Cues in the Perception of Mandarin Emotional Speech in Male Drug Addicts 听觉和视觉线索在男性吸毒者普通话情绪言语知觉中的作用
IF 3.2 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2023-10-12 DOI: 10.1016/j.specom.2023.103000
Puyang Geng , Ningxue Fan , Rong Ling , Hong Guo , Qimeng Lu , Xingwen Chen

Evidence from previous neurological studies has revealed that drugs can cause severe damage to the human brain structure, leading to significant cognitive disorders in emotion processing, such as psychotic-like symptoms (e.g., speech illusion: reporting positive/negative responses when hearing white noise) and negative reinforcement. Due to these emotion processing disorders, drug addicts may experience difficulties in emotion recognition and speech illusion, which are essential for interpersonal communication and a healthy life experience. However, previous research has yielded divergent results regarding whether drug addicts are more attracted to negative stimuli or positive stimuli. Additionally, little attention has been paid to the speech illusion experienced by drug addicts. Therefore, the current study aimed to investigate the effect of drugs on patterns of emotion recognition through two basic channels: auditory (speech) and visual (facial expression), as well as the speech illusions of drug addicts. The current study conducted a perceptual experiment in which 52 stimuli of four emotions (happy, angry, sad, and neutral) in three modalities (auditory, visual, auditory + visual [congruent & incongruent]) were presented to address Question 1 regarding the multi-modal emotional speech perception of drug addicts. Additionally, 26 stimuli of white noise and speech of three emotions in two noise conditions were presented to investigate Question 2 concerning the speech illusion of drug addicts. A total of thirty-five male drug addicts (25 heroin addicts and 10 ketamine addicts) and thirty-five male healthy controls were recruited for the perception experiment. The results, with heroin and ketamine addicts as examples, revealed that drug addicts exhibited lower accuracies in multi-modal emotional speech perception and relied more on visual cues for emotion recognition, especially when auditory and visual inputs were incongruent. Furthermore, both heroin and ketamine addicts showed a higher incidence of emotional responses when only exposed to white noise, suggesting the presence of psychotic-like symptoms (i.e., speech illusion) in drug addicts. Our results preliminarily indicate a disorder or deficit in multi-modal emotional speech processing among drug addicts, and the use of visual cues (e.g., facial expressions) may be recommended to improve their interpretation of emotional expressions. Moreover, the speech illusions experienced by drug addicts warrant greater attention and awareness. This paper not only fills the research gap in understanding multi-modal emotion processing and speech illusion in drug addicts but also contributes to a deeper understanding of the effects of drugs on human behavior and provides insights for the theoretical foundations of detoxification and speech rehabilitation for drug addicts.

先前神经学研究的证据表明,药物会对人类大脑结构造成严重损伤,导致情绪处理中的严重认知障碍,如精神病样症状(例如,言语幻觉:听到白噪音时报告阳性/阴性反应)和阴性强化。由于这些情绪处理障碍,吸毒者可能会在情绪识别和言语幻觉方面遇到困难,这对人际沟通和健康的生活体验至关重要。然而,关于吸毒者是更容易被负面刺激吸引还是更容易被正面刺激吸引,先前的研究得出了不同的结果。此外,很少有人注意到吸毒者所经历的言语幻觉。因此,本研究旨在通过两个基本渠道:听觉(言语)和视觉(面部表情),以及吸毒者的言语错觉,研究药物对情绪识别模式的影响。目前的研究进行了一项感知实验,其中提出了三种模式(听觉、视觉、听觉+视觉[一致和不一致])的四种情绪(快乐、愤怒、悲伤和中性)的52种刺激,以解决关于吸毒者多模式情感言语感知的问题1。此外,在两种噪声条件下,提出了26种白噪声和三种情绪的言语刺激,以调查关于吸毒者言语幻觉的问题2。共招募35名男性吸毒者(25名海洛因成瘾者和10名氯胺酮成瘾者)和35名男性健康对照进行感知实验。以海洛因和氯胺酮成瘾者为例,研究结果表明,成瘾者在多模态情感言语感知方面表现出较低的准确性,并且更多地依赖视觉线索进行情感识别,尤其是在听觉和视觉输入不一致的情况下。此外,海洛因和氯胺酮成瘾者在仅暴露于白噪声时都表现出更高的情绪反应发生率,这表明吸毒者存在精神病样症状(即言语幻觉)。我们的研究结果初步表明,吸毒者在多模态情绪言语处理方面存在障碍或缺陷,建议使用视觉线索(如面部表情)来改善他们对情绪表达的理解。此外,吸毒者所经历的言语幻觉值得更多的关注和认识。本文不仅填补了对吸毒者多模态情绪处理和言语幻觉理解的研究空白,而且有助于更深入地理解毒品对人类行为的影响,为吸毒者戒毒和言语康复的理论基础提供了见解。
{"title":"The Role of Auditory and Visual Cues in the Perception of Mandarin Emotional Speech in Male Drug Addicts","authors":"Puyang Geng ,&nbsp;Ningxue Fan ,&nbsp;Rong Ling ,&nbsp;Hong Guo ,&nbsp;Qimeng Lu ,&nbsp;Xingwen Chen","doi":"10.1016/j.specom.2023.103000","DOIUrl":"https://doi.org/10.1016/j.specom.2023.103000","url":null,"abstract":"<div><p>Evidence from previous neurological studies has revealed that drugs can cause severe damage to the human brain structure, leading to significant cognitive disorders in emotion processing, such as psychotic-like symptoms (e.g., speech illusion: reporting positive/negative responses when hearing white noise) and negative reinforcement. Due to these emotion processing disorders, drug addicts may experience difficulties in emotion recognition and speech illusion, which are essential for interpersonal communication and a healthy life experience. However, previous research has yielded divergent results regarding whether drug addicts are more attracted to negative stimuli or positive stimuli. Additionally, little attention has been paid to the speech illusion experienced by drug addicts. Therefore, the current study aimed to investigate the effect of drugs on patterns of emotion recognition through two basic channels: auditory (speech) and visual (facial expression), as well as the speech illusions of drug addicts. The current study conducted a perceptual experiment in which 52 stimuli of four emotions (happy, angry, sad, and neutral) in three modalities (auditory, visual, auditory + visual [congruent &amp; incongruent]) were presented to address Question 1 regarding the multi-modal emotional speech perception of drug addicts. Additionally, 26 stimuli of white noise and speech of three emotions in two noise conditions were presented to investigate Question 2 concerning the speech illusion of drug addicts. A total of thirty-five male drug addicts (25 heroin addicts and 10 ketamine addicts) and thirty-five male healthy controls were recruited for the perception experiment. The results, with heroin and ketamine addicts as examples, revealed that drug addicts exhibited lower accuracies in multi-modal emotional speech perception and relied more on visual cues for emotion recognition, especially when auditory and visual inputs were incongruent. Furthermore, both heroin and ketamine addicts showed a higher incidence of emotional responses when only exposed to white noise, suggesting the presence of psychotic-like symptoms (i.e., speech illusion) in drug addicts. Our results preliminarily indicate a disorder or deficit in multi-modal emotional speech processing among drug addicts, and the use of visual cues (e.g., facial expressions) may be recommended to improve their interpretation of emotional expressions. Moreover, the speech illusions experienced by drug addicts warrant greater attention and awareness. This paper not only fills the research gap in understanding multi-modal emotion processing and speech illusion in drug addicts but also contributes to a deeper understanding of the effects of drugs on human behavior and provides insights for the theoretical foundations of detoxification and speech rehabilitation for drug addicts.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"155 ","pages":"Article 103000"},"PeriodicalIF":3.2,"publicationDate":"2023-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49701210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Classification of functional dysphonia using the tunable Q wavelet transform 基于可调Q小波变换的功能性语音障碍分类
IF 3.2 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2023-10-06 DOI: 10.1016/j.specom.2023.102989
Kiran Reddy Mittapalle , Madhu Keerthana Yagnavajjula , Paavo Alku

Functional dysphonia (FD) refers to an abnormality in voice quality in the absence of an identifiable lesion. In this paper, we propose an approach based on the tunable Q wavelet transform (TQWT) to automatically classify two types of FD (hyperfunctional dysphonia and hypofunctional dysphonia) from a healthy voice using the acoustic voice signal. Using TQWT, voice signals were decomposed into sub-bands and the entropy values extracted from the sub-bands were utilized as features for the studied 3-class classification problem. In addition, the Mel-frequency cepstral coefficient (MFCC) and glottal features were extracted from the acoustic voice signal and the estimated glottal source signal, respectively. A convolutional neural network (CNN) classifier was trained separately for the TQWT, MFCC and glottal features. Experiments were conducted using voice signals of 57 healthy speakers and 113 FD patients (72 with hyperfunctional dysphonia and 41 with hypofunctional dysphonia) taken from the VOICED database. These experiments revealed that the TQWT features yielded an absolute improvement of 5.5% and 4.5% compared to the baseline MFCC features and glottal features, respectively. Furthermore, the highest classification accuracy (67.91%) was obtained using the combination of the TQWT and glottal features, which indicates the complementary nature of these features.

功能性发音困难(FD)是指在没有可识别病变的情况下出现的语音质量异常。在本文中,我们提出了一种基于可调Q小波变换(TQWT)的方法,使用声学语音信号从健康语音中自动分类两种类型的FD(高功能性发音困难和低功能性发音障碍)。使用TQWT,将语音信号分解为子波段,并将从子波段提取的熵值用作所研究的3类分类问题的特征。此外,分别从声学语音信号和估计的声门源信号中提取Mel频率倒谱系数(MFCC)和声门特征。卷积神经网络(CNN)分类器分别针对TQWT、MFCC和声门特征进行训练。使用来自VOICED数据库的57名健康说话者和113名FD患者(72名患有高功能性发音困难,41名患有低功能性发音障碍)的语音信号进行实验。这些实验表明,与基线MFCC特征和声门特征相比,TQWT特征分别产生了5.5%和4.5%的绝对改善。此外,使用TQWT和声门特征的组合获得了最高的分类准确率(67.91%),这表明了这些特征的互补性。
{"title":"Classification of functional dysphonia using the tunable Q wavelet transform","authors":"Kiran Reddy Mittapalle ,&nbsp;Madhu Keerthana Yagnavajjula ,&nbsp;Paavo Alku","doi":"10.1016/j.specom.2023.102989","DOIUrl":"https://doi.org/10.1016/j.specom.2023.102989","url":null,"abstract":"<div><p>Functional dysphonia (FD) refers to an abnormality in voice quality in the absence of an identifiable lesion. In this paper, we propose an approach based on the tunable Q wavelet transform (TQWT) to automatically classify two types of FD (hyperfunctional dysphonia and hypofunctional dysphonia) from a healthy voice using the acoustic voice signal. Using TQWT, voice signals were decomposed into sub-bands and the entropy values extracted from the sub-bands were utilized as features for the studied 3-class classification problem. In addition, the Mel-frequency cepstral coefficient (MFCC) and glottal features were extracted from the acoustic voice signal and the estimated glottal source signal, respectively. A convolutional neural network (CNN) classifier was trained separately for the TQWT, MFCC and glottal features. Experiments were conducted using voice signals of 57 healthy speakers and 113 FD patients (72 with hyperfunctional dysphonia and 41 with hypofunctional dysphonia) taken from the VOICED database. These experiments revealed that the TQWT features yielded an absolute improvement of 5.5% and 4.5% compared to the baseline MFCC features and glottal features, respectively. Furthermore, the highest classification accuracy (67.91%) was obtained using the combination of the TQWT and glottal features, which indicates the complementary nature of these features.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"155 ","pages":"Article 102989"},"PeriodicalIF":3.2,"publicationDate":"2023-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49702808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Graph attention-based deep embedded clustering for speaker diarization 基于图关注的说话人深度嵌入聚类
IF 3.2 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2023-10-05 DOI: 10.1016/j.specom.2023.102991
Yi Wei, Haiyan Guo, Zirui Ge, Zhen Yang

Deep speaker embedding extraction models have recently served as the cornerstone for modular speaker diarization systems. However, in current modular systems, the extracted speaker embeddings (namely, speaker features) do not effectively leverage their intrinsic relationships, and moreover, are not tailored specifically for the clustering task. In this paper, inspired by deep embedded clustering (DEC), we propose a speaker diarization method using the graph attention-based deep embedded clustering (GADEC) to address the aforementioned issues. First, considering the temporal nature of speech signals, when segmenting the speech signal into small segments, the speech in the current segment and its neighboring segments may likely belong to the same speaker. This suggests that embeddings extracted from neighboring segments could help generate a more informative speaker representation for the current segment. To better describe the complex relationships between segments and leverage the local structural information among their embeddings, we construct a graph for the pre-extracted speaker embeddings in a continuous audio signal. On this basis, we introduce a graph attentional encoder (GAE) module to integrate information from neighboring nodes (i.e., neighboring segments) in the graph and learn latent speaker embeddings. Moreover, we further jointly optimize both the latent speaker embeddings and the clustering results within a unified framework, leading to more discriminative speaker embeddings for the clustering task. Experimental results demonstrate that our proposed GADEC-based speaker diarization system significantly outperforms the baseline systems and several other recent speaker diarization systems concerning diarization error rate (DER) on the NIST SRE 2000 CALLHOME, AMI, and VoxConverse datasets.

深度说话人嵌入提取模型最近成为模块化说话人二元化系统的基石。然而,在当前的模块化系统中,提取的说话人嵌入(即说话人特征)不能有效地利用它们的内在关系,而且,也不是专门为聚类任务定制的。在本文中,受深度嵌入聚类(DEC)的启发,我们提出了一种基于图注意力的深度嵌入聚类的说话人二元化方法来解决上述问题。首先,考虑到语音信号的时间性质,当将语音信号分割成小片段时,当前片段及其相邻片段中的语音可能属于同一说话者。这表明,从相邻片段中提取的嵌入可以帮助为当前片段生成信息量更大的说话者表示。为了更好地描述片段之间的复杂关系,并利用其嵌入之间的局部结构信息,我们为连续音频信号中预先提取的扬声器嵌入构建了一个图。在此基础上,我们引入了一个图注意力编码器(GAE)模块来整合图中相邻节点(即相邻片段)的信息,并学习潜在的说话人嵌入。此外,我们在一个统一的框架内进一步联合优化潜在说话人嵌入和聚类结果,从而为聚类任务提供更具鉴别性的说话人嵌入。实验结果表明,在NIST SRE 2000 CALLHOME、AMI和VoxConverse数据集上,我们提出的基于GADEC的说话人二元化系统在二元化错误率(DER)方面显著优于基线系统和其他几个最近的说话人二次化系统。
{"title":"Graph attention-based deep embedded clustering for speaker diarization","authors":"Yi Wei,&nbsp;Haiyan Guo,&nbsp;Zirui Ge,&nbsp;Zhen Yang","doi":"10.1016/j.specom.2023.102991","DOIUrl":"https://doi.org/10.1016/j.specom.2023.102991","url":null,"abstract":"<div><p>Deep speaker embedding extraction models have recently served as the cornerstone for modular speaker diarization systems. However, in current modular systems, the extracted speaker embeddings (namely, speaker features) do not effectively leverage their intrinsic relationships, and moreover, are not tailored specifically for the clustering task. In this paper, inspired by deep embedded clustering (DEC), we propose a speaker diarization method using the graph attention-based deep embedded clustering (GADEC) to address the aforementioned issues. First, considering the temporal nature of speech signals, when segmenting the speech signal into small segments, the speech in the current segment and its neighboring segments may likely belong to the same speaker. This suggests that embeddings extracted from neighboring segments could help generate a more informative speaker representation for the current segment. To better describe the complex relationships between segments and leverage the local structural information among their embeddings, we construct a graph for the pre-extracted speaker embeddings in a continuous audio signal. On this basis, we introduce a graph attentional encoder (GAE) module to integrate information from neighboring nodes (i.e., neighboring segments) in the graph and learn latent speaker embeddings. Moreover, we further jointly optimize both the latent speaker embeddings and the clustering results within a unified framework, leading to more discriminative speaker embeddings for the clustering task. Experimental results demonstrate that our proposed GADEC-based speaker diarization system significantly outperforms the baseline systems and several other recent speaker diarization systems concerning diarization error rate (DER) on the NIST SRE 2000 CALLHOME, AMI, and VoxConverse datasets.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"155 ","pages":"Article 102991"},"PeriodicalIF":3.2,"publicationDate":"2023-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49702803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Speech Communication
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1