Speech Communication最新文献_第6页

Speechformer-CTC: Sequential modeling of depression detection with speech temporal classification Speechformer-CTC：利用语音时态分类对抑郁检测进行序列建模

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-07-18 DOI: 10.1016/j.specom.2024.103106

Jinhan Wang , Vijay Ravi , Jonathan Flint , Abeer Alwan

Speech-based automatic depression detection systems have been extensively explored over the past few years. Typically, each speaker is assigned a single label (Depressive or Non-depressive), and most approaches formulate depression detection as a speech classification task without explicitly considering the non-uniformly distributed depression pattern within segments, leading to low generalizability and robustness across different scenarios. However, depression corpora do not provide fine-grained labels (at the phoneme or word level) which makes the dynamic depression pattern in speech segments harder to track using conventional frameworks. To address this, we propose a novel framework, Speechformer-CTC, to model non-uniformly distributed depression characteristics within segments using a Connectionist Temporal Classification (CTC) objective function without the necessity of input–output alignment. Two novel CTC-label generation policies, namely the Expectation-One-Hot and the HuBERT policies, are proposed and incorporated in objectives on various granularities. Additionally, experiments using Automatic Speech Recognition (ASR) features are conducted to demonstrate the compatibility of the proposed method with content-based features. Our results show that the performance of depression detection, in terms of Macro F1-score, is improved on both DAIC-WOZ (English) and CONVERGE (Mandarin) datasets. On the DAIC-WOZ dataset, the system with HuBERT ASR features and a CTC objective optimized using HuBERT policy for label generation achieves 83.15% F1-score, which is close to state-of-the-art without the need for phoneme-level transcription or data augmentation. On the CONVERGE dataset, using Whisper features with the HuBERT policy improves the F1-score by 9.82% on CONVERGE1 (in-domain test set) and 18.47% on CONVERGE2 (out-of-domain test set). These findings show that depression detection can benefit from modeling non-uniformly distributed depression patterns and the proposed framework can be potentially used to determine significant depressive regions in speech utterances.

基于语音的抑郁自动检测系统在过去几年中得到了广泛的探索。通常情况下，每个说话者都会被赋予一个单一的标签（抑郁或非抑郁），而且大多数方法都将抑郁检测作为一项语音分类任务，而没有明确考虑片段内非均匀分布的抑郁模式，从而导致在不同场景下的通用性和鲁棒性较低。然而，抑郁语料库不提供细粒度标签（音素或单词级别），这使得使用传统框架跟踪语音片段中的动态抑郁模式变得更加困难。为了解决这个问题，我们提出了一个新颖的框架 Speechformer-CTC，利用 Connectionist Temporal Classification (CTC) 目标函数对片段内非均匀分布的抑郁特征进行建模，而无需输入输出对齐。提出了两种新颖的 CTC 标签生成策略，即期望一热策略和 HuBERT 策略，并将其纳入不同粒度的目标中。此外，还使用自动语音识别（ASR）特征进行了实验，以证明所提方法与基于内容的特征的兼容性。我们的结果表明，在 DAIC-WOZ（英语）和 CONVERGE（普通话）数据集上，抑郁检测的性能（宏观 F1 分数）都得到了提高。在 DAIC-WOZ 数据集上，采用 HuBERT ASR 特征和使用 HuBERT 策略优化标签生成的 CTC 目标的系统取得了 83.15% 的 F1 分数，接近最先进水平，无需进行音素级转录或数据增强。在 CONVERGE 数据集上，使用 Whisper 特征和 HuBERT 策略可将 CONVERGE1（域内测试集）的 F1 分数提高 9.82%，将 CONVERGE2（域外测试集）的 F1 分数提高 18.47%。这些研究结果表明，抑郁检测可以从非均匀分布的抑郁模式建模中获益，所提出的框架可用于确定语音语篇中的重要抑郁区域。

{"title":"Speechformer-CTC: Sequential modeling of depression detection with speech temporal classification","authors":"Jinhan Wang , Vijay Ravi , Jonathan Flint , Abeer Alwan","doi":"10.1016/j.specom.2024.103106","DOIUrl":"10.1016/j.specom.2024.103106","url":null,"abstract":"<div><p>Speech-based automatic depression detection systems have been extensively explored over the past few years. Typically, each speaker is assigned a single label (Depressive or Non-depressive), and most approaches formulate depression detection as a speech classification task without explicitly considering the non-uniformly distributed depression pattern within segments, leading to low generalizability and robustness across different scenarios. However, depression corpora do not provide fine-grained labels (at the phoneme or word level) which makes the dynamic depression pattern in speech segments harder to track using conventional frameworks. To address this, we propose a novel framework, Speechformer-CTC, to model non-uniformly distributed depression characteristics within segments using a Connectionist Temporal Classification (CTC) objective function without the necessity of input–output alignment. Two novel CTC-label generation policies, namely the Expectation-One-Hot and the HuBERT policies, are proposed and incorporated in objectives on various granularities. Additionally, experiments using Automatic Speech Recognition (ASR) features are conducted to demonstrate the compatibility of the proposed method with content-based features. Our results show that the performance of depression detection, in terms of Macro F1-score, is improved on both DAIC-WOZ (English) and CONVERGE (Mandarin) datasets. On the DAIC-WOZ dataset, the system with HuBERT ASR features and a CTC objective optimized using HuBERT policy for label generation achieves 83.15% F1-score, which is close to state-of-the-art without the need for phoneme-level transcription or data augmentation. On the CONVERGE dataset, using Whisper features with the HuBERT policy improves the F1-score by 9.82% on CONVERGE1 (in-domain test set) and 18.47% on CONVERGE2 (out-of-domain test set). These findings show that depression detection can benefit from modeling non-uniformly distributed depression patterns and the proposed framework can be potentially used to determine significant depressive regions in speech utterances.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"163 ","pages":"Article 103106"},"PeriodicalIF":2.4,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000785/pdfft?md5=afe02da612b1e415b45579997ae4074e&pid=1-s2.0-S0167639324000785-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141842447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Whisper-SV: Adapting Whisper for low-data-resource speaker verification Whisper-SV：为低数据资源扬声器验证调整 Whisper

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-07-14 DOI: 10.1016/j.specom.2024.103103

Li Zhang , Ning Jiang , Qing Wang , Yue Li , Quan Lu , Lei Xie

Trained on 680,000 h of massive speech data, Whisper is a multitasking, multilingual speech foundation model demonstrating superior performance in automatic speech recognition, translation, and language identification. However, its applicability in speaker verification (SV) tasks remains unexplored, particularly in low-data-resource scenarios where labeled speaker data in specific domains are limited. To fill this gap, we propose a lightweight adaptor framework to boost SV with Whisper, namely Whisper-SV. Given that Whisper is not specifically optimized for SV tasks, we introduce a representation selection module to quantify the speaker-specific characteristics contained in each layer of Whisper and select the top-k layers with prominent discriminative speaker features. To aggregate pivotal speaker-related features while diminishing non-speaker redundancies across the selected top-k distinct layers of Whisper, we design a multi-layer aggregation module in Whisper-SV to integrate multi-layer representations into a singular, compacted representation for SV. In the multi-layer aggregation module, we employ convolutional layers with shortcut connections among different layers to refine speaker characteristics derived from multi-layer representations from Whisper. In addition, an attention aggregation layer is used to reduce non-speaker interference and amplify speaker-specific cues for SV tasks. Finally, a simple classification module is used for speaker classification. Experiments on VoxCeleb1, FFSVC, and IMSV datasets demonstrate that Whisper-SV achieves EER/minDCF of 2.22%/0.307, 6.14%/0.488, and 7.50%/0.582, respectively, showing superior performance in low-data-resource SV scenarios.

Whisper 是一种多任务、多语言语音基础模型，曾在 680,000 小时的海量语音数据上进行过训练，在自动语音识别、翻译和语言识别方面表现出卓越的性能。然而，它在说话人验证（SV）任务中的适用性仍有待探索，尤其是在低数据资源场景中，因为特定领域的标注说话人数据有限。为了填补这一空白，我们提出了一个轻量级适配器框架，即 Whisper-SV，来利用 Whisper 提升 SV。鉴于 Whisper 并未专门针对 SV 任务进行优化，我们引入了一个表征选择模块，以量化 Whisper 每一层所包含的特定说话人特征，并选择具有突出辨别说话人特征的前 k 层。为了聚合与说话人相关的关键特征，同时减少 Whisper 所选的前 k 个不同层中的非说话人冗余，我们在 Whisper-SV 中设计了一个多层聚合模块，将多层表示法整合为一个单一、紧凑的 SV 表示法。在多层聚合模块中，我们利用卷积层与不同层之间的捷径连接来完善从 Whisper 多层表征中得出的说话者特征。此外，我们还利用注意力聚合层来减少非说话者的干扰，并放大 SV 任务中说话者的特定线索。最后，一个简单的分类模块用于扬声器分类。在 VoxCeleb1、FFSVC 和 IMSV 数据集上的实验表明，Whisper-SV 的 EER/minDCF 分别为 2.22%/0.307、6.14%/0.488 和 7.50%/0.582，在低数据资源 SV 场景中表现出了卓越的性能。

{"title":"Whisper-SV: Adapting Whisper for low-data-resource speaker verification","authors":"Li Zhang , Ning Jiang , Qing Wang , Yue Li , Quan Lu , Lei Xie","doi":"10.1016/j.specom.2024.103103","DOIUrl":"10.1016/j.specom.2024.103103","url":null,"abstract":"<div><p>Trained on 680,000 h of massive speech data, Whisper is a multitasking, multilingual speech foundation model demonstrating superior performance in automatic speech recognition, translation, and language identification. However, its applicability in speaker verification (SV) tasks remains unexplored, particularly in low-data-resource scenarios where labeled speaker data in specific domains are limited. To fill this gap, we propose a lightweight adaptor framework to boost SV with Whisper, namely Whisper-SV. Given that Whisper is not specifically optimized for SV tasks, we introduce a representation selection module to quantify the speaker-specific characteristics contained in each layer of Whisper and select the top-k layers with prominent discriminative speaker features. To aggregate pivotal speaker-related features while diminishing non-speaker redundancies across the selected top-k distinct layers of Whisper, we design a multi-layer aggregation module in Whisper-SV to integrate multi-layer representations into a singular, compacted representation for SV. In the multi-layer aggregation module, we employ convolutional layers with shortcut connections among different layers to refine speaker characteristics derived from multi-layer representations from Whisper. In addition, an attention aggregation layer is used to reduce non-speaker interference and amplify speaker-specific cues for SV tasks. Finally, a simple classification module is used for speaker classification. Experiments on VoxCeleb1, FFSVC, and IMSV datasets demonstrate that Whisper-SV achieves EER/minDCF of 2.22%/0.307, 6.14%/0.488, and 7.50%/0.582, respectively, showing superior performance in low-data-resource SV scenarios.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"163 ","pages":"Article 103103"},"PeriodicalIF":2.4,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141701112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Advancing speaker embedding learning: Wespeaker toolkit for research and production 推进演讲者嵌入式学习：用于研究和制作的 Wespeaker 工具包

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-07-01 DOI: 10.1016/j.specom.2024.103104

Shuai Wang , Zhengyang Chen , Bing Han , Hongji Wang , Chengdong Liang , Binbin Zhang , Xu Xiang , Wen Ding , Johan Rohdin , Anna Silnova , Yanmin Qian , Haizhou Li

Speaker modeling plays a crucial role in various tasks, and fixed-dimensional vector representations, known as speaker embeddings, are the predominant modeling approach. These embeddings are typically evaluated within the framework of speaker verification, yet their utility extends to a broad scope of related tasks including speaker diarization, speech synthesis, voice conversion, and target speaker extraction. This paper presents Wespeaker, a user-friendly toolkit designed for both research and production purposes, dedicated to the learning of speaker embeddings. Wespeaker offers scalable data management, state-of-the-art speaker embedding models, and self-supervised learning training schemes with the potential to leverage large-scale unlabeled real-world data. The toolkit incorporates structured recipes that have been successfully adopted in winning systems across various speaker verification challenges, ensuring highly competitive results. For production-oriented development, Wespeaker integrates CPU- and GPU-compatible deployment and runtime codes, supporting mainstream platforms such as Windows, Linux, Mac and on-device chips such as horizon X3’PI. Wespeaker also provides off-the-shelf high-quality speaker embeddings by providing various pretrained models, which can be effortlessly applied to different tasks that require speaker modeling. The toolkit is publicly available at https://github.com/wenet-e2e/wespeaker.

扬声器建模在各种任务中起着至关重要的作用，而固定维度的向量表示（即扬声器嵌入）是最主要的建模方法。这些嵌入通常是在扬声器验证的框架内进行评估的，但它们的用途也扩展到了扬声器日记化、语音合成、语音转换和目标扬声器提取等广泛的相关任务中。本文介绍的 Wespeaker 是一款用户友好型工具包，既可用于研究，也可用于生产，专门用于学习扬声器嵌入。Wespeaker 提供可扩展的数据管理、最先进的扬声器嵌入模型和自监督学习训练方案，具有利用大规模无标注真实世界数据的潜力。该工具包采用了结构化配方，这些配方已成功应用于各种扬声器验证挑战的成功系统中，确保了极具竞争力的结果。对于面向生产的开发，Wespeaker集成了CPU和GPU兼容的部署和运行代码，支持Windows、Linux、Mac等主流平台和horizon X3'PI等设备上芯片。Wespeaker 还通过提供各种预训练模型提供现成的高质量扬声器嵌入，这些模型可以毫不费力地应用于需要扬声器建模的不同任务。该工具包可通过 https://github.com/wenet-e2e/wespeaker 公开获取。

{"title":"Advancing speaker embedding learning: Wespeaker toolkit for research and production","authors":"Shuai Wang , Zhengyang Chen , Bing Han , Hongji Wang , Chengdong Liang , Binbin Zhang , Xu Xiang , Wen Ding , Johan Rohdin , Anna Silnova , Yanmin Qian , Haizhou Li","doi":"10.1016/j.specom.2024.103104","DOIUrl":"10.1016/j.specom.2024.103104","url":null,"abstract":"<div><p>Speaker modeling plays a crucial role in various tasks, and fixed-dimensional vector representations, known as speaker embeddings, are the predominant modeling approach. These embeddings are typically evaluated within the framework of speaker verification, yet their utility extends to a broad scope of related tasks including speaker diarization, speech synthesis, voice conversion, and target speaker extraction. This paper presents Wespeaker, a user-friendly toolkit designed for both research and production purposes, dedicated to the learning of speaker embeddings. Wespeaker offers scalable data management, state-of-the-art speaker embedding models, and self-supervised learning training schemes with the potential to leverage large-scale unlabeled real-world data. The toolkit incorporates structured recipes that have been successfully adopted in winning systems across various speaker verification challenges, ensuring highly competitive results. For production-oriented development, Wespeaker integrates CPU- and GPU-compatible deployment and runtime codes, supporting mainstream platforms such as Windows, Linux, Mac and on-device chips such as horizon X3’PI. Wespeaker also provides off-the-shelf high-quality speaker embeddings by providing various pretrained models, which can be effortlessly applied to different tasks that require speaker modeling. The toolkit is publicly available at <span><span>https://github.com/wenet-e2e/wespeaker</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"162 ","pages":"Article 103104"},"PeriodicalIF":2.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141688867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The effects of informational and energetic/modulation masking on the efficiency and ease of speech communication across the lifespan 信息和能量/调制掩蔽对一生中语言交流的效率和便利性的影响

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-07-01 DOI: 10.1016/j.specom.2024.103101

Outi Tuomainen , Stuart Rosen , Linda Taschenberger , Valerie Hazan

Children and older adults have greater difficulty understanding speech when there are other voices in the background (informational masking, IM) than when the interference is a steady-state noise with a similar spectral profile but is not speech (due to modulation and energetic masking; EM/MM). We evaluated whether this IM vs. EM/MM difference for certain age ranges was found for broader measures of communication efficiency and ease in 114 participants aged between 8 and 80. Participants carried out interactive diapix problem-solving tasks in age-band- and sex-matched pairs, in quiet and with different maskers in the background affecting both participants. Three measures were taken: (a) task transaction time (communication efficiency), (b) performance on a secondary auditory task simultaneously carried out during diapix, and (c) post-test subjective ratings of effort, concentration, difficulty and noisiness (communication ease). Although participants did not take longer to complete the task when in challenging conditions, effects of IM vs. EM/MM were clearly seen on the other measures. Relative to the EM/MM and quiet conditions, participants in IM conditions were less able to attend to the secondary task and reported greater effects of the masker type on their perceived degree of effort, concentration, difficulty and noisiness. However, we found no evidence of decreased communication efficiency and ease in IM relative to EM/MM for children and older adults in any of our measures. The clearest effects of age were observed in transaction time and secondary task measures. Overall, communication efficiency gradually improved between the ages 8–18 years and performance on the secondary task improved over younger ages (until 30 years) and gradually decreased after 50 years of age. Finally, we also found an impact of communicative role on performance. In adults, the participant asked to take the lead in the task and who spoke the most, performed worse on the secondary task than the person who was mainly in a ‘listening’ role and responding to queries. These results suggest that when a broader evaluation of speech communication is carried out that more closely resembles typical communicative situations, the more acute effects of IM typically seen in populations at the extremes of the lifespan are minimised potentially due to the presence of multiple information sources, which allow the use of varying communication strategies. Such a finding is relevant for clinical evaluations of speech communication.

当背景中有其他声音时（信息掩蔽，IM），儿童和老年人理解语音的难度要大于干扰是具有相似频谱轮廓但不是语音的稳态噪声时（由于调制和能量掩蔽；EM/MM）。我们对 114 名年龄在 8 岁至 80 岁之间的参与者进行了评估，以更广泛地衡量交流效率和难易程度，看看在某些年龄段是否存在这种 IM 与 EM/MM 的差异。参加者在年龄和性别匹配的情况下，在安静的环境中并在不同的背景掩蔽物影响下进行互动式问题解决任务。测试采用了三种测量方法：(a) 任务完成时间（交流效率）；(b) 测试期间同时进行的第二听觉任务的表现；(c) 测试后对努力程度、集中程度、难度和噪音的主观评价（交流难易程度）。虽然参与者在挑战性条件下完成任务的时间并不更长，但在其他测量指标上，即时通讯与电磁/微波的影响显而易见。与 EM/MM 和安静条件相比，在 IM 条件下，参与者不太能够专注于次要任务，并报告了掩蔽器类型对他们感知到的努力程度、集中程度、难度和嘈杂程度的更大影响。不过，我们没有发现任何证据表明，与 EM/MM 相比，儿童和老年人在 IM 条件下的交流效率和轻松程度都有所下降。在交易时间和次要任务测量中，我们观察到了最明显的年龄影响。总体而言，交流效率在 8 至 18 岁之间逐渐提高，在次要任务上的表现在较年轻时（30 岁之前）有所改善，在 50 岁之后逐渐下降。最后，我们还发现了交流角色对成绩的影响。在成人中，被要求在任务中起主导作用和发言最多的人在次要任务中的表现要比主要扮演 "倾听 "角色和回答询问的人差。这些结果表明，如果对语言交流进行更广泛的评估，使其更接近于典型的交流情况，那么通常在处于生命周期极端的人群中出现的即时通讯的严重影响就会被最小化，这可能是由于存在多个信息源，允许使用不同的交流策略。这一发现与言语交流的临床评估有关。

{"title":"The effects of informational and energetic/modulation masking on the efficiency and ease of speech communication across the lifespan","authors":"Outi Tuomainen , Stuart Rosen , Linda Taschenberger , Valerie Hazan","doi":"10.1016/j.specom.2024.103101","DOIUrl":"10.1016/j.specom.2024.103101","url":null,"abstract":"<div><p>Children and older adults have greater difficulty understanding speech when there are other voices in the background (informational masking, IM) than when the interference is a steady-state noise with a similar spectral profile but is not speech (due to modulation and energetic masking; EM/MM). We evaluated whether this IM vs. EM/MM difference for certain age ranges was found for broader measures of communication efficiency and ease in 114 participants aged between 8 and 80. Participants carried out interactive <em>diapix</em> problem-solving tasks in age-band- and sex-matched pairs, in quiet and with different maskers in the background affecting both participants. Three measures were taken: (a) task transaction time (communication efficiency), (b) performance on a secondary auditory task simultaneously carried out during <em>diapix</em>, and (c) post-test subjective ratings of effort, concentration, difficulty and noisiness (communication ease). Although participants did not take longer to complete the task when in challenging conditions, effects of IM vs. EM/MM were clearly seen on the other measures. Relative to the EM/MM and quiet conditions, participants in IM conditions were less able to attend to the secondary task and reported greater effects of the masker type on their perceived degree of effort, concentration, difficulty and noisiness. However, we found no evidence of decreased communication efficiency and ease in IM relative to EM/MM for children and older adults in any of our measures. The clearest effects of age were observed in transaction time and secondary task measures. Overall, communication efficiency gradually improved between the ages 8–18 years and performance on the secondary task improved over younger ages (until 30 years) and gradually decreased after 50 years of age. Finally, we also found an impact of communicative role on performance. In adults, the participant asked to take the lead in the task and who spoke the most, performed worse on the secondary task than the person who was mainly in a ‘listening’ role and responding to queries. These results suggest that when a broader evaluation of speech communication is carried out that more closely resembles typical communicative situations, the more acute effects of IM typically seen in populations at the extremes of the lifespan are minimised potentially due to the presence of multiple information sources, which allow the use of varying communication strategies. Such a finding is relevant for clinical evaluations of speech communication.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"162 ","pages":"Article 103101"},"PeriodicalIF":2.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000736/pdfft?md5=3bae57a7e48911c3d00f77555ed9d386&pid=1-s2.0-S0167639324000736-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141577736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Pathological voice classification using MEEL features and SVM-TabNet model 使用 MEEL 特征和 SVM-TabNet 模型进行病态语音分类

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-07-01 DOI: 10.1016/j.specom.2024.103100

Mohammed Zakariah , Muna Al-Razgan , Taha Alfakih

In clinical settings, early diagnosis and objective assessment depend on the detection of voice pathology. To classify anomalous voices, this work uses an approach that combines the SVM-TabNet fusion model with MEEL (Mel-Frequency Energy Line) features. Further, the dataset consists of 1037 speech files, including recordings from people with laryngocele and Vox senilis as well as from healthy persons. Additionally, the main goal is to create an efficient classification model that can differentiate between normal and abnormal voice patterns. Modern techniques frequently lack the accuracy required for a precise diagnosis, which highlights the need for novel strategies. The suggested approach uses an SVM-TabNet fusion model for classification after feature extraction using MEEL characteristics. MEEL features provide extensive information for categorization by capturing complex patterns in audio transmissions. Moreover, by combining the advantages of SVM and TabNet models, classification performance is improved. Moreover, testing the model on test data yields remarkable results: 99.7 % accuracy, 0.992 F1 score, 0.996 precision, and 0.995 recall. Additional testing on additional datasets reliably validates outstanding performance, with 99.4 % accuracy, 0.99 F1 score, 0.998 precision, and 0.989 % recall. Furthermore, using the Saarbruecken Voice Database (SVD), the suggested methodology achieves an impressive accuracy of 99.97 %, demonstrating its durability and generalizability across many datasets. Overall, this work shows how the SVM-TabNet fusion model with MEEL characteristics may be used to accurately and consistently classify diseased voices, providing encouraging opportunities for clinical diagnosis and therapy tracking.

在临床环境中，早期诊断和客观评估取决于嗓音病理的检测。为了对异常声音进行分类，这项工作采用了一种将 SVM-TabNet 融合模型与 MEEL（Mel-Frequency Energy Line）特征相结合的方法。此外，该数据集由 1037 个语音文件组成，其中包括喉结畸形和老年痴呆症患者以及健康人的录音。此外，主要目标是创建一个高效的分类模型，以区分正常和异常语音模式。现代技术往往缺乏精确诊断所需的准确性，这就凸显了对新策略的需求。建议的方法在使用 MEEL 特征提取特征后，使用 SVM-TabNet 融合模型进行分类。MEEL 特征通过捕捉音频传输中的复杂模式，为分类提供了广泛的信息。此外，通过结合 SVM 和 TabNet 模型的优势，分类性能得到了提高。此外，在测试数据上测试该模型也取得了显著效果：准确率为 99.7%，F1 分数为 0.992，精确度为 0.996，召回率为 0.995。在其他数据集上进行的额外测试可靠地验证了该模型的出色性能，准确率为 99.4%，F1 分数为 0.99，精确度为 0.998，召回率为 0.989%。此外，在使用萨尔布吕肯语音数据库（SVD）时，所建议的方法达到了令人印象深刻的 99.97 % 的准确率，证明了它在许多数据集上的持久性和通用性。总之，这项研究表明，具有 MEEL 特征的 SVM-TabNet 融合模型可用于准确、一致地对病态声音进行分类，为临床诊断和治疗跟踪提供了令人鼓舞的机会。

{"title":"Pathological voice classification using MEEL features and SVM-TabNet model","authors":"Mohammed Zakariah , Muna Al-Razgan , Taha Alfakih","doi":"10.1016/j.specom.2024.103100","DOIUrl":"10.1016/j.specom.2024.103100","url":null,"abstract":"<div><p>In clinical settings, early diagnosis and objective assessment depend on the detection of voice pathology. To classify anomalous voices, this work uses an approach that combines the SVM-TabNet fusion model with MEEL (Mel-Frequency Energy Line) features. Further, the dataset consists of 1037 speech files, including recordings from people with laryngocele and Vox senilis as well as from healthy persons. Additionally, the main goal is to create an efficient classification model that can differentiate between normal and abnormal voice patterns. Modern techniques frequently lack the accuracy required for a precise diagnosis, which highlights the need for novel strategies. The suggested approach uses an SVM-TabNet fusion model for classification after feature extraction using MEEL characteristics. MEEL features provide extensive information for categorization by capturing complex patterns in audio transmissions. Moreover, by combining the advantages of SVM and TabNet models, classification performance is improved. Moreover, testing the model on test data yields remarkable results: 99.7 % accuracy, 0.992 F1 score, 0.996 precision, and 0.995 recall. Additional testing on additional datasets reliably validates outstanding performance, with 99.4 % accuracy, 0.99 F1 score, 0.998 precision, and 0.989 % recall. Furthermore, using the Saarbruecken Voice Database (SVD), the suggested methodology achieves an impressive accuracy of 99.97 %, demonstrating its durability and generalizability across many datasets. Overall, this work shows how the SVM-TabNet fusion model with MEEL characteristics may be used to accurately and consistently classify diseased voices, providing encouraging opportunities for clinical diagnosis and therapy tracking.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"162 ","pages":"Article 103100"},"PeriodicalIF":2.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141571774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Analyzing the influence of different speech data corpora and speech features on speech emotion recognition: A review 分析不同语音数据集和语音特征对语音情感识别的影响：综述

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-07-01 DOI: 10.1016/j.specom.2024.103102

Tarun Rathi, Manoj Tripathy

Emotion recognition from speech has become crucial in human-computer interaction and affective computing applications. This review paper examines the complex relationship between two critical factors: the selection of speech data corpora and the extraction of speech features regarding speech emotion classification accuracy. Through an extensive analysis of literature from 2014 to 2023, publicly available speech datasets are explored and categorized based on their diversity, scale, linguistic attributes, and emotional classifications. The importance of various speech features, from basic spectral features to sophisticated prosodic cues, and their influence on emotion recognition accuracy is analyzed.. In the context of speech data corpora, this review paper unveils trends and insights from comparative studies exploring the repercussions of dataset choice on recognition efficacy. Various datasets such as IEMOCAP, EMODB, and MSP-IMPROV are scrutinized in terms of their influence on classifying the accuracy of the speech emotion recognition (SER) system. At the same time, potential challenges associated with dataset limitations are also examined. Notable features like Mel-frequency cepstral coefficients, pitch, intensity, and prosodic patterns are evaluated for their contributions to emotion recognition. Advanced feature extraction methods, too, are explored for their potential to capture intricate emotional dynamics. Moreover, this review paper offers insights into the methodological aspects of emotion recognition, shedding light on the diverse machine learning and deep learning approaches employed. Through a holistic synthesis of research findings, this review paper observes connections between the choice of speech data corpus, selection of speech features, and resulting emotion recognition accuracy. As the field continues to evolve, avenues for future research are proposed, ranging from enhanced feature extraction techniques to the development of standardized benchmark datasets. In essence, this review serves as a compass guiding researchers and practitioners through the intricate landscape of speech emotion recognition, offering a nuanced understanding of the factors shaping its recognition accuracy of speech emotion.

语音情感识别已成为人机交互和情感计算应用的关键。本综述论文探讨了两个关键因素之间的复杂关系：关于语音情感分类准确性的语音数据集选择和语音特征提取。通过对 2014 年至 2023 年的文献进行广泛分析，探讨了公开可用的语音数据集，并根据其多样性、规模、语言属性和情感分类进行了分类。分析了从基本的频谱特征到复杂的前音线索等各种语音特征的重要性及其对情感识别准确性的影响。在语音数据体方面，这篇综述揭示了比较研究的趋势和见解，探讨了数据集选择对识别效率的影响。本文仔细研究了 IEMOCAP、EMODB 和 MSP-IMPROV 等各种数据集对语音情感识别（SER）系统准确性分类的影响。同时，还研究了与数据集局限性相关的潜在挑战。评估了梅尔频率共振频率系数、音高、音强和前音模式等显著特征对情感识别的贡献。此外，还探讨了先进的特征提取方法，以发现其捕捉复杂情感动态的潜力。此外，这篇综述论文还对情感识别的方法论方面提出了见解，阐明了所采用的各种机器学习和深度学习方法。通过对研究成果的全面综合，本综述论文观察到了语音数据语料的选择、语音特征的选择以及由此产生的情感识别准确率之间的联系。随着该领域的不断发展，本文提出了未来的研究方向，包括增强特征提取技术和开发标准化基准数据集。从本质上讲，这篇综述就像一个指南针，指引着研究人员和从业人员穿越错综复杂的语音情感识别领域，对影响语音情感识别准确性的因素有了细致入微的了解。

{"title":"Analyzing the influence of different speech data corpora and speech features on speech emotion recognition: A review","authors":"Tarun Rathi, Manoj Tripathy","doi":"10.1016/j.specom.2024.103102","DOIUrl":"10.1016/j.specom.2024.103102","url":null,"abstract":"<div><p>Emotion recognition from speech has become crucial in human-computer interaction and affective computing applications. This review paper examines the complex relationship between two critical factors: the selection of speech data corpora and the extraction of speech features regarding speech emotion classification accuracy. Through an extensive analysis of literature from 2014 to 2023, publicly available speech datasets are explored and categorized based on their diversity, scale, linguistic attributes, and emotional classifications. The importance of various speech features, from basic spectral features to sophisticated prosodic cues, and their influence on emotion recognition accuracy is analyzed.. In the context of speech data corpora, this review paper unveils trends and insights from comparative studies exploring the repercussions of dataset choice on recognition efficacy. Various datasets such as IEMOCAP, EMODB, and MSP-IMPROV are scrutinized in terms of their influence on classifying the accuracy of the speech emotion recognition (SER) system. At the same time, potential challenges associated with dataset limitations are also examined. Notable features like Mel-frequency cepstral coefficients, pitch, intensity, and prosodic patterns are evaluated for their contributions to emotion recognition. Advanced feature extraction methods, too, are explored for their potential to capture intricate emotional dynamics. Moreover, this review paper offers insights into the methodological aspects of emotion recognition, shedding light on the diverse machine learning and deep learning approaches employed. Through a holistic synthesis of research findings, this review paper observes connections between the choice of speech data corpus, selection of speech features, and resulting emotion recognition accuracy. As the field continues to evolve, avenues for future research are proposed, ranging from enhanced feature extraction techniques to the development of standardized benchmark datasets. In essence, this review serves as a compass guiding researchers and practitioners through the intricate landscape of speech emotion recognition, offering a nuanced understanding of the factors shaping its recognition accuracy of speech emotion.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"162 ","pages":"Article 103102"},"PeriodicalIF":2.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141637049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Emotions recognition in audio signals using an extension of the latent block model 利用潜块模型的扩展识别音频信号中的情绪

IF 3.2 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-06-01 DOI: 10.1016/j.specom.2024.103092

Abir El Haj

Emotion detection in human speech is a significant area of research, crucial for various applications such as affective computing and human–computer interaction. Despite advancements, accurately categorizing emotional states in speech remains challenging due to its subjective nature and the complexity of human emotions. To address this, we propose leveraging Mel frequency cepstral coefficients (MFCCS) and extend the latent block model (LBM) probabilistic clustering technique with a Gaussian multi-way latent block model (GMWLBM). Our objective is to categorize speech emotions into coherent groups based on the emotional states conveyed by speakers. We employ MFCCS from time-series audio data and utilize a variational Expectation Maximization method to estimate GMWLBM parameters. Additionally, we introduce an integrated Classification Likelihood (ICL) model selection criterion to determine the optimal number of clusters, enhancing robustness. Numerical experiments on real data from the Berlin Database of Emotional Speech (EMO-DB) demonstrate our method’s efficacy in accurately detecting and classifying emotional states in human speech, even in challenging real-world scenarios, thereby contributing significantly to affective computing and human–computer interaction applications.

人类语音中的情感检测是一个重要的研究领域，对情感计算和人机交互等各种应用至关重要。尽管取得了进步，但由于语音的主观性和人类情绪的复杂性，准确地对语音中的情绪状态进行分类仍具有挑战性。为了解决这个问题，我们建议利用梅尔频率倒频谱系数（MFCCS），并使用高斯多向潜在块模型（GMWLBM）扩展潜在块模型（LBM）概率聚类技术。我们的目标是根据说话者所传达的情绪状态，将语音情绪划分为一致的群组。我们采用来自时间序列音频数据的 MFCCS，并利用变异期望最大化方法来估计 GMWLBM 参数。此外，我们还引入了综合分类似然 (ICL) 模型选择标准，以确定最佳聚类数量，从而增强鲁棒性。在柏林情感语音数据库（EMO-DB）的真实数据上进行的数值实验证明，即使在具有挑战性的现实世界场景中，我们的方法也能准确检测人类语音中的情感状态并对其进行分类，从而为情感计算和人机交互应用做出了重大贡献。

{"title":"Emotions recognition in audio signals using an extension of the latent block model","authors":"Abir El Haj","doi":"10.1016/j.specom.2024.103092","DOIUrl":"10.1016/j.specom.2024.103092","url":null,"abstract":"<div><p>Emotion detection in human speech is a significant area of research, crucial for various applications such as affective computing and human–computer interaction. Despite advancements, accurately categorizing emotional states in speech remains challenging due to its subjective nature and the complexity of human emotions. To address this, we propose leveraging Mel frequency cepstral coefficients (MFCCS) and extend the latent block model (LBM) probabilistic clustering technique with a Gaussian multi-way latent block model (GMWLBM). Our objective is to categorize speech emotions into coherent groups based on the emotional states conveyed by speakers. We employ MFCCS from time-series audio data and utilize a variational Expectation Maximization method to estimate GMWLBM parameters. Additionally, we introduce an integrated Classification Likelihood (ICL) model selection criterion to determine the optimal number of clusters, enhancing robustness. Numerical experiments on real data from the Berlin Database of Emotional Speech (EMO-DB) demonstrate our method’s efficacy in accurately detecting and classifying emotional states in human speech, even in challenging real-world scenarios, thereby contributing significantly to affective computing and human–computer interaction applications.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"161 ","pages":"Article 103092"},"PeriodicalIF":3.2,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141278454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Summary of the DISPLACE challenge 2023-DIarization of SPeaker and LAnguage in Conversational Environments 2023 年 DISPLACE 挑战赛摘要--对话环境中的 SPeaker 和 LAnguage 个性化定制

IF 3.2 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-05-11 DOI: 10.1016/j.specom.2024.103080

Shikha Baghel , Shreyas Ramoji , Somil Jain , Pratik Roy Chowdhuri , Prachi Singh , Deepu Vijayasenan , Sriram Ganapathy

In multi-lingual societies, where multiple languages are spoken in a small geographic vicinity, informal conversations often involve mix of languages. Existing speech technologies may be inefficient in extracting information from such conversations, where the speech data is rich in diversity with multiple languages and speakers. The DISPLACE (DIarization of SPeaker and LAnguage in Conversational Environments) challenge constitutes an open-call for evaluating and bench-marking the speaker and language diarization technologies on this challenging condition. To facilitate this challenge, a real-world dataset featuring multilingual, multi-speaker conversational far-field speech was recorded and distributed. The challenge entailed two tracks: Track-1 focused on speaker diarization (SD) in multilingual situations while, Track-2 addressed the language diarization (LD) in a multi-speaker scenario. Both the tracks were evaluated using the same underlying audio data. Furthermore, a baseline system was made available for both SD and LD task which mimicked the state-of-art in these tasks. The challenge garnered a total of 42 world-wide registrations and received a total of 19 combined submissions for Track-1 and Track-2. This paper describes the challenge, details of the datasets, tasks, and the baseline system. Additionally, the paper provides a concise overview of the submitted systems in both tracks, with an emphasis given to the top performing systems. The paper also presents insights and future perspectives for SD and LD tasks, focusing on the key challenges that the systems need to overcome before wide-spread commercial deployment on such conversations.

在多语言社会中，小范围内使用多种语言，非正式对话往往涉及多种语言的混合。现有的语音技术在从这类对话中提取信息时可能效率低下，因为在这类对话中，语音数据丰富多样，包含多种语言和说话人。DISPLACE (DIarization of SPeaker and LAnguage in Conversational Environments) 挑战赛是一项公开征集活动，目的是在这一具有挑战性的条件下对说话者和语言日记化技术进行评估和标杆测试。为了促进这项挑战，我们录制并分发了一个真实世界的数据集，其中包含多语言、多说话人的远场对话语音。挑战赛分为两个赛道：赛道 1 侧重于多语言情况下的说话人日记化（SD），而赛道 2 则针对多说话人情况下的语言日记化（LD）。两个轨道均使用相同的基础音频数据进行评估。此外，还为 SD 和 LD 任务提供了一个基线系统，以模拟这些任务的最新技术水平。此次挑战赛在全球范围内共收到 42 份注册申请，Track-1 和 Track-2 共收到 19 份合并申请。本文介绍了挑战赛、数据集详情、任务和基线系统。此外，本文还对两个赛道中提交的系统进行了简要概述，重点介绍了表现最佳的系统。论文还介绍了对 SD 和 LD 任务的见解和未来展望，重点关注系统在此类对话中广泛商业部署之前需要克服的关键挑战。

{"title":"Summary of the DISPLACE challenge 2023-DIarization of SPeaker and LAnguage in Conversational Environments","authors":"Shikha Baghel , Shreyas Ramoji , Somil Jain , Pratik Roy Chowdhuri , Prachi Singh , Deepu Vijayasenan , Sriram Ganapathy","doi":"10.1016/j.specom.2024.103080","DOIUrl":"10.1016/j.specom.2024.103080","url":null,"abstract":"<div><p>In multi-lingual societies, where multiple languages are spoken in a small geographic vicinity, informal conversations often involve mix of languages. Existing speech technologies may be inefficient in extracting information from such conversations, where the speech data is rich in diversity with multiple languages and speakers. The <strong>DISPLACE</strong> (DIarization of SPeaker and LAnguage in Conversational Environments) challenge constitutes an open-call for evaluating and bench-marking the speaker and language diarization technologies on this challenging condition. To facilitate this challenge, a real-world dataset featuring multilingual, multi-speaker conversational far-field speech was recorded and distributed. The challenge entailed two tracks: Track-1 focused on speaker diarization (SD) in multilingual situations while, Track-2 addressed the language diarization (LD) in a multi-speaker scenario. Both the tracks were evaluated using the same underlying audio data. Furthermore, a baseline system was made available for both SD and LD task which mimicked the state-of-art in these tasks. The challenge garnered a total of 42 world-wide registrations and received a total of 19 combined submissions for Track-1 and Track-2. This paper describes the challenge, details of the datasets, tasks, and the baseline system. Additionally, the paper provides a concise overview of the submitted systems in both tracks, with an emphasis given to the top performing systems. The paper also presents insights and future perspectives for SD and LD tasks, focusing on the key challenges that the systems need to overcome before wide-spread commercial deployment on such conversations.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"161 ","pages":"Article 103080"},"PeriodicalIF":3.2,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141054826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

End-to-end integration of speech separation and voice activity detection for low-latency diarization of telephone conversations 端到端集成语音分离和语音活动检测功能，实现低延迟电话交谈日记化

IF 3.2 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-05-11 DOI: 10.1016/j.specom.2024.103081

Giovanni Morrone , Samuele Cornell , Luca Serafini , Enrico Zovato , Alessio Brutti , Stefano Squartini

Recent works show that speech separation guided diarization (SSGD) is an increasingly promising direction, mainly thanks to the recent progress in speech separation. It performs diarization by first separating the speakers and then applying voice activity detection (VAD) on each separated stream. In this work we conduct an in-depth study of SSGD in the conversational telephone speech (CTS) domain, focusing mainly on low-latency streaming diarization applications. We consider three state-of-the-art speech separation (SSep) algorithms and study their performance both in online and offline scenarios, considering non-causal and causal implementations as well as continuous SSep (CSS) windowed inference. We compare different SSGD algorithms on two widely used CTS datasets: CALLHOME and Fisher Corpus (Part 1 and 2) and evaluate both separation and diarization performance. To improve performance, a novel, causal and computationally efficient leakage removal algorithm is proposed, which significantly decreases false alarms. We also explore, for the first time, fully end-to-end SSGD integration between SSep and VAD modules. Crucially, this enables fine-tuning on real-world data for which oracle speakers sources are not available. In particular, our best model achieves 8.8% DER on CALLHOME, which outperforms the current state-of-the-art end-to-end neural diarization model, despite being trained on an order of magnitude less data and having significantly lower latency, i.e., 0.1 vs. 1 s. Finally, we also show that the separated signals can be readily used also for automatic speech recognition, reaching performance close to using oracle sources in some configurations.

最近的研究表明，语音分离指导下的日记化（SSGD）是一个越来越有前景的方向，这主要归功于语音分离技术的最新进展。它通过首先分离说话者，然后在每个分离流上应用语音活动检测（VAD）来执行日记化。在这项工作中，我们对会话电话语音（CTS）领域的 SSGD 进行了深入研究，主要侧重于低延迟流式日记化应用。我们考虑了三种最先进的语音分离（SSep）算法，并研究了它们在在线和离线场景下的性能，考虑了非因果和因果实现以及连续 SSep (CSS) 窗口推理。我们在两个广泛使用的 CTS 数据集上比较了不同的 SSGD 算法：我们在两个广泛使用的 CTS 数据集：CALLHOME 和 Fisher Corpus（第 1 部分和第 2 部分）上比较了不同的 SSGD 算法，并评估了分离和日记化性能。为了提高性能，我们提出了一种新颖、因果关系明显、计算效率高的泄漏清除算法，该算法可显著降低误报率。我们还首次探索了 SSep 和 VAD 模块之间完全端到端的 SSGD 集成。最重要的是，这使得我们能够在无法获得 Oracle 扬声器源的真实世界数据上进行微调。特别是，我们的最佳模型在 CALLHOME 上达到了 8.8% 的 DER，超过了当前最先进的端到端神经日记化模型，尽管其训练数据量少了一个数量级，延迟时间也大大降低，即 0.1 秒对 1 秒。

{"title":"End-to-end integration of speech separation and voice activity detection for low-latency diarization of telephone conversations","authors":"Giovanni Morrone , Samuele Cornell , Luca Serafini , Enrico Zovato , Alessio Brutti , Stefano Squartini","doi":"10.1016/j.specom.2024.103081","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103081","url":null,"abstract":"<div><p>Recent works show that speech separation guided diarization (SSGD) is an increasingly promising direction, mainly thanks to the recent progress in speech separation. It performs diarization by first separating the speakers and then applying voice activity detection (VAD) on each separated stream. In this work we conduct an in-depth study of SSGD in the conversational telephone speech (CTS) domain, focusing mainly on low-latency streaming diarization applications. We consider three state-of-the-art speech separation (SSep) algorithms and study their performance both in online and offline scenarios, considering non-causal and causal implementations as well as continuous SSep (CSS) windowed inference. We compare different SSGD algorithms on two widely used CTS datasets: CALLHOME and Fisher Corpus (Part 1 and 2) and evaluate both separation and diarization performance. To improve performance, a novel, causal and computationally efficient leakage removal algorithm is proposed, which significantly decreases false alarms. We also explore, for the first time, fully end-to-end SSGD integration between SSep and VAD modules. Crucially, this enables fine-tuning on real-world data for which oracle speakers sources are not available. In particular, our best model achieves 8.8% DER on CALLHOME, which outperforms the current state-of-the-art end-to-end neural diarization model, despite being trained on an order of magnitude less data and having significantly lower latency, i.e., 0.1 vs. 1 s. Finally, we also show that the separated signals can be readily used also for automatic speech recognition, reaching performance close to using oracle sources in some configurations.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"161 ","pages":"Article 103081"},"PeriodicalIF":3.2,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141078094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Visual-articulatory cues facilitate children with CIs to better perceive Mandarin tones in sentences 视觉-发音线索有助于 CI 儿童更好地感知句子中的普通话声调

IF 3.2 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-05-01 DOI: 10.1016/j.specom.2024.103084

Ping Tang, Shanpeng Li, Yanan Shen, Qianxi Yu, Yan Feng

Children with cochlear implants (CIs) face challenges in tonal perception under noise. Nevertheless, our previous research demonstrated that seeing visual-articulatory cues (speakers’ facial/head movements) benefited these children to perceive isolated tones better, particularly in noisy environments, with those implanted earlier gaining more benefits. However, tones in daily speech typically occur in sentence contexts where visual cues are largely reduced compared to those in isolated contexts. It was thus unclear if visual benefits on tonal perception still hold in these challenging sentence contexts. Therefore, this study tested 64 children with CIs and 64 age-matched NH children. Target tones in sentence-medial position were presented in audio-only (AO) or audiovisual (AV) conditions, in quiet and noisy environments. Children selected the target tone using a picture-point task. The results showed that, while NH children did not show any perception difference between AO and AV conditions, children with CIs significantly improved their perceptual accuracy from AO to AV conditions. The degree of improvement was negatively correlated with their implantation ages. Therefore, children with CIs were able to use visual-articulatory cues to facilitate their tonal perception even in sentence contexts, and earlier auditory experience might be important in shaping this ability.

在噪音环境下，植入人工耳蜗（CI）的儿童在音调感知方面面临挑战。然而，我们之前的研究表明，看到视觉-发音线索（说话者的面部/头部动作）有利于这些儿童更好地感知孤立的音调，尤其是在噪音环境中，植入时间较早的儿童受益更多。然而，日常言语中的音调通常出现在句子语境中，与孤立语境中的音调相比，句子语境中的视觉线索要少得多。因此，还不清楚在这些具有挑战性的句子语境中，视觉对音调感知的益处是否仍然存在。因此，本研究对 64 名 CI 儿童和 64 名年龄匹配的 NH 儿童进行了测试。在安静和嘈杂的环境中，以纯音频（AO）或视听（AV）条件呈现句子中间位置的目标音调。儿童通过图片点任务选择目标音调。结果表明，虽然正常儿童在纯音频和视听条件下没有表现出任何感知上的差异，但患有人工耳蜗的儿童在纯音频和视听条件下的感知准确性有了显著提高。提高程度与植入年龄呈负相关。因此，即使在句子语境中，植入人工耳蜗的儿童也能利用视觉-发音线索来促进他们对音调的感知，而早期的听觉经验可能是形成这种能力的重要因素。

{"title":"Visual-articulatory cues facilitate children with CIs to better perceive Mandarin tones in sentences","authors":"Ping Tang, Shanpeng Li, Yanan Shen, Qianxi Yu, Yan Feng","doi":"10.1016/j.specom.2024.103084","DOIUrl":"10.1016/j.specom.2024.103084","url":null,"abstract":"<div><p>Children with cochlear implants (CIs) face challenges in tonal perception under noise. Nevertheless, our previous research demonstrated that seeing visual-articulatory cues (speakers’ facial/head movements) benefited these children to perceive isolated tones better, particularly in noisy environments, with those implanted earlier gaining more benefits. However, tones in daily speech typically occur in sentence contexts where visual cues are largely reduced compared to those in isolated contexts. It was thus unclear if visual benefits on tonal perception still hold in these challenging sentence contexts. Therefore, this study tested 64 children with CIs and 64 age-matched NH children. Target tones in sentence-medial position were presented in audio-only (AO) or audiovisual (AV) conditions, in quiet and noisy environments. Children selected the target tone using a picture-point task. The results showed that, while NH children did not show any perception difference between AO and AV conditions, children with CIs significantly improved their perceptual accuracy from AO to AV conditions. The degree of improvement was negatively correlated with their implantation ages. Therefore, children with CIs were able to use visual-articulatory cues to facilitate their tonal perception even in sentence contexts, and earlier auditory experience might be important in shaping this ability.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"160 ","pages":"Article 103084"},"PeriodicalIF":3.2,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141028923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0