Speech Communication最新文献_第3页

Prosody in narratives: An exploratory study with children with sex chromosomes trisomies 叙事中的拟声词：对性染色体三体儿童的探索性研究

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-07-26 DOI: 10.1016/j.specom.2024.103107

Paola Zanchi , Alessandra Provera , Gaia Silibello , Paola Francesca Ajmone , Elena Altamore , Faustina Lalatta , Maria Antonella Costantino , Paola Giovanna Vizziello , Laura Zampini

Although language delays are common in children with sex chromosome trisomies [SCT], no studies have analysed their prosodic abilities. Considering the importance of prosody in communication, this exploratory study aims to analyse the prosodic features of the narratives of 4-year-old children with SCT.

Participants included 22 children with SCT and 22 typically developing [TD] children. The Narrative Competence Task was administered to elicit the child's narrative. Each utterance was prosodically analysed considering pitch and timing variables.

Considering pitch, the only difference was the number of movements since the utterances of children with SCT were characterised by a lower speech modulation. However, considering the timing variables, children with SCT produced a faster speech rate and a shorter final syllable duration than TD children.

Since both speech modulation and duration measures have important syntactic and pragmatic functions, further investigations should deeply analyse the prosodic skills of children with SCT in interaction with syntax and pragmatics.

虽然性染色体三体综合征（SCT）儿童的语言发育迟缓很常见，但还没有研究对他们的拟声能力进行过分析。考虑到拟声词在交流中的重要性，本探索性研究旨在分析 4 岁性染色体三体综合征儿童叙事的拟声词特征。研究采用了 "叙事能力任务 "来诱导儿童叙事。在音调方面，由于 SCT 儿童的语句调式较低，因此唯一的区别在于动作的数量。由于语音调制和持续时间都具有重要的句法和语用功能，进一步的研究应深入分析 SCT 儿童的拟声技能与句法和语用的相互作用。

{"title":"Prosody in narratives: An exploratory study with children with sex chromosomes trisomies","authors":"Paola Zanchi , Alessandra Provera , Gaia Silibello , Paola Francesca Ajmone , Elena Altamore , Faustina Lalatta , Maria Antonella Costantino , Paola Giovanna Vizziello , Laura Zampini","doi":"10.1016/j.specom.2024.103107","DOIUrl":"10.1016/j.specom.2024.103107","url":null,"abstract":"<div><p>Although language delays are common in children with sex chromosome trisomies [SCT], no studies have analysed their prosodic abilities. Considering the importance of prosody in communication, this exploratory study aims to analyse the prosodic features of the narratives of 4-year-old children with SCT.</p><p>Participants included 22 children with SCT and 22 typically developing [TD] children. The Narrative Competence Task was administered to elicit the child's narrative. Each utterance was prosodically analysed considering pitch and timing variables.</p><p>Considering pitch, the only difference was the number of movements since the utterances of children with SCT were characterised by a lower speech modulation. However, considering the timing variables, children with SCT produced a faster speech rate and a shorter final syllable duration than TD children.</p><p>Since both speech modulation and duration measures have important syntactic and pragmatic functions, further investigations should deeply analyse the prosodic skills of children with SCT in interaction with syntax and pragmatics.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"163 ","pages":"Article 103107"},"PeriodicalIF":2.4,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000797/pdfft?md5=0db7a9636fbd49fbec0c9533ae5f4537&pid=1-s2.0-S0167639324000797-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141846464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Progressive channel fusion for more efficient TDNN on speaker verification 渐进式信道融合可提高 TDNN 在扬声器验证方面的效率

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-07-23 DOI: 10.1016/j.specom.2024.103105

Zhenduo Zhao , Zhuo Li , Wenchao Wang , Ji Xu

ECAPA-TDNN is one of the most popular TDNNs for speaker verification. While most of the updates pay attention to building precisely designed auxiliary modules, the depth-first principle has shown promising performance recently. However, empirical experiments show that one-dimensional convolution (Conv1D) based TDNNs suffer from performance degradation by simply adding massive vanilla basic blocks. Note that Conv1D naturally has a global receptive field (RF) on the feature dimension, progressive channel fusion (PCF) is proposed to alleviate this issue by introducing group convolution to build local RF and fusing the subbands progressively. Instead of reducing the group number in convolution layers used in the previous work, a novel channel permutation strategy is introduced to build information flow between groups so that all basic blocks in the model keep consistent parameter efficiency. The information leakage from lower-frequency bands to higher ones caused by Res2Block is simultaneously solved by introducing group-in-group convolution and using channel permutation. Besides the PCF strategy, redundant connections are removed for a more concise model architecture. The experiments on VoxCeleb and CnCeleb achieve state-of-the-art (SOTA) performance with an average relative improvement of 12.3% on EER and 13.2% on minDCF (0.01), validating the effectiveness of the proposed model.

ECAPA-TDNN 是用于扬声器验证的最流行 TDNN 之一。虽然大多数更新都注重构建精确设计的辅助模块，但深度优先原则最近已显示出良好的性能。然而，经验实验表明，基于一维卷积（Conv1D）的 TDNN 会因为简单地添加大量 vanilla 基本模块而导致性能下降。注意到 Conv1D 在特征维度上天然具有全局感受野（RF），我们提出了渐进信道融合（PCF），通过引入组卷积来建立局部 RF 并逐步融合子带，从而缓解这一问题。我们没有采用前人的方法来减少卷积层中的组数，而是引入了一种新颖的信道置换策略来建立组间信息流，从而使模型中的所有基本模块都能保持一致的参数效率。通过引入组内卷积和使用信道置换，同时解决了 Res2Block 造成的低频段向高频段的信息泄漏问题。除了 PCF 策略外，还移除了冗余连接，使模型结构更加简洁。在 VoxCeleb 和 CnCeleb 上进行的实验取得了最先进（SOTA）的性能，在 EER 和 minDCF (0.01) 上分别平均提高了 12.3% 和 13.2%，验证了所提模型的有效性。

{"title":"Progressive channel fusion for more efficient TDNN on speaker verification","authors":"Zhenduo Zhao , Zhuo Li , Wenchao Wang , Ji Xu","doi":"10.1016/j.specom.2024.103105","DOIUrl":"10.1016/j.specom.2024.103105","url":null,"abstract":"<div><p>ECAPA-TDNN is one of the most popular TDNNs for speaker verification. While most of the updates pay attention to building precisely designed auxiliary modules, the depth-first principle has shown promising performance recently. However, empirical experiments show that one-dimensional convolution (Conv1D) based TDNNs suffer from performance degradation by simply adding massive vanilla basic blocks. Note that Conv1D naturally has a global receptive field (RF) on the feature dimension, progressive channel fusion (PCF) is proposed to alleviate this issue by introducing group convolution to build local RF and fusing the subbands progressively. Instead of reducing the group number in convolution layers used in the previous work, a novel channel permutation strategy is introduced to build information flow between groups so that all basic blocks in the model keep consistent parameter efficiency. The information leakage from lower-frequency bands to higher ones caused by Res2Block is simultaneously solved by introducing group-in-group convolution and using channel permutation. Besides the PCF strategy, redundant connections are removed for a more concise model architecture. The experiments on VoxCeleb and CnCeleb achieve state-of-the-art (SOTA) performance with an average relative improvement of 12.3% on EER and 13.2% on minDCF (0.01), validating the effectiveness of the proposed model.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"163 ","pages":"Article 103105"},"PeriodicalIF":2.4,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141960884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Decoupled structure for improved adaptability of end-to-end models 解耦结构可提高端到端模型的适应性

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-07-23 DOI: 10.1016/j.specom.2024.103109

Keqi Deng, Philip C. Woodland

Although end-to-end (E2E) trainable automatic speech recognition (ASR) has shown great success by jointly learning acoustic and linguistic information, it still suffers from the effect of domain shifts, thus limiting potential applications. The E2E ASR model implicitly learns an internal language model (LM) which characterises the training distribution of the source domain, and the E2E trainable nature makes the internal LM difficult to adapt to the target domain with text-only data. To solve this problem, this paper proposes decoupled structures for attention-based encoder–decoder (Decoupled-AED) and neural transducer (Decoupled-Transducer) models, which can achieve flexible domain adaptation in both offline and online scenarios while maintaining robust intra-domain performance. To this end, the acoustic and linguistic parts of the E2E model decoder (or prediction network) are decoupled, making the linguistic component (i.e. internal LM) replaceable. When encountering a domain shift, the internal LM can be directly replaced during inference by a target-domain LM, without re-training or using domain-specific paired speech-text data. Experiments for E2E ASR models trained on the LibriSpeech-100h corpus showed that the proposed decoupled structure gave 15.1% and 17.2% relative word error rate reductions on the TED-LIUM 2 and AESRC2020 corpora while still maintaining performance on intra-domain data. It is also shown that the decoupled structure can be used to boost cross-domain speech translation quality while retaining the intra-domain performance.

尽管端到端（E2E）可训练自动语音识别（ASR）通过联合学习声学和语言信息取得了巨大成功，但它仍然受到领域转移的影响，从而限制了潜在的应用。E2E ASR 模型隐含地学习了一个内部语言模型（LM），该模型描述了源域的训练分布，而 E2E 可训练的特性使得内部 LM 难以适应纯文本数据的目标域。为了解决这个问题，本文提出了基于注意力的编码器-解码器（Decoupled-AED）和神经换能器（Decoupled-Transducer）模型的解耦结构，它可以在离线和在线场景下实现灵活的域适应，同时保持稳健的域内性能。为此，E2E 模型解码器（或预测网络）的声学和语言部分是解耦的，使得语言部分（即内部 LM）可以替换。当遇到领域转换时，内部 LM 可在推理过程中直接替换为目标领域的 LM，而无需重新训练或使用特定领域的语音-文本配对数据。在 LibriSpeech-100h 语料库上训练的 E2E ASR 模型的实验表明，所提出的解耦结构在 TED-LIUM 2 和 AESRC2020 语料库上分别降低了 15.1% 和 17.2% 的相对词错误率，同时仍能保持域内数据的性能。研究还表明，解耦结构可用于提高跨域语音翻译质量，同时保持域内性能。

{"title":"Decoupled structure for improved adaptability of end-to-end models","authors":"Keqi Deng, Philip C. Woodland","doi":"10.1016/j.specom.2024.103109","DOIUrl":"10.1016/j.specom.2024.103109","url":null,"abstract":"<div><p>Although end-to-end (E2E) trainable automatic speech recognition (ASR) has shown great success by jointly learning acoustic and linguistic information, it still suffers from the effect of domain shifts, thus limiting potential applications. The E2E ASR model implicitly learns an internal language model (LM) which characterises the training distribution of the source domain, and the E2E trainable nature makes the internal LM difficult to adapt to the target domain with text-only data. To solve this problem, this paper proposes decoupled structures for attention-based encoder–decoder (Decoupled-AED) and neural transducer (Decoupled-Transducer) models, which can achieve flexible domain adaptation in both offline and online scenarios while maintaining robust intra-domain performance. To this end, the acoustic and linguistic parts of the E2E model decoder (or prediction network) are decoupled, making the linguistic component (i.e. internal LM) replaceable. When encountering a domain shift, the internal LM can be directly replaced during inference by a target-domain LM, without re-training or using domain-specific paired speech-text data. Experiments for E2E ASR models trained on the LibriSpeech-100h corpus showed that the proposed decoupled structure gave 15.1% and 17.2% relative word error rate reductions on the TED-LIUM 2 and AESRC2020 corpora while still maintaining performance on intra-domain data. It is also shown that the decoupled structure can be used to boost cross-domain speech translation quality while retaining the intra-domain performance.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"163 ","pages":"Article 103109"},"PeriodicalIF":2.4,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000803/pdfft?md5=7e35ebdc40ecd26754dcc103e392268c&pid=1-s2.0-S0167639324000803-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141943677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Speechformer-CTC: Sequential modeling of depression detection with speech temporal classification Speechformer-CTC：利用语音时态分类对抑郁检测进行序列建模

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-07-18 DOI: 10.1016/j.specom.2024.103106

Jinhan Wang , Vijay Ravi , Jonathan Flint , Abeer Alwan

Speech-based automatic depression detection systems have been extensively explored over the past few years. Typically, each speaker is assigned a single label (Depressive or Non-depressive), and most approaches formulate depression detection as a speech classification task without explicitly considering the non-uniformly distributed depression pattern within segments, leading to low generalizability and robustness across different scenarios. However, depression corpora do not provide fine-grained labels (at the phoneme or word level) which makes the dynamic depression pattern in speech segments harder to track using conventional frameworks. To address this, we propose a novel framework, Speechformer-CTC, to model non-uniformly distributed depression characteristics within segments using a Connectionist Temporal Classification (CTC) objective function without the necessity of input–output alignment. Two novel CTC-label generation policies, namely the Expectation-One-Hot and the HuBERT policies, are proposed and incorporated in objectives on various granularities. Additionally, experiments using Automatic Speech Recognition (ASR) features are conducted to demonstrate the compatibility of the proposed method with content-based features. Our results show that the performance of depression detection, in terms of Macro F1-score, is improved on both DAIC-WOZ (English) and CONVERGE (Mandarin) datasets. On the DAIC-WOZ dataset, the system with HuBERT ASR features and a CTC objective optimized using HuBERT policy for label generation achieves 83.15% F1-score, which is close to state-of-the-art without the need for phoneme-level transcription or data augmentation. On the CONVERGE dataset, using Whisper features with the HuBERT policy improves the F1-score by 9.82% on CONVERGE1 (in-domain test set) and 18.47% on CONVERGE2 (out-of-domain test set). These findings show that depression detection can benefit from modeling non-uniformly distributed depression patterns and the proposed framework can be potentially used to determine significant depressive regions in speech utterances.

基于语音的抑郁自动检测系统在过去几年中得到了广泛的探索。通常情况下，每个说话者都会被赋予一个单一的标签（抑郁或非抑郁），而且大多数方法都将抑郁检测作为一项语音分类任务，而没有明确考虑片段内非均匀分布的抑郁模式，从而导致在不同场景下的通用性和鲁棒性较低。然而，抑郁语料库不提供细粒度标签（音素或单词级别），这使得使用传统框架跟踪语音片段中的动态抑郁模式变得更加困难。为了解决这个问题，我们提出了一个新颖的框架 Speechformer-CTC，利用 Connectionist Temporal Classification (CTC) 目标函数对片段内非均匀分布的抑郁特征进行建模，而无需输入输出对齐。提出了两种新颖的 CTC 标签生成策略，即期望一热策略和 HuBERT 策略，并将其纳入不同粒度的目标中。此外，还使用自动语音识别（ASR）特征进行了实验，以证明所提方法与基于内容的特征的兼容性。我们的结果表明，在 DAIC-WOZ（英语）和 CONVERGE（普通话）数据集上，抑郁检测的性能（宏观 F1 分数）都得到了提高。在 DAIC-WOZ 数据集上，采用 HuBERT ASR 特征和使用 HuBERT 策略优化标签生成的 CTC 目标的系统取得了 83.15% 的 F1 分数，接近最先进水平，无需进行音素级转录或数据增强。在 CONVERGE 数据集上，使用 Whisper 特征和 HuBERT 策略可将 CONVERGE1（域内测试集）的 F1 分数提高 9.82%，将 CONVERGE2（域外测试集）的 F1 分数提高 18.47%。这些研究结果表明，抑郁检测可以从非均匀分布的抑郁模式建模中获益，所提出的框架可用于确定语音语篇中的重要抑郁区域。

{"title":"Speechformer-CTC: Sequential modeling of depression detection with speech temporal classification","authors":"Jinhan Wang , Vijay Ravi , Jonathan Flint , Abeer Alwan","doi":"10.1016/j.specom.2024.103106","DOIUrl":"10.1016/j.specom.2024.103106","url":null,"abstract":"<div><p>Speech-based automatic depression detection systems have been extensively explored over the past few years. Typically, each speaker is assigned a single label (Depressive or Non-depressive), and most approaches formulate depression detection as a speech classification task without explicitly considering the non-uniformly distributed depression pattern within segments, leading to low generalizability and robustness across different scenarios. However, depression corpora do not provide fine-grained labels (at the phoneme or word level) which makes the dynamic depression pattern in speech segments harder to track using conventional frameworks. To address this, we propose a novel framework, Speechformer-CTC, to model non-uniformly distributed depression characteristics within segments using a Connectionist Temporal Classification (CTC) objective function without the necessity of input–output alignment. Two novel CTC-label generation policies, namely the Expectation-One-Hot and the HuBERT policies, are proposed and incorporated in objectives on various granularities. Additionally, experiments using Automatic Speech Recognition (ASR) features are conducted to demonstrate the compatibility of the proposed method with content-based features. Our results show that the performance of depression detection, in terms of Macro F1-score, is improved on both DAIC-WOZ (English) and CONVERGE (Mandarin) datasets. On the DAIC-WOZ dataset, the system with HuBERT ASR features and a CTC objective optimized using HuBERT policy for label generation achieves 83.15% F1-score, which is close to state-of-the-art without the need for phoneme-level transcription or data augmentation. On the CONVERGE dataset, using Whisper features with the HuBERT policy improves the F1-score by 9.82% on CONVERGE1 (in-domain test set) and 18.47% on CONVERGE2 (out-of-domain test set). These findings show that depression detection can benefit from modeling non-uniformly distributed depression patterns and the proposed framework can be potentially used to determine significant depressive regions in speech utterances.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"163 ","pages":"Article 103106"},"PeriodicalIF":2.4,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000785/pdfft?md5=afe02da612b1e415b45579997ae4074e&pid=1-s2.0-S0167639324000785-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141842447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Whisper-SV: Adapting Whisper for low-data-resource speaker verification Whisper-SV：为低数据资源扬声器验证调整 Whisper

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-07-14 DOI: 10.1016/j.specom.2024.103103

Li Zhang , Ning Jiang , Qing Wang , Yue Li , Quan Lu , Lei Xie

Trained on 680,000 h of massive speech data, Whisper is a multitasking, multilingual speech foundation model demonstrating superior performance in automatic speech recognition, translation, and language identification. However, its applicability in speaker verification (SV) tasks remains unexplored, particularly in low-data-resource scenarios where labeled speaker data in specific domains are limited. To fill this gap, we propose a lightweight adaptor framework to boost SV with Whisper, namely Whisper-SV. Given that Whisper is not specifically optimized for SV tasks, we introduce a representation selection module to quantify the speaker-specific characteristics contained in each layer of Whisper and select the top-k layers with prominent discriminative speaker features. To aggregate pivotal speaker-related features while diminishing non-speaker redundancies across the selected top-k distinct layers of Whisper, we design a multi-layer aggregation module in Whisper-SV to integrate multi-layer representations into a singular, compacted representation for SV. In the multi-layer aggregation module, we employ convolutional layers with shortcut connections among different layers to refine speaker characteristics derived from multi-layer representations from Whisper. In addition, an attention aggregation layer is used to reduce non-speaker interference and amplify speaker-specific cues for SV tasks. Finally, a simple classification module is used for speaker classification. Experiments on VoxCeleb1, FFSVC, and IMSV datasets demonstrate that Whisper-SV achieves EER/minDCF of 2.22%/0.307, 6.14%/0.488, and 7.50%/0.582, respectively, showing superior performance in low-data-resource SV scenarios.

Whisper 是一种多任务、多语言语音基础模型，曾在 680,000 小时的海量语音数据上进行过训练，在自动语音识别、翻译和语言识别方面表现出卓越的性能。然而，它在说话人验证（SV）任务中的适用性仍有待探索，尤其是在低数据资源场景中，因为特定领域的标注说话人数据有限。为了填补这一空白，我们提出了一个轻量级适配器框架，即 Whisper-SV，来利用 Whisper 提升 SV。鉴于 Whisper 并未专门针对 SV 任务进行优化，我们引入了一个表征选择模块，以量化 Whisper 每一层所包含的特定说话人特征，并选择具有突出辨别说话人特征的前 k 层。为了聚合与说话人相关的关键特征，同时减少 Whisper 所选的前 k 个不同层中的非说话人冗余，我们在 Whisper-SV 中设计了一个多层聚合模块，将多层表示法整合为一个单一、紧凑的 SV 表示法。在多层聚合模块中，我们利用卷积层与不同层之间的捷径连接来完善从 Whisper 多层表征中得出的说话者特征。此外，我们还利用注意力聚合层来减少非说话者的干扰，并放大 SV 任务中说话者的特定线索。最后，一个简单的分类模块用于扬声器分类。在 VoxCeleb1、FFSVC 和 IMSV 数据集上的实验表明，Whisper-SV 的 EER/minDCF 分别为 2.22%/0.307、6.14%/0.488 和 7.50%/0.582，在低数据资源 SV 场景中表现出了卓越的性能。

{"title":"Whisper-SV: Adapting Whisper for low-data-resource speaker verification","authors":"Li Zhang , Ning Jiang , Qing Wang , Yue Li , Quan Lu , Lei Xie","doi":"10.1016/j.specom.2024.103103","DOIUrl":"10.1016/j.specom.2024.103103","url":null,"abstract":"<div><p>Trained on 680,000 h of massive speech data, Whisper is a multitasking, multilingual speech foundation model demonstrating superior performance in automatic speech recognition, translation, and language identification. However, its applicability in speaker verification (SV) tasks remains unexplored, particularly in low-data-resource scenarios where labeled speaker data in specific domains are limited. To fill this gap, we propose a lightweight adaptor framework to boost SV with Whisper, namely Whisper-SV. Given that Whisper is not specifically optimized for SV tasks, we introduce a representation selection module to quantify the speaker-specific characteristics contained in each layer of Whisper and select the top-k layers with prominent discriminative speaker features. To aggregate pivotal speaker-related features while diminishing non-speaker redundancies across the selected top-k distinct layers of Whisper, we design a multi-layer aggregation module in Whisper-SV to integrate multi-layer representations into a singular, compacted representation for SV. In the multi-layer aggregation module, we employ convolutional layers with shortcut connections among different layers to refine speaker characteristics derived from multi-layer representations from Whisper. In addition, an attention aggregation layer is used to reduce non-speaker interference and amplify speaker-specific cues for SV tasks. Finally, a simple classification module is used for speaker classification. Experiments on VoxCeleb1, FFSVC, and IMSV datasets demonstrate that Whisper-SV achieves EER/minDCF of 2.22%/0.307, 6.14%/0.488, and 7.50%/0.582, respectively, showing superior performance in low-data-resource SV scenarios.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"163 ","pages":"Article 103103"},"PeriodicalIF":2.4,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141701112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Advancing speaker embedding learning: Wespeaker toolkit for research and production 推进演讲者嵌入式学习：用于研究和制作的 Wespeaker 工具包

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-07-01 DOI: 10.1016/j.specom.2024.103104

Shuai Wang , Zhengyang Chen , Bing Han , Hongji Wang , Chengdong Liang , Binbin Zhang , Xu Xiang , Wen Ding , Johan Rohdin , Anna Silnova , Yanmin Qian , Haizhou Li

Speaker modeling plays a crucial role in various tasks, and fixed-dimensional vector representations, known as speaker embeddings, are the predominant modeling approach. These embeddings are typically evaluated within the framework of speaker verification, yet their utility extends to a broad scope of related tasks including speaker diarization, speech synthesis, voice conversion, and target speaker extraction. This paper presents Wespeaker, a user-friendly toolkit designed for both research and production purposes, dedicated to the learning of speaker embeddings. Wespeaker offers scalable data management, state-of-the-art speaker embedding models, and self-supervised learning training schemes with the potential to leverage large-scale unlabeled real-world data. The toolkit incorporates structured recipes that have been successfully adopted in winning systems across various speaker verification challenges, ensuring highly competitive results. For production-oriented development, Wespeaker integrates CPU- and GPU-compatible deployment and runtime codes, supporting mainstream platforms such as Windows, Linux, Mac and on-device chips such as horizon X3’PI. Wespeaker also provides off-the-shelf high-quality speaker embeddings by providing various pretrained models, which can be effortlessly applied to different tasks that require speaker modeling. The toolkit is publicly available at https://github.com/wenet-e2e/wespeaker.

扬声器建模在各种任务中起着至关重要的作用，而固定维度的向量表示（即扬声器嵌入）是最主要的建模方法。这些嵌入通常是在扬声器验证的框架内进行评估的，但它们的用途也扩展到了扬声器日记化、语音合成、语音转换和目标扬声器提取等广泛的相关任务中。本文介绍的 Wespeaker 是一款用户友好型工具包，既可用于研究，也可用于生产，专门用于学习扬声器嵌入。Wespeaker 提供可扩展的数据管理、最先进的扬声器嵌入模型和自监督学习训练方案，具有利用大规模无标注真实世界数据的潜力。该工具包采用了结构化配方，这些配方已成功应用于各种扬声器验证挑战的成功系统中，确保了极具竞争力的结果。对于面向生产的开发，Wespeaker集成了CPU和GPU兼容的部署和运行代码，支持Windows、Linux、Mac等主流平台和horizon X3'PI等设备上芯片。Wespeaker 还通过提供各种预训练模型提供现成的高质量扬声器嵌入，这些模型可以毫不费力地应用于需要扬声器建模的不同任务。该工具包可通过 https://github.com/wenet-e2e/wespeaker 公开获取。

{"title":"Advancing speaker embedding learning: Wespeaker toolkit for research and production","authors":"Shuai Wang , Zhengyang Chen , Bing Han , Hongji Wang , Chengdong Liang , Binbin Zhang , Xu Xiang , Wen Ding , Johan Rohdin , Anna Silnova , Yanmin Qian , Haizhou Li","doi":"10.1016/j.specom.2024.103104","DOIUrl":"10.1016/j.specom.2024.103104","url":null,"abstract":"<div><p>Speaker modeling plays a crucial role in various tasks, and fixed-dimensional vector representations, known as speaker embeddings, are the predominant modeling approach. These embeddings are typically evaluated within the framework of speaker verification, yet their utility extends to a broad scope of related tasks including speaker diarization, speech synthesis, voice conversion, and target speaker extraction. This paper presents Wespeaker, a user-friendly toolkit designed for both research and production purposes, dedicated to the learning of speaker embeddings. Wespeaker offers scalable data management, state-of-the-art speaker embedding models, and self-supervised learning training schemes with the potential to leverage large-scale unlabeled real-world data. The toolkit incorporates structured recipes that have been successfully adopted in winning systems across various speaker verification challenges, ensuring highly competitive results. For production-oriented development, Wespeaker integrates CPU- and GPU-compatible deployment and runtime codes, supporting mainstream platforms such as Windows, Linux, Mac and on-device chips such as horizon X3’PI. Wespeaker also provides off-the-shelf high-quality speaker embeddings by providing various pretrained models, which can be effortlessly applied to different tasks that require speaker modeling. The toolkit is publicly available at <span><span>https://github.com/wenet-e2e/wespeaker</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"162 ","pages":"Article 103104"},"PeriodicalIF":2.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141688867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The effects of informational and energetic/modulation masking on the efficiency and ease of speech communication across the lifespan 信息和能量/调制掩蔽对一生中语言交流的效率和便利性的影响

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-07-01 DOI: 10.1016/j.specom.2024.103101

Outi Tuomainen , Stuart Rosen , Linda Taschenberger , Valerie Hazan

Children and older adults have greater difficulty understanding speech when there are other voices in the background (informational masking, IM) than when the interference is a steady-state noise with a similar spectral profile but is not speech (due to modulation and energetic masking; EM/MM). We evaluated whether this IM vs. EM/MM difference for certain age ranges was found for broader measures of communication efficiency and ease in 114 participants aged between 8 and 80. Participants carried out interactive diapix problem-solving tasks in age-band- and sex-matched pairs, in quiet and with different maskers in the background affecting both participants. Three measures were taken: (a) task transaction time (communication efficiency), (b) performance on a secondary auditory task simultaneously carried out during diapix, and (c) post-test subjective ratings of effort, concentration, difficulty and noisiness (communication ease). Although participants did not take longer to complete the task when in challenging conditions, effects of IM vs. EM/MM were clearly seen on the other measures. Relative to the EM/MM and quiet conditions, participants in IM conditions were less able to attend to the secondary task and reported greater effects of the masker type on their perceived degree of effort, concentration, difficulty and noisiness. However, we found no evidence of decreased communication efficiency and ease in IM relative to EM/MM for children and older adults in any of our measures. The clearest effects of age were observed in transaction time and secondary task measures. Overall, communication efficiency gradually improved between the ages 8–18 years and performance on the secondary task improved over younger ages (until 30 years) and gradually decreased after 50 years of age. Finally, we also found an impact of communicative role on performance. In adults, the participant asked to take the lead in the task and who spoke the most, performed worse on the secondary task than the person who was mainly in a ‘listening’ role and responding to queries. These results suggest that when a broader evaluation of speech communication is carried out that more closely resembles typical communicative situations, the more acute effects of IM typically seen in populations at the extremes of the lifespan are minimised potentially due to the presence of multiple information sources, which allow the use of varying communication strategies. Such a finding is relevant for clinical evaluations of speech communication.

当背景中有其他声音时（信息掩蔽，IM），儿童和老年人理解语音的难度要大于干扰是具有相似频谱轮廓但不是语音的稳态噪声时（由于调制和能量掩蔽；EM/MM）。我们对 114 名年龄在 8 岁至 80 岁之间的参与者进行了评估，以更广泛地衡量交流效率和难易程度，看看在某些年龄段是否存在这种 IM 与 EM/MM 的差异。参加者在年龄和性别匹配的情况下，在安静的环境中并在不同的背景掩蔽物影响下进行互动式问题解决任务。测试采用了三种测量方法：(a) 任务完成时间（交流效率）；(b) 测试期间同时进行的第二听觉任务的表现；(c) 测试后对努力程度、集中程度、难度和噪音的主观评价（交流难易程度）。虽然参与者在挑战性条件下完成任务的时间并不更长，但在其他测量指标上，即时通讯与电磁/微波的影响显而易见。与 EM/MM 和安静条件相比，在 IM 条件下，参与者不太能够专注于次要任务，并报告了掩蔽器类型对他们感知到的努力程度、集中程度、难度和嘈杂程度的更大影响。不过，我们没有发现任何证据表明，与 EM/MM 相比，儿童和老年人在 IM 条件下的交流效率和轻松程度都有所下降。在交易时间和次要任务测量中，我们观察到了最明显的年龄影响。总体而言，交流效率在 8 至 18 岁之间逐渐提高，在次要任务上的表现在较年轻时（30 岁之前）有所改善，在 50 岁之后逐渐下降。最后，我们还发现了交流角色对成绩的影响。在成人中，被要求在任务中起主导作用和发言最多的人在次要任务中的表现要比主要扮演 "倾听 "角色和回答询问的人差。这些结果表明，如果对语言交流进行更广泛的评估，使其更接近于典型的交流情况，那么通常在处于生命周期极端的人群中出现的即时通讯的严重影响就会被最小化，这可能是由于存在多个信息源，允许使用不同的交流策略。这一发现与言语交流的临床评估有关。

{"title":"The effects of informational and energetic/modulation masking on the efficiency and ease of speech communication across the lifespan","authors":"Outi Tuomainen , Stuart Rosen , Linda Taschenberger , Valerie Hazan","doi":"10.1016/j.specom.2024.103101","DOIUrl":"10.1016/j.specom.2024.103101","url":null,"abstract":"<div><p>Children and older adults have greater difficulty understanding speech when there are other voices in the background (informational masking, IM) than when the interference is a steady-state noise with a similar spectral profile but is not speech (due to modulation and energetic masking; EM/MM). We evaluated whether this IM vs. EM/MM difference for certain age ranges was found for broader measures of communication efficiency and ease in 114 participants aged between 8 and 80. Participants carried out interactive <em>diapix</em> problem-solving tasks in age-band- and sex-matched pairs, in quiet and with different maskers in the background affecting both participants. Three measures were taken: (a) task transaction time (communication efficiency), (b) performance on a secondary auditory task simultaneously carried out during <em>diapix</em>, and (c) post-test subjective ratings of effort, concentration, difficulty and noisiness (communication ease). Although participants did not take longer to complete the task when in challenging conditions, effects of IM vs. EM/MM were clearly seen on the other measures. Relative to the EM/MM and quiet conditions, participants in IM conditions were less able to attend to the secondary task and reported greater effects of the masker type on their perceived degree of effort, concentration, difficulty and noisiness. However, we found no evidence of decreased communication efficiency and ease in IM relative to EM/MM for children and older adults in any of our measures. The clearest effects of age were observed in transaction time and secondary task measures. Overall, communication efficiency gradually improved between the ages 8–18 years and performance on the secondary task improved over younger ages (until 30 years) and gradually decreased after 50 years of age. Finally, we also found an impact of communicative role on performance. In adults, the participant asked to take the lead in the task and who spoke the most, performed worse on the secondary task than the person who was mainly in a ‘listening’ role and responding to queries. These results suggest that when a broader evaluation of speech communication is carried out that more closely resembles typical communicative situations, the more acute effects of IM typically seen in populations at the extremes of the lifespan are minimised potentially due to the presence of multiple information sources, which allow the use of varying communication strategies. Such a finding is relevant for clinical evaluations of speech communication.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"162 ","pages":"Article 103101"},"PeriodicalIF":2.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000736/pdfft?md5=3bae57a7e48911c3d00f77555ed9d386&pid=1-s2.0-S0167639324000736-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141577736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Pathological voice classification using MEEL features and SVM-TabNet model 使用 MEEL 特征和 SVM-TabNet 模型进行病态语音分类

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-07-01 DOI: 10.1016/j.specom.2024.103100

Mohammed Zakariah , Muna Al-Razgan , Taha Alfakih

In clinical settings, early diagnosis and objective assessment depend on the detection of voice pathology. To classify anomalous voices, this work uses an approach that combines the SVM-TabNet fusion model with MEEL (Mel-Frequency Energy Line) features. Further, the dataset consists of 1037 speech files, including recordings from people with laryngocele and Vox senilis as well as from healthy persons. Additionally, the main goal is to create an efficient classification model that can differentiate between normal and abnormal voice patterns. Modern techniques frequently lack the accuracy required for a precise diagnosis, which highlights the need for novel strategies. The suggested approach uses an SVM-TabNet fusion model for classification after feature extraction using MEEL characteristics. MEEL features provide extensive information for categorization by capturing complex patterns in audio transmissions. Moreover, by combining the advantages of SVM and TabNet models, classification performance is improved. Moreover, testing the model on test data yields remarkable results: 99.7 % accuracy, 0.992 F1 score, 0.996 precision, and 0.995 recall. Additional testing on additional datasets reliably validates outstanding performance, with 99.4 % accuracy, 0.99 F1 score, 0.998 precision, and 0.989 % recall. Furthermore, using the Saarbruecken Voice Database (SVD), the suggested methodology achieves an impressive accuracy of 99.97 %, demonstrating its durability and generalizability across many datasets. Overall, this work shows how the SVM-TabNet fusion model with MEEL characteristics may be used to accurately and consistently classify diseased voices, providing encouraging opportunities for clinical diagnosis and therapy tracking.

在临床环境中，早期诊断和客观评估取决于嗓音病理的检测。为了对异常声音进行分类，这项工作采用了一种将 SVM-TabNet 融合模型与 MEEL（Mel-Frequency Energy Line）特征相结合的方法。此外，该数据集由 1037 个语音文件组成，其中包括喉结畸形和老年痴呆症患者以及健康人的录音。此外，主要目标是创建一个高效的分类模型，以区分正常和异常语音模式。现代技术往往缺乏精确诊断所需的准确性，这就凸显了对新策略的需求。建议的方法在使用 MEEL 特征提取特征后，使用 SVM-TabNet 融合模型进行分类。MEEL 特征通过捕捉音频传输中的复杂模式，为分类提供了广泛的信息。此外，通过结合 SVM 和 TabNet 模型的优势，分类性能得到了提高。此外，在测试数据上测试该模型也取得了显著效果：准确率为 99.7%，F1 分数为 0.992，精确度为 0.996，召回率为 0.995。在其他数据集上进行的额外测试可靠地验证了该模型的出色性能，准确率为 99.4%，F1 分数为 0.99，精确度为 0.998，召回率为 0.989%。此外，在使用萨尔布吕肯语音数据库（SVD）时，所建议的方法达到了令人印象深刻的 99.97 % 的准确率，证明了它在许多数据集上的持久性和通用性。总之，这项研究表明，具有 MEEL 特征的 SVM-TabNet 融合模型可用于准确、一致地对病态声音进行分类，为临床诊断和治疗跟踪提供了令人鼓舞的机会。

{"title":"Pathological voice classification using MEEL features and SVM-TabNet model","authors":"Mohammed Zakariah , Muna Al-Razgan , Taha Alfakih","doi":"10.1016/j.specom.2024.103100","DOIUrl":"10.1016/j.specom.2024.103100","url":null,"abstract":"<div><p>In clinical settings, early diagnosis and objective assessment depend on the detection of voice pathology. To classify anomalous voices, this work uses an approach that combines the SVM-TabNet fusion model with MEEL (Mel-Frequency Energy Line) features. Further, the dataset consists of 1037 speech files, including recordings from people with laryngocele and Vox senilis as well as from healthy persons. Additionally, the main goal is to create an efficient classification model that can differentiate between normal and abnormal voice patterns. Modern techniques frequently lack the accuracy required for a precise diagnosis, which highlights the need for novel strategies. The suggested approach uses an SVM-TabNet fusion model for classification after feature extraction using MEEL characteristics. MEEL features provide extensive information for categorization by capturing complex patterns in audio transmissions. Moreover, by combining the advantages of SVM and TabNet models, classification performance is improved. Moreover, testing the model on test data yields remarkable results: 99.7 % accuracy, 0.992 F1 score, 0.996 precision, and 0.995 recall. Additional testing on additional datasets reliably validates outstanding performance, with 99.4 % accuracy, 0.99 F1 score, 0.998 precision, and 0.989 % recall. Furthermore, using the Saarbruecken Voice Database (SVD), the suggested methodology achieves an impressive accuracy of 99.97 %, demonstrating its durability and generalizability across many datasets. Overall, this work shows how the SVM-TabNet fusion model with MEEL characteristics may be used to accurately and consistently classify diseased voices, providing encouraging opportunities for clinical diagnosis and therapy tracking.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"162 ","pages":"Article 103100"},"PeriodicalIF":2.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141571774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Analyzing the influence of different speech data corpora and speech features on speech emotion recognition: A review 分析不同语音数据集和语音特征对语音情感识别的影响：综述

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-07-01 DOI: 10.1016/j.specom.2024.103102

Tarun Rathi, Manoj Tripathy

Emotion recognition from speech has become crucial in human-computer interaction and affective computing applications. This review paper examines the complex relationship between two critical factors: the selection of speech data corpora and the extraction of speech features regarding speech emotion classification accuracy. Through an extensive analysis of literature from 2014 to 2023, publicly available speech datasets are explored and categorized based on their diversity, scale, linguistic attributes, and emotional classifications. The importance of various speech features, from basic spectral features to sophisticated prosodic cues, and their influence on emotion recognition accuracy is analyzed.. In the context of speech data corpora, this review paper unveils trends and insights from comparative studies exploring the repercussions of dataset choice on recognition efficacy. Various datasets such as IEMOCAP, EMODB, and MSP-IMPROV are scrutinized in terms of their influence on classifying the accuracy of the speech emotion recognition (SER) system. At the same time, potential challenges associated with dataset limitations are also examined. Notable features like Mel-frequency cepstral coefficients, pitch, intensity, and prosodic patterns are evaluated for their contributions to emotion recognition. Advanced feature extraction methods, too, are explored for their potential to capture intricate emotional dynamics. Moreover, this review paper offers insights into the methodological aspects of emotion recognition, shedding light on the diverse machine learning and deep learning approaches employed. Through a holistic synthesis of research findings, this review paper observes connections between the choice of speech data corpus, selection of speech features, and resulting emotion recognition accuracy. As the field continues to evolve, avenues for future research are proposed, ranging from enhanced feature extraction techniques to the development of standardized benchmark datasets. In essence, this review serves as a compass guiding researchers and practitioners through the intricate landscape of speech emotion recognition, offering a nuanced understanding of the factors shaping its recognition accuracy of speech emotion.

语音情感识别已成为人机交互和情感计算应用的关键。本综述论文探讨了两个关键因素之间的复杂关系：关于语音情感分类准确性的语音数据集选择和语音特征提取。通过对 2014 年至 2023 年的文献进行广泛分析，探讨了公开可用的语音数据集，并根据其多样性、规模、语言属性和情感分类进行了分类。分析了从基本的频谱特征到复杂的前音线索等各种语音特征的重要性及其对情感识别准确性的影响。在语音数据体方面，这篇综述揭示了比较研究的趋势和见解，探讨了数据集选择对识别效率的影响。本文仔细研究了 IEMOCAP、EMODB 和 MSP-IMPROV 等各种数据集对语音情感识别（SER）系统准确性分类的影响。同时，还研究了与数据集局限性相关的潜在挑战。评估了梅尔频率共振频率系数、音高、音强和前音模式等显著特征对情感识别的贡献。此外，还探讨了先进的特征提取方法，以发现其捕捉复杂情感动态的潜力。此外，这篇综述论文还对情感识别的方法论方面提出了见解，阐明了所采用的各种机器学习和深度学习方法。通过对研究成果的全面综合，本综述论文观察到了语音数据语料的选择、语音特征的选择以及由此产生的情感识别准确率之间的联系。随着该领域的不断发展，本文提出了未来的研究方向，包括增强特征提取技术和开发标准化基准数据集。从本质上讲，这篇综述就像一个指南针，指引着研究人员和从业人员穿越错综复杂的语音情感识别领域，对影响语音情感识别准确性的因素有了细致入微的了解。

{"title":"Analyzing the influence of different speech data corpora and speech features on speech emotion recognition: A review","authors":"Tarun Rathi, Manoj Tripathy","doi":"10.1016/j.specom.2024.103102","DOIUrl":"10.1016/j.specom.2024.103102","url":null,"abstract":"<div><p>Emotion recognition from speech has become crucial in human-computer interaction and affective computing applications. This review paper examines the complex relationship between two critical factors: the selection of speech data corpora and the extraction of speech features regarding speech emotion classification accuracy. Through an extensive analysis of literature from 2014 to 2023, publicly available speech datasets are explored and categorized based on their diversity, scale, linguistic attributes, and emotional classifications. The importance of various speech features, from basic spectral features to sophisticated prosodic cues, and their influence on emotion recognition accuracy is analyzed.. In the context of speech data corpora, this review paper unveils trends and insights from comparative studies exploring the repercussions of dataset choice on recognition efficacy. Various datasets such as IEMOCAP, EMODB, and MSP-IMPROV are scrutinized in terms of their influence on classifying the accuracy of the speech emotion recognition (SER) system. At the same time, potential challenges associated with dataset limitations are also examined. Notable features like Mel-frequency cepstral coefficients, pitch, intensity, and prosodic patterns are evaluated for their contributions to emotion recognition. Advanced feature extraction methods, too, are explored for their potential to capture intricate emotional dynamics. Moreover, this review paper offers insights into the methodological aspects of emotion recognition, shedding light on the diverse machine learning and deep learning approaches employed. Through a holistic synthesis of research findings, this review paper observes connections between the choice of speech data corpus, selection of speech features, and resulting emotion recognition accuracy. As the field continues to evolve, avenues for future research are proposed, ranging from enhanced feature extraction techniques to the development of standardized benchmark datasets. In essence, this review serves as a compass guiding researchers and practitioners through the intricate landscape of speech emotion recognition, offering a nuanced understanding of the factors shaping its recognition accuracy of speech emotion.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"162 ","pages":"Article 103102"},"PeriodicalIF":2.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141637049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Emotions recognition in audio signals using an extension of the latent block model 利用潜块模型的扩展识别音频信号中的情绪

IF 3.2 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-06-01 DOI: 10.1016/j.specom.2024.103092

Abir El Haj

Emotion detection in human speech is a significant area of research, crucial for various applications such as affective computing and human–computer interaction. Despite advancements, accurately categorizing emotional states in speech remains challenging due to its subjective nature and the complexity of human emotions. To address this, we propose leveraging Mel frequency cepstral coefficients (MFCCS) and extend the latent block model (LBM) probabilistic clustering technique with a Gaussian multi-way latent block model (GMWLBM). Our objective is to categorize speech emotions into coherent groups based on the emotional states conveyed by speakers. We employ MFCCS from time-series audio data and utilize a variational Expectation Maximization method to estimate GMWLBM parameters. Additionally, we introduce an integrated Classification Likelihood (ICL) model selection criterion to determine the optimal number of clusters, enhancing robustness. Numerical experiments on real data from the Berlin Database of Emotional Speech (EMO-DB) demonstrate our method’s efficacy in accurately detecting and classifying emotional states in human speech, even in challenging real-world scenarios, thereby contributing significantly to affective computing and human–computer interaction applications.

人类语音中的情感检测是一个重要的研究领域，对情感计算和人机交互等各种应用至关重要。尽管取得了进步，但由于语音的主观性和人类情绪的复杂性，准确地对语音中的情绪状态进行分类仍具有挑战性。为了解决这个问题，我们建议利用梅尔频率倒频谱系数（MFCCS），并使用高斯多向潜在块模型（GMWLBM）扩展潜在块模型（LBM）概率聚类技术。我们的目标是根据说话者所传达的情绪状态，将语音情绪划分为一致的群组。我们采用来自时间序列音频数据的 MFCCS，并利用变异期望最大化方法来估计 GMWLBM 参数。此外，我们还引入了综合分类似然 (ICL) 模型选择标准，以确定最佳聚类数量，从而增强鲁棒性。在柏林情感语音数据库（EMO-DB）的真实数据上进行的数值实验证明，即使在具有挑战性的现实世界场景中，我们的方法也能准确检测人类语音中的情感状态并对其进行分类，从而为情感计算和人机交互应用做出了重大贡献。

{"title":"Emotions recognition in audio signals using an extension of the latent block model","authors":"Abir El Haj","doi":"10.1016/j.specom.2024.103092","DOIUrl":"10.1016/j.specom.2024.103092","url":null,"abstract":"<div><p>Emotion detection in human speech is a significant area of research, crucial for various applications such as affective computing and human–computer interaction. Despite advancements, accurately categorizing emotional states in speech remains challenging due to its subjective nature and the complexity of human emotions. To address this, we propose leveraging Mel frequency cepstral coefficients (MFCCS) and extend the latent block model (LBM) probabilistic clustering technique with a Gaussian multi-way latent block model (GMWLBM). Our objective is to categorize speech emotions into coherent groups based on the emotional states conveyed by speakers. We employ MFCCS from time-series audio data and utilize a variational Expectation Maximization method to estimate GMWLBM parameters. Additionally, we introduce an integrated Classification Likelihood (ICL) model selection criterion to determine the optimal number of clusters, enhancing robustness. Numerical experiments on real data from the Berlin Database of Emotional Speech (EMO-DB) demonstrate our method’s efficacy in accurately detecting and classifying emotional states in human speech, even in challenging real-world scenarios, thereby contributing significantly to affective computing and human–computer interaction applications.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"161 ","pages":"Article 103092"},"PeriodicalIF":3.2,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141278454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0