Speech Communication最新文献_第4页

Summary of the DISPLACE challenge 2023-DIarization of SPeaker and LAnguage in Conversational Environments 2023 年 DISPLACE 挑战赛摘要--对话环境中的 SPeaker 和 LAnguage 个性化定制

IF 3.2 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-05-11 DOI: 10.1016/j.specom.2024.103080

Shikha Baghel , Shreyas Ramoji , Somil Jain , Pratik Roy Chowdhuri , Prachi Singh , Deepu Vijayasenan , Sriram Ganapathy

In multi-lingual societies, where multiple languages are spoken in a small geographic vicinity, informal conversations often involve mix of languages. Existing speech technologies may be inefficient in extracting information from such conversations, where the speech data is rich in diversity with multiple languages and speakers. The DISPLACE (DIarization of SPeaker and LAnguage in Conversational Environments) challenge constitutes an open-call for evaluating and bench-marking the speaker and language diarization technologies on this challenging condition. To facilitate this challenge, a real-world dataset featuring multilingual, multi-speaker conversational far-field speech was recorded and distributed. The challenge entailed two tracks: Track-1 focused on speaker diarization (SD) in multilingual situations while, Track-2 addressed the language diarization (LD) in a multi-speaker scenario. Both the tracks were evaluated using the same underlying audio data. Furthermore, a baseline system was made available for both SD and LD task which mimicked the state-of-art in these tasks. The challenge garnered a total of 42 world-wide registrations and received a total of 19 combined submissions for Track-1 and Track-2. This paper describes the challenge, details of the datasets, tasks, and the baseline system. Additionally, the paper provides a concise overview of the submitted systems in both tracks, with an emphasis given to the top performing systems. The paper also presents insights and future perspectives for SD and LD tasks, focusing on the key challenges that the systems need to overcome before wide-spread commercial deployment on such conversations.

在多语言社会中，小范围内使用多种语言，非正式对话往往涉及多种语言的混合。现有的语音技术在从这类对话中提取信息时可能效率低下，因为在这类对话中，语音数据丰富多样，包含多种语言和说话人。DISPLACE (DIarization of SPeaker and LAnguage in Conversational Environments) 挑战赛是一项公开征集活动，目的是在这一具有挑战性的条件下对说话者和语言日记化技术进行评估和标杆测试。为了促进这项挑战，我们录制并分发了一个真实世界的数据集，其中包含多语言、多说话人的远场对话语音。挑战赛分为两个赛道：赛道 1 侧重于多语言情况下的说话人日记化（SD），而赛道 2 则针对多说话人情况下的语言日记化（LD）。两个轨道均使用相同的基础音频数据进行评估。此外，还为 SD 和 LD 任务提供了一个基线系统，以模拟这些任务的最新技术水平。此次挑战赛在全球范围内共收到 42 份注册申请，Track-1 和 Track-2 共收到 19 份合并申请。本文介绍了挑战赛、数据集详情、任务和基线系统。此外，本文还对两个赛道中提交的系统进行了简要概述，重点介绍了表现最佳的系统。论文还介绍了对 SD 和 LD 任务的见解和未来展望，重点关注系统在此类对话中广泛商业部署之前需要克服的关键挑战。

{"title":"Summary of the DISPLACE challenge 2023-DIarization of SPeaker and LAnguage in Conversational Environments","authors":"Shikha Baghel , Shreyas Ramoji , Somil Jain , Pratik Roy Chowdhuri , Prachi Singh , Deepu Vijayasenan , Sriram Ganapathy","doi":"10.1016/j.specom.2024.103080","DOIUrl":"10.1016/j.specom.2024.103080","url":null,"abstract":"<div><p>In multi-lingual societies, where multiple languages are spoken in a small geographic vicinity, informal conversations often involve mix of languages. Existing speech technologies may be inefficient in extracting information from such conversations, where the speech data is rich in diversity with multiple languages and speakers. The <strong>DISPLACE</strong> (DIarization of SPeaker and LAnguage in Conversational Environments) challenge constitutes an open-call for evaluating and bench-marking the speaker and language diarization technologies on this challenging condition. To facilitate this challenge, a real-world dataset featuring multilingual, multi-speaker conversational far-field speech was recorded and distributed. The challenge entailed two tracks: Track-1 focused on speaker diarization (SD) in multilingual situations while, Track-2 addressed the language diarization (LD) in a multi-speaker scenario. Both the tracks were evaluated using the same underlying audio data. Furthermore, a baseline system was made available for both SD and LD task which mimicked the state-of-art in these tasks. The challenge garnered a total of 42 world-wide registrations and received a total of 19 combined submissions for Track-1 and Track-2. This paper describes the challenge, details of the datasets, tasks, and the baseline system. Additionally, the paper provides a concise overview of the submitted systems in both tracks, with an emphasis given to the top performing systems. The paper also presents insights and future perspectives for SD and LD tasks, focusing on the key challenges that the systems need to overcome before wide-spread commercial deployment on such conversations.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"161 ","pages":"Article 103080"},"PeriodicalIF":3.2,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141054826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

End-to-end integration of speech separation and voice activity detection for low-latency diarization of telephone conversations 端到端集成语音分离和语音活动检测功能，实现低延迟电话交谈日记化

IF 3.2 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-05-11 DOI: 10.1016/j.specom.2024.103081

Giovanni Morrone , Samuele Cornell , Luca Serafini , Enrico Zovato , Alessio Brutti , Stefano Squartini

Recent works show that speech separation guided diarization (SSGD) is an increasingly promising direction, mainly thanks to the recent progress in speech separation. It performs diarization by first separating the speakers and then applying voice activity detection (VAD) on each separated stream. In this work we conduct an in-depth study of SSGD in the conversational telephone speech (CTS) domain, focusing mainly on low-latency streaming diarization applications. We consider three state-of-the-art speech separation (SSep) algorithms and study their performance both in online and offline scenarios, considering non-causal and causal implementations as well as continuous SSep (CSS) windowed inference. We compare different SSGD algorithms on two widely used CTS datasets: CALLHOME and Fisher Corpus (Part 1 and 2) and evaluate both separation and diarization performance. To improve performance, a novel, causal and computationally efficient leakage removal algorithm is proposed, which significantly decreases false alarms. We also explore, for the first time, fully end-to-end SSGD integration between SSep and VAD modules. Crucially, this enables fine-tuning on real-world data for which oracle speakers sources are not available. In particular, our best model achieves 8.8% DER on CALLHOME, which outperforms the current state-of-the-art end-to-end neural diarization model, despite being trained on an order of magnitude less data and having significantly lower latency, i.e., 0.1 vs. 1 s. Finally, we also show that the separated signals can be readily used also for automatic speech recognition, reaching performance close to using oracle sources in some configurations.

最近的研究表明，语音分离指导下的日记化（SSGD）是一个越来越有前景的方向，这主要归功于语音分离技术的最新进展。它通过首先分离说话者，然后在每个分离流上应用语音活动检测（VAD）来执行日记化。在这项工作中，我们对会话电话语音（CTS）领域的 SSGD 进行了深入研究，主要侧重于低延迟流式日记化应用。我们考虑了三种最先进的语音分离（SSep）算法，并研究了它们在在线和离线场景下的性能，考虑了非因果和因果实现以及连续 SSep (CSS) 窗口推理。我们在两个广泛使用的 CTS 数据集上比较了不同的 SSGD 算法：我们在两个广泛使用的 CTS 数据集：CALLHOME 和 Fisher Corpus（第 1 部分和第 2 部分）上比较了不同的 SSGD 算法，并评估了分离和日记化性能。为了提高性能，我们提出了一种新颖、因果关系明显、计算效率高的泄漏清除算法，该算法可显著降低误报率。我们还首次探索了 SSep 和 VAD 模块之间完全端到端的 SSGD 集成。最重要的是，这使得我们能够在无法获得 Oracle 扬声器源的真实世界数据上进行微调。特别是，我们的最佳模型在 CALLHOME 上达到了 8.8% 的 DER，超过了当前最先进的端到端神经日记化模型，尽管其训练数据量少了一个数量级，延迟时间也大大降低，即 0.1 秒对 1 秒。

{"title":"End-to-end integration of speech separation and voice activity detection for low-latency diarization of telephone conversations","authors":"Giovanni Morrone , Samuele Cornell , Luca Serafini , Enrico Zovato , Alessio Brutti , Stefano Squartini","doi":"10.1016/j.specom.2024.103081","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103081","url":null,"abstract":"<div><p>Recent works show that speech separation guided diarization (SSGD) is an increasingly promising direction, mainly thanks to the recent progress in speech separation. It performs diarization by first separating the speakers and then applying voice activity detection (VAD) on each separated stream. In this work we conduct an in-depth study of SSGD in the conversational telephone speech (CTS) domain, focusing mainly on low-latency streaming diarization applications. We consider three state-of-the-art speech separation (SSep) algorithms and study their performance both in online and offline scenarios, considering non-causal and causal implementations as well as continuous SSep (CSS) windowed inference. We compare different SSGD algorithms on two widely used CTS datasets: CALLHOME and Fisher Corpus (Part 1 and 2) and evaluate both separation and diarization performance. To improve performance, a novel, causal and computationally efficient leakage removal algorithm is proposed, which significantly decreases false alarms. We also explore, for the first time, fully end-to-end SSGD integration between SSep and VAD modules. Crucially, this enables fine-tuning on real-world data for which oracle speakers sources are not available. In particular, our best model achieves 8.8% DER on CALLHOME, which outperforms the current state-of-the-art end-to-end neural diarization model, despite being trained on an order of magnitude less data and having significantly lower latency, i.e., 0.1 vs. 1 s. Finally, we also show that the separated signals can be readily used also for automatic speech recognition, reaching performance close to using oracle sources in some configurations.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"161 ","pages":"Article 103081"},"PeriodicalIF":3.2,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141078094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Visual-articulatory cues facilitate children with CIs to better perceive Mandarin tones in sentences 视觉-发音线索有助于 CI 儿童更好地感知句子中的普通话声调

IF 3.2 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-05-01 DOI: 10.1016/j.specom.2024.103084

Ping Tang, Shanpeng Li, Yanan Shen, Qianxi Yu, Yan Feng

Children with cochlear implants (CIs) face challenges in tonal perception under noise. Nevertheless, our previous research demonstrated that seeing visual-articulatory cues (speakers’ facial/head movements) benefited these children to perceive isolated tones better, particularly in noisy environments, with those implanted earlier gaining more benefits. However, tones in daily speech typically occur in sentence contexts where visual cues are largely reduced compared to those in isolated contexts. It was thus unclear if visual benefits on tonal perception still hold in these challenging sentence contexts. Therefore, this study tested 64 children with CIs and 64 age-matched NH children. Target tones in sentence-medial position were presented in audio-only (AO) or audiovisual (AV) conditions, in quiet and noisy environments. Children selected the target tone using a picture-point task. The results showed that, while NH children did not show any perception difference between AO and AV conditions, children with CIs significantly improved their perceptual accuracy from AO to AV conditions. The degree of improvement was negatively correlated with their implantation ages. Therefore, children with CIs were able to use visual-articulatory cues to facilitate their tonal perception even in sentence contexts, and earlier auditory experience might be important in shaping this ability.

在噪音环境下，植入人工耳蜗（CI）的儿童在音调感知方面面临挑战。然而，我们之前的研究表明，看到视觉-发音线索（说话者的面部/头部动作）有利于这些儿童更好地感知孤立的音调，尤其是在噪音环境中，植入时间较早的儿童受益更多。然而，日常言语中的音调通常出现在句子语境中，与孤立语境中的音调相比，句子语境中的视觉线索要少得多。因此，还不清楚在这些具有挑战性的句子语境中，视觉对音调感知的益处是否仍然存在。因此，本研究对 64 名 CI 儿童和 64 名年龄匹配的 NH 儿童进行了测试。在安静和嘈杂的环境中，以纯音频（AO）或视听（AV）条件呈现句子中间位置的目标音调。儿童通过图片点任务选择目标音调。结果表明，虽然正常儿童在纯音频和视听条件下没有表现出任何感知上的差异，但患有人工耳蜗的儿童在纯音频和视听条件下的感知准确性有了显著提高。提高程度与植入年龄呈负相关。因此，即使在句子语境中，植入人工耳蜗的儿童也能利用视觉-发音线索来促进他们对音调的感知，而早期的听觉经验可能是形成这种能力的重要因素。

{"title":"Visual-articulatory cues facilitate children with CIs to better perceive Mandarin tones in sentences","authors":"Ping Tang, Shanpeng Li, Yanan Shen, Qianxi Yu, Yan Feng","doi":"10.1016/j.specom.2024.103084","DOIUrl":"10.1016/j.specom.2024.103084","url":null,"abstract":"<div><p>Children with cochlear implants (CIs) face challenges in tonal perception under noise. Nevertheless, our previous research demonstrated that seeing visual-articulatory cues (speakers’ facial/head movements) benefited these children to perceive isolated tones better, particularly in noisy environments, with those implanted earlier gaining more benefits. However, tones in daily speech typically occur in sentence contexts where visual cues are largely reduced compared to those in isolated contexts. It was thus unclear if visual benefits on tonal perception still hold in these challenging sentence contexts. Therefore, this study tested 64 children with CIs and 64 age-matched NH children. Target tones in sentence-medial position were presented in audio-only (AO) or audiovisual (AV) conditions, in quiet and noisy environments. Children selected the target tone using a picture-point task. The results showed that, while NH children did not show any perception difference between AO and AV conditions, children with CIs significantly improved their perceptual accuracy from AO to AV conditions. The degree of improvement was negatively correlated with their implantation ages. Therefore, children with CIs were able to use visual-articulatory cues to facilitate their tonal perception even in sentence contexts, and earlier auditory experience might be important in shaping this ability.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"160 ","pages":"Article 103084"},"PeriodicalIF":3.2,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141028923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The prosody of theme, rheme and focus in Egyptian Arabic: A quantitative investigation of tunes, configurations and speaker variability 埃及阿拉伯语中主题、韵律和重点的前奏：对曲调、配置和说话者变异性的定量研究

IF 3.2 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-05-01 DOI: 10.1016/j.specom.2024.103082

Dina El Zarka , Anneliese Kelterer , Michele Gubian , Barbara Schuppler

This paper investigates the prosody of sentences elicited in three Information Structure (IS) conditions: all-new, theme-rheme and rhematic focus-background. The sentences were produced by 18 speakers of Egyptian Arabic (EA). This is the first quantitative study to provide a comprehensive analysis of holistic f0 contours (by means of GAMM) and configurations of f0, duration and intensity (by means of FPCA) associated with the three IS conditions, both across and within speakers. A significant difference between focus-background and the other information structure conditions was found, but also strong inter-speaker variation in terms of strategies and the degree to which these strategies were applied. The results suggest that post-focus register lowering and the duration of the stressed syllables of the focused and the utterance-final word are more consistent cues to focus than a higher peak of the focus accent. In addition, some independence of duration and intensity from f0 could be identified. These results thus support the assumption that, when focus is marked prosodically in EA, it is marked by prominence. Nevertheless, the fact that a considerable number of EA speakers did not apply prosodic marking and the fact that prosodic focus marking was gradient rather than categorical suggest that EA does not have a fully conventionalized prosodic focus construction.

本文研究了在三种信息结构（IS）条件下激发的句子的拟声：全新、主题-主题和主题-焦点-背景。句子是由 18 位讲埃及阿拉伯语（EA）的人发出的。这是首次对三种 IS 条件下的整体 f0 等高线（通过 GAMM）和与之相关的 f0、持续时间和强度配置（通过 FPCA）进行全面分析的定量研究，包括不同说话者之间和说话者内部的分析。结果发现，焦点-背景与其他信息结构条件之间存在明显差异，但在策略和策略应用程度方面，不同说话者之间也存在很大差异。研究结果表明，聚焦后的音位降低以及聚焦词和语篇末尾词的重读音节的持续时间比聚焦重音的峰值更高，是更一致的聚焦线索。此外，还可以发现持续时间和强度与 f0 有一定的独立性。因此，这些结果支持了这样的假设，即在 EA 中，当重点在前音上被标记时，它是通过突出来标记的。然而，相当多的东亚语使用者没有使用前音标记，而且前音重心标记是渐变的而不是分类的，这些事实表明东亚语并没有完全常规化的前音重心结构。

{"title":"The prosody of theme, rheme and focus in Egyptian Arabic: A quantitative investigation of tunes, configurations and speaker variability","authors":"Dina El Zarka , Anneliese Kelterer , Michele Gubian , Barbara Schuppler","doi":"10.1016/j.specom.2024.103082","DOIUrl":"10.1016/j.specom.2024.103082","url":null,"abstract":"<div><p>This paper investigates the prosody of sentences elicited in three Information Structure (IS) conditions: all-new, theme-rheme and rhematic focus-background. The sentences were produced by 18 speakers of Egyptian Arabic (EA). This is the first quantitative study to provide a comprehensive analysis of holistic f0 contours (by means of GAMM) and configurations of f0, duration and intensity (by means of FPCA) associated with the three IS conditions, both across and within speakers. A significant difference between focus-background and the other information structure conditions was found, but also strong inter-speaker variation in terms of strategies and the degree to which these strategies were applied. The results suggest that post-focus register lowering and the duration of the stressed syllables of the focused and the utterance-final word are more consistent cues to focus than a higher peak of the focus accent. In addition, some independence of duration and intensity from f0 could be identified. These results thus support the assumption that, when focus is marked prosodically in EA, it is marked by prominence. Nevertheless, the fact that a considerable number of EA speakers did not apply prosodic marking and the fact that prosodic focus marking was gradient rather than categorical suggest that EA does not have a fully conventionalized prosodic focus construction.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"160 ","pages":"Article 103082"},"PeriodicalIF":3.2,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000542/pdfft?md5=dcb4ae8365c4f0e84a5827d3ae202551&pid=1-s2.0-S0167639324000542-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141035839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Factorized and progressive knowledge distillation for CTC-based ASR models 基于 CTC 的 ASR 模型的因子化和渐进式知识提炼

IF 3.2 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-05-01 DOI: 10.1016/j.specom.2024.103071

Sanli Tian , Zehan Li , Zhaobiao Lyv , Gaofeng Cheng , Qing Xiao , Ta Li , Qingwei Zhao

Knowledge distillation (KD) is a popular model compression method to improve the performance of lightweight models by transferring knowledge from a teacher model to a student model. However, applying KD to connectionist temporal classification (CTC) ASR model is challenging due to its peaky posterior property. In this paper, we propose to address this issue by treating non-blank and blank frames differently for two main reasons. First, the non-blank frames in the teacher model’s posterior matrix and hidden representations provide more acoustic and linguistic information than the blank frames, but the frame number of non-blank frames only accounts for a small fraction of all frames, leading to a severe learning imbalance problem. Second, the non-blank tokens in the teacher’s blank-frame posteriors exhibit irregular probability distributions, negatively impacting the student model’s learning. Thus, we propose to factorize the distillation of non-blank and blank frames and further combine them into a progressive KD framework, which contains three incremental stages to facilitate the student model gradually building up its knowledge. The first stage involves a simple binary classification KD task, in which the student learns to distinguish between non-blank and blank frames, as the two types of frames are learned separately in subsequent stages. The second stage is a factorized representation-based KD, in which hidden representations are divided into non-blank and blank frames so that both can be distilled in a balanced manner. In the third stage, the student learns from the teacher’s posterior matrix through our proposed method, factorized KL-divergence (FKL), which performs different operation on blank and non-blank frame posteriors to alleviate the imbalance issue and reduce the influence of irregular probability distributions. Compared to the baseline, our proposed method achieves 22.5% relative CER reduction on the Aishell-1 dataset, 23.0% relative WER reduction on the Tedlium-2 dataset, and 17.6% relative WER reduction on the LibriSpeech dataset. To show the generalization of our method, we also evaluate our method on the hybrid CTC/Attention architecture as well as on scenarios with cross-model topology KD.

知识蒸馏（KD）是一种流行的模型压缩方法，通过将知识从教师模型转移到学生模型来提高轻量级模型的性能。然而，由于其峰值后验特性，将 KD 应用于连接主义时序分类 (CTC) ASR 模型具有挑战性。本文建议通过区别对待非空白帧和空白帧来解决这一问题，主要原因有两个。首先，在教师模型的后验矩阵和隐藏表征中，非空白帧比空白帧提供了更多的声学和语言信息，但非空白帧的帧数只占所有帧数的一小部分，从而导致严重的学习不平衡问题。其次，教师空白帧后验中的非空白标记呈现出不规则的概率分布，对学生模型的学习产生了负面影响。因此，我们建议对非空白帧和空白帧进行因子化提炼，并进一步将其结合到渐进式 KD 框架中，该框架包含三个增量阶段，以促进学生模型逐步积累知识。第一阶段是一个简单的二元分类 KD 任务，学生在这个任务中学习如何区分非空白帧和空白帧，因为这两种类型的帧会在后续阶段分别学习。第二阶段是基于因式分解表征的 KD，在这一阶段中，隐藏表征被分为非空白帧和空白帧，从而可以均衡地提炼出这两种帧。在第三阶段，学生通过我们提出的因子化 KL-发散（FKL）方法从教师的后验矩阵中学习，该方法对空白帧和非空白帧后验进行不同的操作，以缓解不平衡问题并减少不规则概率分布的影响。与基线相比，我们提出的方法在 Aishell-1 数据集上实现了 22.5% 的相对 CER 降低，在 Tedlium-2 数据集上实现了 23.0% 的相对 WER 降低，在 LibriSpeech 数据集上实现了 17.6% 的相对 WER 降低。为了展示我们方法的通用性，我们还在 CTC/Attention 混合架构以及跨模型拓扑 KD 场景中对我们的方法进行了评估。

{"title":"Factorized and progressive knowledge distillation for CTC-based ASR models","authors":"Sanli Tian , Zehan Li , Zhaobiao Lyv , Gaofeng Cheng , Qing Xiao , Ta Li , Qingwei Zhao","doi":"10.1016/j.specom.2024.103071","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103071","url":null,"abstract":"<div><p>Knowledge distillation (KD) is a popular model compression method to improve the performance of lightweight models by transferring knowledge from a teacher model to a student model. However, applying KD to connectionist temporal classification (CTC) ASR model is challenging due to its peaky posterior property. In this paper, we propose to address this issue by treating non-blank and blank frames differently for two main reasons. First, the non-blank frames in the teacher model’s posterior matrix and hidden representations provide more acoustic and linguistic information than the blank frames, but the frame number of non-blank frames only accounts for a small fraction of all frames, leading to a severe learning imbalance problem. Second, the non-blank tokens in the teacher’s blank-frame posteriors exhibit irregular probability distributions, negatively impacting the student model’s learning. Thus, we propose to factorize the distillation of non-blank and blank frames and further combine them into a progressive KD framework, which contains three incremental stages to facilitate the student model gradually building up its knowledge. The first stage involves a simple binary classification KD task, in which the student learns to distinguish between non-blank and blank frames, as the two types of frames are learned separately in subsequent stages. The second stage is a factorized representation-based KD, in which hidden representations are divided into non-blank and blank frames so that both can be distilled in a balanced manner. In the third stage, the student learns from the teacher’s posterior matrix through our proposed method, factorized KL-divergence (FKL), which performs different operation on blank and non-blank frame posteriors to alleviate the imbalance issue and reduce the influence of irregular probability distributions. Compared to the baseline, our proposed method achieves 22.5% relative CER reduction on the Aishell-1 dataset, 23.0% relative WER reduction on the Tedlium-2 dataset, and 17.6% relative WER reduction on the LibriSpeech dataset. To show the generalization of our method, we also evaluate our method on the hybrid CTC/Attention architecture as well as on scenarios with cross-model topology KD.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"160 ","pages":"Article 103071"},"PeriodicalIF":3.2,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140879835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimization-based planning of speech articulation using general Tau Theory 利用一般 Tau 理论进行基于优化的语音发音规划

IF 3.2 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-05-01 DOI: 10.1016/j.specom.2024.103083

Benjamin Elie , Juraj Šimko , Alice Turk

This paper presents a model of speech articulation planning and generation based on General Tau Theory and Optimal Control Theory. Because General Tau Theory assumes that articulatory targets are always reached, the model accounts for speech variation via context-dependent articulatory targets. Targets are chosen via the optimization of a composite objective function. This function models three different task requirements: maximal intelligibility, minimal articulatory effort and minimal utterance duration. The paper shows that systematic phonetic variability can be reproduced by adjusting the weights assigned to each task requirement. Weights can be adjusted globally to simulate different speech styles, and can be adjusted locally to simulate different levels of prosodic prominence. The solution of the optimization procedure contains Tau equation parameter values for each articulatory movement, namely position of the articulator at the movement offset, movement duration, and a parameter which relates to the shape of the movement’s velocity profile. The paper presents simulations which illustrate the ability of the model to predict or reproduce several well-known characteristics of speech. These phenomena include close-to-symmetric velocity profiles for articulatory movement, variation related to speech rate, centralization of unstressed vowels, lengthening of stressed vowels, lenition of unstressed lingual stop consonants, and coarticulation of stop consonants.

本文介绍了一种基于通用 Tau 理论和最优控制理论的语音发音规划和生成模型。由于通用 Tau 理论假设发音目标总是可以达到，因此该模型通过与语境相关的发音目标来考虑语音的变化。目标是通过优化综合目标函数来选择的。该函数模拟了三种不同的任务要求：最大可懂度、最小发音努力和最短语篇持续时间。论文表明，通过调整分配给每个任务要求的权重，可以再现系统的语音变异性。权重可以全局调整，以模拟不同的语音风格，也可以局部调整，以模拟不同的前音突出程度。优化程序的解决方案包含每个发音动作的 Tau 方程参数值，即发音器在动作偏移时的位置、动作持续时间以及与动作速度曲线形状有关的参数。论文中的模拟结果表明，该模型能够预测或再现几种众所周知的语音特征。这些现象包括近乎对称的发音运动速度曲线、与语速有关的变化、非重读元音的集中、重读元音的延长、非重读舌尖停止辅音的变长以及停止辅音的共同发音。

{"title":"Optimization-based planning of speech articulation using general Tau Theory","authors":"Benjamin Elie , Juraj Šimko , Alice Turk","doi":"10.1016/j.specom.2024.103083","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103083","url":null,"abstract":"<div><p>This paper presents a model of speech articulation planning and generation based on General Tau Theory and Optimal Control Theory. Because General Tau Theory assumes that articulatory targets are always reached, the model accounts for speech variation via context-dependent articulatory targets. Targets are chosen via the optimization of a composite objective function. This function models three different task requirements: maximal intelligibility, minimal articulatory effort and minimal utterance duration. The paper shows that systematic phonetic variability can be reproduced by adjusting the weights assigned to each task requirement. Weights can be adjusted globally to simulate different speech styles, and can be adjusted locally to simulate different levels of prosodic prominence. The solution of the optimization procedure contains Tau equation parameter values for each articulatory movement, namely position of the articulator at the movement offset, movement duration, and a parameter which relates to the shape of the movement’s velocity profile. The paper presents simulations which illustrate the ability of the model to predict or reproduce several well-known characteristics of speech. These phenomena include close-to-symmetric velocity profiles for articulatory movement, variation related to speech rate, centralization of unstressed vowels, lengthening of stressed vowels, lenition of unstressed lingual stop consonants, and coarticulation of stop consonants.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"160 ","pages":"Article 103083"},"PeriodicalIF":3.2,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000554/pdfft?md5=9244f2762d9cdb76bf74cf04a57a092e&pid=1-s2.0-S0167639324000554-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140948784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Chinese speech intelligibility and speech intelligibility index for the elderly 中文语音清晰度和老年人语音清晰度指数

IF 3.2 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-04-21 DOI: 10.1016/j.specom.2024.103072

Jiazhong Zeng , Jianxin Peng , Shuyin Xiang

The speech intelligibility index (SII) and speech transmission index (STI) are widely accepted objective metrics for assessing speech intelligibility. In previous work, the relationship between STI and Chinese speech intelligibility (CSI) scores was studied. In this paper, the relationship between SII and CSI scores in rooms for the elderly aged 60–69 and over 70 is investigated by using auralization method under different background noise levels (40dBA and 55dBA) and different reverberation times. The results show that SII has good correlation with CSI score of the elderly. To get the same CSI score as the young adults, the elderly need a larger SII value, and the value increases with the increase of the age for the elderly. Since hearing loss of the elderly is considered in the calculation of SII, the difference in the required SII between the elderly and young is less than that of the required STI under the same CSI score condition. This indicates that SII is a more consistent evaluation criterion for different ages.

语音清晰度指数（SII）和语音传输指数（STI）是公认的评估语音清晰度的客观指标。在以往的工作中，人们研究了 STI 与中文语音清晰度（CSI）得分之间的关系。本文在不同的背景噪声水平（40dBA 和 55dBA）和不同的混响时间下，采用听觉化方法研究了 60-69 岁和 70 岁以上老年人房间中 SII 和 CSI 分数之间的关系。结果表明，SII 与老年人的 CSI 分数具有良好的相关性。要获得与青壮年相同的 CSI 分数，老年人需要更大的 SII 值，而且该值随着老年人年龄的增加而增加。由于在计算 SII 时考虑了老年人的听力损失，因此在 CSI 分数相同的条件下，老年人和年轻人所需的 SII 差异小于所需的 STI 差异。这表明，SII 对不同年龄的人来说是一个更加一致的评价标准。

引用次数: 0

Combined approach to dysarthric speaker verification using data augmentation and feature fusion 利用数据扩增和特征融合的组合方法验证发音障碍者

IF 3.2 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-04-06 DOI: 10.1016/j.specom.2024.103070

Shinimol Salim , Syed Shahnawazuddin , Waquar Ahmad

In this study, the challenges of adapting automatic speaker verification (ASV) systems to accommodate individuals with dysarthria, a speech disorder affecting intelligibility and articulation, are addressed. The scarcity of dysarthric speech data presents a significant obstacle in the development of an effective ASV system. To mitigate the detrimental effects of data paucity, an out-of-domain data augmentation approach was employed based on the observation that dysarthric speech often exhibits longer phoneme duration. Motivated by this observation, the duration of healthy speech data was modified with various stretching factors and then pooled into training, resulting in a significant reduction in the error rate. In addition to analyzing average phoneme duration, another analysis revealed that dysarthric speech contains crucial high-frequency spectral information. However, Mel-frequency cepstral coefficients (MFCC) are inherently designed to down-sample spectral information in the higher-frequency regions, and the same is true for Mel-filterbank features. To address this shortcoming, Linear-filterbank cepstral coefficients (LFCC) were used in combination with MFCC features. While MFCC effectively captures certain aspects of dysarthric speech, LFCC complements this by capturing high-frequency details essential for accurate dysarthric speaker verification. This proposed feature fusion effectively minimizes spectral information loss, further reducing error rates. To support the significance of combination of MFCC and LFCC features in an automatic speaker verification system for speakers with dysarthria, comprehensive experimentation was conducted. The fusion of MFCC and LFCC features was compared with several other front-end acoustic features, such as Mel-filterbank features, linear filterbank features, wavelet filterbank features, linear prediction cepstral coefficients (LPCC), frequency domain LPCC, and constant Q cepstral coefficients (CQCC). The approaches were evaluated using both i-vector and x-vector-based representation, comparing systems developed using MFCC and LFCC features individually and in combination. The experimental results presented in this paper demonstrate substantial improvements, with a 25.78% reduction in equal error rate (EER) for i-vector models and a 23.66% reduction in EER for x-vector models when compared to the baseline ASV system. Additionally, the effect of feature concatenation with variation in dysarthria severity levels (low, medium, and high) was studied, and the proposed approach was found to be highly effective in those cases as well.

构音障碍是一种影响清晰度和发音的语言障碍，本研究探讨了如何调整自动说话者验证（ASV）系统以适应构音障碍患者的挑战。构音障碍语音数据的匮乏是开发有效 ASV 系统的一大障碍。为了减轻数据匮乏带来的不利影响，我们采用了一种域外数据增强方法，该方法基于对发音障碍语音通常表现出较长音素持续时间的观察。在这一观察结果的激励下，健康语音数据的持续时间被各种拉伸因子修改，然后汇集到训练中，从而显著降低了错误率。除了分析平均音素持续时间外，另一项分析显示，发音障碍语音包含重要的高频频谱信息。然而，Mel-frequency cepstral coefficients（MFCC）的固有设计会降低高频区域的频谱信息采样率，Mel-filterbank 特征也是如此。为了解决这一缺陷，我们将线性滤波器组共谱系数（LFCC）与 MFCC 特征结合使用。MFCC 能有效捕捉发音障碍语音的某些方面，而 LFCC 则能捕捉高频细节，从而对准确验证发音障碍说话者起到补充作用。这种拟议的特征融合有效地减少了频谱信息损失，进一步降低了错误率。为了证明 MFCC 和 LFCC 特征在构音障碍说话人自动验证系统中的重要性，我们进行了全面的实验。MFCC 和 LFCC 特征的融合与其他几种前端声学特征进行了比较，如 Mel 滤波库特征、线性滤波库特征、小波滤波库特征、线性预测前谱系数 (LPCC)、频域 LPCC 和常数 Q 前谱系数 (CQCC)。本文使用基于 i 向量和 x 向量的表示方法对这些方法进行了评估，并对使用 MFCC 和 LFCC 特征单独或组合开发的系统进行了比较。与基线 ASV 系统相比，本文介绍的实验结果表明，i-vector 模型的等效错误率 (EER) 降低了 25.78%，x-vector 模型的等效错误率降低了 23.66%。此外，还研究了构音障碍严重程度变化（低、中、高）对特征连接的影响，发现所提出的方法在这些情况下也非常有效。

{"title":"Combined approach to dysarthric speaker verification using data augmentation and feature fusion","authors":"Shinimol Salim , Syed Shahnawazuddin , Waquar Ahmad","doi":"10.1016/j.specom.2024.103070","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103070","url":null,"abstract":"<div><p>In this study, the challenges of adapting automatic speaker verification (ASV) systems to accommodate individuals with dysarthria, a speech disorder affecting intelligibility and articulation, are addressed. The scarcity of dysarthric speech data presents a significant obstacle in the development of an effective ASV system. To mitigate the detrimental effects of data paucity, an out-of-domain data augmentation approach was employed based on the observation that dysarthric speech often exhibits longer phoneme duration. Motivated by this observation, the duration of healthy speech data was modified with various stretching factors and then pooled into training, resulting in a significant reduction in the error rate. In addition to analyzing average phoneme duration, another analysis revealed that dysarthric speech contains crucial high-frequency spectral information. However, Mel-frequency cepstral coefficients (MFCC) are inherently designed to down-sample spectral information in the higher-frequency regions, and the same is true for Mel-filterbank features. To address this shortcoming, Linear-filterbank cepstral coefficients (LFCC) were used in combination with MFCC features. While MFCC effectively captures certain aspects of dysarthric speech, LFCC complements this by capturing high-frequency details essential for accurate dysarthric speaker verification. This proposed feature fusion effectively minimizes spectral information loss, further reducing error rates. To support the significance of combination of MFCC and LFCC features in an automatic speaker verification system for speakers with dysarthria, comprehensive experimentation was conducted. The fusion of MFCC and LFCC features was compared with several other front-end acoustic features, such as Mel-filterbank features, linear filterbank features, wavelet filterbank features, linear prediction cepstral coefficients (LPCC), frequency domain LPCC, and constant Q cepstral coefficients (CQCC). The approaches were evaluated using both <em>i</em>-vector and <em>x</em>-vector-based representation, comparing systems developed using MFCC and LFCC features individually and in combination. The experimental results presented in this paper demonstrate substantial improvements, with a 25.78% reduction in equal error rate (EER) for <em>i</em>-vector models and a 23.66% reduction in EER for <em>x</em>-vector models when compared to the baseline ASV system. Additionally, the effect of feature concatenation with variation in dysarthria severity levels (low, medium, and high) was studied, and the proposed approach was found to be highly effective in those cases as well.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"160 ","pages":"Article 103070"},"PeriodicalIF":3.2,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140555266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An ensemble technique to predict Parkinson's disease using machine learning algorithms 利用机器学习算法预测帕金森病的集合技术

IF 3.2 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-04-01 DOI: 10.1016/j.specom.2024.103067

Nutan Singh, Priyanka Tripathi

Parkinson's Disease (PD) is a progressive neurodegenerative disorder affecting motor and non-motor symptoms. Its symptoms develop slowly, making early identification difficult. Machine learning has a significant potential to predict Parkinson's disease on features hidden in voice data. This work aimed to identify the most relevant features from a high-dimensional dataset, which helps accurately classify Parkinson's Disease with less computation time. Three individual datasets with various medical features based on voice have been analyzed in this work. An Ensemble Feature Selection Algorithm (EFSA) technique based on filter, wrapper, and embedding algorithms that pick highly relevant features for identifying Parkinson's Disease is proposed, and the same has been validated on three different datasets based on voice. These techniques can shorten training time to improve model accuracy and minimize overfitting. We utilized different ML models such as K-Nearest Neighbors (KNN), Random Forest, Decision Tree, Support Vector Machine (SVM), Bagging Classifier, Multi-Layer Perceptron (MLP) Classifier, and Gradient Boosting. Each of these models was fine-tuned to ensure optimal performance within our specific context. Moreover, in addition to these established classifiers, we proposed an ensemble classifier is found on a high optimal majority of the votes. Dataset-I achieves classification accuracy with 97.6 %, F₁-score 97.9 %, precision with 98 % and recall with 98 %. Dataset-II achieves classification accuracy 90.2 %, F₁-score 90.2 %, precision 90.2 %, and recall 90.5 %. Dataset-III achieves 83.3 % accuracy, F₁-score 83.3 %, precision 83.5 % and recall 83.3 %. These results have been taken using 13 out of 23, 45 out of 754, and 17 out of 46 features from respective datasets. The proposed EFSA model has performed with higher accuracy and is more efficient than other models for each dataset.

帕金森病（PD）是一种进行性神经退行性疾病，影响运动和非运动症状。其症状发展缓慢，因此很难早期识别。机器学习在根据隐藏在语音数据中的特征预测帕金森病方面潜力巨大。这项工作旨在从高维数据集中找出最相关的特征，从而有助于以较少的计算时间准确地对帕金森病进行分类。这项研究分析了三个基于语音的具有各种医疗特征的数据集。本文提出了一种基于过滤器、包装器和嵌入算法的集合特征选择算法（EFSA）技术，该技术可挑选出与帕金森病识别高度相关的特征，并在三个不同的语音数据集上进行了验证。这些技术可以缩短训练时间，从而提高模型的准确性，并最大限度地减少过拟合。我们采用了不同的 ML 模型，如 K-Nearest Neighbors (KNN)、随机森林、决策树、支持向量机 (SVM)、袋式分类器、多层感知器 (MLP) 分类器和梯度提升。这些模型中的每一个都经过了微调，以确保在我们的特定情况下达到最佳性能。此外，除了这些成熟的分类器外，我们还提出了一种集合分类器，它能获得最佳多数选票。数据集 I 的分类准确率为 97.6%，F1 分数为 97.9%，精确度为 98%，召回率为 98%。数据集 II 的分类准确率为 90.2 %，F1 分数为 90.2 %，精确度为 90.2 %，召回率为 90.5 %。数据集 III 的分类准确率为 83.3%，F1 分数为 83.3%，精确度为 83.5%，召回率为 83.3%。这些结果分别使用了数据集中 23 个特征中的 13 个、754 个特征中的 45 个和 46 个特征中的 17 个。就每个数据集而言，拟议的 EFSA 模型都比其他模型具有更高的准确率和更高的效率。

{"title":"An ensemble technique to predict Parkinson's disease using machine learning algorithms","authors":"Nutan Singh, Priyanka Tripathi","doi":"10.1016/j.specom.2024.103067","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103067","url":null,"abstract":"<div><p>Parkinson's Disease (PD) is a progressive neurodegenerative disorder affecting motor and non-motor symptoms. Its symptoms develop slowly, making early identification difficult. Machine learning has a significant potential to predict Parkinson's disease on features hidden in voice data. This work aimed to identify the most relevant features from a high-dimensional dataset, which helps accurately classify Parkinson's Disease with less computation time. Three individual datasets with various medical features based on voice have been analyzed in this work. An Ensemble Feature Selection Algorithm (EFSA) technique based on filter, wrapper, and embedding algorithms that pick highly relevant features for identifying Parkinson's Disease is proposed, and the same has been validated on three different datasets based on voice. These techniques can shorten training time to improve model accuracy and minimize overfitting. We utilized different ML models such as K-Nearest Neighbors (KNN), Random Forest, Decision Tree, Support Vector Machine (SVM), Bagging Classifier, Multi-Layer Perceptron (MLP) Classifier, and Gradient Boosting. Each of these models was fine-tuned to ensure optimal performance within our specific context. Moreover, in addition to these established classifiers, we proposed an ensemble classifier is found on a high optimal majority of the votes. Dataset-I achieves classification accuracy with 97.6 %, F<sub>1</sub>-score 97.9 %, precision with 98 % and recall with 98 %. Dataset-II achieves classification accuracy 90.2 %, F<sub>1</sub>-score 90.2 %, precision 90.2 %, and recall 90.5 %. Dataset-III achieves 83.3 % accuracy, F<sub>1</sub>-score 83.3 %, precision 83.5 % and recall 83.3 %. These results have been taken using 13 out of 23, 45 out of 754, and 17 out of 46 features from respective datasets. The proposed EFSA model has performed with higher accuracy and is more efficient than other models for each dataset.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"159 ","pages":"Article 103067"},"PeriodicalIF":3.2,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140547363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A multimodal model for predicting feedback position and type during conversation 预测对话过程中反馈位置和类型的多模态模型

IF 3.2 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2024-04-01 DOI: 10.1016/j.specom.2024.103066

Auriane Boudin , Roxane Bertrand , Stéphane Rauzy , Magalie Ochs , Philippe Blache

This study investigates conversational feedback, that is, a listener's reaction in response to a speaker, a phenomenon which occurs in all natural interactions. Feedback depends on the main speaker's productions and in return supports the elaboration of the interaction. As a consequence, feedback production has a direct impact on the quality of the interaction.

This paper examines all types of feedback, from generic to specific feedback, the latter of which has received less attention in the literature. We also present a fine-grained labeling system introducing two sub-types of specific feedback: positive/negative and given/new. Following a literature review on linguistic and machine learning perspectives highlighting the main issues in feedback prediction, we present a model based on a set of multimodal features which predicts the possible position of feedback and its type. This computational model makes it possible to precisely identify the different features in the speaker's production (morpho-syntactic, prosodic and mimo-gestural) which play a role in triggering feedback from the listener; the model also evaluates their relative importance.

The main contribution of this study is twofold: we sought to improve 1/ the model's performance in comparison with other approaches relying on a small set of features, and 2/ the model's interpretability, in particular by investigating feature importance. By integrating all the different modalities as well as high-level features, our model is uniquely positioned to be applied to French corpora.

本研究调查的是会话反馈，即听者对说话者的反应，这是所有自然互动中都会出现的现象。反馈取决于主讲人的话语，并反过来支持互动的阐述。因此，反馈的产生对互动的质量有着直接的影响。本文研究了所有类型的反馈，从一般反馈到特殊反馈，后者在文献中受到的关注较少。我们还提出了一个细粒度标签系统，引入了两种特定反馈的子类型：正面/负面和给定/新反馈。在从语言学和机器学习的角度对反馈预测中的主要问题进行文献综述后，我们提出了一个基于多模态特征集的模型，该模型可预测反馈的可能位置及其类型。本研究的主要贡献有两个方面：1/与其他依赖于少量特征集的方法相比，我们试图提高模型的性能；2/模型的可解释性，特别是通过研究特征的重要性。通过整合所有不同的模式以及高级特征，我们的模型在应用于法语语料库方面具有独特的优势。

{"title":"A multimodal model for predicting feedback position and type during conversation","authors":"Auriane Boudin , Roxane Bertrand , Stéphane Rauzy , Magalie Ochs , Philippe Blache","doi":"10.1016/j.specom.2024.103066","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103066","url":null,"abstract":"<div><p>This study investigates conversational feedback, that is, a listener's reaction in response to a speaker, a phenomenon which occurs in all natural interactions. Feedback depends on the main speaker's productions and in return supports the elaboration of the interaction. As a consequence, feedback production has a direct impact on the quality of the interaction.</p><p>This paper examines all types of feedback, from generic to specific feedback, the latter of which has received less attention in the literature. We also present a fine-grained labeling system introducing two sub-types of specific feedback: <em>positive/negative</em> and <em>given/new</em>. Following a literature review on linguistic and machine learning perspectives highlighting the main issues in feedback prediction, we present a model based on a set of multimodal features which predicts the possible position of feedback and its type. This computational model makes it possible to precisely identify the different features in the speaker's production (morpho-syntactic, prosodic and mimo-gestural) which play a role in triggering feedback from the listener; the model also evaluates their relative importance.</p><p>The main contribution of this study is twofold: we sought to improve 1/ the model's performance in comparison with other approaches relying on a small set of features, and 2/ the model's interpretability, in particular by investigating feature importance. By integrating all the different modalities as well as high-level features, our model is uniquely positioned to be applied to French corpora.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"159 ","pages":"Article 103066"},"PeriodicalIF":3.2,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000384/pdfft?md5=d3bb6a1d05cfbf539d30e718f252c2d8&pid=1-s2.0-S0167639324000384-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140331131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0