首页 > 最新文献

Speech Communication最新文献

英文 中文
The effects of informational and energetic/modulation masking on the efficiency and ease of speech communication across the lifespan 信息和能量/调制掩蔽对一生中语言交流的效率和便利性的影响
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-07-01 DOI: 10.1016/j.specom.2024.103101
Outi Tuomainen , Stuart Rosen , Linda Taschenberger , Valerie Hazan

Children and older adults have greater difficulty understanding speech when there are other voices in the background (informational masking, IM) than when the interference is a steady-state noise with a similar spectral profile but is not speech (due to modulation and energetic masking; EM/MM). We evaluated whether this IM vs. EM/MM difference for certain age ranges was found for broader measures of communication efficiency and ease in 114 participants aged between 8 and 80. Participants carried out interactive diapix problem-solving tasks in age-band- and sex-matched pairs, in quiet and with different maskers in the background affecting both participants. Three measures were taken: (a) task transaction time (communication efficiency), (b) performance on a secondary auditory task simultaneously carried out during diapix, and (c) post-test subjective ratings of effort, concentration, difficulty and noisiness (communication ease). Although participants did not take longer to complete the task when in challenging conditions, effects of IM vs. EM/MM were clearly seen on the other measures. Relative to the EM/MM and quiet conditions, participants in IM conditions were less able to attend to the secondary task and reported greater effects of the masker type on their perceived degree of effort, concentration, difficulty and noisiness. However, we found no evidence of decreased communication efficiency and ease in IM relative to EM/MM for children and older adults in any of our measures. The clearest effects of age were observed in transaction time and secondary task measures. Overall, communication efficiency gradually improved between the ages 8–18 years and performance on the secondary task improved over younger ages (until 30 years) and gradually decreased after 50 years of age. Finally, we also found an impact of communicative role on performance. In adults, the participant asked to take the lead in the task and who spoke the most, performed worse on the secondary task than the person who was mainly in a ‘listening’ role and responding to queries. These results suggest that when a broader evaluation of speech communication is carried out that more closely resembles typical communicative situations, the more acute effects of IM typically seen in populations at the extremes of the lifespan are minimised potentially due to the presence of multiple information sources, which allow the use of varying communication strategies. Such a finding is relevant for clinical evaluations of speech communication.

当背景中有其他声音时(信息掩蔽,IM),儿童和老年人理解语音的难度要大于干扰是具有相似频谱轮廓但不是语音的稳态噪声时(由于调制和能量掩蔽;EM/MM)。我们对 114 名年龄在 8 岁至 80 岁之间的参与者进行了评估,以更广泛地衡量交流效率和难易程度,看看在某些年龄段是否存在这种 IM 与 EM/MM 的差异。参加者在年龄和性别匹配的情况下,在安静的环境中并在不同的背景掩蔽物影响下进行互动式问题解决任务。测试采用了三种测量方法:(a) 任务完成时间(交流效率);(b) 测试期间同时进行的第二听觉任务的表现;(c) 测试后对努力程度、集中程度、难度和噪音的主观评价(交流难易程度)。虽然参与者在挑战性条件下完成任务的时间并不更长,但在其他测量指标上,即时通讯与电磁/微波的影响显而易见。与 EM/MM 和安静条件相比,在 IM 条件下,参与者不太能够专注于次要任务,并报告了掩蔽器类型对他们感知到的努力程度、集中程度、难度和嘈杂程度的更大影响。不过,我们没有发现任何证据表明,与 EM/MM 相比,儿童和老年人在 IM 条件下的交流效率和轻松程度都有所下降。在交易时间和次要任务测量中,我们观察到了最明显的年龄影响。总体而言,交流效率在 8 至 18 岁之间逐渐提高,在次要任务上的表现在较年轻时(30 岁之前)有所改善,在 50 岁之后逐渐下降。最后,我们还发现了交流角色对成绩的影响。在成人中,被要求在任务中起主导作用和发言最多的人在次要任务中的表现要比主要扮演 "倾听 "角色和回答询问的人差。这些结果表明,如果对语言交流进行更广泛的评估,使其更接近于典型的交流情况,那么通常在处于生命周期极端的人群中出现的即时通讯的严重影响就会被最小化,这可能是由于存在多个信息源,允许使用不同的交流策略。这一发现与言语交流的临床评估有关。
{"title":"The effects of informational and energetic/modulation masking on the efficiency and ease of speech communication across the lifespan","authors":"Outi Tuomainen ,&nbsp;Stuart Rosen ,&nbsp;Linda Taschenberger ,&nbsp;Valerie Hazan","doi":"10.1016/j.specom.2024.103101","DOIUrl":"10.1016/j.specom.2024.103101","url":null,"abstract":"<div><p>Children and older adults have greater difficulty understanding speech when there are other voices in the background (informational masking, IM) than when the interference is a steady-state noise with a similar spectral profile but is not speech (due to modulation and energetic masking; EM/MM). We evaluated whether this IM vs. EM/MM difference for certain age ranges was found for broader measures of communication efficiency and ease in 114 participants aged between 8 and 80. Participants carried out interactive <em>diapix</em> problem-solving tasks in age-band- and sex-matched pairs, in quiet and with different maskers in the background affecting both participants. Three measures were taken: (a) task transaction time (communication efficiency), (b) performance on a secondary auditory task simultaneously carried out during <em>diapix</em>, and (c) post-test subjective ratings of effort, concentration, difficulty and noisiness (communication ease). Although participants did not take longer to complete the task when in challenging conditions, effects of IM vs. EM/MM were clearly seen on the other measures. Relative to the EM/MM and quiet conditions, participants in IM conditions were less able to attend to the secondary task and reported greater effects of the masker type on their perceived degree of effort, concentration, difficulty and noisiness. However, we found no evidence of decreased communication efficiency and ease in IM relative to EM/MM for children and older adults in any of our measures. The clearest effects of age were observed in transaction time and secondary task measures. Overall, communication efficiency gradually improved between the ages 8–18 years and performance on the secondary task improved over younger ages (until 30 years) and gradually decreased after 50 years of age. Finally, we also found an impact of communicative role on performance. In adults, the participant asked to take the lead in the task and who spoke the most, performed worse on the secondary task than the person who was mainly in a ‘listening’ role and responding to queries. These results suggest that when a broader evaluation of speech communication is carried out that more closely resembles typical communicative situations, the more acute effects of IM typically seen in populations at the extremes of the lifespan are minimised potentially due to the presence of multiple information sources, which allow the use of varying communication strategies. Such a finding is relevant for clinical evaluations of speech communication.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000736/pdfft?md5=3bae57a7e48911c3d00f77555ed9d386&pid=1-s2.0-S0167639324000736-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141577736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pathological voice classification using MEEL features and SVM-TabNet model 使用 MEEL 特征和 SVM-TabNet 模型进行病态语音分类
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-07-01 DOI: 10.1016/j.specom.2024.103100
Mohammed Zakariah , Muna Al-Razgan , Taha Alfakih

In clinical settings, early diagnosis and objective assessment depend on the detection of voice pathology. To classify anomalous voices, this work uses an approach that combines the SVM-TabNet fusion model with MEEL (Mel-Frequency Energy Line) features. Further, the dataset consists of 1037 speech files, including recordings from people with laryngocele and Vox senilis as well as from healthy persons. Additionally, the main goal is to create an efficient classification model that can differentiate between normal and abnormal voice patterns. Modern techniques frequently lack the accuracy required for a precise diagnosis, which highlights the need for novel strategies. The suggested approach uses an SVM-TabNet fusion model for classification after feature extraction using MEEL characteristics. MEEL features provide extensive information for categorization by capturing complex patterns in audio transmissions. Moreover, by combining the advantages of SVM and TabNet models, classification performance is improved. Moreover, testing the model on test data yields remarkable results: 99.7 % accuracy, 0.992 F1 score, 0.996 precision, and 0.995 recall. Additional testing on additional datasets reliably validates outstanding performance, with 99.4 % accuracy, 0.99 F1 score, 0.998 precision, and 0.989 % recall. Furthermore, using the Saarbruecken Voice Database (SVD), the suggested methodology achieves an impressive accuracy of 99.97 %, demonstrating its durability and generalizability across many datasets. Overall, this work shows how the SVM-TabNet fusion model with MEEL characteristics may be used to accurately and consistently classify diseased voices, providing encouraging opportunities for clinical diagnosis and therapy tracking.

在临床环境中,早期诊断和客观评估取决于嗓音病理的检测。为了对异常声音进行分类,这项工作采用了一种将 SVM-TabNet 融合模型与 MEEL(Mel-Frequency Energy Line)特征相结合的方法。此外,该数据集由 1037 个语音文件组成,其中包括喉结畸形和老年痴呆症患者以及健康人的录音。此外,主要目标是创建一个高效的分类模型,以区分正常和异常语音模式。现代技术往往缺乏精确诊断所需的准确性,这就凸显了对新策略的需求。建议的方法在使用 MEEL 特征提取特征后,使用 SVM-TabNet 融合模型进行分类。MEEL 特征通过捕捉音频传输中的复杂模式,为分类提供了广泛的信息。此外,通过结合 SVM 和 TabNet 模型的优势,分类性能得到了提高。此外,在测试数据上测试该模型也取得了显著效果:准确率为 99.7%,F1 分数为 0.992,精确度为 0.996,召回率为 0.995。在其他数据集上进行的额外测试可靠地验证了该模型的出色性能,准确率为 99.4%,F1 分数为 0.99,精确度为 0.998,召回率为 0.989%。此外,在使用萨尔布吕肯语音数据库(SVD)时,所建议的方法达到了令人印象深刻的 99.97 % 的准确率,证明了它在许多数据集上的持久性和通用性。总之,这项研究表明,具有 MEEL 特征的 SVM-TabNet 融合模型可用于准确、一致地对病态声音进行分类,为临床诊断和治疗跟踪提供了令人鼓舞的机会。
{"title":"Pathological voice classification using MEEL features and SVM-TabNet model","authors":"Mohammed Zakariah ,&nbsp;Muna Al-Razgan ,&nbsp;Taha Alfakih","doi":"10.1016/j.specom.2024.103100","DOIUrl":"10.1016/j.specom.2024.103100","url":null,"abstract":"<div><p>In clinical settings, early diagnosis and objective assessment depend on the detection of voice pathology. To classify anomalous voices, this work uses an approach that combines the SVM-TabNet fusion model with MEEL (Mel-Frequency Energy Line) features. Further, the dataset consists of 1037 speech files, including recordings from people with laryngocele and Vox senilis as well as from healthy persons. Additionally, the main goal is to create an efficient classification model that can differentiate between normal and abnormal voice patterns. Modern techniques frequently lack the accuracy required for a precise diagnosis, which highlights the need for novel strategies. The suggested approach uses an SVM-TabNet fusion model for classification after feature extraction using MEEL characteristics. MEEL features provide extensive information for categorization by capturing complex patterns in audio transmissions. Moreover, by combining the advantages of SVM and TabNet models, classification performance is improved. Moreover, testing the model on test data yields remarkable results: 99.7 % accuracy, 0.992 F1 score, 0.996 precision, and 0.995 recall. Additional testing on additional datasets reliably validates outstanding performance, with 99.4 % accuracy, 0.99 F1 score, 0.998 precision, and 0.989 % recall. Furthermore, using the Saarbruecken Voice Database (SVD), the suggested methodology achieves an impressive accuracy of 99.97 %, demonstrating its durability and generalizability across many datasets. Overall, this work shows how the SVM-TabNet fusion model with MEEL characteristics may be used to accurately and consistently classify diseased voices, providing encouraging opportunities for clinical diagnosis and therapy tracking.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141571774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analyzing the influence of different speech data corpora and speech features on speech emotion recognition: A review 分析不同语音数据集和语音特征对语音情感识别的影响:综述
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2024-07-01 DOI: 10.1016/j.specom.2024.103102

Emotion recognition from speech has become crucial in human-computer interaction and affective computing applications. This review paper examines the complex relationship between two critical factors: the selection of speech data corpora and the extraction of speech features regarding speech emotion classification accuracy. Through an extensive analysis of literature from 2014 to 2023, publicly available speech datasets are explored and categorized based on their diversity, scale, linguistic attributes, and emotional classifications. The importance of various speech features, from basic spectral features to sophisticated prosodic cues, and their influence on emotion recognition accuracy is analyzed.. In the context of speech data corpora, this review paper unveils trends and insights from comparative studies exploring the repercussions of dataset choice on recognition efficacy. Various datasets such as IEMOCAP, EMODB, and MSP-IMPROV are scrutinized in terms of their influence on classifying the accuracy of the speech emotion recognition (SER) system. At the same time, potential challenges associated with dataset limitations are also examined. Notable features like Mel-frequency cepstral coefficients, pitch, intensity, and prosodic patterns are evaluated for their contributions to emotion recognition. Advanced feature extraction methods, too, are explored for their potential to capture intricate emotional dynamics. Moreover, this review paper offers insights into the methodological aspects of emotion recognition, shedding light on the diverse machine learning and deep learning approaches employed. Through a holistic synthesis of research findings, this review paper observes connections between the choice of speech data corpus, selection of speech features, and resulting emotion recognition accuracy. As the field continues to evolve, avenues for future research are proposed, ranging from enhanced feature extraction techniques to the development of standardized benchmark datasets. In essence, this review serves as a compass guiding researchers and practitioners through the intricate landscape of speech emotion recognition, offering a nuanced understanding of the factors shaping its recognition accuracy of speech emotion.

语音情感识别已成为人机交互和情感计算应用的关键。本综述论文探讨了两个关键因素之间的复杂关系:关于语音情感分类准确性的语音数据集选择和语音特征提取。通过对 2014 年至 2023 年的文献进行广泛分析,探讨了公开可用的语音数据集,并根据其多样性、规模、语言属性和情感分类进行了分类。分析了从基本的频谱特征到复杂的前音线索等各种语音特征的重要性及其对情感识别准确性的影响。在语音数据体方面,这篇综述揭示了比较研究的趋势和见解,探讨了数据集选择对识别效率的影响。本文仔细研究了 IEMOCAP、EMODB 和 MSP-IMPROV 等各种数据集对语音情感识别(SER)系统准确性分类的影响。同时,还研究了与数据集局限性相关的潜在挑战。评估了梅尔频率共振频率系数、音高、音强和前音模式等显著特征对情感识别的贡献。此外,还探讨了先进的特征提取方法,以发现其捕捉复杂情感动态的潜力。此外,这篇综述论文还对情感识别的方法论方面提出了见解,阐明了所采用的各种机器学习和深度学习方法。通过对研究成果的全面综合,本综述论文观察到了语音数据语料的选择、语音特征的选择以及由此产生的情感识别准确率之间的联系。随着该领域的不断发展,本文提出了未来的研究方向,包括增强特征提取技术和开发标准化基准数据集。从本质上讲,这篇综述就像一个指南针,指引着研究人员和从业人员穿越错综复杂的语音情感识别领域,对影响语音情感识别准确性的因素有了细致入微的了解。
{"title":"Analyzing the influence of different speech data corpora and speech features on speech emotion recognition: A review","authors":"","doi":"10.1016/j.specom.2024.103102","DOIUrl":"10.1016/j.specom.2024.103102","url":null,"abstract":"<div><p>Emotion recognition from speech has become crucial in human-computer interaction and affective computing applications. This review paper examines the complex relationship between two critical factors: the selection of speech data corpora and the extraction of speech features regarding speech emotion classification accuracy. Through an extensive analysis of literature from 2014 to 2023, publicly available speech datasets are explored and categorized based on their diversity, scale, linguistic attributes, and emotional classifications. The importance of various speech features, from basic spectral features to sophisticated prosodic cues, and their influence on emotion recognition accuracy is analyzed.. In the context of speech data corpora, this review paper unveils trends and insights from comparative studies exploring the repercussions of dataset choice on recognition efficacy. Various datasets such as IEMOCAP, EMODB, and MSP-IMPROV are scrutinized in terms of their influence on classifying the accuracy of the speech emotion recognition (SER) system. At the same time, potential challenges associated with dataset limitations are also examined. Notable features like Mel-frequency cepstral coefficients, pitch, intensity, and prosodic patterns are evaluated for their contributions to emotion recognition. Advanced feature extraction methods, too, are explored for their potential to capture intricate emotional dynamics. Moreover, this review paper offers insights into the methodological aspects of emotion recognition, shedding light on the diverse machine learning and deep learning approaches employed. Through a holistic synthesis of research findings, this review paper observes connections between the choice of speech data corpus, selection of speech features, and resulting emotion recognition accuracy. As the field continues to evolve, avenues for future research are proposed, ranging from enhanced feature extraction techniques to the development of standardized benchmark datasets. In essence, this review serves as a compass guiding researchers and practitioners through the intricate landscape of speech emotion recognition, offering a nuanced understanding of the factors shaping its recognition accuracy of speech emotion.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141637049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Emotions recognition in audio signals using an extension of the latent block model 利用潜块模型的扩展识别音频信号中的情绪
IF 3.2 3区 计算机科学 Q1 Arts and Humanities Pub Date : 2024-06-01 DOI: 10.1016/j.specom.2024.103092
Abir El Haj

Emotion detection in human speech is a significant area of research, crucial for various applications such as affective computing and human–computer interaction. Despite advancements, accurately categorizing emotional states in speech remains challenging due to its subjective nature and the complexity of human emotions. To address this, we propose leveraging Mel frequency cepstral coefficients (MFCCS) and extend the latent block model (LBM) probabilistic clustering technique with a Gaussian multi-way latent block model (GMWLBM). Our objective is to categorize speech emotions into coherent groups based on the emotional states conveyed by speakers. We employ MFCCS from time-series audio data and utilize a variational Expectation Maximization method to estimate GMWLBM parameters. Additionally, we introduce an integrated Classification Likelihood (ICL) model selection criterion to determine the optimal number of clusters, enhancing robustness. Numerical experiments on real data from the Berlin Database of Emotional Speech (EMO-DB) demonstrate our method’s efficacy in accurately detecting and classifying emotional states in human speech, even in challenging real-world scenarios, thereby contributing significantly to affective computing and human–computer interaction applications.

人类语音中的情感检测是一个重要的研究领域,对情感计算和人机交互等各种应用至关重要。尽管取得了进步,但由于语音的主观性和人类情绪的复杂性,准确地对语音中的情绪状态进行分类仍具有挑战性。为了解决这个问题,我们建议利用梅尔频率倒频谱系数(MFCCS),并使用高斯多向潜在块模型(GMWLBM)扩展潜在块模型(LBM)概率聚类技术。我们的目标是根据说话者所传达的情绪状态,将语音情绪划分为一致的群组。我们采用来自时间序列音频数据的 MFCCS,并利用变异期望最大化方法来估计 GMWLBM 参数。此外,我们还引入了综合分类似然 (ICL) 模型选择标准,以确定最佳聚类数量,从而增强鲁棒性。在柏林情感语音数据库(EMO-DB)的真实数据上进行的数值实验证明,即使在具有挑战性的现实世界场景中,我们的方法也能准确检测人类语音中的情感状态并对其进行分类,从而为情感计算和人机交互应用做出了重大贡献。
{"title":"Emotions recognition in audio signals using an extension of the latent block model","authors":"Abir El Haj","doi":"10.1016/j.specom.2024.103092","DOIUrl":"10.1016/j.specom.2024.103092","url":null,"abstract":"<div><p>Emotion detection in human speech is a significant area of research, crucial for various applications such as affective computing and human–computer interaction. Despite advancements, accurately categorizing emotional states in speech remains challenging due to its subjective nature and the complexity of human emotions. To address this, we propose leveraging Mel frequency cepstral coefficients (MFCCS) and extend the latent block model (LBM) probabilistic clustering technique with a Gaussian multi-way latent block model (GMWLBM). Our objective is to categorize speech emotions into coherent groups based on the emotional states conveyed by speakers. We employ MFCCS from time-series audio data and utilize a variational Expectation Maximization method to estimate GMWLBM parameters. Additionally, we introduce an integrated Classification Likelihood (ICL) model selection criterion to determine the optimal number of clusters, enhancing robustness. Numerical experiments on real data from the Berlin Database of Emotional Speech (EMO-DB) demonstrate our method’s efficacy in accurately detecting and classifying emotional states in human speech, even in challenging real-world scenarios, thereby contributing significantly to affective computing and human–computer interaction applications.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141278454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Summary of the DISPLACE challenge 2023-DIarization of SPeaker and LAnguage in Conversational Environments 2023 年 DISPLACE 挑战赛摘要--对话环境中的 SPeaker 和 LAnguage 个性化定制
IF 3.2 3区 计算机科学 Q1 Arts and Humanities Pub Date : 2024-05-11 DOI: 10.1016/j.specom.2024.103080
Shikha Baghel , Shreyas Ramoji , Somil Jain , Pratik Roy Chowdhuri , Prachi Singh , Deepu Vijayasenan , Sriram Ganapathy

In multi-lingual societies, where multiple languages are spoken in a small geographic vicinity, informal conversations often involve mix of languages. Existing speech technologies may be inefficient in extracting information from such conversations, where the speech data is rich in diversity with multiple languages and speakers. The DISPLACE (DIarization of SPeaker and LAnguage in Conversational Environments) challenge constitutes an open-call for evaluating and bench-marking the speaker and language diarization technologies on this challenging condition. To facilitate this challenge, a real-world dataset featuring multilingual, multi-speaker conversational far-field speech was recorded and distributed. The challenge entailed two tracks: Track-1 focused on speaker diarization (SD) in multilingual situations while, Track-2 addressed the language diarization (LD) in a multi-speaker scenario. Both the tracks were evaluated using the same underlying audio data. Furthermore, a baseline system was made available for both SD and LD task which mimicked the state-of-art in these tasks. The challenge garnered a total of 42 world-wide registrations and received a total of 19 combined submissions for Track-1 and Track-2. This paper describes the challenge, details of the datasets, tasks, and the baseline system. Additionally, the paper provides a concise overview of the submitted systems in both tracks, with an emphasis given to the top performing systems. The paper also presents insights and future perspectives for SD and LD tasks, focusing on the key challenges that the systems need to overcome before wide-spread commercial deployment on such conversations.

在多语言社会中,小范围内使用多种语言,非正式对话往往涉及多种语言的混合。现有的语音技术在从这类对话中提取信息时可能效率低下,因为在这类对话中,语音数据丰富多样,包含多种语言和说话人。DISPLACE (DIarization of SPeaker and LAnguage in Conversational Environments) 挑战赛是一项公开征集活动,目的是在这一具有挑战性的条件下对说话者和语言日记化技术进行评估和标杆测试。为了促进这项挑战,我们录制并分发了一个真实世界的数据集,其中包含多语言、多说话人的远场对话语音。挑战赛分为两个赛道:赛道 1 侧重于多语言情况下的说话人日记化(SD),而赛道 2 则针对多说话人情况下的语言日记化(LD)。两个轨道均使用相同的基础音频数据进行评估。此外,还为 SD 和 LD 任务提供了一个基线系统,以模拟这些任务的最新技术水平。此次挑战赛在全球范围内共收到 42 份注册申请,Track-1 和 Track-2 共收到 19 份合并申请。本文介绍了挑战赛、数据集详情、任务和基线系统。此外,本文还对两个赛道中提交的系统进行了简要概述,重点介绍了表现最佳的系统。论文还介绍了对 SD 和 LD 任务的见解和未来展望,重点关注系统在此类对话中广泛商业部署之前需要克服的关键挑战。
{"title":"Summary of the DISPLACE challenge 2023-DIarization of SPeaker and LAnguage in Conversational Environments","authors":"Shikha Baghel ,&nbsp;Shreyas Ramoji ,&nbsp;Somil Jain ,&nbsp;Pratik Roy Chowdhuri ,&nbsp;Prachi Singh ,&nbsp;Deepu Vijayasenan ,&nbsp;Sriram Ganapathy","doi":"10.1016/j.specom.2024.103080","DOIUrl":"10.1016/j.specom.2024.103080","url":null,"abstract":"<div><p>In multi-lingual societies, where multiple languages are spoken in a small geographic vicinity, informal conversations often involve mix of languages. Existing speech technologies may be inefficient in extracting information from such conversations, where the speech data is rich in diversity with multiple languages and speakers. The <strong>DISPLACE</strong> (DIarization of SPeaker and LAnguage in Conversational Environments) challenge constitutes an open-call for evaluating and bench-marking the speaker and language diarization technologies on this challenging condition. To facilitate this challenge, a real-world dataset featuring multilingual, multi-speaker conversational far-field speech was recorded and distributed. The challenge entailed two tracks: Track-1 focused on speaker diarization (SD) in multilingual situations while, Track-2 addressed the language diarization (LD) in a multi-speaker scenario. Both the tracks were evaluated using the same underlying audio data. Furthermore, a baseline system was made available for both SD and LD task which mimicked the state-of-art in these tasks. The challenge garnered a total of 42 world-wide registrations and received a total of 19 combined submissions for Track-1 and Track-2. This paper describes the challenge, details of the datasets, tasks, and the baseline system. Additionally, the paper provides a concise overview of the submitted systems in both tracks, with an emphasis given to the top performing systems. The paper also presents insights and future perspectives for SD and LD tasks, focusing on the key challenges that the systems need to overcome before wide-spread commercial deployment on such conversations.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141054826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
End-to-end integration of speech separation and voice activity detection for low-latency diarization of telephone conversations 端到端集成语音分离和语音活动检测功能,实现低延迟电话交谈日记化
IF 3.2 3区 计算机科学 Q1 Arts and Humanities Pub Date : 2024-05-11 DOI: 10.1016/j.specom.2024.103081
Giovanni Morrone , Samuele Cornell , Luca Serafini , Enrico Zovato , Alessio Brutti , Stefano Squartini

Recent works show that speech separation guided diarization (SSGD) is an increasingly promising direction, mainly thanks to the recent progress in speech separation. It performs diarization by first separating the speakers and then applying voice activity detection (VAD) on each separated stream. In this work we conduct an in-depth study of SSGD in the conversational telephone speech (CTS) domain, focusing mainly on low-latency streaming diarization applications. We consider three state-of-the-art speech separation (SSep) algorithms and study their performance both in online and offline scenarios, considering non-causal and causal implementations as well as continuous SSep (CSS) windowed inference. We compare different SSGD algorithms on two widely used CTS datasets: CALLHOME and Fisher Corpus (Part 1 and 2) and evaluate both separation and diarization performance. To improve performance, a novel, causal and computationally efficient leakage removal algorithm is proposed, which significantly decreases false alarms. We also explore, for the first time, fully end-to-end SSGD integration between SSep and VAD modules. Crucially, this enables fine-tuning on real-world data for which oracle speakers sources are not available. In particular, our best model achieves 8.8% DER on CALLHOME, which outperforms the current state-of-the-art end-to-end neural diarization model, despite being trained on an order of magnitude less data and having significantly lower latency, i.e., 0.1 vs. 1 s. Finally, we also show that the separated signals can be readily used also for automatic speech recognition, reaching performance close to using oracle sources in some configurations.

最近的研究表明,语音分离指导下的日记化(SSGD)是一个越来越有前景的方向,这主要归功于语音分离技术的最新进展。它通过首先分离说话者,然后在每个分离流上应用语音活动检测(VAD)来执行日记化。在这项工作中,我们对会话电话语音(CTS)领域的 SSGD 进行了深入研究,主要侧重于低延迟流式日记化应用。我们考虑了三种最先进的语音分离(SSep)算法,并研究了它们在在线和离线场景下的性能,考虑了非因果和因果实现以及连续 SSep (CSS) 窗口推理。我们在两个广泛使用的 CTS 数据集上比较了不同的 SSGD 算法:我们在两个广泛使用的 CTS 数据集:CALLHOME 和 Fisher Corpus(第 1 部分和第 2 部分)上比较了不同的 SSGD 算法,并评估了分离和日记化性能。为了提高性能,我们提出了一种新颖、因果关系明显、计算效率高的泄漏清除算法,该算法可显著降低误报率。我们还首次探索了 SSep 和 VAD 模块之间完全端到端的 SSGD 集成。最重要的是,这使得我们能够在无法获得 Oracle 扬声器源的真实世界数据上进行微调。特别是,我们的最佳模型在 CALLHOME 上达到了 8.8% 的 DER,超过了当前最先进的端到端神经日记化模型,尽管其训练数据量少了一个数量级,延迟时间也大大降低,即 0.1 秒对 1 秒。
{"title":"End-to-end integration of speech separation and voice activity detection for low-latency diarization of telephone conversations","authors":"Giovanni Morrone ,&nbsp;Samuele Cornell ,&nbsp;Luca Serafini ,&nbsp;Enrico Zovato ,&nbsp;Alessio Brutti ,&nbsp;Stefano Squartini","doi":"10.1016/j.specom.2024.103081","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103081","url":null,"abstract":"<div><p>Recent works show that speech separation guided diarization (SSGD) is an increasingly promising direction, mainly thanks to the recent progress in speech separation. It performs diarization by first separating the speakers and then applying voice activity detection (VAD) on each separated stream. In this work we conduct an in-depth study of SSGD in the conversational telephone speech (CTS) domain, focusing mainly on low-latency streaming diarization applications. We consider three state-of-the-art speech separation (SSep) algorithms and study their performance both in online and offline scenarios, considering non-causal and causal implementations as well as continuous SSep (CSS) windowed inference. We compare different SSGD algorithms on two widely used CTS datasets: CALLHOME and Fisher Corpus (Part 1 and 2) and evaluate both separation and diarization performance. To improve performance, a novel, causal and computationally efficient leakage removal algorithm is proposed, which significantly decreases false alarms. We also explore, for the first time, fully end-to-end SSGD integration between SSep and VAD modules. Crucially, this enables fine-tuning on real-world data for which oracle speakers sources are not available. In particular, our best model achieves 8.8% DER on CALLHOME, which outperforms the current state-of-the-art end-to-end neural diarization model, despite being trained on an order of magnitude less data and having significantly lower latency, i.e., 0.1 vs. 1 s. Finally, we also show that the separated signals can be readily used also for automatic speech recognition, reaching performance close to using oracle sources in some configurations.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141078094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Visual-articulatory cues facilitate children with CIs to better perceive Mandarin tones in sentences 视觉-发音线索有助于 CI 儿童更好地感知句子中的普通话声调
IF 3.2 3区 计算机科学 Q1 Arts and Humanities Pub Date : 2024-05-01 DOI: 10.1016/j.specom.2024.103084
Ping Tang, Shanpeng Li, Yanan Shen, Qianxi Yu, Yan Feng

Children with cochlear implants (CIs) face challenges in tonal perception under noise. Nevertheless, our previous research demonstrated that seeing visual-articulatory cues (speakers’ facial/head movements) benefited these children to perceive isolated tones better, particularly in noisy environments, with those implanted earlier gaining more benefits. However, tones in daily speech typically occur in sentence contexts where visual cues are largely reduced compared to those in isolated contexts. It was thus unclear if visual benefits on tonal perception still hold in these challenging sentence contexts. Therefore, this study tested 64 children with CIs and 64 age-matched NH children. Target tones in sentence-medial position were presented in audio-only (AO) or audiovisual (AV) conditions, in quiet and noisy environments. Children selected the target tone using a picture-point task. The results showed that, while NH children did not show any perception difference between AO and AV conditions, children with CIs significantly improved their perceptual accuracy from AO to AV conditions. The degree of improvement was negatively correlated with their implantation ages. Therefore, children with CIs were able to use visual-articulatory cues to facilitate their tonal perception even in sentence contexts, and earlier auditory experience might be important in shaping this ability.

在噪音环境下,植入人工耳蜗(CI)的儿童在音调感知方面面临挑战。然而,我们之前的研究表明,看到视觉-发音线索(说话者的面部/头部动作)有利于这些儿童更好地感知孤立的音调,尤其是在噪音环境中,植入时间较早的儿童受益更多。然而,日常言语中的音调通常出现在句子语境中,与孤立语境中的音调相比,句子语境中的视觉线索要少得多。因此,还不清楚在这些具有挑战性的句子语境中,视觉对音调感知的益处是否仍然存在。因此,本研究对 64 名 CI 儿童和 64 名年龄匹配的 NH 儿童进行了测试。在安静和嘈杂的环境中,以纯音频(AO)或视听(AV)条件呈现句子中间位置的目标音调。儿童通过图片点任务选择目标音调。结果表明,虽然正常儿童在纯音频和视听条件下没有表现出任何感知上的差异,但患有人工耳蜗的儿童在纯音频和视听条件下的感知准确性有了显著提高。提高程度与植入年龄呈负相关。因此,即使在句子语境中,植入人工耳蜗的儿童也能利用视觉-发音线索来促进他们对音调的感知,而早期的听觉经验可能是形成这种能力的重要因素。
{"title":"Visual-articulatory cues facilitate children with CIs to better perceive Mandarin tones in sentences","authors":"Ping Tang,&nbsp;Shanpeng Li,&nbsp;Yanan Shen,&nbsp;Qianxi Yu,&nbsp;Yan Feng","doi":"10.1016/j.specom.2024.103084","DOIUrl":"10.1016/j.specom.2024.103084","url":null,"abstract":"<div><p>Children with cochlear implants (CIs) face challenges in tonal perception under noise. Nevertheless, our previous research demonstrated that seeing visual-articulatory cues (speakers’ facial/head movements) benefited these children to perceive isolated tones better, particularly in noisy environments, with those implanted earlier gaining more benefits. However, tones in daily speech typically occur in sentence contexts where visual cues are largely reduced compared to those in isolated contexts. It was thus unclear if visual benefits on tonal perception still hold in these challenging sentence contexts. Therefore, this study tested 64 children with CIs and 64 age-matched NH children. Target tones in sentence-medial position were presented in audio-only (AO) or audiovisual (AV) conditions, in quiet and noisy environments. Children selected the target tone using a picture-point task. The results showed that, while NH children did not show any perception difference between AO and AV conditions, children with CIs significantly improved their perceptual accuracy from AO to AV conditions. The degree of improvement was negatively correlated with their implantation ages. Therefore, children with CIs were able to use visual-articulatory cues to facilitate their tonal perception even in sentence contexts, and earlier auditory experience might be important in shaping this ability.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141028923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The prosody of theme, rheme and focus in Egyptian Arabic: A quantitative investigation of tunes, configurations and speaker variability 埃及阿拉伯语中主题、韵律和重点的前奏:对曲调、配置和说话者变异性的定量研究
IF 3.2 3区 计算机科学 Q1 Arts and Humanities Pub Date : 2024-05-01 DOI: 10.1016/j.specom.2024.103082
Dina El Zarka , Anneliese Kelterer , Michele Gubian , Barbara Schuppler

This paper investigates the prosody of sentences elicited in three Information Structure (IS) conditions: all-new, theme-rheme and rhematic focus-background. The sentences were produced by 18 speakers of Egyptian Arabic (EA). This is the first quantitative study to provide a comprehensive analysis of holistic f0 contours (by means of GAMM) and configurations of f0, duration and intensity (by means of FPCA) associated with the three IS conditions, both across and within speakers. A significant difference between focus-background and the other information structure conditions was found, but also strong inter-speaker variation in terms of strategies and the degree to which these strategies were applied. The results suggest that post-focus register lowering and the duration of the stressed syllables of the focused and the utterance-final word are more consistent cues to focus than a higher peak of the focus accent. In addition, some independence of duration and intensity from f0 could be identified. These results thus support the assumption that, when focus is marked prosodically in EA, it is marked by prominence. Nevertheless, the fact that a considerable number of EA speakers did not apply prosodic marking and the fact that prosodic focus marking was gradient rather than categorical suggest that EA does not have a fully conventionalized prosodic focus construction.

本文研究了在三种信息结构(IS)条件下激发的句子的拟声:全新、主题-主题和主题-焦点-背景。句子是由 18 位讲埃及阿拉伯语(EA)的人发出的。这是首次对三种 IS 条件下的整体 f0 等高线(通过 GAMM)和与之相关的 f0、持续时间和强度配置(通过 FPCA)进行全面分析的定量研究,包括不同说话者之间和说话者内部的分析。结果发现,焦点-背景与其他信息结构条件之间存在明显差异,但在策略和策略应用程度方面,不同说话者之间也存在很大差异。研究结果表明,聚焦后的音位降低以及聚焦词和语篇末尾词的重读音节的持续时间比聚焦重音的峰值更高,是更一致的聚焦线索。此外,还可以发现持续时间和强度与 f0 有一定的独立性。因此,这些结果支持了这样的假设,即在 EA 中,当重点在前音上被标记时,它是通过突出来标记的。然而,相当多的东亚语使用者没有使用前音标记,而且前音重心标记是渐变的而不是分类的,这些事实表明东亚语并没有完全常规化的前音重心结构。
{"title":"The prosody of theme, rheme and focus in Egyptian Arabic: A quantitative investigation of tunes, configurations and speaker variability","authors":"Dina El Zarka ,&nbsp;Anneliese Kelterer ,&nbsp;Michele Gubian ,&nbsp;Barbara Schuppler","doi":"10.1016/j.specom.2024.103082","DOIUrl":"10.1016/j.specom.2024.103082","url":null,"abstract":"<div><p>This paper investigates the prosody of sentences elicited in three Information Structure (IS) conditions: all-new, theme-rheme and rhematic focus-background. The sentences were produced by 18 speakers of Egyptian Arabic (EA). This is the first quantitative study to provide a comprehensive analysis of holistic f0 contours (by means of GAMM) and configurations of f0, duration and intensity (by means of FPCA) associated with the three IS conditions, both across and within speakers. A significant difference between focus-background and the other information structure conditions was found, but also strong inter-speaker variation in terms of strategies and the degree to which these strategies were applied. The results suggest that post-focus register lowering and the duration of the stressed syllables of the focused and the utterance-final word are more consistent cues to focus than a higher peak of the focus accent. In addition, some independence of duration and intensity from f0 could be identified. These results thus support the assumption that, when focus is marked prosodically in EA, it is marked by prominence. Nevertheless, the fact that a considerable number of EA speakers did not apply prosodic marking and the fact that prosodic focus marking was gradient rather than categorical suggest that EA does not have a fully conventionalized prosodic focus construction.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000542/pdfft?md5=dcb4ae8365c4f0e84a5827d3ae202551&pid=1-s2.0-S0167639324000542-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141035839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Factorized and progressive knowledge distillation for CTC-based ASR models 基于 CTC 的 ASR 模型的因子化和渐进式知识提炼
IF 3.2 3区 计算机科学 Q1 Arts and Humanities Pub Date : 2024-05-01 DOI: 10.1016/j.specom.2024.103071
Sanli Tian , Zehan Li , Zhaobiao Lyv , Gaofeng Cheng , Qing Xiao , Ta Li , Qingwei Zhao

Knowledge distillation (KD) is a popular model compression method to improve the performance of lightweight models by transferring knowledge from a teacher model to a student model. However, applying KD to connectionist temporal classification (CTC) ASR model is challenging due to its peaky posterior property. In this paper, we propose to address this issue by treating non-blank and blank frames differently for two main reasons. First, the non-blank frames in the teacher model’s posterior matrix and hidden representations provide more acoustic and linguistic information than the blank frames, but the frame number of non-blank frames only accounts for a small fraction of all frames, leading to a severe learning imbalance problem. Second, the non-blank tokens in the teacher’s blank-frame posteriors exhibit irregular probability distributions, negatively impacting the student model’s learning. Thus, we propose to factorize the distillation of non-blank and blank frames and further combine them into a progressive KD framework, which contains three incremental stages to facilitate the student model gradually building up its knowledge. The first stage involves a simple binary classification KD task, in which the student learns to distinguish between non-blank and blank frames, as the two types of frames are learned separately in subsequent stages. The second stage is a factorized representation-based KD, in which hidden representations are divided into non-blank and blank frames so that both can be distilled in a balanced manner. In the third stage, the student learns from the teacher’s posterior matrix through our proposed method, factorized KL-divergence (FKL), which performs different operation on blank and non-blank frame posteriors to alleviate the imbalance issue and reduce the influence of irregular probability distributions. Compared to the baseline, our proposed method achieves 22.5% relative CER reduction on the Aishell-1 dataset, 23.0% relative WER reduction on the Tedlium-2 dataset, and 17.6% relative WER reduction on the LibriSpeech dataset. To show the generalization of our method, we also evaluate our method on the hybrid CTC/Attention architecture as well as on scenarios with cross-model topology KD.

知识蒸馏(KD)是一种流行的模型压缩方法,通过将知识从教师模型转移到学生模型来提高轻量级模型的性能。然而,由于其峰值后验特性,将 KD 应用于连接主义时序分类 (CTC) ASR 模型具有挑战性。本文建议通过区别对待非空白帧和空白帧来解决这一问题,主要原因有两个。首先,在教师模型的后验矩阵和隐藏表征中,非空白帧比空白帧提供了更多的声学和语言信息,但非空白帧的帧数只占所有帧数的一小部分,从而导致严重的学习不平衡问题。其次,教师空白帧后验中的非空白标记呈现出不规则的概率分布,对学生模型的学习产生了负面影响。因此,我们建议对非空白帧和空白帧进行因子化提炼,并进一步将其结合到渐进式 KD 框架中,该框架包含三个增量阶段,以促进学生模型逐步积累知识。第一阶段是一个简单的二元分类 KD 任务,学生在这个任务中学习如何区分非空白帧和空白帧,因为这两种类型的帧会在后续阶段分别学习。第二阶段是基于因式分解表征的 KD,在这一阶段中,隐藏表征被分为非空白帧和空白帧,从而可以均衡地提炼出这两种帧。在第三阶段,学生通过我们提出的因子化 KL-发散(FKL)方法从教师的后验矩阵中学习,该方法对空白帧和非空白帧后验进行不同的操作,以缓解不平衡问题并减少不规则概率分布的影响。与基线相比,我们提出的方法在 Aishell-1 数据集上实现了 22.5% 的相对 CER 降低,在 Tedlium-2 数据集上实现了 23.0% 的相对 WER 降低,在 LibriSpeech 数据集上实现了 17.6% 的相对 WER 降低。为了展示我们方法的通用性,我们还在 CTC/Attention 混合架构以及跨模型拓扑 KD 场景中对我们的方法进行了评估。
{"title":"Factorized and progressive knowledge distillation for CTC-based ASR models","authors":"Sanli Tian ,&nbsp;Zehan Li ,&nbsp;Zhaobiao Lyv ,&nbsp;Gaofeng Cheng ,&nbsp;Qing Xiao ,&nbsp;Ta Li ,&nbsp;Qingwei Zhao","doi":"10.1016/j.specom.2024.103071","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103071","url":null,"abstract":"<div><p>Knowledge distillation (KD) is a popular model compression method to improve the performance of lightweight models by transferring knowledge from a teacher model to a student model. However, applying KD to connectionist temporal classification (CTC) ASR model is challenging due to its peaky posterior property. In this paper, we propose to address this issue by treating non-blank and blank frames differently for two main reasons. First, the non-blank frames in the teacher model’s posterior matrix and hidden representations provide more acoustic and linguistic information than the blank frames, but the frame number of non-blank frames only accounts for a small fraction of all frames, leading to a severe learning imbalance problem. Second, the non-blank tokens in the teacher’s blank-frame posteriors exhibit irregular probability distributions, negatively impacting the student model’s learning. Thus, we propose to factorize the distillation of non-blank and blank frames and further combine them into a progressive KD framework, which contains three incremental stages to facilitate the student model gradually building up its knowledge. The first stage involves a simple binary classification KD task, in which the student learns to distinguish between non-blank and blank frames, as the two types of frames are learned separately in subsequent stages. The second stage is a factorized representation-based KD, in which hidden representations are divided into non-blank and blank frames so that both can be distilled in a balanced manner. In the third stage, the student learns from the teacher’s posterior matrix through our proposed method, factorized KL-divergence (FKL), which performs different operation on blank and non-blank frame posteriors to alleviate the imbalance issue and reduce the influence of irregular probability distributions. Compared to the baseline, our proposed method achieves 22.5% relative CER reduction on the Aishell-1 dataset, 23.0% relative WER reduction on the Tedlium-2 dataset, and 17.6% relative WER reduction on the LibriSpeech dataset. To show the generalization of our method, we also evaluate our method on the hybrid CTC/Attention architecture as well as on scenarios with cross-model topology KD.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140879835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimization-based planning of speech articulation using general Tau Theory 利用一般 Tau 理论进行基于优化的语音发音规划
IF 3.2 3区 计算机科学 Q1 Arts and Humanities Pub Date : 2024-05-01 DOI: 10.1016/j.specom.2024.103083
Benjamin Elie , Juraj Šimko , Alice Turk

This paper presents a model of speech articulation planning and generation based on General Tau Theory and Optimal Control Theory. Because General Tau Theory assumes that articulatory targets are always reached, the model accounts for speech variation via context-dependent articulatory targets. Targets are chosen via the optimization of a composite objective function. This function models three different task requirements: maximal intelligibility, minimal articulatory effort and minimal utterance duration. The paper shows that systematic phonetic variability can be reproduced by adjusting the weights assigned to each task requirement. Weights can be adjusted globally to simulate different speech styles, and can be adjusted locally to simulate different levels of prosodic prominence. The solution of the optimization procedure contains Tau equation parameter values for each articulatory movement, namely position of the articulator at the movement offset, movement duration, and a parameter which relates to the shape of the movement’s velocity profile. The paper presents simulations which illustrate the ability of the model to predict or reproduce several well-known characteristics of speech. These phenomena include close-to-symmetric velocity profiles for articulatory movement, variation related to speech rate, centralization of unstressed vowels, lengthening of stressed vowels, lenition of unstressed lingual stop consonants, and coarticulation of stop consonants.

本文介绍了一种基于通用 Tau 理论和最优控制理论的语音发音规划和生成模型。由于通用 Tau 理论假设发音目标总是可以达到,因此该模型通过与语境相关的发音目标来考虑语音的变化。目标是通过优化综合目标函数来选择的。该函数模拟了三种不同的任务要求:最大可懂度、最小发音努力和最短语篇持续时间。论文表明,通过调整分配给每个任务要求的权重,可以再现系统的语音变异性。权重可以全局调整,以模拟不同的语音风格,也可以局部调整,以模拟不同的前音突出程度。优化程序的解决方案包含每个发音动作的 Tau 方程参数值,即发音器在动作偏移时的位置、动作持续时间以及与动作速度曲线形状有关的参数。论文中的模拟结果表明,该模型能够预测或再现几种众所周知的语音特征。这些现象包括近乎对称的发音运动速度曲线、与语速有关的变化、非重读元音的集中、重读元音的延长、非重读舌尖停止辅音的变长以及停止辅音的共同发音。
{"title":"Optimization-based planning of speech articulation using general Tau Theory","authors":"Benjamin Elie ,&nbsp;Juraj Šimko ,&nbsp;Alice Turk","doi":"10.1016/j.specom.2024.103083","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103083","url":null,"abstract":"<div><p>This paper presents a model of speech articulation planning and generation based on General Tau Theory and Optimal Control Theory. Because General Tau Theory assumes that articulatory targets are always reached, the model accounts for speech variation via context-dependent articulatory targets. Targets are chosen via the optimization of a composite objective function. This function models three different task requirements: maximal intelligibility, minimal articulatory effort and minimal utterance duration. The paper shows that systematic phonetic variability can be reproduced by adjusting the weights assigned to each task requirement. Weights can be adjusted globally to simulate different speech styles, and can be adjusted locally to simulate different levels of prosodic prominence. The solution of the optimization procedure contains Tau equation parameter values for each articulatory movement, namely position of the articulator at the movement offset, movement duration, and a parameter which relates to the shape of the movement’s velocity profile. The paper presents simulations which illustrate the ability of the model to predict or reproduce several well-known characteristics of speech. These phenomena include close-to-symmetric velocity profiles for articulatory movement, variation related to speech rate, centralization of unstressed vowels, lengthening of stressed vowels, lenition of unstressed lingual stop consonants, and coarticulation of stop consonants.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000554/pdfft?md5=9244f2762d9cdb76bf74cf04a57a092e&pid=1-s2.0-S0167639324000554-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140948784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Speech Communication
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1