Interspeech最新文献

英文中文

Contrastive Learning Approach for Assessment of Phonological Precision in Patients with Tongue Cancer Using MRI Data. 对比学习方法在舌癌患者语音准确性评估中的应用。

Interspeech

Pub Date : 2024-09-01 DOI: 10.21437/interspeech.2024-2236

Tomás Arias-Vergara, Paula Andrea Pérez-Toro, Xiaofeng Liu, Fangxu Xing, Maureen Stone, Jiachen Zhuo, Jerry L Prince, Maria Schuster, Elmar Nöth, Jonghye Woo, Andreas Maier

Magnetic Resonance Imaging (MRI) allows analyzing speech production by capturing high-resolution images of the dynamic processes in the vocal tract. In clinical applications, combining MRI with synchronized speech recordings leads to improved patient outcomes, especially if a phonological-based approach is used for assessment. However, when audio signals are unavailable, the recognition accuracy of sounds is decreased when using only MRI data. We propose a contrastive learning approach to improve the detection of phonological classes from MRI data when acoustic signals are not available at inference time. We demonstrate that frame-wise recognition of phonological classes improves from an f1 of 0.74 to 0.85 when the contrastive loss approach is implemented. Furthermore, we show the utility of our approach in the clinical application of using such phonological classes to assess speech disorders in patients with tongue cancer, yielding promising results in the recognition task.

磁共振成像（MRI）可以通过捕捉声道中动态过程的高分辨率图像来分析语音产生。在临床应用中，将MRI与同步语音记录相结合可以改善患者的预后，特别是如果使用基于语音的方法进行评估。然而，当音频信号不可用时，仅使用MRI数据对声音的识别精度会降低。我们提出了一种对比学习方法，以提高在推理时声学信号不可用时从MRI数据中检测语音类别的能力。我们证明，当实现对比损失方法时，逐帧语音分类识别的f1从0.74提高到0.85。此外，我们展示了我们的方法在临床应用中的实用性，使用这种语音分类来评估舌癌患者的语言障碍，在识别任务中产生了有希望的结果。

{"title":"Contrastive Learning Approach for Assessment of Phonological Precision in Patients with Tongue Cancer Using MRI Data.","authors":"Tomás Arias-Vergara, Paula Andrea Pérez-Toro, Xiaofeng Liu, Fangxu Xing, Maureen Stone, Jiachen Zhuo, Jerry L Prince, Maria Schuster, Elmar Nöth, Jonghye Woo, Andreas Maier","doi":"10.21437/interspeech.2024-2236","DOIUrl":"10.21437/interspeech.2024-2236","url":null,"abstract":"Magnetic Resonance Imaging (MRI) allows analyzing speech production by capturing high-resolution images of the dynamic processes in the vocal tract. In clinical applications, combining MRI with synchronized speech recordings leads to improved patient outcomes, especially if a phonological-based approach is used for assessment. However, when audio signals are unavailable, the recognition accuracy of sounds is decreased when using only MRI data. We propose a contrastive learning approach to improve the detection of phonological classes from MRI data when acoustic signals are not available at inference time. We demonstrate that frame-wise recognition of phonological classes improves from an f1 of 0.74 to 0.85 when the contrastive loss approach is implemented. Furthermore, we show the utility of our approach in the clinical application of using such phonological classes to assess speech disorders in patients with tongue cancer, yielding promising results in the recognition task.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"2024 ","pages":"927-931"},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11671147/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142900847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Segmental and Suprasegmental Speech Foundation Models for Classifying Cognitive Risk Factors: Evaluating Out-of-the-Box Performance.

Interspeech

Pub Date : 2024-09-01 DOI: 10.21437/interspeech.2024-2063

Si-Ioi Ng, Lingfeng Xu, Kimberly D Mueller, Julie Liss, Visar Berisha

Speech foundation models are remarkably successful in various consumer applications, prompting their extension to clinical use-cases. This is challenged by small clinical datasets, which precludes effective fine-tuning. We tested the efficacy of two models to classify participants by segmental (Wav2Vec2.0) and suprasegmental (Trillsson) speech analysis windows. Analysis at both time scales has shown differences in the context of cognitive decline. Speakers were classified as healthy controls (HC), Amyloid-β+ (Aβ+), mild cognitive impairment (MCI), or dementia. A subset of W2V2 and Trillsson representations showed large effect size between HC and each risk factor. Cross-validation showed W2V2 consistently outperforms Trillsson. Mean macro-F1 of 54.1%, 63.5%, and 72.0% in were found for classifying Aβ+, MCI, and dementia from HC. Repeatability of Trillsson and W2V2 showed intraclass correlations of 0.30 and 0.41. Reliability of such models must be enhanced for clinical speech analysis and longitudinal tracking.

{"title":"Segmental and Suprasegmental Speech Foundation Models for Classifying Cognitive Risk Factors: Evaluating Out-of-the-Box Performance.","authors":"Si-Ioi Ng, Lingfeng Xu, Kimberly D Mueller, Julie Liss, Visar Berisha","doi":"10.21437/interspeech.2024-2063","DOIUrl":"10.21437/interspeech.2024-2063","url":null,"abstract":"Speech foundation models are remarkably successful in various consumer applications, prompting their extension to clinical use-cases. This is challenged by small clinical datasets, which precludes effective fine-tuning. We tested the efficacy of two models to classify participants by segmental (Wav2Vec2.0) and suprasegmental (Trillsson) speech analysis windows. Analysis at both time scales has shown differences in the context of cognitive decline. Speakers were classified as healthy controls (HC), Amyloid-β+ (Aβ+), mild cognitive impairment (MCI), or dementia. A subset of W2V2 and Trillsson representations showed large effect size between HC and each risk factor. Cross-validation showed W2V2 consistently outperforms Trillsson. Mean macro-F1 of 54.1%, 63.5%, and 72.0% in were found for classifying Aβ+, MCI, and dementia from HC. Repeatability of Trillsson and W2V2 showed intraclass correlations of 0.30 and 0.41. Reliability of such models must be enhanced for clinical speech analysis and longitudinal tracking.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"2024 ","pages":"917-921"},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11884505/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143574965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

How Does Alignment Error Affect Automated Pronunciation Scoring in Children's Speech?

Interspeech

Pub Date : 2024-09-01 DOI: 10.21437/interspeech.2024-2239

Prad Kadambi, Tristan Mahr, Lucas Annear, Henry Nomeland, Julie Liss, Katherine Hustad, Visar Berisha

Automated goodness of pronunciation scores measure deviation from typical adult speech by first phonetically segmenting speech using forced alignment and then computing phoneme likelihoods. Care must be taken to distinguish between the impact of alignment error (a spurious signal) and true acoustic deviation on the automated score. Using mixed effects modeling, we predict $Δ P L L R$ , the difference between pronunciation scores computed using manual alignment ( $P L L R_{m}$ ) versus computed using automatic forced alignments ( $P L L R_{a}$ ). Pronunciation deviations and alignment error are both magnified in children's speech and may be influenced by factors such as phoneme position and phoneme type. Our methodology shows that alignment error has a moderate effect on $Δ P L L R$ , and other variables have small to no effect. Manual $PLLR$ closely matches automatically calculated $PLLR$ following cross utterance averaging. Thus, practical comparisons between child speakers should be very comparable across the two methods.

{"title":"How Does Alignment Error Affect Automated Pronunciation Scoring in Children's Speech?","authors":"Prad Kadambi, Tristan Mahr, Lucas Annear, Henry Nomeland, Julie Liss, Katherine Hustad, Visar Berisha","doi":"10.21437/interspeech.2024-2239","DOIUrl":"https://doi.org/10.21437/interspeech.2024-2239","url":null,"abstract":"Automated goodness of pronunciation scores measure deviation from typical adult speech by first phonetically segmenting speech using forced alignment and then computing phoneme likelihoods. Care must be taken to distinguish between the impact of alignment error (a spurious signal) and true acoustic deviation on the automated score. Using mixed effects modeling, we predict <math><mi>Δ</mi> <mi>P</mi> <mi>L</mi> <mi>L</mi> <mi>R</mi></math> , the difference between pronunciation scores computed using manual alignment ( <math><mi>P</mi> <mi>L</mi> <mi>L</mi> <msub><mrow><mi>R</mi></mrow> <mrow><mi>m</mi></mrow> </msub> </math> ) versus computed using automatic forced alignments ( <math><mi>P</mi> <mi>L</mi> <mi>L</mi> <msub><mrow><mi>R</mi></mrow> <mrow><mi>a</mi></mrow> </msub> </math> ). Pronunciation deviations and alignment error are both magnified in children's speech and may be influenced by factors such as phoneme position and phoneme type. Our methodology shows that alignment error has a moderate effect on <math><mi>Δ</mi> <mi>P</mi> <mi>L</mi> <mi>L</mi> <mi>R</mi></math> , and other variables have small to no effect. Manual <math><mi>PLLR</mi></math> closely matches automatically calculated <math><mi>PLLR</mi></math> following cross utterance averaging. Thus, practical comparisons between child speakers should be very comparable across the two methods.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"2024 ","pages":"5133-5137"},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11977302/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143813014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Remote Assessment for ALS using Multimodal Dialog Agents: Data Quality, Feasibility and Task Compliance. 使用多模式对话代理对ALS进行远程评估：数据质量、可行性和任务符合性。

Interspeech

Pub Date : 2023-08-01 DOI: 10.21437/interspeech.2023-2115

Vanessa Richter, Michael Neumann, Jordan R Green, Brian Richburg, Oliver Roesler, Hardik Kothare, Vikram Ramanarayanan

We investigate the feasibility, task compliance and audiovisual data quality of a multimodal dialog-based solution for remote assessment of Amyotrophic Lateral Sclerosis (ALS). 53 people with ALS and 52 healthy controls interacted with Tina, a cloud-based conversational agent, in performing speech tasks designed to probe various aspects of motor speech function while their audio and video was recorded. We rated a total of 250 recordings for audio/video quality and participant task compliance, along with the relative frequency of different issues observed. We observed excellent compliance (98%) and audio (95.2%) and visual quality rates (84.8%), resulting in an overall yield of 80.8% recordings that were both compliant and of high quality. Furthermore, recording quality and compliance were not affected by level of speech severity and did not differ significantly across end devices. These findings support the utility of dialog systems for remote monitoring of speech in ALS.

我们研究了用于肌萎缩侧索硬化症（ALS）远程评估的基于多模式对话的解决方案的可行性、任务依从性和视听数据质量。53名ALS患者和52名健康对照者与Tina（一种基于云的对话代理）进行了互动，在录制他们的音频和视频时，他们执行了旨在探索运动言语功能各个方面的言语任务。我们对总共250段录音的音频/视频质量和参与者任务依从性进行了评级，以及观察到的不同问题的相对频率。我们观察到良好的依从性（98%）、音频（95.2%）和视觉质量率（84.8%），导致80.8%的录音符合要求且质量高。此外，录音质量和合规性不受语音严重程度的影响，在终端设备之间也没有显著差异。这些发现支持对话系统在ALS语音远程监测中的应用。

{"title":"Remote Assessment for ALS using Multimodal Dialog Agents: Data Quality, Feasibility and Task Compliance.","authors":"Vanessa Richter, Michael Neumann, Jordan R Green, Brian Richburg, Oliver Roesler, Hardik Kothare, Vikram Ramanarayanan","doi":"10.21437/interspeech.2023-2115","DOIUrl":"https://doi.org/10.21437/interspeech.2023-2115","url":null,"abstract":"We investigate the feasibility, task compliance and audiovisual data quality of a multimodal dialog-based solution for remote assessment of Amyotrophic Lateral Sclerosis (ALS). 53 people with ALS and 52 healthy controls interacted with Tina, a cloud-based conversational agent, in performing speech tasks designed to probe various aspects of motor speech function while their audio and video was recorded. We rated a total of 250 recordings for audio/video quality and participant task compliance, along with the relative frequency of different issues observed. We observed excellent compliance (98%) and audio (95.2%) and visual quality rates (84.8%), resulting in an overall yield of 80.8% recordings that were both compliant and of high quality. Furthermore, recording quality and compliance were not affected by level of speech severity and did not differ significantly across end devices. These findings support the utility of dialog systems for remote monitoring of speech in ALS.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"2023 ","pages":"5441-5445"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10547018/pdf/nihms-1931217.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41174190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Pronunciation modeling of foreign words for Mandarin ASR by considering the effect of language transfer 考虑语言迁移影响的普通话ASR外来词语音建模

Interspeech

Pub Date : 2022-10-07 DOI: 10.21437/Interspeech.2014-353

Lei Wang, R. Tong

One of the challenges in automatic speech recognition is foreign words recognition. It is observed that a speaker's pronunciation of a foreign word is influenced by his native language knowledge, and such phenomenon is known as the effect of language transfer. This paper focuses on examining the phonetic effect of language transfer in automatic speech recognition. A set of lexical rules is proposed to convert an English word into Mandarin phonetic representation. In this way, a Mandarin lexicon can be augmented by including English words. Hence, the Mandarin ASR system becomes capable to recognize English words without retraining or re-estimation of the acoustic model parameters. Using the lexicon that derived from the proposed rules, the ASR performance of Mandarin English mixed speech is improved without harming the accuracy of Mandarin only speech. The proposed lexical rules are generalized and they can be directly applied to unseen English words.

语音自动识别的难点之一是外来词识别。人们观察到，说话者对外来词的发音受到其母语知识的影响，这种现象被称为语言迁移效应。本文主要研究语音自动识别中语言迁移的语音效应。提出了一套将英语单词转换为汉语语音表示的词汇规则。通过这种方式，汉语词汇可以通过加入英语单词来扩充。因此，普通话ASR系统无需重新训练或重新估计声学模型参数即可识别英语单词。在不影响纯普通话语音准确性的前提下，使用基于规则的词汇表，提高了普通话英语混合语音的ASR性能。所提出的词汇规则具有概括性，可以直接应用于未见过的英语单词。

引用次数: 3

Automatic Speaker Verification System for Dysarthria Patients 用于构音障碍患者的自动扬声器验证系统

Interspeech

Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-375

Shinimol Salim, S. Shahnawazuddin, Waquar Ahmad

Dysarthria is one of the most common speech communication disorder associate with a neurological damage that weakens the muscles necessary for speech. In this paper, we present our efforts towards developing an automatic speaker verification (ASV) system based on x -vectors for dysarthric speakers with varying speech intelligibility (low, medium and high). For that purpose, a baseline ASV system was trained on speech data from healthy speakers since there is severe scarcity of data from dysarthric speakers. To improve the performance with respect to dysarthric speakers, data augmentation based on duration modification is proposed in this study. Duration modification with several scaling factors was applied to healthy training speech. An ASV system was then trained on healthy speech augmented with its duration modified versions. It compen-sates for the substantial disparities in phone duration between normal and dysarthric speakers of varying speech intelligibilty. Experiment evaluations presented in this study show that proposed duration-modification-based data augmentation resulted in a relative improvement of 22% over the baseline. Further to that, a relative improvement of 26% was obtained in the case of speakers with high severity level of dysarthria.

构音障碍是最常见的言语交流障碍之一，与削弱言语所需肌肉的神经损伤有关。在本文中，我们致力于开发一个基于x向量的自动说话人验证（ASV）系统，用于不同语音清晰度（低、中、高）的构音障碍说话人。为此，基线ASV系统是根据健康说话者的语音数据进行训练的，因为严重缺乏构音障碍说话者的数据。为了提高构音障碍说话者的表现，本研究提出了基于持续时间修改的数据增强。将几个比例因子的时长修正应用于健康训练语音。然后对ASV系统进行了健康语音训练，并对其持续时间进行了修改。它弥补了不同语音清晰度的正常和构音障碍说话者之间电话持续时间的巨大差异。本研究中的实验评估表明，所提出的基于持续时间修改的数据增强比基线相对提高了22%。此外，在具有高严重程度构音障碍的说话者的情况下，获得了26%的相对改善。

{"title":"Automatic Speaker Verification System for Dysarthria Patients","authors":"Shinimol Salim, S. Shahnawazuddin, Waquar Ahmad","doi":"10.21437/interspeech.2022-375","DOIUrl":"https://doi.org/10.21437/interspeech.2022-375","url":null,"abstract":"Dysarthria is one of the most common speech communication disorder associate with a neurological damage that weakens the muscles necessary for speech. In this paper, we present our efforts towards developing an automatic speaker verification (ASV) system based on x -vectors for dysarthric speakers with varying speech intelligibility (low, medium and high). For that purpose, a baseline ASV system was trained on speech data from healthy speakers since there is severe scarcity of data from dysarthric speakers. To improve the performance with respect to dysarthric speakers, data augmentation based on duration modification is proposed in this study. Duration modification with several scaling factors was applied to healthy training speech. An ASV system was then trained on healthy speech augmented with its duration modified versions. It compen-sates for the substantial disparities in phone duration between normal and dysarthric speakers of varying speech intelligibilty. Experiment evaluations presented in this study show that proposed duration-modification-based data augmentation resulted in a relative improvement of 22% over the baseline. Further to that, a relative improvement of 26% was obtained in the case of speakers with high severity level of dysarthria.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5070-5074"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44912875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Robust Cough Feature Extraction and Classification Method for COVID-19 Cough Detection Based on Vocalization Characteristics 基于发声特征的新型冠状病毒咳嗽检测鲁棒咳嗽特征提取与分类方法

Interspeech

Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10401

Xueshuai Zhang, Jiakun Shen, J. Zhou, Pengyuan Zhang, Yonghong Yan, Zhihua Huang, Yanfen Tang, Yu Wang, Fujie Zhang, Shenmin Zhang, Aijun Sun

A fast, efficient and accurate detection method of COVID-19 remains a critical challenge. Many cough-based COVID-19 detection researches have shown competitive results through artificial intelligence. However, the lack of analysis on vocalization characteristics of cough sounds limits the further improvement of detection performance. In this paper, we propose two novel acoustic features of cough sounds and a convolutional neural network structure for COVID-19 detection. First, a time-frequency differential feature is proposed to characterize dynamic information of cough sounds in time and frequency domain. Then, an energy ratio feature is proposed to calculate the energy difference caused by the phonation characteristics in different cough phases. Finally, a convolutional neural network with two parallel branches which is pre-trained on a large amount of unlabeled cough data is proposed for classification. Experiment results show that our proposed method achieves state-of-the-art performance on Coswara dataset for COVID-19 detection. The results on an external clinical dataset Virufy also show the better generalization ability of our proposed method. Copyright © 2022 ISCA.

一种快速、高效和准确的新冠肺炎检测方法仍然是一个关键的挑战。许多基于咳嗽的新冠肺炎检测研究通过人工智能显示了具有竞争力的结果。然而，由于缺乏对咳嗽声发声特征的分析，限制了检测性能的进一步提高。在本文中，我们提出了咳嗽声的两种新的声学特征和用于新冠肺炎检测的卷积神经网络结构。首先，提出了一种时频微分特征来表征咳嗽声在时域和频域中的动态信息。然后，提出了一种能量比特征来计算不同咳嗽阶段由发音特征引起的能量差异。最后，提出了一种在大量未标记咳嗽数据上预训练的具有两个并行分支的卷积神经网络进行分类。实验结果表明，我们提出的方法在用于新冠肺炎检测的Coswara数据集上实现了最先进的性能。在外部临床数据集Virify上的结果也表明了我们提出的方法更好的泛化能力。版权所有©2022 ISCA。

{"title":"Robust Cough Feature Extraction and Classification Method for COVID-19 Cough Detection Based on Vocalization Characteristics","authors":"Xueshuai Zhang, Jiakun Shen, J. Zhou, Pengyuan Zhang, Yonghong Yan, Zhihua Huang, Yanfen Tang, Yu Wang, Fujie Zhang, Shenmin Zhang, Aijun Sun","doi":"10.21437/interspeech.2022-10401","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10401","url":null,"abstract":"A fast, efficient and accurate detection method of COVID-19 remains a critical challenge. Many cough-based COVID-19 detection researches have shown competitive results through artificial intelligence. However, the lack of analysis on vocalization characteristics of cough sounds limits the further improvement of detection performance. In this paper, we propose two novel acoustic features of cough sounds and a convolutional neural network structure for COVID-19 detection. First, a time-frequency differential feature is proposed to characterize dynamic information of cough sounds in time and frequency domain. Then, an energy ratio feature is proposed to calculate the energy difference caused by the phonation characteristics in different cough phases. Finally, a convolutional neural network with two parallel branches which is pre-trained on a large amount of unlabeled cough data is proposed for classification. Experiment results show that our proposed method achieves state-of-the-art performance on Coswara dataset for COVID-19 detection. The results on an external clinical dataset Virufy also show the better generalization ability of our proposed method. Copyright © 2022 ISCA.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2168-2172"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45011547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Predicting Speech Intelligibility using the Spike Acativity Mutual Information Index 利用Spike Acactivity互信息指数预测语音可懂性

Interspeech

Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10488

F. Cardinale, W. Nogueira

The spike activity mutual information index (SAMII) is presented as a new intrusive objective metric to predict speech intelligibility. A target speech signal and speech-in-noise signal are processed by a state-of-the-art computational model of the peripheral auditory system. It simulates the neural activity in a population of auditory nerve fibers (ANFs), which are grouped into critical bands covering the speech frequency range. The mutual information between the neural activity of both signals is calculated using analysis windows of 20 ms. Then, the mutual information is averaged along these analysis windows to obtain SAMII. SAMII is also extended to binaural scenarios by calculating the index for the left ear, right ear, and both ears, choosing the best case for predicting intelligibility. SAMII was developed based on the first clarity prediction challenge training dataset and compared to the modified binaural short-time objective intelligibility (MBSTOI) as baseline. Scores are reported in root mean squared error (RMSE) between measured and predicted data using the clarity challenge test dataset. SAMII scored 35.16%, slightly better than the MBSTOI which obtained 36.52%. This work leads to the conclu-sion that SAMII is a reliable objective metric when “low-level” representations of the speech, such as spike activity, are used.

尖峰活动互信息指数（SAMII）是预测语音可懂度的一种新的侵入性客观指标。通过外周听觉系统的最先进的计算模型来处理目标语音信号和噪声中的语音信号。它模拟了一群听觉神经纤维（ANF）的神经活动，这些神经纤维被分为覆盖语音频率范围的关键频带。使用20ms的分析窗口来计算两个信号的神经活动之间的相互信息。然后，沿着这些分析窗口对相互信息进行平均以获得SAMII。SAMII还通过计算左耳、右耳和双耳的指数，选择预测可懂度的最佳情况，扩展到双耳场景。SAMII是基于第一个清晰度预测挑战训练数据集开发的，并与作为基线的改良双耳短时目标可懂度（MBSTOI）进行了比较。使用清晰度挑战测试数据集，以测量数据和预测数据之间的均方根误差（RMSE）报告分数。SAMII的得分为35.16%，略高于MBSTOI的得分36.52%。这项工作得出的结论是，当使用语音的“低级”表示（如尖峰活动）时，SAMII是一个可靠的客观指标。

{"title":"Predicting Speech Intelligibility using the Spike Acativity Mutual Information Index","authors":"F. Cardinale, W. Nogueira","doi":"10.21437/interspeech.2022-10488","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10488","url":null,"abstract":"The spike activity mutual information index (SAMII) is presented as a new intrusive objective metric to predict speech intelligibility. A target speech signal and speech-in-noise signal are processed by a state-of-the-art computational model of the peripheral auditory system. It simulates the neural activity in a population of auditory nerve fibers (ANFs), which are grouped into critical bands covering the speech frequency range. The mutual information between the neural activity of both signals is calculated using analysis windows of 20 ms. Then, the mutual information is averaged along these analysis windows to obtain SAMII. SAMII is also extended to binaural scenarios by calculating the index for the left ear, right ear, and both ears, choosing the best case for predicting intelligibility. SAMII was developed based on the first clarity prediction challenge training dataset and compared to the modified binaural short-time objective intelligibility (MBSTOI) as baseline. Scores are reported in root mean squared error (RMSE) between measured and predicted data using the clarity challenge test dataset. SAMII scored 35.16%, slightly better than the MBSTOI which obtained 36.52%. This work leads to the conclu-sion that SAMII is a reliable objective metric when “low-level” representations of the speech, such as spike activity, are used.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3503-3507"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45143239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Cooperative Speech Separation With a Microphone Array and Asynchronous Wearable Devices 基于麦克风阵列和异步可穿戴设备的协同语音分离

Interspeech

Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11025

R. Corey, Manan Mittal, Kanad Sarkar, A. Singer

We consider the problem of separating speech from several talkers in background noise using a fixed microphone array and a set of wearable devices. Wearable devices can provide reliable information about speech from their wearers, but they typically cannot be used directly for multichannel source separation due to network delay, sample rate offsets, and relative motion. Instead, the wearable microphone signals are used to compute the speech presence probability for each talker at each time-frequency index. Those parameters, which are robust against small sample rate offsets and relative motion, are used to track the second-order statistics of the speech sources and background noise. The fixed array then separates the speech signals using an adaptive linear time-varying multichannel Wiener filter. The proposed method is demonstrated using real-room recordings from three human talkers with binaural earbud microphones and an eight-microphone tabletop array. but are useful for distin-guishing between different sources because of their known positions relative to the talkers. The proposed system uses the wearable devices to estimate SPP values, which are then used to learn the second-order statistics for each source at the microphones of the fixed array. The array separates the sources using an adaptive linear time-varying spatial filter suitable for real-time applications. This work combines the cooperative ar-chitecture of [19], the distributed SPP method of [18], and the motion-robust modeling of [15]. The system is implemented adaptively and demonstrated using live human talkers.

我们考虑了在背景噪声中使用固定麦克风阵列和一组可穿戴设备从多个说话者中分离语音的问题。可穿戴设备可以提供有关其佩戴者的语音的可靠信息，但由于网络延迟、采样率偏移和相对运动，它们通常不能直接用于多通道源分离。相反，使用可穿戴麦克风信号来计算每个说话者在每个时频指数下的语音存在概率。这些参数对小采样率偏移和相对运动具有鲁棒性，用于跟踪语音源和背景噪声的二阶统计量。固定阵列然后使用自适应线性时变多通道维纳滤波器分离语音信号。该方法通过使用双耳耳塞麦克风和8个麦克风桌面阵列的三个人的真实房间录音进行了演示。但是对于区分不同的来源是有用的，因为它们相对于说话者的已知位置。该系统使用可穿戴设备来估计SPP值，然后使用SPP值来学习固定阵列麦克风处每个源的二阶统计量。该阵列使用适合于实时应用的自适应线性时变空间滤波器分离源。本工作结合[19]的协作架构、[18]的分布式SPP方法和[15]的运动鲁棒建模。该系统是自适应实现的，并使用真人说话者进行演示。

{"title":"Cooperative Speech Separation With a Microphone Array and Asynchronous Wearable Devices","authors":"R. Corey, Manan Mittal, Kanad Sarkar, A. Singer","doi":"10.21437/interspeech.2022-11025","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11025","url":null,"abstract":"We consider the problem of separating speech from several talkers in background noise using a fixed microphone array and a set of wearable devices. Wearable devices can provide reliable information about speech from their wearers, but they typically cannot be used directly for multichannel source separation due to network delay, sample rate offsets, and relative motion. Instead, the wearable microphone signals are used to compute the speech presence probability for each talker at each time-frequency index. Those parameters, which are robust against small sample rate offsets and relative motion, are used to track the second-order statistics of the speech sources and background noise. The fixed array then separates the speech signals using an adaptive linear time-varying multichannel Wiener filter. The proposed method is demonstrated using real-room recordings from three human talkers with binaural earbud microphones and an eight-microphone tabletop array. but are useful for distin-guishing between different sources because of their known positions relative to the talkers. The proposed system uses the wearable devices to estimate SPP values, which are then used to learn the second-order statistics for each source at the microphones of the fixed array. The array separates the sources using an adaptive linear time-varying spatial filter suitable for real-time applications. This work combines the cooperative ar-chitecture of [19], the distributed SPP method of [18], and the motion-robust modeling of [15]. The system is implemented adaptively and demonstrated using live human talkers.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5398-5402"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45171254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Automatic Soundtracking System for Text-to-Speech Audiobooks 一种用于文本到语音有声读物的自动声音跟踪系统

Interspeech

Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10236

Zikai Chen, Lin Wu, Junjie Pan, Xiang Yin

Background music (BGM) plays an essential role in audiobooks, which can enhance the immersive experience of audiences and help them better understand the story. However, welldesigned BGM still requires human effort in the text-to-speech (TTS) audiobook production, which is quite time-consuming and costly. In this paper, we introduce an automatic soundtracking system for TTS-based audiobooks. The proposed system divides the soundtracking process into three tasks: plot partition, plot classification, and music selection. The experiments shows that both our plot partition module and plot classification module outperform baselines by a large margin. Furthermore, TTS-based audiobooks produced with our proposed automatic soundtracking system achieves comparable performance to that produced with the human soundtracking system. To our best of knowledge, this is the first work of automatic soundtracking system for audiobooks. Demos are available on https: //acst1223.github.io/interspeech2022/main.

背景音乐在有声读物中发挥着至关重要的作用，它可以增强观众的沉浸式体验，帮助他们更好地理解故事。然而，设计良好的BGM仍然需要在文本到语音（TTS）有声读物的制作中付出人力，这是相当耗时和昂贵的。本文介绍了一种基于TTS的有声读物自动跟踪系统。该系统将声音跟踪过程分为三个任务：情节划分、情节分类和音乐选择。实验表明，我们的小区划分模块和小区分类模块都大大优于基线。此外，使用我们提出的自动声音跟踪系统制作的基于TTS的有声读物实现了与使用人类声音跟踪系统生产的有声书相当的性能。据我们所知，这是有声读物自动声音跟踪系统的第一部作品。演示可在https://acst1223.github.io/interseech2022/main上获得。

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Interspeech

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀