首页 > 最新文献

Interspeech最新文献

英文 中文
Remote Assessment for ALS using Multimodal Dialog Agents: Data Quality, Feasibility and Task Compliance. 使用多模式对话代理对ALS进行远程评估:数据质量、可行性和任务符合性。
Pub Date : 2023-08-01 DOI: 10.21437/interspeech.2023-2115
Vanessa Richter, Michael Neumann, Jordan R Green, Brian Richburg, Oliver Roesler, Hardik Kothare, Vikram Ramanarayanan

We investigate the feasibility, task compliance and audiovisual data quality of a multimodal dialog-based solution for remote assessment of Amyotrophic Lateral Sclerosis (ALS). 53 people with ALS and 52 healthy controls interacted with Tina, a cloud-based conversational agent, in performing speech tasks designed to probe various aspects of motor speech function while their audio and video was recorded. We rated a total of 250 recordings for audio/video quality and participant task compliance, along with the relative frequency of different issues observed. We observed excellent compliance (98%) and audio (95.2%) and visual quality rates (84.8%), resulting in an overall yield of 80.8% recordings that were both compliant and of high quality. Furthermore, recording quality and compliance were not affected by level of speech severity and did not differ significantly across end devices. These findings support the utility of dialog systems for remote monitoring of speech in ALS.

我们研究了用于肌萎缩侧索硬化症(ALS)远程评估的基于多模式对话的解决方案的可行性、任务依从性和视听数据质量。53名ALS患者和52名健康对照者与Tina(一种基于云的对话代理)进行了互动,在录制他们的音频和视频时,他们执行了旨在探索运动言语功能各个方面的言语任务。我们对总共250段录音的音频/视频质量和参与者任务依从性进行了评级,以及观察到的不同问题的相对频率。我们观察到良好的依从性(98%)、音频(95.2%)和视觉质量率(84.8%),导致80.8%的录音符合要求且质量高。此外,录音质量和合规性不受语音严重程度的影响,在终端设备之间也没有显著差异。这些发现支持对话系统在ALS语音远程监测中的应用。
{"title":"Remote Assessment for ALS using Multimodal Dialog Agents: Data Quality, Feasibility and Task Compliance.","authors":"Vanessa Richter,&nbsp;Michael Neumann,&nbsp;Jordan R Green,&nbsp;Brian Richburg,&nbsp;Oliver Roesler,&nbsp;Hardik Kothare,&nbsp;Vikram Ramanarayanan","doi":"10.21437/interspeech.2023-2115","DOIUrl":"https://doi.org/10.21437/interspeech.2023-2115","url":null,"abstract":"<p><p>We investigate the feasibility, task compliance and audiovisual data quality of a multimodal dialog-based solution for remote assessment of Amyotrophic Lateral Sclerosis (ALS). 53 people with ALS and 52 healthy controls interacted with Tina, a cloud-based conversational agent, in performing speech tasks designed to probe various aspects of motor speech function while their audio and video was recorded. We rated a total of 250 recordings for audio/video quality and participant task compliance, along with the relative frequency of different issues observed. We observed excellent compliance (98%) and audio (95.2%) and visual quality rates (84.8%), resulting in an overall yield of 80.8% recordings that were both compliant and of high quality. Furthermore, recording quality and compliance were not affected by level of speech severity and did not differ significantly across end devices. These findings support the utility of dialog systems for remote monitoring of speech in ALS.</p>","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"2023 ","pages":"5441-5445"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10547018/pdf/nihms-1931217.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41174190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pronunciation modeling of foreign words for Mandarin ASR by considering the effect of language transfer 考虑语言迁移影响的普通话ASR外来词语音建模
Pub Date : 2022-10-07 DOI: 10.21437/Interspeech.2014-353
Lei Wang, R. Tong
One of the challenges in automatic speech recognition is foreign words recognition. It is observed that a speaker's pronunciation of a foreign word is influenced by his native language knowledge, and such phenomenon is known as the effect of language transfer. This paper focuses on examining the phonetic effect of language transfer in automatic speech recognition. A set of lexical rules is proposed to convert an English word into Mandarin phonetic representation. In this way, a Mandarin lexicon can be augmented by including English words. Hence, the Mandarin ASR system becomes capable to recognize English words without retraining or re-estimation of the acoustic model parameters. Using the lexicon that derived from the proposed rules, the ASR performance of Mandarin English mixed speech is improved without harming the accuracy of Mandarin only speech. The proposed lexical rules are generalized and they can be directly applied to unseen English words.
语音自动识别的难点之一是外来词识别。人们观察到,说话者对外来词的发音受到其母语知识的影响,这种现象被称为语言迁移效应。本文主要研究语音自动识别中语言迁移的语音效应。提出了一套将英语单词转换为汉语语音表示的词汇规则。通过这种方式,汉语词汇可以通过加入英语单词来扩充。因此,普通话ASR系统无需重新训练或重新估计声学模型参数即可识别英语单词。在不影响纯普通话语音准确性的前提下,使用基于规则的词汇表,提高了普通话英语混合语音的ASR性能。所提出的词汇规则具有概括性,可以直接应用于未见过的英语单词。
{"title":"Pronunciation modeling of foreign words for Mandarin ASR by considering the effect of language transfer","authors":"Lei Wang, R. Tong","doi":"10.21437/Interspeech.2014-353","DOIUrl":"https://doi.org/10.21437/Interspeech.2014-353","url":null,"abstract":"One of the challenges in automatic speech recognition is foreign words recognition. It is observed that a speaker's pronunciation of a foreign word is influenced by his native language knowledge, and such phenomenon is known as the effect of language transfer. This paper focuses on examining the phonetic effect of language transfer in automatic speech recognition. A set of lexical rules is proposed to convert an English word into Mandarin phonetic representation. In this way, a Mandarin lexicon can be augmented by including English words. Hence, the Mandarin ASR system becomes capable to recognize English words without retraining or re-estimation of the acoustic model parameters. Using the lexicon that derived from the proposed rules, the ASR performance of Mandarin English mixed speech is improved without harming the accuracy of Mandarin only speech. The proposed lexical rules are generalized and they can be directly applied to unseen English words.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1443-1447"},"PeriodicalIF":0.0,"publicationDate":"2022-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43367945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Automatic Speaker Verification System for Dysarthria Patients 用于构音障碍患者的自动扬声器验证系统
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-375
Shinimol Salim, S. Shahnawazuddin, Waquar Ahmad
Dysarthria is one of the most common speech communication disorder associate with a neurological damage that weakens the muscles necessary for speech. In this paper, we present our efforts towards developing an automatic speaker verification (ASV) system based on x -vectors for dysarthric speakers with varying speech intelligibility (low, medium and high). For that purpose, a baseline ASV system was trained on speech data from healthy speakers since there is severe scarcity of data from dysarthric speakers. To improve the performance with respect to dysarthric speakers, data augmentation based on duration modification is proposed in this study. Duration modification with several scaling factors was applied to healthy training speech. An ASV system was then trained on healthy speech augmented with its duration modified versions. It compen-sates for the substantial disparities in phone duration between normal and dysarthric speakers of varying speech intelligibilty. Experiment evaluations presented in this study show that proposed duration-modification-based data augmentation resulted in a relative improvement of 22% over the baseline. Further to that, a relative improvement of 26% was obtained in the case of speakers with high severity level of dysarthria.
构音障碍是最常见的言语交流障碍之一,与削弱言语所需肌肉的神经损伤有关。在本文中,我们致力于开发一个基于x向量的自动说话人验证(ASV)系统,用于不同语音清晰度(低、中、高)的构音障碍说话人。为此,基线ASV系统是根据健康说话者的语音数据进行训练的,因为严重缺乏构音障碍说话者的数据。为了提高构音障碍说话者的表现,本研究提出了基于持续时间修改的数据增强。将几个比例因子的时长修正应用于健康训练语音。然后对ASV系统进行了健康语音训练,并对其持续时间进行了修改。它弥补了不同语音清晰度的正常和构音障碍说话者之间电话持续时间的巨大差异。本研究中的实验评估表明,所提出的基于持续时间修改的数据增强比基线相对提高了22%。此外,在具有高严重程度构音障碍的说话者的情况下,获得了26%的相对改善。
{"title":"Automatic Speaker Verification System for Dysarthria Patients","authors":"Shinimol Salim, S. Shahnawazuddin, Waquar Ahmad","doi":"10.21437/interspeech.2022-375","DOIUrl":"https://doi.org/10.21437/interspeech.2022-375","url":null,"abstract":"Dysarthria is one of the most common speech communication disorder associate with a neurological damage that weakens the muscles necessary for speech. In this paper, we present our efforts towards developing an automatic speaker verification (ASV) system based on x -vectors for dysarthric speakers with varying speech intelligibility (low, medium and high). For that purpose, a baseline ASV system was trained on speech data from healthy speakers since there is severe scarcity of data from dysarthric speakers. To improve the performance with respect to dysarthric speakers, data augmentation based on duration modification is proposed in this study. Duration modification with several scaling factors was applied to healthy training speech. An ASV system was then trained on healthy speech augmented with its duration modified versions. It compen-sates for the substantial disparities in phone duration between normal and dysarthric speakers of varying speech intelligibilty. Experiment evaluations presented in this study show that proposed duration-modification-based data augmentation resulted in a relative improvement of 22% over the baseline. Further to that, a relative improvement of 26% was obtained in the case of speakers with high severity level of dysarthria.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5070-5074"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44912875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Robust Cough Feature Extraction and Classification Method for COVID-19 Cough Detection Based on Vocalization Characteristics 基于发声特征的新型冠状病毒咳嗽检测鲁棒咳嗽特征提取与分类方法
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10401
Xueshuai Zhang, Jiakun Shen, J. Zhou, Pengyuan Zhang, Yonghong Yan, Zhihua Huang, Yanfen Tang, Yu Wang, Fujie Zhang, Shenmin Zhang, Aijun Sun
A fast, efficient and accurate detection method of COVID-19 remains a critical challenge. Many cough-based COVID-19 detection researches have shown competitive results through artificial intelligence. However, the lack of analysis on vocalization characteristics of cough sounds limits the further improvement of detection performance. In this paper, we propose two novel acoustic features of cough sounds and a convolutional neural network structure for COVID-19 detection. First, a time-frequency differential feature is proposed to characterize dynamic information of cough sounds in time and frequency domain. Then, an energy ratio feature is proposed to calculate the energy difference caused by the phonation characteristics in different cough phases. Finally, a convolutional neural network with two parallel branches which is pre-trained on a large amount of unlabeled cough data is proposed for classification. Experiment results show that our proposed method achieves state-of-the-art performance on Coswara dataset for COVID-19 detection. The results on an external clinical dataset Virufy also show the better generalization ability of our proposed method. Copyright © 2022 ISCA.
一种快速、高效和准确的新冠肺炎检测方法仍然是一个关键的挑战。许多基于咳嗽的新冠肺炎检测研究通过人工智能显示了具有竞争力的结果。然而,由于缺乏对咳嗽声发声特征的分析,限制了检测性能的进一步提高。在本文中,我们提出了咳嗽声的两种新的声学特征和用于新冠肺炎检测的卷积神经网络结构。首先,提出了一种时频微分特征来表征咳嗽声在时域和频域中的动态信息。然后,提出了一种能量比特征来计算不同咳嗽阶段由发音特征引起的能量差异。最后,提出了一种在大量未标记咳嗽数据上预训练的具有两个并行分支的卷积神经网络进行分类。实验结果表明,我们提出的方法在用于新冠肺炎检测的Coswara数据集上实现了最先进的性能。在外部临床数据集Virify上的结果也表明了我们提出的方法更好的泛化能力。版权所有©2022 ISCA。
{"title":"Robust Cough Feature Extraction and Classification Method for COVID-19 Cough Detection Based on Vocalization Characteristics","authors":"Xueshuai Zhang, Jiakun Shen, J. Zhou, Pengyuan Zhang, Yonghong Yan, Zhihua Huang, Yanfen Tang, Yu Wang, Fujie Zhang, Shenmin Zhang, Aijun Sun","doi":"10.21437/interspeech.2022-10401","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10401","url":null,"abstract":"A fast, efficient and accurate detection method of COVID-19 remains a critical challenge. Many cough-based COVID-19 detection researches have shown competitive results through artificial intelligence. However, the lack of analysis on vocalization characteristics of cough sounds limits the further improvement of detection performance. In this paper, we propose two novel acoustic features of cough sounds and a convolutional neural network structure for COVID-19 detection. First, a time-frequency differential feature is proposed to characterize dynamic information of cough sounds in time and frequency domain. Then, an energy ratio feature is proposed to calculate the energy difference caused by the phonation characteristics in different cough phases. Finally, a convolutional neural network with two parallel branches which is pre-trained on a large amount of unlabeled cough data is proposed for classification. Experiment results show that our proposed method achieves state-of-the-art performance on Coswara dataset for COVID-19 detection. The results on an external clinical dataset Virufy also show the better generalization ability of our proposed method. Copyright © 2022 ISCA.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2168-2172"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45011547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Predicting Speech Intelligibility using the Spike Acativity Mutual Information Index 利用Spike Acactivity互信息指数预测语音可懂性
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10488
F. Cardinale, W. Nogueira
The spike activity mutual information index (SAMII) is presented as a new intrusive objective metric to predict speech intelligibility. A target speech signal and speech-in-noise signal are processed by a state-of-the-art computational model of the peripheral auditory system. It simulates the neural activity in a population of auditory nerve fibers (ANFs), which are grouped into critical bands covering the speech frequency range. The mutual information between the neural activity of both signals is calculated using analysis windows of 20 ms. Then, the mutual information is averaged along these analysis windows to obtain SAMII. SAMII is also extended to binaural scenarios by calculating the index for the left ear, right ear, and both ears, choosing the best case for predicting intelligibility. SAMII was developed based on the first clarity prediction challenge training dataset and compared to the modified binaural short-time objective intelligibility (MBSTOI) as baseline. Scores are reported in root mean squared error (RMSE) between measured and predicted data using the clarity challenge test dataset. SAMII scored 35.16%, slightly better than the MBSTOI which obtained 36.52%. This work leads to the conclu-sion that SAMII is a reliable objective metric when “low-level” representations of the speech, such as spike activity, are used.
尖峰活动互信息指数(SAMII)是预测语音可懂度的一种新的侵入性客观指标。通过外周听觉系统的最先进的计算模型来处理目标语音信号和噪声中的语音信号。它模拟了一群听觉神经纤维(ANF)的神经活动,这些神经纤维被分为覆盖语音频率范围的关键频带。使用20ms的分析窗口来计算两个信号的神经活动之间的相互信息。然后,沿着这些分析窗口对相互信息进行平均以获得SAMII。SAMII还通过计算左耳、右耳和双耳的指数,选择预测可懂度的最佳情况,扩展到双耳场景。SAMII是基于第一个清晰度预测挑战训练数据集开发的,并与作为基线的改良双耳短时目标可懂度(MBSTOI)进行了比较。使用清晰度挑战测试数据集,以测量数据和预测数据之间的均方根误差(RMSE)报告分数。SAMII的得分为35.16%,略高于MBSTOI的得分36.52%。这项工作得出的结论是,当使用语音的“低级”表示(如尖峰活动)时,SAMII是一个可靠的客观指标。
{"title":"Predicting Speech Intelligibility using the Spike Acativity Mutual Information Index","authors":"F. Cardinale, W. Nogueira","doi":"10.21437/interspeech.2022-10488","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10488","url":null,"abstract":"The spike activity mutual information index (SAMII) is presented as a new intrusive objective metric to predict speech intelligibility. A target speech signal and speech-in-noise signal are processed by a state-of-the-art computational model of the peripheral auditory system. It simulates the neural activity in a population of auditory nerve fibers (ANFs), which are grouped into critical bands covering the speech frequency range. The mutual information between the neural activity of both signals is calculated using analysis windows of 20 ms. Then, the mutual information is averaged along these analysis windows to obtain SAMII. SAMII is also extended to binaural scenarios by calculating the index for the left ear, right ear, and both ears, choosing the best case for predicting intelligibility. SAMII was developed based on the first clarity prediction challenge training dataset and compared to the modified binaural short-time objective intelligibility (MBSTOI) as baseline. Scores are reported in root mean squared error (RMSE) between measured and predicted data using the clarity challenge test dataset. SAMII scored 35.16%, slightly better than the MBSTOI which obtained 36.52%. This work leads to the conclu-sion that SAMII is a reliable objective metric when “low-level” representations of the speech, such as spike activity, are used.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3503-3507"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45143239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Cooperative Speech Separation With a Microphone Array and Asynchronous Wearable Devices 基于麦克风阵列和异步可穿戴设备的协同语音分离
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11025
R. Corey, Manan Mittal, Kanad Sarkar, A. Singer
We consider the problem of separating speech from several talkers in background noise using a fixed microphone array and a set of wearable devices. Wearable devices can provide reliable information about speech from their wearers, but they typically cannot be used directly for multichannel source separation due to network delay, sample rate offsets, and relative motion. Instead, the wearable microphone signals are used to compute the speech presence probability for each talker at each time-frequency index. Those parameters, which are robust against small sample rate offsets and relative motion, are used to track the second-order statistics of the speech sources and background noise. The fixed array then separates the speech signals using an adaptive linear time-varying multichannel Wiener filter. The proposed method is demonstrated using real-room recordings from three human talkers with binaural earbud microphones and an eight-microphone tabletop array. but are useful for distin-guishing between different sources because of their known positions relative to the talkers. The proposed system uses the wearable devices to estimate SPP values, which are then used to learn the second-order statistics for each source at the microphones of the fixed array. The array separates the sources using an adaptive linear time-varying spatial filter suitable for real-time applications. This work combines the cooperative ar-chitecture of [19], the distributed SPP method of [18], and the motion-robust modeling of [15]. The system is implemented adaptively and demonstrated using live human talkers.
我们考虑了在背景噪声中使用固定麦克风阵列和一组可穿戴设备从多个说话者中分离语音的问题。可穿戴设备可以提供有关其佩戴者的语音的可靠信息,但由于网络延迟、采样率偏移和相对运动,它们通常不能直接用于多通道源分离。相反,使用可穿戴麦克风信号来计算每个说话者在每个时频指数下的语音存在概率。这些参数对小采样率偏移和相对运动具有鲁棒性,用于跟踪语音源和背景噪声的二阶统计量。固定阵列然后使用自适应线性时变多通道维纳滤波器分离语音信号。该方法通过使用双耳耳塞麦克风和8个麦克风桌面阵列的三个人的真实房间录音进行了演示。但是对于区分不同的来源是有用的,因为它们相对于说话者的已知位置。该系统使用可穿戴设备来估计SPP值,然后使用SPP值来学习固定阵列麦克风处每个源的二阶统计量。该阵列使用适合于实时应用的自适应线性时变空间滤波器分离源。本工作结合[19]的协作架构、[18]的分布式SPP方法和[15]的运动鲁棒建模。该系统是自适应实现的,并使用真人说话者进行演示。
{"title":"Cooperative Speech Separation With a Microphone Array and Asynchronous Wearable Devices","authors":"R. Corey, Manan Mittal, Kanad Sarkar, A. Singer","doi":"10.21437/interspeech.2022-11025","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11025","url":null,"abstract":"We consider the problem of separating speech from several talkers in background noise using a fixed microphone array and a set of wearable devices. Wearable devices can provide reliable information about speech from their wearers, but they typically cannot be used directly for multichannel source separation due to network delay, sample rate offsets, and relative motion. Instead, the wearable microphone signals are used to compute the speech presence probability for each talker at each time-frequency index. Those parameters, which are robust against small sample rate offsets and relative motion, are used to track the second-order statistics of the speech sources and background noise. The fixed array then separates the speech signals using an adaptive linear time-varying multichannel Wiener filter. The proposed method is demonstrated using real-room recordings from three human talkers with binaural earbud microphones and an eight-microphone tabletop array. but are useful for distin-guishing between different sources because of their known positions relative to the talkers. The proposed system uses the wearable devices to estimate SPP values, which are then used to learn the second-order statistics for each source at the microphones of the fixed array. The array separates the sources using an adaptive linear time-varying spatial filter suitable for real-time applications. This work combines the cooperative ar-chitecture of [19], the distributed SPP method of [18], and the motion-robust modeling of [15]. The system is implemented adaptively and demonstrated using live human talkers.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5398-5402"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45171254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Automatic Soundtracking System for Text-to-Speech Audiobooks 一种用于文本到语音有声读物的自动声音跟踪系统
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10236
Zikai Chen, Lin Wu, Junjie Pan, Xiang Yin
Background music (BGM) plays an essential role in audiobooks, which can enhance the immersive experience of audiences and help them better understand the story. However, welldesigned BGM still requires human effort in the text-to-speech (TTS) audiobook production, which is quite time-consuming and costly. In this paper, we introduce an automatic soundtracking system for TTS-based audiobooks. The proposed system divides the soundtracking process into three tasks: plot partition, plot classification, and music selection. The experiments shows that both our plot partition module and plot classification module outperform baselines by a large margin. Furthermore, TTS-based audiobooks produced with our proposed automatic soundtracking system achieves comparable performance to that produced with the human soundtracking system. To our best of knowledge, this is the first work of automatic soundtracking system for audiobooks. Demos are available on https: //acst1223.github.io/interspeech2022/main.
背景音乐在有声读物中发挥着至关重要的作用,它可以增强观众的沉浸式体验,帮助他们更好地理解故事。然而,设计良好的BGM仍然需要在文本到语音(TTS)有声读物的制作中付出人力,这是相当耗时和昂贵的。本文介绍了一种基于TTS的有声读物自动跟踪系统。该系统将声音跟踪过程分为三个任务:情节划分、情节分类和音乐选择。实验表明,我们的小区划分模块和小区分类模块都大大优于基线。此外,使用我们提出的自动声音跟踪系统制作的基于TTS的有声读物实现了与使用人类声音跟踪系统生产的有声书相当的性能。据我们所知,这是有声读物自动声音跟踪系统的第一部作品。演示可在https://acst1223.github.io/interseech2022/main上获得。
{"title":"An Automatic Soundtracking System for Text-to-Speech Audiobooks","authors":"Zikai Chen, Lin Wu, Junjie Pan, Xiang Yin","doi":"10.21437/interspeech.2022-10236","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10236","url":null,"abstract":"Background music (BGM) plays an essential role in audiobooks, which can enhance the immersive experience of audiences and help them better understand the story. However, welldesigned BGM still requires human effort in the text-to-speech (TTS) audiobook production, which is quite time-consuming and costly. In this paper, we introduce an automatic soundtracking system for TTS-based audiobooks. The proposed system divides the soundtracking process into three tasks: plot partition, plot classification, and music selection. The experiments shows that both our plot partition module and plot classification module outperform baselines by a large margin. Furthermore, TTS-based audiobooks produced with our proposed automatic soundtracking system achieves comparable performance to that produced with the human soundtracking system. To our best of knowledge, this is the first work of automatic soundtracking system for audiobooks. Demos are available on https: //acst1223.github.io/interspeech2022/main.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"476-480"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45188982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Empirical Sampling from Latent Utterance-wise Evidence Model for Missing Data ASR based on Neural Encoder-Decoder Model 基于神经编码器-解码器模型的基于潜在话语证据的缺失数据ASR经验抽样
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-576
Ryu Takeda, Yui Sudo, K. Nakadai, Kazunori Komatani
Missing data automatic speech recognition (MD-ASR) can utilize the uncertainty of speech enhancement (SE) results without re-training of model parameters. Such uncertainty is represented by a probabilistic evidence model, and the design and the expectation calculation of it are important. Two problems arise in applying the MD approach to utterance-wise ASR based on neural encoder-decoder model: the high-dimensionality of an utterance-wise evidence model and the discontinuity among frames of generated samples in approximating the expectation with Monte-Carlo method. We propose new utterance-wise evidence models using a latent variable and an empirical method for sampling from them. The space of our latent model is restricted by simpler conditional probability density functions (pdfs) given the latent variable, which enables us to generate samples from the low-dimensional space in deterministic or stochastic way. Because the variable also works as a common smoothing parameter among simple pdfs, the generated samples are continuous among frames, which improves the ASR performance unlike frame-wise models. The uncertainty from a neural SE is also used as a component in our mixture pdf models. Experiments showed that the character error rate of the enhanced speech was further improved by 2.5 points on average with our MD-ASR using transformer model.
缺失数据自动语音识别(MD-ASR)可以利用语音增强(SE)结果的不确定性,而无需重新训练模型参数。这种不确定性由概率证据模型表示,其设计和期望计算非常重要。在将MD方法应用于基于神经编码器-解码器模型的话语ASR时,出现了两个问题:话语证据模型的高维性和使用蒙特卡罗方法近似期望时生成的样本帧之间的不连续性。我们提出了新的话语证据模型,使用了一个潜在变量和一种从中取样的经验方法。我们的潜在模型的空间受到给定潜在变量的更简单的条件概率密度函数(pdfs)的限制,这使我们能够以确定性或随机的方式从低维空间生成样本。因为该变量在简单的pdf中也是一个常见的平滑参数,所以生成的样本在帧之间是连续的,这与逐帧模型不同,提高了ASR性能。神经SE的不确定性也被用作我们的混合pdf模型中的一个组成部分。实验表明,使用变换器模型的MD-ASR,增强语音的字符错误率平均提高了2.5个点。
{"title":"Empirical Sampling from Latent Utterance-wise Evidence Model for Missing Data ASR based on Neural Encoder-Decoder Model","authors":"Ryu Takeda, Yui Sudo, K. Nakadai, Kazunori Komatani","doi":"10.21437/interspeech.2022-576","DOIUrl":"https://doi.org/10.21437/interspeech.2022-576","url":null,"abstract":"Missing data automatic speech recognition (MD-ASR) can utilize the uncertainty of speech enhancement (SE) results without re-training of model parameters. Such uncertainty is represented by a probabilistic evidence model, and the design and the expectation calculation of it are important. Two problems arise in applying the MD approach to utterance-wise ASR based on neural encoder-decoder model: the high-dimensionality of an utterance-wise evidence model and the discontinuity among frames of generated samples in approximating the expectation with Monte-Carlo method. We propose new utterance-wise evidence models using a latent variable and an empirical method for sampling from them. The space of our latent model is restricted by simpler conditional probability density functions (pdfs) given the latent variable, which enables us to generate samples from the low-dimensional space in deterministic or stochastic way. Because the variable also works as a common smoothing parameter among simple pdfs, the generated samples are continuous among frames, which improves the ASR performance unlike frame-wise models. The uncertainty from a neural SE is also used as a component in our mixture pdf models. Experiments showed that the character error rate of the enhanced speech was further improved by 2.5 points on average with our MD-ASR using transformer model.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3789-3793"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45261354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Isochronous is beautiful? Syllabic event detection in a neuro-inspired oscillatory model is facilitated by isochrony in speech 等时是美丽的吗?语音中的等时性有助于神经振荡模型中的音节事件检测
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10426
Mamady Nabe, J. Diard, J. Schwartz
Oscillation-based neuro-computational models of speech perception are grounded in the capacity of human brain oscillations to track the speech signal. Consequently, one would expect this tracking to be more efficient for more regular signals. In this pa-per, we address the question of the contribution of isochrony to event detection by neuro-computational models of speech perception. We consider a simple model of event detection proposed in the literature, based on oscillatory processes driven by the acoustic envelope, that was previously shown to efficiently detect syllabic events in various languages. We first evaluate its performance in the detection of syllabic events for French, and show that “perceptual centers” associated to vowel onsets are more robustly detected than syllable onsets. Then we show that isochrony in natural speech improves the performance of event detection in the oscillatory model. We also evaluate the model’s robustness to acoustic noise. Overall, these results show the importance of bottom-up resonance mechanism for event detection; however, they suggest that bottom-up processing of acoustic envelope is not able to perfectly detect events relevant to speech temporal segmentation, highlighting the potential and complementary role of top-down, predictive knowledge.
基于振荡的语音感知神经计算模型基于人脑振荡跟踪语音信号的能力。因此,人们期望这种跟踪对于更规则的信号更有效。在这篇论文中,我们通过语音感知的神经计算模型来解决等时性对事件检测的贡献问题。我们考虑了文献中提出的一个简单的事件检测模型,该模型基于声包络驱动的振荡过程,先前已被证明可以有效地检测各种语言中的音节事件。我们首先评估了它在法语音节事件检测中的表现,并表明与元音起始点相关的“感知中心”比音节起始点检测得更稳健。然后,我们证明了自然语音中的等时性提高了振荡模型中事件检测的性能。我们还评估了该模型对声学噪声的鲁棒性。总体而言,这些结果表明了自下而上的共振机制对事件检测的重要性;然而,他们认为自下而上的声包络处理无法完美地检测与语音时间分割相关的事件,这突出了自上而下的预测知识的潜在和互补作用。
{"title":"Isochronous is beautiful? Syllabic event detection in a neuro-inspired oscillatory model is facilitated by isochrony in speech","authors":"Mamady Nabe, J. Diard, J. Schwartz","doi":"10.21437/interspeech.2022-10426","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10426","url":null,"abstract":"Oscillation-based neuro-computational models of speech perception are grounded in the capacity of human brain oscillations to track the speech signal. Consequently, one would expect this tracking to be more efficient for more regular signals. In this pa-per, we address the question of the contribution of isochrony to event detection by neuro-computational models of speech perception. We consider a simple model of event detection proposed in the literature, based on oscillatory processes driven by the acoustic envelope, that was previously shown to efficiently detect syllabic events in various languages. We first evaluate its performance in the detection of syllabic events for French, and show that “perceptual centers” associated to vowel onsets are more robustly detected than syllable onsets. Then we show that isochrony in natural speech improves the performance of event detection in the oscillatory model. We also evaluate the model’s robustness to acoustic noise. Overall, these results show the importance of bottom-up resonance mechanism for event detection; however, they suggest that bottom-up processing of acoustic envelope is not able to perfectly detect events relevant to speech temporal segmentation, highlighting the potential and complementary role of top-down, predictive knowledge.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4671-4675"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45456275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Mandarin Lombard Grid: a Lombard-grid-like corpus of Standard Chinese 普通话伦巴第格:一个类似伦巴第格的标准汉语语料库
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-854
Yuhong Yang, Xufeng Chen, Qingmu Liu, Weiping Tu, Hongyang Chen, Linjun Cai
{"title":"Mandarin Lombard Grid: a Lombard-grid-like corpus of Standard Chinese","authors":"Yuhong Yang, Xufeng Chen, Qingmu Liu, Weiping Tu, Hongyang Chen, Linjun Cai","doi":"10.21437/interspeech.2022-854","DOIUrl":"https://doi.org/10.21437/interspeech.2022-854","url":null,"abstract":"","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3078-3082"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45598645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
Interspeech
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1