首页 > 最新文献

2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)最新文献

英文 中文
MARS: the First Romanian Pollen Dataset using a Rapid-E Particle Analyzer 火星:第一个使用快速e粒子分析仪的罗马尼亚花粉数据集
Pub Date : 2021-10-13 DOI: 10.1109/sped53181.2021.9587447
M. Boldeanu, C. Marin, D. Ene, L. Mărmureanu, H. Cucu, C. Burileanu
Pollen allergies are a growing concern for human health. This is why automated pollen monitoring is becoming an important area of research. Machine learning approaches show great promise for tackling this issue but these algorithms need large training data sets to perform well. This study introduces a new pollen data set, obtained using a Rapid-E particle analyzer, that is representative for the flora of Romania. Pollen, from thirteen species present in Romania, was used in developing this database with over 100 thousand samples measured. Our study shows performance similar to or above that of humans in the task of pollen classification on the newly introduced data set using a convolutional neural network.
花粉过敏是人类健康日益关注的问题。这就是为什么自动花粉监测正在成为一个重要的研究领域。机器学习方法在解决这个问题上显示出很大的希望,但这些算法需要大量的训练数据集才能表现良好。本文介绍了一种新的花粉数据集,该数据集是用Rapid-E颗粒分析仪获得的,它代表了罗马尼亚的植物区系。来自罗马尼亚的13个物种的花粉被用于建立这个数据库,测量了超过10万个样本。我们的研究表明,在使用卷积神经网络对新引入的数据集进行花粉分类的任务中,它们的表现与人类相似或更高。
{"title":"MARS: the First Romanian Pollen Dataset using a Rapid-E Particle Analyzer","authors":"M. Boldeanu, C. Marin, D. Ene, L. Mărmureanu, H. Cucu, C. Burileanu","doi":"10.1109/sped53181.2021.9587447","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587447","url":null,"abstract":"Pollen allergies are a growing concern for human health. This is why automated pollen monitoring is becoming an important area of research. Machine learning approaches show great promise for tackling this issue but these algorithms need large training data sets to perform well. This study introduces a new pollen data set, obtained using a Rapid-E particle analyzer, that is representative for the flora of Romania. Pollen, from thirteen species present in Romania, was used in developing this database with over 100 thousand samples measured. Our study shows performance similar to or above that of humans in the task of pollen classification on the newly introduced data set using a convolutional neural network.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125938865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Establishing a Baseline of Romanian Speech-to-Text Models 建立罗马尼亚语语音到文本模型的基线
Pub Date : 2021-10-13 DOI: 10.1109/sped53181.2021.9587345
D. Ungureanu, Madalina Badeanu, Gabriela-Catalina Marica, M. Dascalu, D. Tufis
With the increasing usage of Natural Language Processing to facilitate the interactions between humans and machines, automatic speech recognition systems have become increasingly popular as a result of their utility in a wide range of applications. In this paper we explore well-known open-source speech-to-text engines, namely CMUSphinx, DeepSpeech, and Kaldi, to build a baseline of models to transcribe Romanian speech. These engines employ various underlying methods from hidden Markov models to deep neural networks that also integrate language models, thus providing a solid baseline for comparison. Unfortunately, Romanian is still a low-resource language and six datasets of various qualities were merged to obtain 104 hours of speech. To further increase the size of the gathered corpora, our experiments consider data augmentation techniques, specifically SpecAugment, applied on the most promising model. Besides using existing corpora, we publicly release a dataset of 11.5 hours generated from Governmental transcripts. The best performing model is obtained using the Kaldi architecture, considers a hybrid structure with a Deep Neural Network, and achieves a WER of 3.10% on the test partition.
随着越来越多地使用自然语言处理来促进人与机器之间的交互,自动语音识别系统因其在广泛应用中的实用性而越来越受欢迎。在本文中,我们探索了著名的开源语音到文本引擎,即CMUSphinx, DeepSpeech和Kaldi,以建立一个模型基线来转录罗马尼亚语语音。这些引擎采用了从隐马尔可夫模型到深度神经网络的各种底层方法,这些方法也集成了语言模型,从而为比较提供了坚实的基线。不幸的是,罗马尼亚语仍然是一种资源匮乏的语言,六个不同质量的数据集被合并后获得了104小时的语音。为了进一步增加收集的语料库的大小,我们的实验考虑了数据增强技术,特别是SpecAugment,应用于最有前途的模型。除了使用现有的语料库外,我们还公开发布了一个由政府成绩单生成的11.5小时数据集。采用Kaldi架构得到了性能最好的模型,考虑了与深度神经网络的混合结构,在测试分区上实现了3.10%的WER。
{"title":"Establishing a Baseline of Romanian Speech-to-Text Models","authors":"D. Ungureanu, Madalina Badeanu, Gabriela-Catalina Marica, M. Dascalu, D. Tufis","doi":"10.1109/sped53181.2021.9587345","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587345","url":null,"abstract":"With the increasing usage of Natural Language Processing to facilitate the interactions between humans and machines, automatic speech recognition systems have become increasingly popular as a result of their utility in a wide range of applications. In this paper we explore well-known open-source speech-to-text engines, namely CMUSphinx, DeepSpeech, and Kaldi, to build a baseline of models to transcribe Romanian speech. These engines employ various underlying methods from hidden Markov models to deep neural networks that also integrate language models, thus providing a solid baseline for comparison. Unfortunately, Romanian is still a low-resource language and six datasets of various qualities were merged to obtain 104 hours of speech. To further increase the size of the gathered corpora, our experiments consider data augmentation techniques, specifically SpecAugment, applied on the most promising model. Besides using existing corpora, we publicly release a dataset of 11.5 hours generated from Governmental transcripts. The best performing model is obtained using the Kaldi architecture, considers a hybrid structure with a Deep Neural Network, and achieves a WER of 3.10% on the test partition.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129761117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Influence of Silence and Noise Filtering on Speech Quality Monitoring 静音和噪声滤波对语音质量监测的影响
Pub Date : 2021-10-13 DOI: 10.1109/sped53181.2021.9587364
R. Jaiswal
With the exponential increase of mobile users and internet subscribers, the utilization of voice over internet protocol (VoIP) application is increasing dramatically. People exploit different VoIP applications for effective communication, for example, Google Meet, Microsoft Skype, Zoom video conferencing applications, etc. The single-ended speech quality metrics are employed for measuring and monitoring the quality of speech. However, different types of degradations present in the surroundings distort the quality of speech. In order to meet the desired quality of experience (QoE) level of end-user while using VoIP applications, it is necessary to reduce VoIP degradations and obtain the optimized speech quality. Along that line, this paper investigates the conjunction of filtering of silence and noise as a pre-processing block with the single-ended speech quality metric under various common occurring degradations encountered during VoIP communication. This can help the internet service providers in understanding the potential root cause of decrement in quality of speech and then applying the QoE management service to fulfill desired human QoE level. Results demonstrate that the deployment of joint pre-processing on speech samples under various VoIP degradations improves the quality of speech to a great extent.
随着移动用户和互联网用户的指数级增长,VoIP (voice over internet protocol)应用的使用率急剧上升。人们利用不同的VoIP应用程序进行有效的通信,例如,Google Meet, Microsoft Skype, Zoom视频会议应用程序等。采用单端语音质量指标对语音质量进行测量和监控。然而,周围环境中存在的不同类型的退化会扭曲语音质量。为了满足终端用户在使用VoIP应用时所期望的QoE (quality of experience)水平,有必要减少VoIP的降级,从而获得最优的语音质量。沿着这条线,本文研究了在VoIP通信中遇到的各种常见退化情况下,沉默和噪声滤波作为预处理块与单端语音质量度量的结合。这可以帮助互联网服务提供商了解语音质量下降的潜在根本原因,然后应用质量质量管理服务来满足期望的人类质量质量水平。结果表明,在各种VoIP降级情况下对语音样本进行联合预处理,在很大程度上提高了语音质量。
{"title":"Influence of Silence and Noise Filtering on Speech Quality Monitoring","authors":"R. Jaiswal","doi":"10.1109/sped53181.2021.9587364","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587364","url":null,"abstract":"With the exponential increase of mobile users and internet subscribers, the utilization of voice over internet protocol (VoIP) application is increasing dramatically. People exploit different VoIP applications for effective communication, for example, Google Meet, Microsoft Skype, Zoom video conferencing applications, etc. The single-ended speech quality metrics are employed for measuring and monitoring the quality of speech. However, different types of degradations present in the surroundings distort the quality of speech. In order to meet the desired quality of experience (QoE) level of end-user while using VoIP applications, it is necessary to reduce VoIP degradations and obtain the optimized speech quality. Along that line, this paper investigates the conjunction of filtering of silence and noise as a pre-processing block with the single-ended speech quality metric under various common occurring degradations encountered during VoIP communication. This can help the internet service providers in understanding the potential root cause of decrement in quality of speech and then applying the QoE management service to fulfill desired human QoE level. Results demonstrate that the deployment of joint pre-processing on speech samples under various VoIP degradations improves the quality of speech to a great extent.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129137569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
[Copyright notice] (版权)
Pub Date : 2021-10-13 DOI: 10.1109/sped53181.2021.9587394
{"title":"[Copyright notice]","authors":"","doi":"10.1109/sped53181.2021.9587394","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587394","url":null,"abstract":"","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131846871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Romanian printed language, statistical independence and the type II statistical error 罗马尼亚语印刷,统计独立,统计误差二类
Pub Date : 2021-10-13 DOI: 10.1109/sped53181.2021.9587397
Alexandru Dinu, A. Vlad
The paper revisits the notion of statistical independence for printed Romanian with a focus for the case when the language is considered as a chain of words. The analysis is carried out on a literary corpus of approx. 6 million words. We aim to improve the perception of the concept of statistical independence for natural texts and to use this concept to evaluate the numerical properties of the printed language.One main objective is to correlate the statistical independence results with the type II statistical error and at the same time to expand and circle back on previous results of the authors. We investigated different scenarios in relation with different word clusters involved in the process of creating Artificial Words (corroborated with the statistical independence evaluation). The Artificial Words consist of groups of the low probability words (based on previous findings on the type II statistical error in word probability investigation) and the results support the d = 100 minimum statistical independence distance.
本文重新审视统计独立性的概念,印刷罗马尼亚语的重点情况下,当语言被认为是一个词链。该分析是在一个大约有200篇文章的文学语料库上进行的。600万字。我们的目标是提高对自然文本的统计独立性概念的感知,并使用这个概念来评估印刷语言的数值特性。一个主要目标是将统计独立性结果与II型统计误差联系起来,同时扩展和循环作者以前的结果。我们研究了人工词生成过程中涉及不同词类的不同场景(与统计独立性评估相印证)。人工词由低概率词组组成(基于先前对词概率调查中第二类统计误差的研究结果),结果支持d = 100的最小统计独立距离。
{"title":"Romanian printed language, statistical independence and the type II statistical error","authors":"Alexandru Dinu, A. Vlad","doi":"10.1109/sped53181.2021.9587397","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587397","url":null,"abstract":"The paper revisits the notion of statistical independence for printed Romanian with a focus for the case when the language is considered as a chain of words. The analysis is carried out on a literary corpus of approx. 6 million words. We aim to improve the perception of the concept of statistical independence for natural texts and to use this concept to evaluate the numerical properties of the printed language.One main objective is to correlate the statistical independence results with the type II statistical error and at the same time to expand and circle back on previous results of the authors. We investigated different scenarios in relation with different word clusters involved in the process of creating Artificial Words (corroborated with the statistical independence evaluation). The Artificial Words consist of groups of the low probability words (based on previous findings on the type II statistical error in word probability investigation) and the results support the d = 100 minimum statistical independence distance.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114809333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparison in Suprasegmental Characteristics between Typical and Dysarthric Talkers at Varying Severity Levels 不同严重程度的典型和困难说话者的超节段特征比较
Pub Date : 2021-10-13 DOI: 10.1109/sped53181.2021.9587369
M. Soleymanpour, Michael T. Johnson, J. Berry
Dysarthria is a speech disorder often characterized by slow speech with reduced intelligibility. This preliminary study investigates suprasegmental characteristics between typical and dysarthric speakers at varying severity levels, with the long-term goal of improving methods for dysarthric speech synthesis/augmentation and enhancement. First, we aim to analyze phonemes, speaking rate and pause characteristics of typical and dysarthric speech using the phoneme- and word-level alignment information extracted by Montreal Forced Aligner (MFA). Then, pitch and intensity declination trends and range analysis are conducted. The pitch and intensity declination are measured by fitting a regression line. These analyses are conducted on dysarthric speech in TORGO, containing 8 dysarthric speakers involved with cerebral palsy or amyotrophic lateral sclerosis and 7 age- and gender-matched typical speakers. These results are important for the development of dysarthric speech synthesis, augmentation to statistically model and evaluate characteristics such as pause, speaking rate, pitch, and intensity.
构音障碍是一种语言障碍,通常以言语缓慢和清晰度降低为特征。这项初步研究调查了不同严重程度的典型和困难说话者之间的超节段特征,其长期目标是改进困难语音合成/增强和增强方法。首先,利用蒙特利尔强制对齐器(Montreal Forced Aligner, MFA)提取的音素和词级对齐信息,分析典型语音和困难语音的音素、语速和停顿特征。然后进行了音高和强度下降趋势和幅度分析。通过拟合回归线测量音调和强度衰减。这些分析是在TORGO中进行的,包括8名患有脑瘫或肌萎缩侧索硬化症的困难说话者和7名年龄和性别匹配的典型说话者。这些结果对于发展困难语音合成、增强统计模型和评估停顿、说话速率、音高和强度等特征具有重要意义。
{"title":"Comparison in Suprasegmental Characteristics between Typical and Dysarthric Talkers at Varying Severity Levels","authors":"M. Soleymanpour, Michael T. Johnson, J. Berry","doi":"10.1109/sped53181.2021.9587369","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587369","url":null,"abstract":"Dysarthria is a speech disorder often characterized by slow speech with reduced intelligibility. This preliminary study investigates suprasegmental characteristics between typical and dysarthric speakers at varying severity levels, with the long-term goal of improving methods for dysarthric speech synthesis/augmentation and enhancement. First, we aim to analyze phonemes, speaking rate and pause characteristics of typical and dysarthric speech using the phoneme- and word-level alignment information extracted by Montreal Forced Aligner (MFA). Then, pitch and intensity declination trends and range analysis are conducted. The pitch and intensity declination are measured by fitting a regression line. These analyses are conducted on dysarthric speech in TORGO, containing 8 dysarthric speakers involved with cerebral palsy or amyotrophic lateral sclerosis and 7 age- and gender-matched typical speakers. These results are important for the development of dysarthric speech synthesis, augmentation to statistically model and evaluate characteristics such as pause, speaking rate, pitch, and intensity.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114923141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Mispronunciation Detection and Diagnosis for Mandarin Accented English Speech 普通话口音英语语音错误检测与诊断
Pub Date : 2021-10-13 DOI: 10.1109/sped53181.2021.9587408
Subash Khanal, Michael T. Johnson, M. Soleymanpour, Narjes Bozorg
This paper presents a Mispronunciation Detection and Diagnosis (MDD) system based on a range of Automatic Speech Recognition (ASR) models and feature types. The goals of this research are to assess the ability of speech recognition systems to detect and diagnose the common pronunciation errors seen in non-native speakers (L2) of English and to assess the contribution of the information offered by Electromagnetic Articulography (EMA) data in improving the performance of such MDD systems. To evaluate the ability of the ASR systems to detect and diagnose pronunciation errors, the recognized sequence of phonemes generated by the ASR models were aligned with human-labeled phonetic transcripts as well as with the original phonetic prompts. This three-way alignment determined the MDD related metrics of the ASR system. System architectures included GMM-HMM, DNN, and RNN based ASR engines for the MDD system. Articulatory features derived from the Electromagnetic Articulography corpus of Mandarin-Accented English (EMA-MAE) were utilized along with acoustic features to compare the performance of MDD systems. The best performing system using a combination of acoustic and articulatory features had an accuracy of 82.4%, diagnostic accuracy of 75.8% and a false rejection rate of 17.2%.
本文提出了一种基于自动语音识别(ASR)模型和特征类型的语音错误检测与诊断系统。本研究的目的是评估语音识别系统检测和诊断非英语母语者(L2)常见发音错误的能力,并评估电磁发音图(EMA)数据提供的信息在改善此类MDD系统性能方面的贡献。为了评估ASR系统检测和诊断语音错误的能力,我们将ASR模型生成的识别音素序列与人工标记的语音转录本以及原始语音提示进行比对。这种三向对齐确定了ASR系统的MDD相关度量。系统架构包括用于MDD系统的基于GMM-HMM、DNN和RNN的ASR引擎。利用电磁发音语料库(EMA-MAE)的发音特征和声学特征来比较MDD系统的性能。结合声学和发音特征的最佳系统准确率为82.4%,诊断准确率为75.8%,假排斥率为17.2%。
{"title":"Mispronunciation Detection and Diagnosis for Mandarin Accented English Speech","authors":"Subash Khanal, Michael T. Johnson, M. Soleymanpour, Narjes Bozorg","doi":"10.1109/sped53181.2021.9587408","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587408","url":null,"abstract":"This paper presents a Mispronunciation Detection and Diagnosis (MDD) system based on a range of Automatic Speech Recognition (ASR) models and feature types. The goals of this research are to assess the ability of speech recognition systems to detect and diagnose the common pronunciation errors seen in non-native speakers (L2) of English and to assess the contribution of the information offered by Electromagnetic Articulography (EMA) data in improving the performance of such MDD systems. To evaluate the ability of the ASR systems to detect and diagnose pronunciation errors, the recognized sequence of phonemes generated by the ASR models were aligned with human-labeled phonetic transcripts as well as with the original phonetic prompts. This three-way alignment determined the MDD related metrics of the ASR system. System architectures included GMM-HMM, DNN, and RNN based ASR engines for the MDD system. Articulatory features derived from the Electromagnetic Articulography corpus of Mandarin-Accented English (EMA-MAE) were utilized along with acoustic features to compare the performance of MDD systems. The best performing system using a combination of acoustic and articulatory features had an accuracy of 82.4%, diagnostic accuracy of 75.8% and a false rejection rate of 17.2%.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"1997 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123632494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Human-Machine Interaction Speech Corpus from the ROBIN project 来自ROBIN项目的人机交互语音语料库
Pub Date : 2021-10-13 DOI: 10.1109/SpeD53181.2021.9587355
V. Pais, Radu Ion, Andrei-Marius Avram, Elena Irimia, V. Mititelu, Maria Mitrofan
This paper introduces a new Romanian speech corpus from the ROBIN project, called ROBIN Technical Acquisition Speech Corpus (ROBINTASC). Its main purpose was to improve the behaviour of a conversational agent, allowing human-machine interaction in the context of purchasing technical equipment. The paper contains a detailed description of the acquisition process, corpus statistics as well as an evaluation of the corpus influence on a low-latency ASR system as well as a dialogue component.
本文介绍了一个来自ROBIN项目的新的罗马尼亚语语音语料库,称为ROBIN技术获取语音语料库(ROBINTASC)。它的主要目的是改善会话代理的行为,允许在购买技术设备的情况下进行人机交互。本文详细描述了获取过程、语料库统计、评估了语料库对低延迟ASR系统的影响以及对话组件。
{"title":"Human-Machine Interaction Speech Corpus from the ROBIN project","authors":"V. Pais, Radu Ion, Andrei-Marius Avram, Elena Irimia, V. Mititelu, Maria Mitrofan","doi":"10.1109/SpeD53181.2021.9587355","DOIUrl":"https://doi.org/10.1109/SpeD53181.2021.9587355","url":null,"abstract":"This paper introduces a new Romanian speech corpus from the ROBIN project, called ROBIN Technical Acquisition Speech Corpus (ROBINTASC). Its main purpose was to improve the behaviour of a conversational agent, allowing human-machine interaction in the context of purchasing technical equipment. The paper contains a detailed description of the acquisition process, corpus statistics as well as an evaluation of the corpus influence on a low-latency ASR system as well as a dialogue component.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126533934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
An analysis of the data efficiency in Tacotron2 speech synthesis system Tacotron2语音合成系统的数据效率分析
Pub Date : 2021-10-13 DOI: 10.1109/sped53181.2021.9587411
G. Săracu, Adriana Stan
This paper introduces an evaluation of the amount of data required by the Tacotron2 speech synthesis model in order to achieve a good quality output synthesis. We evaluate the capabilities of the model to adapt to new speakers in very limited data scenarios. We use three Romanian speakers for which we gathered at most 5 minutes of speech, and use this data to fine tune a large pre-trained model over a few training epochs. We look at the performance of the system by evaluating the intelligibility, naturalness and speaker similarity measures, as well as performing an analysis of the trade-off between speech quality and overfitting of the network.The results show that the Tacotron2 network can replicate the identity of a speaker from as little as one speech sample. Also it inherently learns individual grapheme representations, such that if the training data is carefully selected to present all the common graphemes in the language, the adaptation data requirements can be significantly lowered.
本文介绍了Tacotron2语音合成模型为获得高质量的输出合成所需的数据量的评估。我们评估了模型在非常有限的数据场景下适应新说话者的能力。我们使用了三个罗马尼亚人,我们收集了最多5分钟的演讲,并使用这些数据在几个训练时期微调大型预训练模型。我们通过评估可理解性、自然度和说话者相似度来评估系统的性能,并对语音质量和网络过拟合之间的权衡进行分析。结果表明,Tacotron2网络可以从一个语音样本中复制说话者的身份。此外,它本身也学习单个字素表示,因此,如果仔细选择训练数据来表示语言中所有常见的字素,则可以显着降低自适应数据的要求。
{"title":"An analysis of the data efficiency in Tacotron2 speech synthesis system","authors":"G. Săracu, Adriana Stan","doi":"10.1109/sped53181.2021.9587411","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587411","url":null,"abstract":"This paper introduces an evaluation of the amount of data required by the Tacotron2 speech synthesis model in order to achieve a good quality output synthesis. We evaluate the capabilities of the model to adapt to new speakers in very limited data scenarios. We use three Romanian speakers for which we gathered at most 5 minutes of speech, and use this data to fine tune a large pre-trained model over a few training epochs. We look at the performance of the system by evaluating the intelligibility, naturalness and speaker similarity measures, as well as performing an analysis of the trade-off between speech quality and overfitting of the network.The results show that the Tacotron2 network can replicate the identity of a speaker from as little as one speech sample. Also it inherently learns individual grapheme representations, such that if the training data is carefully selected to present all the common graphemes in the language, the adaptation data requirements can be significantly lowered.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128125124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Dunstan Baby Language Classification with CNN CNN的邓斯坦婴儿语言分类
Pub Date : 2021-10-13 DOI: 10.1109/sped53181.2021.9587374
Costin Andrei Bratan, Mirela Gheorghe, Ioan Ispas, E. Franti, M. Dascalu, S. Stoicescu, Ioana Rosca, Florentina Gherghiceanu, Doina Dumitrache, L. Nastase
Several methods were reported in the scientific literature for the classification of the infant cries, in order to automatically detect the need behind their tears and help the parents and caretakers. In the same scope, this paper has an original approach in which the sounds that precede the cry are used. Such sounds can be considered primitive words and are classified according to the “Dunstan Baby Language”. The paper verifies the universal baby language hypothesis starting from the research reported in a previous article. A CNN architecture trained with recordings of babies from Australia was used for classifying the audio material coming from Romanian babies. It was an attempt to see what happens should the participants belong to a different cultural landscape. The database loaded with the sounds made by Romanian babies was labelled by doctors in the maternity hospitals and two Dunstan experts, separately. Finally, the results of the CNN automatic classification were compared to those obtained by the Dunstan coaches. The conclusions have proved that Dunstan language is universal.
科学文献中报道了几种对婴儿哭声进行分类的方法,以便自动发现他们哭泣背后的需求,并帮助父母和看护人。在同样的范围内,这篇论文有一个原创的方法,在哭声之前的声音被使用。这些声音可以被认为是原始的单词,并根据“邓斯坦婴儿语言”进行分类。本文从前人的研究入手,验证了婴儿通用语言假说。一个经过澳大利亚婴儿录音训练的CNN架构被用来对来自罗马尼亚婴儿的音频材料进行分类。这是一个尝试,看看如果参与者属于不同的文化景观会发生什么。这个包含罗马尼亚婴儿声音的数据库由妇产医院的医生和邓斯坦的两位专家分别标记。最后,将CNN自动分类的结果与Dunstan教练员得到的结果进行比较。这些结论证明了邓斯坦语的通用性。
{"title":"Dunstan Baby Language Classification with CNN","authors":"Costin Andrei Bratan, Mirela Gheorghe, Ioan Ispas, E. Franti, M. Dascalu, S. Stoicescu, Ioana Rosca, Florentina Gherghiceanu, Doina Dumitrache, L. Nastase","doi":"10.1109/sped53181.2021.9587374","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587374","url":null,"abstract":"Several methods were reported in the scientific literature for the classification of the infant cries, in order to automatically detect the need behind their tears and help the parents and caretakers. In the same scope, this paper has an original approach in which the sounds that precede the cry are used. Such sounds can be considered primitive words and are classified according to the “Dunstan Baby Language”. The paper verifies the universal baby language hypothesis starting from the research reported in a previous article. A CNN architecture trained with recordings of babies from Australia was used for classifying the audio material coming from Romanian babies. It was an attempt to see what happens should the participants belong to a different cultural landscape. The database loaded with the sounds made by Romanian babies was labelled by doctors in the maternity hospitals and two Dunstan experts, separately. Finally, the results of the CNN automatic classification were compared to those obtained by the Dunstan coaches. The conclusions have proved that Dunstan language is universal.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133194740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1