首页 > 最新文献

2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)最新文献

英文 中文
Dysarthric Speech Augmentation Using Prosodic Transformation and Masking for Subword End-to-end ASR 基于韵律变换和掩蔽的子词端到端ASR的诵读困难语音增强
Pub Date : 2021-10-13 DOI: 10.1109/sped53181.2021.9587372
M. Soleymanpour, Michael T. Johnson, J. Berry
End-to-end speech recognition systems are effective, but in order to train an end-to-end model, a large amount of training data is needed. For applications such as dysarthric speech recognition, we do not have sufficient data. In this paper, we propose a specialized data augmentation approach to enhance the performance of an end-to-end dysarthric ASR based on sub-word models. The proposed approach contains two methods, including prosodic transformation and time-feature masking. Prosodic transformation modifies the speaking rate and pitch of normal speech to control prosodic characteristics such as loudness, intonation, and rhythm. Using time and feature masking, we apply a mask to the Mel Frequency Cepstral Coefficients (MFCC) for robustness-focused augmentation. Results show that augmenting normal speech with prosodic transformation plus masking decreases CER by 5.4% and WER by 5.6%, and the further addition of dysarthric speech masking decreases CER by 11.3% and WER by 11.4%.
端到端语音识别系统是有效的,但为了训练端到端模型,需要大量的训练数据。对于像困难语音识别这样的应用,我们没有足够的数据。在本文中,我们提出了一种专门的数据增强方法来增强基于子词模型的端到端诵读ASR的性能。该方法包括韵律变换和时间特征掩蔽两种方法。韵律变换通过改变正常语音的语速和音高来控制诸如响度、语调和节奏等韵律特征。使用时间和特征掩蔽,我们对Mel频率倒谱系数(MFCC)应用掩码以增强鲁棒性。结果表明,用韵律变换加掩蔽法增强正常语音,使CER降低5.4%,WER降低5.6%;进一步增强困难语音掩蔽法,使CER降低11.3%,WER降低11.4%。
{"title":"Dysarthric Speech Augmentation Using Prosodic Transformation and Masking for Subword End-to-end ASR","authors":"M. Soleymanpour, Michael T. Johnson, J. Berry","doi":"10.1109/sped53181.2021.9587372","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587372","url":null,"abstract":"End-to-end speech recognition systems are effective, but in order to train an end-to-end model, a large amount of training data is needed. For applications such as dysarthric speech recognition, we do not have sufficient data. In this paper, we propose a specialized data augmentation approach to enhance the performance of an end-to-end dysarthric ASR based on sub-word models. The proposed approach contains two methods, including prosodic transformation and time-feature masking. Prosodic transformation modifies the speaking rate and pitch of normal speech to control prosodic characteristics such as loudness, intonation, and rhythm. Using time and feature masking, we apply a mask to the Mel Frequency Cepstral Coefficients (MFCC) for robustness-focused augmentation. Results show that augmenting normal speech with prosodic transformation plus masking decreases CER by 5.4% and WER by 5.6%, and the further addition of dysarthric speech masking decreases CER by 11.3% and WER by 11.4%.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133725974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
[SpeD 2021 Front cover] [SpeD 2021封面]
Pub Date : 2021-10-13 DOI: 10.1109/sped53181.2021.9587348
{"title":"[SpeD 2021 Front cover]","authors":"","doi":"10.1109/sped53181.2021.9587348","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587348","url":null,"abstract":"","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132980884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Word Embeddings for Romanian Language and Their Use for Synonyms Detection 罗马尼亚语的词嵌入及其在同义词检测中的应用
Pub Date : 2021-10-13 DOI: 10.1109/sped53181.2021.9587432
M. Popescu, C. Rusu, L. Grama
The aim of this paper is to present some results on word embeddings for the Romanian language, based on the word2vec method. More concretely, we generate word embeddings of different lengths, and using different preprocessing and training techniques. The embeddings are general purpose, and we use the Romanian language version of Wikipedia as corpus. We also evaluate the computational resources needed for the task. The embeddings are validated by performing some experiments on synonyms detection, using a new dataset created for this purpose. The code and the dataset are made publicly available. The results indicate that these types of embeddings can be used with the summarization approaches.
本文的目的是展示一些基于word2vec方法的罗马尼亚语词嵌入的结果。更具体地说,我们生成不同长度的词嵌入,并使用不同的预处理和训练技术。嵌入是通用的,我们使用罗马尼亚语版本的维基百科作为语料库。我们还评估了任务所需的计算资源。通过使用为此目的创建的新数据集执行同义词检测的一些实验来验证嵌入。代码和数据集是公开的。结果表明,这些类型的嵌入可以与摘要方法一起使用。
{"title":"Word Embeddings for Romanian Language and Their Use for Synonyms Detection","authors":"M. Popescu, C. Rusu, L. Grama","doi":"10.1109/sped53181.2021.9587432","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587432","url":null,"abstract":"The aim of this paper is to present some results on word embeddings for the Romanian language, based on the word2vec method. More concretely, we generate word embeddings of different lengths, and using different preprocessing and training techniques. The embeddings are general purpose, and we use the Romanian language version of Wikipedia as corpus. We also evaluate the computational resources needed for the task. The embeddings are validated by performing some experiments on synonyms detection, using a new dataset created for this purpose. The code and the dataset are made publicly available. The results indicate that these types of embeddings can be used with the summarization approaches.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122779573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Detection of Synthetic Utterances in Romanian Language Speech Forensics 罗马尼亚语语音取证中合成语音的检测研究
Pub Date : 2021-10-13 DOI: 10.1109/sped53181.2021.9587393
Gheorghe Pop, D. Burileanu
The latest decade has seen a huge wave of interest in the synthesis of human image and speech. Besides the enormous impact of synthetic voice in the communication between humans and machines, the production of the so-called “fake media” entered the focus of forensic audio and video communities. A large variety of techniques are now available to produce synthetic speech, from the traditional concatenative speech production to multi-million parameter speech and speaker models. Recent work in the field of artificial intelligence (AI) has shown some synthetic speech generators as capable to fool even state-of-the-art automatic speaker verification systems. AI seems to hold the key to successful speaker spoofing attacks, but also for their countermeasures. As a first step on the way, this paper describes a data-centric method to detect the use of synthetically generated spoken digits in the Romanian language.
近十年来,人们对人类图像和语言的合成产生了巨大的兴趣。除了人工合成语音在人与机器交流中的巨大影响外,所谓“假媒体”的生产也成为了法医音视频界关注的焦点。现在有各种各样的技术可用于合成语音,从传统的串联语音生产到数百万参数的语音和扬声器模型。人工智能(AI)领域的最新研究表明,一些合成语音生成器甚至能够欺骗最先进的自动语音验证系统。人工智能似乎掌握了成功的演讲者欺骗攻击的关键,但也为他们的对策。作为第一步,本文描述了一种以数据为中心的方法来检测罗马尼亚语中合成生成的口语数字的使用。
{"title":"Towards Detection of Synthetic Utterances in Romanian Language Speech Forensics","authors":"Gheorghe Pop, D. Burileanu","doi":"10.1109/sped53181.2021.9587393","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587393","url":null,"abstract":"The latest decade has seen a huge wave of interest in the synthesis of human image and speech. Besides the enormous impact of synthetic voice in the communication between humans and machines, the production of the so-called “fake media” entered the focus of forensic audio and video communities. A large variety of techniques are now available to produce synthetic speech, from the traditional concatenative speech production to multi-million parameter speech and speaker models. Recent work in the field of artificial intelligence (AI) has shown some synthetic speech generators as capable to fool even state-of-the-art automatic speaker verification systems. AI seems to hold the key to successful speaker spoofing attacks, but also for their countermeasures. As a first step on the way, this paper describes a data-centric method to detect the use of synthetically generated spoken digits in the Romanian language.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121386863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Synthetic Speech Detection Using Neural Networks 基于神经网络的合成语音检测
Pub Date : 2021-10-13 DOI: 10.1109/sped53181.2021.9587406
Ricardo Reimao, Vassilios Tzerpos
Computer generated speech has improved drastically due to advancements in voice synthesis using deep learning techniques. The latest speech synthesizers achieve such high level of naturalness that humans have difficulty distinguishing real speech from computer generated speech. These technologies allow any person to train a synthesizer with a target voice, creating a model that is able to reproduce someone’s voice with high fidelity. This technology can be used in several legit commercial applications (e.g. call centres) as well as criminal activities, such as the impersonation of someone’s voice.In this paper, we analyze how synthetic speech is generated and propose deep learning methodologies to detect such synthesized utterances. Using a large dataset containing both synthetic and real speech, we analyzed the performance of the latest deep learning models in the classification of such utterances. Our proposed model achieves up to 92.00% accuracy in detecting unseen synthetic speech, which is a significant improvement from human performance (65.7%).
由于使用深度学习技术的语音合成的进步,计算机生成的语音得到了极大的改善。最新的语音合成器达到了如此高的自然程度,以至于人类很难区分真正的语音和计算机生成的语音。这些技术允许任何人用目标声音训练合成器,创建一个能够高保真地再现某人声音的模型。这项技术可以用于几种合法的商业应用(例如呼叫中心)以及犯罪活动,例如冒充某人的声音。在本文中,我们分析了合成语音是如何生成的,并提出了深度学习方法来检测这些合成语音。使用包含合成语音和真实语音的大型数据集,我们分析了最新深度学习模型在这些话语分类方面的性能。我们提出的模型在检测看不见的合成语音方面达到了92.00%的准确率,比人类的65.7%有了显著的提高。
{"title":"Synthetic Speech Detection Using Neural Networks","authors":"Ricardo Reimao, Vassilios Tzerpos","doi":"10.1109/sped53181.2021.9587406","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587406","url":null,"abstract":"Computer generated speech has improved drastically due to advancements in voice synthesis using deep learning techniques. The latest speech synthesizers achieve such high level of naturalness that humans have difficulty distinguishing real speech from computer generated speech. These technologies allow any person to train a synthesizer with a target voice, creating a model that is able to reproduce someone’s voice with high fidelity. This technology can be used in several legit commercial applications (e.g. call centres) as well as criminal activities, such as the impersonation of someone’s voice.In this paper, we analyze how synthetic speech is generated and propose deep learning methodologies to detect such synthesized utterances. Using a large dataset containing both synthetic and real speech, we analyzed the performance of the latest deep learning models in the classification of such utterances. Our proposed model achieves up to 92.00% accuracy in detecting unseen synthetic speech, which is a significant improvement from human performance (65.7%).","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125892988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Versatility and Population Diversity of Evolutionary Algorithms in Automated Circuit Sizing Applications 进化算法在自动电路定径应用中的多功能性和种群多样性
Pub Date : 2021-10-13 DOI: 10.1109/sped53181.2021.9587352
C. Vișan, Octavian Pascu, Marius Stanescu, H. Cucu, C. Diaconu, Andi Buzo, G. Pelz
In modern circuit design, highly specialized engineers are using computer tools to increase their chance of finding the best configurations, while decreasing the development time. However, certain tasks, like circuit sizing, consist of try and error processes that require the designer’s attention for a variable amount of time. The task duration is usually directly proportional to the complexity of the circuit. To minimize the R&D costs of the circuit, relieving the designer from the repetitive tasks is essential. Thus, the trend of replacing manual-based circuit sizing by AI solutions is growing. In this context, we are comparing the five most promising Evolutionary Algorithms for circuit sizing automation. The focus of this paper is to assess the performance of the algorithms in terms of versatility and population diversity.
在现代电路设计中,高度专业化的工程师正在使用计算机工具来增加他们找到最佳配置的机会,同时减少开发时间。然而,某些任务,如电路尺寸,包括尝试和错误的过程,需要设计师的注意力在可变的时间量。任务持续时间通常与电路的复杂程度成正比。为了最大限度地降低电路的研发成本,将设计人员从重复的任务中解脱出来是必不可少的。因此,人工智能解决方案取代手动电路尺寸的趋势正在增长。在这种情况下,我们比较了五种最有前途的电路尺寸自动化进化算法。本文的重点是从通用性和种群多样性的角度来评估算法的性能。
{"title":"Versatility and Population Diversity of Evolutionary Algorithms in Automated Circuit Sizing Applications","authors":"C. Vișan, Octavian Pascu, Marius Stanescu, H. Cucu, C. Diaconu, Andi Buzo, G. Pelz","doi":"10.1109/sped53181.2021.9587352","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587352","url":null,"abstract":"In modern circuit design, highly specialized engineers are using computer tools to increase their chance of finding the best configurations, while decreasing the development time. However, certain tasks, like circuit sizing, consist of try and error processes that require the designer’s attention for a variable amount of time. The task duration is usually directly proportional to the complexity of the circuit. To minimize the R&D costs of the circuit, relieving the designer from the repetitive tasks is essential. Thus, the trend of replacing manual-based circuit sizing by AI solutions is growing. In this context, we are comparing the five most promising Evolutionary Algorithms for circuit sizing automation. The focus of this paper is to assess the performance of the algorithms in terms of versatility and population diversity.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130708327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Neural Networks for Automatic Environmental Sound Recognition 自动环境声音识别的神经网络
Pub Date : 2021-10-13 DOI: 10.1109/sped53181.2021.9587378
Svetlana Segarceanu, G. Suciu, I. Gavat
Environmental sound recognition is currently an important and valuable field of computer science and robotics, security or environmental protection. The underlying methodology evolved from primary speech application characteristic methods to more specific approaches, and with the advent of the deep learning paradigm many attempts using these methods arose. The paper reopens the research we have started on the application of the Feed Forward Neural Networks, by exploring several configurations, and introduces the Convolutional Neural Networks in our investigation. The experiments consider three classes of forest specific sounds and meant to detect the chainsaw sounds, vehicle, and genuine forest.
环境声音识别是当前计算机科学与机器人、安全或环境保护领域的一个重要而有价值的领域。基础方法从主要的语音应用特征方法发展到更具体的方法,随着深度学习范式的出现,出现了许多使用这些方法的尝试。本文通过对前馈神经网络的几种结构的探索,重新开启了我们对前馈神经网络应用的研究,并在我们的研究中引入了卷积神经网络。实验考虑了三类森林特有的声音,旨在检测电锯声、车辆声和真正的森林声。
{"title":"Neural Networks for Automatic Environmental Sound Recognition","authors":"Svetlana Segarceanu, G. Suciu, I. Gavat","doi":"10.1109/sped53181.2021.9587378","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587378","url":null,"abstract":"Environmental sound recognition is currently an important and valuable field of computer science and robotics, security or environmental protection. The underlying methodology evolved from primary speech application characteristic methods to more specific approaches, and with the advent of the deep learning paradigm many attempts using these methods arose. The paper reopens the research we have started on the application of the Feed Forward Neural Networks, by exploring several configurations, and introduces the Convolutional Neural Networks in our investigation. The experiments consider three classes of forest specific sounds and meant to detect the chainsaw sounds, vehicle, and genuine forest.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115016507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Results on the MFCC extraction for improving audio capabilities of TIAGo service robot MFCC提取提高TIAGo服务机器人音频性能的研究结果
Pub Date : 2021-10-13 DOI: 10.1109/sped53181.2021.9587416
Toma Telembici, L. Grama, Lorena Muscar, C. Rusu
The purpose of this paper is to obtain through simulations high correct classification rates for isolated audio events detection. To obtain the audio signals, we have used a service robot named TIAGo that simulates scenarios from our everyday life. Mel Frequency Cepstral Coefficients features will be extracted for each audio signal. Then will be classified based on the k-Nearest Neighbors algorithm. To better analyze the performance, besides Mel Frequency Cepstral Coefficients coefficients, 6 more coefficients, non- Mel Frequency Cepstral Coefficients, will be extracted. The number of neighbors for the k-Nearest Neighbors algorithm will vary and also the percent value that represents the number of audio signals used for training or for testing. Simulations will be done also about the metrics and distance. For this, Euclidean and Manhattan metric-distance will be implemented. All these scenarios and combinations of them will be perform through this paper. The highest correct classification rate, 99.27%, is obtained for Mel Frequency Cepstral Coefficients using 70% of input data for training, 5 neighbors and the Euclidean metric.
本文的目的是通过仿真得到孤立音频事件检测的高正确分类率。为了获得音频信号,我们使用了一个名为TIAGo的服务机器人来模拟我们日常生活中的场景。Mel频率倒谱系数特征将被提取为每个音频信号。然后根据k近邻算法进行分类。为了更好地分析性能,除了Mel频率倒谱系数外,还将提取6个非Mel频率倒谱系数。k近邻算法的邻居数量会有所不同,表示用于训练或测试的音频信号数量的百分比值也会有所不同。还将对度量和距离进行模拟。为此,欧几里得和曼哈顿公制距离将被实施。所有这些场景和它们的组合将通过本文来实现。使用70%的训练输入数据、5个邻域和欧几里得度量,Mel Frequency Cepstral Coefficients的分类正确率最高,达到99.27%。
{"title":"Results on the MFCC extraction for improving audio capabilities of TIAGo service robot","authors":"Toma Telembici, L. Grama, Lorena Muscar, C. Rusu","doi":"10.1109/sped53181.2021.9587416","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587416","url":null,"abstract":"The purpose of this paper is to obtain through simulations high correct classification rates for isolated audio events detection. To obtain the audio signals, we have used a service robot named TIAGo that simulates scenarios from our everyday life. Mel Frequency Cepstral Coefficients features will be extracted for each audio signal. Then will be classified based on the k-Nearest Neighbors algorithm. To better analyze the performance, besides Mel Frequency Cepstral Coefficients coefficients, 6 more coefficients, non- Mel Frequency Cepstral Coefficients, will be extracted. The number of neighbors for the k-Nearest Neighbors algorithm will vary and also the percent value that represents the number of audio signals used for training or for testing. Simulations will be done also about the metrics and distance. For this, Euclidean and Manhattan metric-distance will be implemented. All these scenarios and combinations of them will be perform through this paper. The highest correct classification rate, 99.27%, is obtained for Mel Frequency Cepstral Coefficients using 70% of input data for training, 5 neighbors and the Euclidean metric.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128549520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
The MARA corpus: Expressivity in end-to-end TTS systems using synthesised speech data MARA语料库:使用合成语音数据的端到端TTS系统的表达能力
Pub Date : 2021-10-13 DOI: 10.1109/sped53181.2021.9587438
Adriana Stan, Beáta Lőrincz, Maria Nutu, M. Giurgiu
This paper introduces the MARA corpus, a large expressive Romanian speech corpus containing over 11 hours of high-quality data recorded by a professional female speaker. The data is orthographically transcribed, manually segmented at utterance level and semi-automatically aligned at phone-level. The associated text is processed by a complete linguistic feature extractor composed of: text normalisation, phonetic transcription, syllabification, lexical stress assignment, lemma extraction, part-of-speech tagging, chunking and dependency parsing.Using the MARA corpus, we evaluate the use of synthesised speech as training data in end-to-end speech synthesis systems. The synthesised data copies the original phone duration and F0 patterns of the most expressive utterances from MARA. Five systems with different sets of expressive data are trained. The objective and subjective results show that the low quality of the synthesised speech data is averaged out by the synthesis network, and that no statistically significant differences are found between the systems’ expressivity and naturalness evaluations.
本文介绍了MARA语料库,这是一个大型的富有表现力的罗马尼亚语语料库,包含由专业女性演讲者记录的超过11小时的高质量数据。数据按正字法转录,在语音水平上手动分割,在电话水平上半自动对齐。相关文本由一个完整的语言特征提取器处理,该语言特征提取器包括:文本规范化、语音转录、音节化、词汇重音赋值、引理提取、词性标注、分块和依赖关系分析。使用MARA语料库,我们评估了端到端语音合成系统中合成语音作为训练数据的使用。合成的数据复制了来自MARA的最具表现力话语的原始通话时长和F0模式。训练了五个具有不同表达数据集的系统。客观和主观结果表明,合成语音数据的低质量被合成网络平均,并且系统的表达性和自然度评估之间没有统计学上的显着差异。
{"title":"The MARA corpus: Expressivity in end-to-end TTS systems using synthesised speech data","authors":"Adriana Stan, Beáta Lőrincz, Maria Nutu, M. Giurgiu","doi":"10.1109/sped53181.2021.9587438","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587438","url":null,"abstract":"This paper introduces the MARA corpus, a large expressive Romanian speech corpus containing over 11 hours of high-quality data recorded by a professional female speaker. The data is orthographically transcribed, manually segmented at utterance level and semi-automatically aligned at phone-level. The associated text is processed by a complete linguistic feature extractor composed of: text normalisation, phonetic transcription, syllabification, lexical stress assignment, lemma extraction, part-of-speech tagging, chunking and dependency parsing.Using the MARA corpus, we evaluate the use of synthesised speech as training data in end-to-end speech synthesis systems. The synthesised data copies the original phone duration and F0 patterns of the most expressive utterances from MARA. Five systems with different sets of expressive data are trained. The objective and subjective results show that the low quality of the synthesised speech data is averaged out by the synthesis network, and that no statistically significant differences are found between the systems’ expressivity and naturalness evaluations.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"53 210 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125953475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Automatic Segmentation of Texts based on Stylistic Features 基于文体特征的文本自动分割
Pub Date : 2021-10-13 DOI: 10.1109/sped53181.2021.9587362
H. Teodorescu, Cecilia Bolea
We report on an automatic method and program for text structure discovery and subsequent segmentation of texts. The method, previously presented and herein enhanced, is based on stylistic features. The segmentation was applied to two self-biographic works; the results are compared and conclusions are derived. The method can be used as a tool in text generation, as a tool in editorial offices, and in literary analysis.
本文报道了一种用于文本结构发现和后续文本分割的自动方法和程序。先前提出并在此增强的方法基于文体特征。将这种分割方法应用于两部自传体作品;对结果进行了比较,并得出结论。该方法可作为文本生成、编辑部和文学分析的工具。
{"title":"Automatic Segmentation of Texts based on Stylistic Features","authors":"H. Teodorescu, Cecilia Bolea","doi":"10.1109/sped53181.2021.9587362","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587362","url":null,"abstract":"We report on an automatic method and program for text structure discovery and subsequent segmentation of texts. The method, previously presented and herein enhanced, is based on stylistic features. The segmentation was applied to two self-biographic works; the results are compared and conclusions are derived. The method can be used as a tool in text generation, as a tool in editorial offices, and in literary analysis.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114285500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1