Adriana Stan, Beáta Lőrincz, Maria Nutu, M. Giurgiu
{"title":"MARA语料库:使用合成语音数据的端到端TTS系统的表达能力","authors":"Adriana Stan, Beáta Lőrincz, Maria Nutu, M. Giurgiu","doi":"10.1109/sped53181.2021.9587438","DOIUrl":null,"url":null,"abstract":"This paper introduces the MARA corpus, a large expressive Romanian speech corpus containing over 11 hours of high-quality data recorded by a professional female speaker. The data is orthographically transcribed, manually segmented at utterance level and semi-automatically aligned at phone-level. The associated text is processed by a complete linguistic feature extractor composed of: text normalisation, phonetic transcription, syllabification, lexical stress assignment, lemma extraction, part-of-speech tagging, chunking and dependency parsing.Using the MARA corpus, we evaluate the use of synthesised speech as training data in end-to-end speech synthesis systems. The synthesised data copies the original phone duration and F0 patterns of the most expressive utterances from MARA. Five systems with different sets of expressive data are trained. The objective and subjective results show that the low quality of the synthesised speech data is averaged out by the synthesis network, and that no statistically significant differences are found between the systems’ expressivity and naturalness evaluations.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"53 210 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"The MARA corpus: Expressivity in end-to-end TTS systems using synthesised speech data\",\"authors\":\"Adriana Stan, Beáta Lőrincz, Maria Nutu, M. Giurgiu\",\"doi\":\"10.1109/sped53181.2021.9587438\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper introduces the MARA corpus, a large expressive Romanian speech corpus containing over 11 hours of high-quality data recorded by a professional female speaker. The data is orthographically transcribed, manually segmented at utterance level and semi-automatically aligned at phone-level. The associated text is processed by a complete linguistic feature extractor composed of: text normalisation, phonetic transcription, syllabification, lexical stress assignment, lemma extraction, part-of-speech tagging, chunking and dependency parsing.Using the MARA corpus, we evaluate the use of synthesised speech as training data in end-to-end speech synthesis systems. The synthesised data copies the original phone duration and F0 patterns of the most expressive utterances from MARA. Five systems with different sets of expressive data are trained. The objective and subjective results show that the low quality of the synthesised speech data is averaged out by the synthesis network, and that no statistically significant differences are found between the systems’ expressivity and naturalness evaluations.\",\"PeriodicalId\":193702,\"journal\":{\"name\":\"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)\",\"volume\":\"53 210 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/sped53181.2021.9587438\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/sped53181.2021.9587438","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
The MARA corpus: Expressivity in end-to-end TTS systems using synthesised speech data
This paper introduces the MARA corpus, a large expressive Romanian speech corpus containing over 11 hours of high-quality data recorded by a professional female speaker. The data is orthographically transcribed, manually segmented at utterance level and semi-automatically aligned at phone-level. The associated text is processed by a complete linguistic feature extractor composed of: text normalisation, phonetic transcription, syllabification, lexical stress assignment, lemma extraction, part-of-speech tagging, chunking and dependency parsing.Using the MARA corpus, we evaluate the use of synthesised speech as training data in end-to-end speech synthesis systems. The synthesised data copies the original phone duration and F0 patterns of the most expressive utterances from MARA. Five systems with different sets of expressive data are trained. The objective and subjective results show that the low quality of the synthesised speech data is averaged out by the synthesis network, and that no statistically significant differences are found between the systems’ expressivity and naturalness evaluations.