D. Ungureanu, Madalina Badeanu, Gabriela-Catalina Marica, M. Dascalu, D. Tufis
{"title":"建立罗马尼亚语语音到文本模型的基线","authors":"D. Ungureanu, Madalina Badeanu, Gabriela-Catalina Marica, M. Dascalu, D. Tufis","doi":"10.1109/sped53181.2021.9587345","DOIUrl":null,"url":null,"abstract":"With the increasing usage of Natural Language Processing to facilitate the interactions between humans and machines, automatic speech recognition systems have become increasingly popular as a result of their utility in a wide range of applications. In this paper we explore well-known open-source speech-to-text engines, namely CMUSphinx, DeepSpeech, and Kaldi, to build a baseline of models to transcribe Romanian speech. These engines employ various underlying methods from hidden Markov models to deep neural networks that also integrate language models, thus providing a solid baseline for comparison. Unfortunately, Romanian is still a low-resource language and six datasets of various qualities were merged to obtain 104 hours of speech. To further increase the size of the gathered corpora, our experiments consider data augmentation techniques, specifically SpecAugment, applied on the most promising model. Besides using existing corpora, we publicly release a dataset of 11.5 hours generated from Governmental transcripts. The best performing model is obtained using the Kaldi architecture, considers a hybrid structure with a Deep Neural Network, and achieves a WER of 3.10% on the test partition.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Establishing a Baseline of Romanian Speech-to-Text Models\",\"authors\":\"D. Ungureanu, Madalina Badeanu, Gabriela-Catalina Marica, M. Dascalu, D. Tufis\",\"doi\":\"10.1109/sped53181.2021.9587345\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the increasing usage of Natural Language Processing to facilitate the interactions between humans and machines, automatic speech recognition systems have become increasingly popular as a result of their utility in a wide range of applications. In this paper we explore well-known open-source speech-to-text engines, namely CMUSphinx, DeepSpeech, and Kaldi, to build a baseline of models to transcribe Romanian speech. These engines employ various underlying methods from hidden Markov models to deep neural networks that also integrate language models, thus providing a solid baseline for comparison. Unfortunately, Romanian is still a low-resource language and six datasets of various qualities were merged to obtain 104 hours of speech. To further increase the size of the gathered corpora, our experiments consider data augmentation techniques, specifically SpecAugment, applied on the most promising model. Besides using existing corpora, we publicly release a dataset of 11.5 hours generated from Governmental transcripts. The best performing model is obtained using the Kaldi architecture, considers a hybrid structure with a Deep Neural Network, and achieves a WER of 3.10% on the test partition.\",\"PeriodicalId\":193702,\"journal\":{\"name\":\"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/sped53181.2021.9587345\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/sped53181.2021.9587345","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Establishing a Baseline of Romanian Speech-to-Text Models
With the increasing usage of Natural Language Processing to facilitate the interactions between humans and machines, automatic speech recognition systems have become increasingly popular as a result of their utility in a wide range of applications. In this paper we explore well-known open-source speech-to-text engines, namely CMUSphinx, DeepSpeech, and Kaldi, to build a baseline of models to transcribe Romanian speech. These engines employ various underlying methods from hidden Markov models to deep neural networks that also integrate language models, thus providing a solid baseline for comparison. Unfortunately, Romanian is still a low-resource language and six datasets of various qualities were merged to obtain 104 hours of speech. To further increase the size of the gathered corpora, our experiments consider data augmentation techniques, specifically SpecAugment, applied on the most promising model. Besides using existing corpora, we publicly release a dataset of 11.5 hours generated from Governmental transcripts. The best performing model is obtained using the Kaldi architecture, considers a hybrid structure with a Deep Neural Network, and achieves a WER of 3.10% on the test partition.