Mahta Fetrat Qharabagh, Zahra Dehghanian, Hamid R. Rabiee
{"title":"ManaTTS 波斯语:为低资源语言创建 TTS 数据集的秘诀","authors":"Mahta Fetrat Qharabagh, Zahra Dehghanian, Hamid R. Rabiee","doi":"arxiv-2409.07259","DOIUrl":null,"url":null,"abstract":"In this study, we introduce ManaTTS, the most extensive publicly accessible\nsingle-speaker Persian corpus, and a comprehensive framework for collecting\ntranscribed speech datasets for the Persian language. ManaTTS, released under\nthe open CC-0 license, comprises approximately 86 hours of audio with a\nsampling rate of 44.1 kHz. Alongside ManaTTS, we also generated the\nVirgoolInformal dataset to evaluate Persian speech recognition models used for\nforced alignment, extending over 5 hours of audio. The datasets are supported\nby a fully transparent, MIT-licensed pipeline, a testament to innovation in the\nfield. It includes unique tools for sentence tokenization, bounded audio\nsegmentation, and a novel forced alignment method. This alignment technique is\nspecifically designed for low-resource languages, addressing a crucial need in\nthe field. With this dataset, we trained a Tacotron2-based TTS model, achieving\na Mean Opinion Score (MOS) of 3.76, which is remarkably close to the MOS of\n3.86 for the utterances generated by the same vocoder and natural spectrogram,\nand the MOS of 4.01 for the natural waveform, demonstrating the exceptional\nquality and effectiveness of the corpus.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"273 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages\",\"authors\":\"Mahta Fetrat Qharabagh, Zahra Dehghanian, Hamid R. Rabiee\",\"doi\":\"arxiv-2409.07259\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this study, we introduce ManaTTS, the most extensive publicly accessible\\nsingle-speaker Persian corpus, and a comprehensive framework for collecting\\ntranscribed speech datasets for the Persian language. ManaTTS, released under\\nthe open CC-0 license, comprises approximately 86 hours of audio with a\\nsampling rate of 44.1 kHz. Alongside ManaTTS, we also generated the\\nVirgoolInformal dataset to evaluate Persian speech recognition models used for\\nforced alignment, extending over 5 hours of audio. The datasets are supported\\nby a fully transparent, MIT-licensed pipeline, a testament to innovation in the\\nfield. It includes unique tools for sentence tokenization, bounded audio\\nsegmentation, and a novel forced alignment method. This alignment technique is\\nspecifically designed for low-resource languages, addressing a crucial need in\\nthe field. With this dataset, we trained a Tacotron2-based TTS model, achieving\\na Mean Opinion Score (MOS) of 3.76, which is remarkably close to the MOS of\\n3.86 for the utterances generated by the same vocoder and natural spectrogram,\\nand the MOS of 4.01 for the natural waveform, demonstrating the exceptional\\nquality and effectiveness of the corpus.\",\"PeriodicalId\":501284,\"journal\":{\"name\":\"arXiv - EE - Audio and Speech Processing\",\"volume\":\"273 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Audio and Speech Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07259\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07259","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages
In this study, we introduce ManaTTS, the most extensive publicly accessible
single-speaker Persian corpus, and a comprehensive framework for collecting
transcribed speech datasets for the Persian language. ManaTTS, released under
the open CC-0 license, comprises approximately 86 hours of audio with a
sampling rate of 44.1 kHz. Alongside ManaTTS, we also generated the
VirgoolInformal dataset to evaluate Persian speech recognition models used for
forced alignment, extending over 5 hours of audio. The datasets are supported
by a fully transparent, MIT-licensed pipeline, a testament to innovation in the
field. It includes unique tools for sentence tokenization, bounded audio
segmentation, and a novel forced alignment method. This alignment technique is
specifically designed for low-resource languages, addressing a crucial need in
the field. With this dataset, we trained a Tacotron2-based TTS model, achieving
a Mean Opinion Score (MOS) of 3.76, which is remarkably close to the MOS of
3.86 for the utterances generated by the same vocoder and natural spectrogram,
and the MOS of 4.01 for the natural waveform, demonstrating the exceptional
quality and effectiveness of the corpus.