An end-to-end Tacotron model versus pre trained Tacotron model for Arabic text-to-speech synthesis

IF 2.2 4区工程技术 Q3 ENGINEERING, MULTIDISCIPLINARY Journal of Engineering Research Pub Date : 2025-03-01 DOI:10.1016/j.jer.2023.08.016

A.M. Mutawa

{"title":"An end-to-end Tacotron model versus pre trained Tacotron model for Arabic text-to-speech synthesis","authors":"A.M. Mutawa","doi":"10.1016/j.jer.2023.08.016","DOIUrl":null,"url":null,"abstract":"<div><div>Text-to-Speech (TTS) systems turn normal text into spoken language, which is important for accessibility and user interaction. Many of these systems make speech from phonetic or phonemic transcriptions, but another way is to make speech by connecting together pre-recorded units from a database. The size of the units varies, from diphones to whole phrases. Even though this method covers a lot of ground, it sometimes needs more clarity, especially when high-quality output requires storing whole words or phrases in certain situations. Synthesizers can also use the way humans talk and the way their vocal tracts work to make voices. The Arabic language is hard to develop TTS methods for, as a result of its complicated morphology, semantic nuances, and many different dialects. These dialects often have a lot of differences from standard Arabic and don't follow formal rules for spelling. This means that traditional Arabic that hasn't been edited often has spelling and grammar mistakes. In this study, we show and test a Tacotron model that was made just for Arabic TTS synthesis from beginning to end. This model uses the richness of acoustic information in audio files, such as frequency and pitch, to make naturalistic speech that sounds a lot like what humans say. We also compare the performance of this model with that of a pre-trained Tacotron model applied to Arabic text. This gives us important information about how well Arabic TTS systems work and where they could be improved.</div></div>","PeriodicalId":48803,"journal":{"name":"Journal of Engineering Research","volume":"13 1","pages":"Pages 384-389"},"PeriodicalIF":2.2000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Engineering Research","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2307187723001943","RegionNum":4,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

Text-to-Speech (TTS) systems turn normal text into spoken language, which is important for accessibility and user interaction. Many of these systems make speech from phonetic or phonemic transcriptions, but another way is to make speech by connecting together pre-recorded units from a database. The size of the units varies, from diphones to whole phrases. Even though this method covers a lot of ground, it sometimes needs more clarity, especially when high-quality output requires storing whole words or phrases in certain situations. Synthesizers can also use the way humans talk and the way their vocal tracts work to make voices. The Arabic language is hard to develop TTS methods for, as a result of its complicated morphology, semantic nuances, and many different dialects. These dialects often have a lot of differences from standard Arabic and don't follow formal rules for spelling. This means that traditional Arabic that hasn't been edited often has spelling and grammar mistakes. In this study, we show and test a Tacotron model that was made just for Arabic TTS synthesis from beginning to end. This model uses the richness of acoustic information in audio files, such as frequency and pitch, to make naturalistic speech that sounds a lot like what humans say. We also compare the performance of this model with that of a pre-trained Tacotron model applied to Arabic text. This gives us important information about how well Arabic TTS systems work and where they could be improved.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

端到端Tacotron模型与阿拉伯语文本到语音合成的预训练Tacotron模型

文本到语音（TTS）系统将正常文本转换为口语，这对于可访问性和用户交互非常重要。许多这样的系统通过语音或音位的转录来发声，但另一种方法是通过连接数据库中预先录制的单元来发声。单元的大小各不相同，从diphone到整个短语。尽管这种方法涵盖了很多内容，但有时它需要更清晰，特别是当高质量的输出需要在某些情况下存储整个单词或短语时。合成器也可以利用人类说话的方式和他们声道的工作方式来发出声音。阿拉伯语由于其复杂的形态学、语义上的细微差别和许多不同的方言，很难开发TTS方法。这些方言通常与标准阿拉伯语有很大的不同，并且在拼写上不遵循正式的规则。这意味着没有经过编辑的传统阿拉伯语经常有拼写和语法错误。在本研究中，我们展示并测试了一个从头到尾仅为阿拉伯TTS合成而制作的Tacotron模型。这个模型利用音频文件中丰富的声学信息，比如频率和音高，来制作听起来很像人类说话的自然语言。我们还将该模型的性能与应用于阿拉伯文本的预训练Tacotron模型的性能进行了比较。这为我们提供了关于阿拉伯语TTS系统的工作情况以及可以改进的地方的重要信息。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Engineering Research ENGINEERING, MULTIDISCIPLINARY-

CiteScore

1.60

自引率

10.00%

发文量

181

审稿时长

20 weeks

期刊介绍： Journal of Engineering Research (JER) is a international, peer reviewed journal which publishes full length original research papers, reviews, case studies related to all areas of Engineering such as: Civil, Mechanical, Industrial, Electrical, Computer, Chemical, Petroleum, Aerospace, Architectural, Biomedical, Coastal, Environmental, Marine & Ocean, Metallurgical & Materials, software, Surveying, Systems and Manufacturing Engineering. In particular, JER focuses on innovative approaches and methods that contribute to solving the environmental and manufacturing problems, which exist primarily in the Arabian Gulf region and the Middle East countries. Kuwait University used to publish the Journal "Kuwait Journal of Science and Engineering" (ISSN: 1024-8684), which included Science and Engineering articles since 1974. In 2011 the decision was taken to split KJSE into two independent Journals - "Journal of Engineering Research "(JER) and "Kuwait Journal of Science" (KJS).