利用动态深度学习规划文本到语音合成模型和数据集的开发

IF 6.1 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Journal of King Saud University-Computer and Information Sciences Pub Date : 2024-09-01 Epub Date: 2024-07-26 DOI:10.1016/j.jksuci.2024.102131

Hawraz A. Ahmad , Tarik A. Rashid

{"title":"利用动态深度学习规划文本到语音合成模型和数据集的开发","authors":"Hawraz A. Ahmad , Tarik A. Rashid","doi":"10.1016/j.jksuci.2024.102131","DOIUrl":null,"url":null,"abstract":"<div><p>Synthesis of Text-to-speech (TTS) is a process that involves translating a natural language text into a speech. Speech synthesisers face a major challenge when recognizing the prosodic elements of written text, such as intonation (the rise and fall of the voice in speaking), and length. In contrast, continuous speech features are influenced by the personality and emotions of the artist. A database is maintained to store the synthesized speech pieces. Its output is determined by how similar the person utters the words and how capable they are of being implied. In the past few years, the field of text-to-speech synthesis has been heavily impacted by the emergence of deep learning, an AI technology that has gained widespread popularity. This review paper presents a taxonomy of models and architectures that are based on deep learning and discusses the various datasets that are utilised in the TTS process. It also covers the evaluation matrices that are commonly used. The paper ends with a look at the future directions of the system and reaches to some Deep learning models that give promising results in this field.</p></div>","PeriodicalId":48547,"journal":{"name":"Journal of King Saud University-Computer and Information Sciences","volume":"36 7","pages":"Article 102131"},"PeriodicalIF":6.1000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1319157824002209/pdfft?md5=73c94f11cbc25ec7eb6841c1af93654a&pid=1-s2.0-S1319157824002209-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Planning the development of text-to-speech synthesis models and datasets with dynamic deep learning\",\"authors\":\"Hawraz A. Ahmad , Tarik A. Rashid\",\"doi\":\"10.1016/j.jksuci.2024.102131\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Synthesis of Text-to-speech (TTS) is a process that involves translating a natural language text into a speech. Speech synthesisers face a major challenge when recognizing the prosodic elements of written text, such as intonation (the rise and fall of the voice in speaking), and length. In contrast, continuous speech features are influenced by the personality and emotions of the artist. A database is maintained to store the synthesized speech pieces. Its output is determined by how similar the person utters the words and how capable they are of being implied. In the past few years, the field of text-to-speech synthesis has been heavily impacted by the emergence of deep learning, an AI technology that has gained widespread popularity. This review paper presents a taxonomy of models and architectures that are based on deep learning and discusses the various datasets that are utilised in the TTS process. It also covers the evaluation matrices that are commonly used. The paper ends with a look at the future directions of the system and reaches to some Deep learning models that give promising results in this field.</p></div>\",\"PeriodicalId\":48547,\"journal\":{\"name\":\"Journal of King Saud University-Computer and Information Sciences\",\"volume\":\"36 7\",\"pages\":\"Article 102131\"},\"PeriodicalIF\":6.1000,\"publicationDate\":\"2024-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S1319157824002209/pdfft?md5=73c94f11cbc25ec7eb6841c1af93654a&pid=1-s2.0-S1319157824002209-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of King Saud University-Computer and Information Sciences\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1319157824002209\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/7/26 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of King Saud University-Computer and Information Sciences","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1319157824002209","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/7/26 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

文本到语音合成（TTS）是一个将自然语言文本翻译成语音的过程。语音合成人员在识别书面文本的前音要素（如语调（说话时声音的起伏）和长度）时面临着巨大挑战。相比之下，连续语音特征会受到艺术家个性和情感的影响。合成语音片段会保存在一个数据库中。它的输出取决于发音人发音的相似程度以及发音人的暗示能力。在过去几年中，深度学习这一人工智能技术的出现对文本到语音合成领域产生了巨大影响，并得到了广泛普及。本综述论文介绍了基于深度学习的模型和架构分类法，并讨论了在 TTS 过程中使用的各种数据集。本文还介绍了常用的评估矩阵。文章最后展望了系统的未来发展方向，并介绍了一些在该领域取得良好效果的深度学习模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Planning the development of text-to-speech synthesis models and datasets with dynamic deep learning

Synthesis of Text-to-speech (TTS) is a process that involves translating a natural language text into a speech. Speech synthesisers face a major challenge when recognizing the prosodic elements of written text, such as intonation (the rise and fall of the voice in speaking), and length. In contrast, continuous speech features are influenced by the personality and emotions of the artist. A database is maintained to store the synthesized speech pieces. Its output is determined by how similar the person utters the words and how capable they are of being implied. In the past few years, the field of text-to-speech synthesis has been heavily impacted by the emergence of deep learning, an AI technology that has gained widespread popularity. This review paper presents a taxonomy of models and architectures that are based on deep learning and discusses the various datasets that are utilised in the TTS process. It also covers the evaluation matrices that are commonly used. The paper ends with a look at the future directions of the system and reaches to some Deep learning models that give promising results in this field.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of King Saud University-Computer and Information Sciences COMPUTER SCIENCE, INFORMATION SYSTEMS-

CiteScore

10.50

自引率

8.70%

发文量

656

审稿时长

29 days

期刊介绍： In 2022 the Journal of King Saud University - Computer and Information Sciences will become an author paid open access journal. Authors who submit their manuscript after October 31st 2021 will be asked to pay an Article Processing Charge (APC) after acceptance of their paper to make their work immediately, permanently, and freely accessible to all. The Journal of King Saud University Computer and Information Sciences is a refereed, international journal that covers all aspects of both foundations of computer and its practical applications.