Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources

IF 1.9 3区计算机科学 Q2 ACOUSTICS Eurasip Journal on Audio Speech and Music Processing Pub Date : 2024-02-12 DOI:10.1186/s13636-024-00329-7

Huda Barakat, Oytun Turk, Cenk Demiroglu

{"title":"Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources","authors":"Huda Barakat, Oytun Turk, Cenk Demiroglu","doi":"10.1186/s13636-024-00329-7","DOIUrl":null,"url":null,"abstract":"Speech synthesis has made significant strides thanks to the transition from machine learning to deep learning models. Contemporary text-to-speech (TTS) models possess the capability to generate speech of exceptionally high quality, closely mimicking human speech. Nevertheless, given the wide array of applications now employing TTS models, mere high-quality speech generation is no longer sufficient. Present-day TTS models must also excel at producing expressive speech that can convey various speaking styles and emotions, akin to human speech. Consequently, researchers have concentrated their efforts on developing more efficient models for expressive speech synthesis in recent years. This paper presents a systematic review of the literature on expressive speech synthesis models published within the last 5 years, with a particular emphasis on approaches based on deep learning. We offer a comprehensive classification scheme for these models and provide concise descriptions of models falling into each category. Additionally, we summarize the principal challenges encountered in this research domain and outline the strategies employed to tackle these challenges as documented in the literature. In the Section 8, we pinpoint some research gaps in this field that necessitate further exploration. Our objective with this work is to give an all-encompassing overview of this hot research area to offer guidance to interested researchers and future endeavors in this field.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"17 1","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2024-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Eurasip Journal on Audio Speech and Music Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1186/s13636-024-00329-7","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Speech synthesis has made significant strides thanks to the transition from machine learning to deep learning models. Contemporary text-to-speech (TTS) models possess the capability to generate speech of exceptionally high quality, closely mimicking human speech. Nevertheless, given the wide array of applications now employing TTS models, mere high-quality speech generation is no longer sufficient. Present-day TTS models must also excel at producing expressive speech that can convey various speaking styles and emotions, akin to human speech. Consequently, researchers have concentrated their efforts on developing more efficient models for expressive speech synthesis in recent years. This paper presents a systematic review of the literature on expressive speech synthesis models published within the last 5 years, with a particular emphasis on approaches based on deep learning. We offer a comprehensive classification scheme for these models and provide concise descriptions of models falling into each category. Additionally, we summarize the principal challenges encountered in this research domain and outline the strategies employed to tackle these challenges as documented in the literature. In the Section 8, we pinpoint some research gaps in this field that necessitate further exploration. Our objective with this work is to give an all-encompassing overview of this hot research area to offer guidance to interested researchers and future endeavors in this field.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于深度学习的表达式语音合成：方法、挑战和资源的系统回顾

由于从机器学习到深度学习模型的转变，语音合成技术取得了长足进步。当代的文本到语音（TTS）模型具有生成极高质量语音的能力，可以近似模拟人类语音。然而，鉴于目前采用 TTS 模型的应用范围广泛，仅仅生成高质量语音已经不够了。当今的 TTS 模型还必须擅长生成富有表现力的语音，能够传达各种说话风格和情感，与人类语音相仿。因此，近年来研究人员集中精力开发更高效的表达式语音合成模型。本文系统回顾了过去 5 年中发表的有关表现力语音合成模型的文献，并特别强调了基于深度学习的方法。我们为这些模型提供了一个全面的分类方案，并对每一类模型进行了简明扼要的描述。此外，我们还总结了这一研究领域遇到的主要挑战，并概述了文献中记载的应对这些挑战的策略。在第 8 部分，我们指出了这一领域需要进一步探索的一些研究空白。我们的目标是对这一热门研究领域进行全方位的概述，为感兴趣的研究人员和该领域的未来努力提供指导。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Eurasip Journal on Audio Speech and Music Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

4.10

自引率

4.20%

发文量

审稿时长

12 months

期刊介绍： The aim of “EURASIP Journal on Audio, Speech, and Music Processing” is to bring together researchers, scientists and engineers working on the theory and applications of the processing of various audio signals, with a specific focus on speech and music. EURASIP Journal on Audio, Speech, and Music Processing will be an interdisciplinary journal for the dissemination of all basic and applied aspects of speech communication and audio processes.

期刊最新文献

Diffraction perception in L-shaped rooms using virtual reality. Singing to speech conversion with generative flow. Robust and early howling detection based on a sparsity measure. Compression of room impulse responses for compact storage and fast low-latency convolution Guest editorial: AI for computational audition—sound and music processing