E$^{3}$TTS:端到端基于文本的语音编辑 TTS 系统及其应用

IF 4.1 2区 计算机科学 Q1 ACOUSTICS IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-23 DOI:10.1109/TASLP.2024.3485466
Zheng Liang;Ziyang Ma;Chenpeng Du;Kai Yu;Xie Chen
{"title":"E$^{3}$TTS:端到端基于文本的语音编辑 TTS 系统及其应用","authors":"Zheng Liang;Ziyang Ma;Chenpeng Du;Kai Yu;Xie Chen","doi":"10.1109/TASLP.2024.3485466","DOIUrl":null,"url":null,"abstract":"Text-based speech editing aims at manipulating part of real audio by modifying the corresponding transcribed text, without being discernible by human auditory system. With the enhanced capability of neural Text-to-speech (TTS), researchers try to tackle speech editing problems with TTS methods. In this paper, we propose E\n<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>\nTTS, a.k.a. end-to-end text-based speech editing TTS system, which combines a text encoder, a speech encoder, and a joint net for speech synthesis and speech editing. E\n<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>\nTTS can insert, replace, and delete speech content at will, by manipulating the given text. Experiments show that our speech editing outperforms strong baselines on HiFiTTS and LibriTTS datasets, speakers of which are seen or unseen, respectively. Further, we introduce E\n<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>\nTTS into data augmentation for automatic speech recognition (ASR) to mitigate the data insufficiency problem in code-switching and named entity recognition scenarios\n<sup>1</sup>\n. E\n<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>\nTTS retains the coherence and reality of the recorded audio compared to past data augmentation methods. The experimental results show significant performance improvements over baseline systems with traditional TTS-based data augmentation. The code and samples of the proposed speech editing model are available at this repository.\n<sup>2</sup>","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4810-4821"},"PeriodicalIF":4.1000,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"E$^{3}$TTS: End-to-End Text-Based Speech Editing TTS System and Its Applications\",\"authors\":\"Zheng Liang;Ziyang Ma;Chenpeng Du;Kai Yu;Xie Chen\",\"doi\":\"10.1109/TASLP.2024.3485466\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text-based speech editing aims at manipulating part of real audio by modifying the corresponding transcribed text, without being discernible by human auditory system. With the enhanced capability of neural Text-to-speech (TTS), researchers try to tackle speech editing problems with TTS methods. In this paper, we propose E\\n<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>\\nTTS, a.k.a. end-to-end text-based speech editing TTS system, which combines a text encoder, a speech encoder, and a joint net for speech synthesis and speech editing. E\\n<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>\\nTTS can insert, replace, and delete speech content at will, by manipulating the given text. Experiments show that our speech editing outperforms strong baselines on HiFiTTS and LibriTTS datasets, speakers of which are seen or unseen, respectively. Further, we introduce E\\n<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>\\nTTS into data augmentation for automatic speech recognition (ASR) to mitigate the data insufficiency problem in code-switching and named entity recognition scenarios\\n<sup>1</sup>\\n. E\\n<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>\\nTTS retains the coherence and reality of the recorded audio compared to past data augmentation methods. The experimental results show significant performance improvements over baseline systems with traditional TTS-based data augmentation. The code and samples of the proposed speech editing model are available at this repository.\\n<sup>2</sup>\",\"PeriodicalId\":13332,\"journal\":{\"name\":\"IEEE/ACM Transactions on Audio, Speech, and Language Processing\",\"volume\":\"32 \",\"pages\":\"4810-4821\"},\"PeriodicalIF\":4.1000,\"publicationDate\":\"2024-10-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE/ACM Transactions on Audio, Speech, and Language Processing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10731477/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10731477/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

摘要

基于文本的语音编辑旨在通过修改相应的转录文本,在人类听觉系统无法识别的情况下,对部分真实音频进行处理。随着神经文本到语音(TTS)功能的增强,研究人员尝试用 TTS 方法来解决语音编辑问题。本文提出了 E$^{3}$TTS,即端到端基于文本的语音编辑 TTS 系统,它结合了文本编码器、语音编码器以及用于语音合成和语音编辑的联合网络。E$^{3}$TTS 可以通过处理给定文本,随意插入、替换和删除语音内容。实验表明,我们的语音编辑功能在 HiFiTTS 和 LibriTTS 数据集上的表现优于强大的基线,这两个数据集的说话者分别是见过或没见过的。此外,我们还将 E$^{3}$TTS 引入到自动语音识别(ASR)的数据增强中,以缓解代码转换和命名实体识别场景中的数据不足问题1。与过去的数据增强方法相比,E$^{3}$TTS 保留了录制音频的连贯性和真实性。实验结果表明,与传统的基于 TTS 的数据增强系统相比,E$^{3}$TTS 的性能有了显著提高。建议的语音编辑模型的代码和样本可在此资源库中获取2。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
E$^{3}$TTS: End-to-End Text-Based Speech Editing TTS System and Its Applications
Text-based speech editing aims at manipulating part of real audio by modifying the corresponding transcribed text, without being discernible by human auditory system. With the enhanced capability of neural Text-to-speech (TTS), researchers try to tackle speech editing problems with TTS methods. In this paper, we propose E $^{3}$ TTS, a.k.a. end-to-end text-based speech editing TTS system, which combines a text encoder, a speech encoder, and a joint net for speech synthesis and speech editing. E $^{3}$ TTS can insert, replace, and delete speech content at will, by manipulating the given text. Experiments show that our speech editing outperforms strong baselines on HiFiTTS and LibriTTS datasets, speakers of which are seen or unseen, respectively. Further, we introduce E $^{3}$ TTS into data augmentation for automatic speech recognition (ASR) to mitigate the data insufficiency problem in code-switching and named entity recognition scenarios 1 . E $^{3}$ TTS retains the coherence and reality of the recorded audio compared to past data augmentation methods. The experimental results show significant performance improvements over baseline systems with traditional TTS-based data augmentation. The code and samples of the proposed speech editing model are available at this repository. 2
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE/ACM Transactions on Audio, Speech, and Language Processing
IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC
CiteScore
11.30
自引率
11.10%
发文量
217
期刊介绍: The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.
期刊最新文献
Enhancing Robustness of Speech Watermarking Using a Transformer-Based Framework Exploiting Acoustic Features FxLMS/F Based Tap Decomposed Adaptive Filter for Decentralized Active Noise Control System MRC-PASCL: A Few-Shot Machine Reading Comprehension Approach via Post-Training and Answer Span-Oriented Contrastive Learning Knowledge-Guided Transformer for Joint Theme and Emotion Classification of Chinese Classical Poetry WEDA: Exploring Copyright Protection for Large Language Model Downstream Alignment
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1