SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

IF 4.1 2区计算机科学 Q1 ACOUSTICS IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-06-28 DOI:10.1109/TASLP.2024.3419418

Xiaofei Wang;Manthan Thakker;Zhuo Chen;Naoyuki Kanda;Sefik Emre Eskimez;Sanyuan Chen;Min Tang;Shujie Liu;Jinyu Li;Takuya Yoshioka

{"title":"SpeechX: Neural Codec Language Model as a Versatile Speech Transformer","authors":"Xiaofei Wang;Manthan Thakker;Zhuo Chen;Naoyuki Kanda;Sefik Emre Eskimez;Sanyuan Chen;Min Tang;Shujie Liu;Jinyu Li;Takuya Yoshioka","doi":"10.1109/TASLP.2024.3419418","DOIUrl":null,"url":null,"abstract":"Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3355-3364"},"PeriodicalIF":4.1000,"publicationDate":"2024-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10577150/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

SpeechX：作为多功能语音转换器的神经编解码语言模型

基于音频文本提示的语音生成模型的最新进展带来了令人瞩目的创新，如高质量的零镜头文本到语音。然而，现有模型在处理各种音频-文本语音生成任务时仍面临限制，包括转换输入语音和处理在不利声学条件下捕获的音频。本文介绍了 SpeechX，这是一种多功能语音生成模型，能够处理零声道 TTS 和各种语音转换任务，既能处理干净信号，也能处理噪声信号。SpeechX 将神经编解码语言建模与使用任务提示的多任务学习相结合，实现了统一和可扩展的建模，并为在语音增强和转换任务中利用文本输入提供了一致的方法。实验结果表明，SpeechX 在各种任务中都很有效，包括零镜头 TTS、噪声抑制、目标说话人提取、语音移除以及有背景噪声或无背景噪声的语音编辑，在各种任务中都取得了与专门模型相当或更优的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

11.30

自引率

11.10%

发文量

217

期刊介绍： The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.

期刊最新文献

List of Reviewers IPDnet: A Universal Direct-Path IPD Estimation Network for Sound Source Localization MO-Transformer: Extract High-Level Relationship Between Words for Neural Machine Translation Online Neural Speaker Diarization With Target Speaker Tracking Blind Audio Bandwidth Extension: A Diffusion-Based Zero-Shot Approach