Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, Furu Wei
{"title":"VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers","authors":"Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, Furu Wei","doi":"arxiv-2406.05370","DOIUrl":null,"url":null,"abstract":"This paper introduces VALL-E 2, the latest advancement in neural codec\nlanguage models that marks a milestone in zero-shot text-to-speech synthesis\n(TTS), achieving human parity for the first time. Based on its predecessor,\nVALL-E, the new iteration introduces two significant enhancements: Repetition\nAware Sampling refines the original nucleus sampling process by accounting for\ntoken repetition in the decoding history. It not only stabilizes the decoding\nbut also circumvents the infinite loop issue. Grouped Code Modeling organizes\ncodec codes into groups to effectively shorten the sequence length, which not\nonly boosts inference speed but also addresses the challenges of long sequence\nmodeling. Our experiments on the LibriSpeech and VCTK datasets show that VALL-E\n2 surpasses previous systems in speech robustness, naturalness, and speaker\nsimilarity. It is the first of its kind to reach human parity on these\nbenchmarks. Moreover, VALL-E 2 consistently synthesizes high-quality speech,\neven for sentences that are traditionally challenging due to their complexity\nor repetitive phrases. The advantages of this work could contribute to valuable\nendeavors, such as generating speech for individuals with aphasia or people\nwith amyotrophic lateral sclerosis. Demos of VALL-E 2 will be posted to\nhttps://aka.ms/valle2.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.05370","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This paper introduces VALL-E 2, the latest advancement in neural codec
language models that marks a milestone in zero-shot text-to-speech synthesis
(TTS), achieving human parity for the first time. Based on its predecessor,
VALL-E, the new iteration introduces two significant enhancements: Repetition
Aware Sampling refines the original nucleus sampling process by accounting for
token repetition in the decoding history. It not only stabilizes the decoding
but also circumvents the infinite loop issue. Grouped Code Modeling organizes
codec codes into groups to effectively shorten the sequence length, which not
only boosts inference speed but also addresses the challenges of long sequence
modeling. Our experiments on the LibriSpeech and VCTK datasets show that VALL-E
2 surpasses previous systems in speech robustness, naturalness, and speaker
similarity. It is the first of its kind to reach human parity on these
benchmarks. Moreover, VALL-E 2 consistently synthesizes high-quality speech,
even for sentences that are traditionally challenging due to their complexity
or repetitive phrases. The advantages of this work could contribute to valuable
endeavors, such as generating speech for individuals with aphasia or people
with amyotrophic lateral sclerosis. Demos of VALL-E 2 will be posted to
https://aka.ms/valle2.