{"title":"InstructSing: High-Fidelity Singing Voice Generation via Instructing Yourself","authors":"Chang Zeng, Chunhui Wang, Xiaoxiao Miao, Jian Zhao, Zhonglin Jiang, Yong Chen","doi":"arxiv-2409.06330","DOIUrl":null,"url":null,"abstract":"It is challenging to accelerate the training process while ensuring both\nhigh-quality generated voices and acceptable inference speed. In this paper, we\npropose a novel neural vocoder called InstructSing, which can converge much\nfaster compared with other neural vocoders while maintaining good performance\nby integrating differentiable digital signal processing and adversarial\ntraining. It includes one generator and two discriminators. Specifically, the\ngenerator incorporates a harmonic-plus-noise (HN) module to produce 8kHz audio\nas an instructive signal. Subsequently, the HN module is connected with an\nextended WaveNet by an UNet-based module, which transforms the output of the HN\nmodule to a latent variable sequence containing essential periodic and\naperiodic information. In addition to the latent sequence, the extended WaveNet\nalso takes the mel-spectrogram as input to generate 48kHz high-fidelity singing\nvoices. In terms of discriminators, we combine a multi-period discriminator, as\noriginally proposed in HiFiGAN, with a multi-resolution multi-band STFT\ndiscriminator. Notably, InstructSing achieves comparable voice quality to other\nneural vocoders but with only one-tenth of the training steps on a 4 NVIDIA\nV100 GPU machine\\footnote{{Demo page:\n\\href{https://wavelandspeech.github.io/instructsing/}{\\texttt{https://wavelandspeech.github.io/inst\\\\ructsing/}}}}.\nWe plan to open-source our code and pretrained model once the paper get\naccepted.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"12 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06330","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
It is challenging to accelerate the training process while ensuring both
high-quality generated voices and acceptable inference speed. In this paper, we
propose a novel neural vocoder called InstructSing, which can converge much
faster compared with other neural vocoders while maintaining good performance
by integrating differentiable digital signal processing and adversarial
training. It includes one generator and two discriminators. Specifically, the
generator incorporates a harmonic-plus-noise (HN) module to produce 8kHz audio
as an instructive signal. Subsequently, the HN module is connected with an
extended WaveNet by an UNet-based module, which transforms the output of the HN
module to a latent variable sequence containing essential periodic and
aperiodic information. In addition to the latent sequence, the extended WaveNet
also takes the mel-spectrogram as input to generate 48kHz high-fidelity singing
voices. In terms of discriminators, we combine a multi-period discriminator, as
originally proposed in HiFiGAN, with a multi-resolution multi-band STFT
discriminator. Notably, InstructSing achieves comparable voice quality to other
neural vocoders but with only one-tenth of the training steps on a 4 NVIDIA
V100 GPU machine\footnote{{Demo page:
\href{https://wavelandspeech.github.io/instructsing/}{\texttt{https://wavelandspeech.github.io/inst\\ructsing/}}}}.
We plan to open-source our code and pretrained model once the paper get
accepted.