InstructSing: High-Fidelity Singing Voice Generation via Instructing Yourself

Chang Zeng, Chunhui Wang, Xiaoxiao Miao, Jian Zhao, Zhonglin Jiang, Yong Chen
{"title":"InstructSing: High-Fidelity Singing Voice Generation via Instructing Yourself","authors":"Chang Zeng, Chunhui Wang, Xiaoxiao Miao, Jian Zhao, Zhonglin Jiang, Yong Chen","doi":"arxiv-2409.06330","DOIUrl":null,"url":null,"abstract":"It is challenging to accelerate the training process while ensuring both\nhigh-quality generated voices and acceptable inference speed. In this paper, we\npropose a novel neural vocoder called InstructSing, which can converge much\nfaster compared with other neural vocoders while maintaining good performance\nby integrating differentiable digital signal processing and adversarial\ntraining. It includes one generator and two discriminators. Specifically, the\ngenerator incorporates a harmonic-plus-noise (HN) module to produce 8kHz audio\nas an instructive signal. Subsequently, the HN module is connected with an\nextended WaveNet by an UNet-based module, which transforms the output of the HN\nmodule to a latent variable sequence containing essential periodic and\naperiodic information. In addition to the latent sequence, the extended WaveNet\nalso takes the mel-spectrogram as input to generate 48kHz high-fidelity singing\nvoices. In terms of discriminators, we combine a multi-period discriminator, as\noriginally proposed in HiFiGAN, with a multi-resolution multi-band STFT\ndiscriminator. Notably, InstructSing achieves comparable voice quality to other\nneural vocoders but with only one-tenth of the training steps on a 4 NVIDIA\nV100 GPU machine\\footnote{{Demo page:\n\\href{https://wavelandspeech.github.io/instructsing/}{\\texttt{https://wavelandspeech.github.io/inst\\\\ructsing/}}}}.\nWe plan to open-source our code and pretrained model once the paper get\naccepted.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"12 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06330","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

It is challenging to accelerate the training process while ensuring both high-quality generated voices and acceptable inference speed. In this paper, we propose a novel neural vocoder called InstructSing, which can converge much faster compared with other neural vocoders while maintaining good performance by integrating differentiable digital signal processing and adversarial training. It includes one generator and two discriminators. Specifically, the generator incorporates a harmonic-plus-noise (HN) module to produce 8kHz audio as an instructive signal. Subsequently, the HN module is connected with an extended WaveNet by an UNet-based module, which transforms the output of the HN module to a latent variable sequence containing essential periodic and aperiodic information. In addition to the latent sequence, the extended WaveNet also takes the mel-spectrogram as input to generate 48kHz high-fidelity singing voices. In terms of discriminators, we combine a multi-period discriminator, as originally proposed in HiFiGAN, with a multi-resolution multi-band STFT discriminator. Notably, InstructSing achieves comparable voice quality to other neural vocoders but with only one-tenth of the training steps on a 4 NVIDIA V100 GPU machine\footnote{{Demo page: \href{https://wavelandspeech.github.io/instructsing/}{\texttt{https://wavelandspeech.github.io/inst\\ructsing/}}}}. We plan to open-source our code and pretrained model once the paper get accepted.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
InstructSing:通过自学生成高保真歌声
如何在加快训练过程的同时,确保生成高质量的语音和可接受的推理速度,是一项挑战。在本文中,我们提出了一种名为 InstructSing 的新型神经声码器,与其他神经声码器相比,它的收敛速度更快,同时通过整合可微分数字信号处理和对抗训练,保持了良好的性能。它包括一个生成器和两个鉴别器。具体来说,生成器包含一个谐波加噪声(HN)模块,生成 8kHz 音频作为指示信号。随后,谐波加噪声模块通过一个基于 UNet 的模块与下一个扩展波网连接,后者将谐波加噪声模块的输出转换为包含基本周期性和周期性信息的潜变量序列。除潜在序列外,扩展波网还将旋律谱图作为输入,生成 48kHz 高保真歌声。在判别器方面,我们将最初在 HiFiGAN 中提出的多周期判别器与多分辨率多波段 STFT 判别器相结合。值得注意的是,InstructSing实现了与其他神经声码器相当的语音质量,但在4台英伟达V100 GPU机器上的训练步骤只有其十分之一。我们计划在论文被接受后开源我们的代码和预训练模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Exploring an Inter-Pausal Unit (IPU) based Approach for Indic End-to-End TTS Systems Conformal Prediction for Manifold-based Source Localization with Gaussian Processes Insights into the Incorporation of Signal Information in Binaural Signal Matching with Wearable Microphone Arrays Dense-TSNet: Dense Connected Two-Stage Structure for Ultra-Lightweight Speech Enhancement Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1