Singing to speech conversion with generative flow.

IF 1.9 3区计算机科学 Q2 ACOUSTICS Eurasip Journal on Audio Speech and Music Processing Pub Date : 2025-01-01 Epub Date: 2025-03-10 DOI:10.1186/s13636-025-00400-x

Jiawen Huang, Emmanouil Benetos

引用次数: 0

Abstract

This paper introduces singing to speech conversion (S2S), a cross-domain voice conversion task, and presents the first deep learning-based S2S system. S2S aims to transform singing into speech while retaining the phonetic information, reducing variations in pitch, rhythm, and timbre. Inspired by the Glow-TTS architecture, the proposed model is built using generative flow, with an adjusted alignment module between the latent features. We adapt the original monotonic alignment search (MAS) to the S2S scenario and utilize a duration predictor to deal with the duration differences between the two modalities. Subjective evaluations show that the proposed model outperforms signal processing baselines in naturalness and outperforms a transcribe-and-synthesize baseline in phonetic similarity to the original singing. We further demonstrate that singing-to-speech could be an effective augmentation method for low-resource lyrics transcription.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用生成流演唱到言语的转换。

本文介绍了跨域语音转换任务——唱歌到语音转换（S2S），并提出了首个基于深度学习的S2S系统。S2S旨在将歌唱转化为语音，同时保留语音信息，减少音高、节奏和音色的变化。受Glow-TTS架构的启发，该模型采用生成流构建，并在潜在特征之间设置了可调整的对齐模块。我们将原始的单调对齐搜索（MAS）适应于S2S场景，并利用持续时间预测器来处理两种模式之间的持续时间差异。主观评价表明，该模型在自然度方面优于信号处理基线，在与原始歌声的语音相似度方面优于转录合成基线。我们进一步证明，唱歌到说话可能是一种有效的增强方法，对低资源的歌词转录。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Eurasip Journal on Audio Speech and Music Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

4.10

自引率

4.20%

发文量

审稿时长

12 months

期刊介绍： The aim of “EURASIP Journal on Audio, Speech, and Music Processing” is to bring together researchers, scientists and engineers working on the theory and applications of the processing of various audio signals, with a specific focus on speech and music. EURASIP Journal on Audio, Speech, and Music Processing will be an interdisciplinary journal for the dissemination of all basic and applied aspects of speech communication and audio processes.

期刊最新文献

Diffraction perception in L-shaped rooms using virtual reality. Singing to speech conversion with generative flow. Robust and early howling detection based on a sparsity measure. Compression of room impulse responses for compact storage and fast low-latency convolution Guest editorial: AI for computational audition—sound and music processing