Low-Latency Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks

Yang Ai, Zhen-Hua Ling
{"title":"Low-Latency Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks","authors":"Yang Ai, Zhen-Hua Ling","doi":"arxiv-2403.17378","DOIUrl":null,"url":null,"abstract":"This paper presents a novel neural speech phase prediction model which\npredicts wrapped phase spectra directly from amplitude spectra. The proposed\nmodel is a cascade of a residual convolutional network and a parallel\nestimation architecture. The parallel estimation architecture is a core module\nfor direct wrapped phase prediction. This architecture consists of two parallel\nlinear convolutional layers and a phase calculation formula, imitating the\nprocess of calculating the phase spectra from the real and imaginary parts of\ncomplex spectra and strictly restricting the predicted phase values to the\nprincipal value interval. To avoid the error expansion issue caused by phase\nwrapping, we design anti-wrapping training losses defined between the predicted\nwrapped phase spectra and natural ones by activating the instantaneous phase\nerror, group delay error and instantaneous angular frequency error using an\nanti-wrapping function. We mathematically demonstrate that the anti-wrapping\nfunction should possess three properties, namely parity, periodicity and\nmonotonicity. We also achieve low-latency streamable phase prediction by\ncombining causal convolutions and knowledge distillation training strategies.\nFor both analysis-synthesis and specific speech generation tasks, experimental\nresults show that our proposed neural speech phase prediction model outperforms\nthe iterative phase estimation algorithms and neural network-based phase\nprediction methods in terms of phase prediction precision, efficiency and\nrobustness. Compared with HiFi-GAN-based waveform reconstruction method, our\nproposed model also shows outstanding efficiency advantages while ensuring the\nquality of synthesized speech. To the best of our knowledge, we are the first\nto directly predict speech phase spectra from amplitude spectra only via neural\nnetworks.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2403.17378","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

This paper presents a novel neural speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is a core module for direct wrapped phase prediction. This architecture consists of two parallel linear convolutional layers and a phase calculation formula, imitating the process of calculating the phase spectra from the real and imaginary parts of complex spectra and strictly restricting the predicted phase values to the principal value interval. To avoid the error expansion issue caused by phase wrapping, we design anti-wrapping training losses defined between the predicted wrapped phase spectra and natural ones by activating the instantaneous phase error, group delay error and instantaneous angular frequency error using an anti-wrapping function. We mathematically demonstrate that the anti-wrapping function should possess three properties, namely parity, periodicity and monotonicity. We also achieve low-latency streamable phase prediction by combining causal convolutions and knowledge distillation training strategies. For both analysis-synthesis and specific speech generation tasks, experimental results show that our proposed neural speech phase prediction model outperforms the iterative phase estimation algorithms and neural network-based phase prediction methods in terms of phase prediction precision, efficiency and robustness. Compared with HiFi-GAN-based waveform reconstruction method, our proposed model also shows outstanding efficiency advantages while ensuring the quality of synthesized speech. To the best of our knowledge, we are the first to directly predict speech phase spectra from amplitude spectra only via neural networks.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于并行估算架构和防缠绕损失的低延迟神经语音相位预测,适用于语音生成任务
本文提出了一种新颖的神经语音相位预测模型,该模型可直接从振幅频谱预测包裹的相位频谱。所提出的模型是一个残差卷积网络和一个并行估计架构的级联。并行估计架构是直接进行包裹相位预测的核心模块。该架构由两个并行线性卷积层和一个相位计算公式组成,模仿了从复杂频谱的实部和虚部计算相位频谱的过程,并将预测的相位值严格限制在主值区间内。为了避免相位裹包引起的误差扩大问题,我们设计了反裹包训练损耗,通过使用反裹包函数激活瞬时相位误差、群延迟误差和瞬时角频率误差,在预测裹包相位谱和自然相位谱之间定义反裹包训练损耗。我们用数学方法证明了反包函数应具备三个特性,即奇偶性、周期性和单调性。对于分析-合成和特定语音生成任务,实验结果表明,我们提出的神经语音相位预测模型在相位预测精度、效率和稳健性方面优于迭代相位估计算法和基于神经网络的相位预测方法。与基于 HiFiGAN 的波形重构方法相比,我们提出的模型在保证合成语音质量的同时,也表现出了突出的效率优势。据我们所知,我们是第一个仅通过神经网络从振幅频谱直接预测语音相位频谱的人。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Explaining Deep Learning Embeddings for Speech Emotion Recognition by Predicting Interpretable Acoustic Features ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration Prevailing Research Areas for Music AI in the Era of Foundation Models Egocentric Speaker Classification in Child-Adult Dyadic Interactions: From Sensing to Computational Modeling The T05 System for The VoiceMOS Challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1