{"title":"Low-Latency Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks","authors":"Yang Ai, Zhen-Hua Ling","doi":"arxiv-2403.17378","DOIUrl":null,"url":null,"abstract":"This paper presents a novel neural speech phase prediction model which\npredicts wrapped phase spectra directly from amplitude spectra. The proposed\nmodel is a cascade of a residual convolutional network and a parallel\nestimation architecture. The parallel estimation architecture is a core module\nfor direct wrapped phase prediction. This architecture consists of two parallel\nlinear convolutional layers and a phase calculation formula, imitating the\nprocess of calculating the phase spectra from the real and imaginary parts of\ncomplex spectra and strictly restricting the predicted phase values to the\nprincipal value interval. To avoid the error expansion issue caused by phase\nwrapping, we design anti-wrapping training losses defined between the predicted\nwrapped phase spectra and natural ones by activating the instantaneous phase\nerror, group delay error and instantaneous angular frequency error using an\nanti-wrapping function. We mathematically demonstrate that the anti-wrapping\nfunction should possess three properties, namely parity, periodicity and\nmonotonicity. We also achieve low-latency streamable phase prediction by\ncombining causal convolutions and knowledge distillation training strategies.\nFor both analysis-synthesis and specific speech generation tasks, experimental\nresults show that our proposed neural speech phase prediction model outperforms\nthe iterative phase estimation algorithms and neural network-based phase\nprediction methods in terms of phase prediction precision, efficiency and\nrobustness. Compared with HiFi-GAN-based waveform reconstruction method, our\nproposed model also shows outstanding efficiency advantages while ensuring the\nquality of synthesized speech. To the best of our knowledge, we are the first\nto directly predict speech phase spectra from amplitude spectra only via neural\nnetworks.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2403.17378","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This paper presents a novel neural speech phase prediction model which
predicts wrapped phase spectra directly from amplitude spectra. The proposed
model is a cascade of a residual convolutional network and a parallel
estimation architecture. The parallel estimation architecture is a core module
for direct wrapped phase prediction. This architecture consists of two parallel
linear convolutional layers and a phase calculation formula, imitating the
process of calculating the phase spectra from the real and imaginary parts of
complex spectra and strictly restricting the predicted phase values to the
principal value interval. To avoid the error expansion issue caused by phase
wrapping, we design anti-wrapping training losses defined between the predicted
wrapped phase spectra and natural ones by activating the instantaneous phase
error, group delay error and instantaneous angular frequency error using an
anti-wrapping function. We mathematically demonstrate that the anti-wrapping
function should possess three properties, namely parity, periodicity and
monotonicity. We also achieve low-latency streamable phase prediction by
combining causal convolutions and knowledge distillation training strategies.
For both analysis-synthesis and specific speech generation tasks, experimental
results show that our proposed neural speech phase prediction model outperforms
the iterative phase estimation algorithms and neural network-based phase
prediction methods in terms of phase prediction precision, efficiency and
robustness. Compared with HiFi-GAN-based waveform reconstruction method, our
proposed model also shows outstanding efficiency advantages while ensuring the
quality of synthesized speech. To the best of our knowledge, we are the first
to directly predict speech phase spectra from amplitude spectra only via neural
networks.