利用频谱相位重建从倒谱产生非滤波波形

Speech Synthesis Workshop Pub Date : 2016-09-13 DOI:10.21437/SSW.2016-5

Yasuhiro Hamada, Nobutaka Ono, S. Sagayama

{"title":"利用频谱相位重建从倒谱产生非滤波波形","authors":"Yasuhiro Hamada, Nobutaka Ono, S. Sagayama","doi":"10.21437/SSW.2016-5","DOIUrl":null,"url":null,"abstract":"This paper discusses non-ﬁlter waveform generation from cepstral features using spectral phase reconstruction as an alternative method to replace the conventional source-ﬁlter model in text-to-speech (TTS) systems. As the primary purpose of the use of ﬁlters is considered as producing a waveform from the desired spectrum shape, one possible alternative of the source-ﬁlter framework is to directly convert the designed spectrum into a waveform by utilizing a recently developed “ phase reconstruction ” from the power spectrogram. Given cepstral features and fundamental frequency ( F 0 ) as desired spectrum from a TTS system, the spectrum to be heard by the listener is calculated by converting the cepstral features into a linear-scale power spectrum and multiplying with the pitch structure of F 0 . The signal waveform is generated from the power spectrogram by spectral phase reconstruction. An advantageous property of the proposed method is that it is free from undesired amplitude and long time decay often caused by sharp resonances in recursive ﬁlters. In preliminary experiments, we compared temporal and gain characteristics of the synthesized speech using the proposed method and mel-log spectrum approximation (MLSA) ﬁlter. Results show the proposed method performed better than the MLSA ﬁlter in the both characteristics of the synthesized speech, and imply a desirable properties of the proposed method for speech synthesis.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"121 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Non-filter waveform generation from cepstrum using spectral phase reconstruction\",\"authors\":\"Yasuhiro Hamada, Nobutaka Ono, S. Sagayama\",\"doi\":\"10.21437/SSW.2016-5\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper discusses non-ﬁlter waveform generation from cepstral features using spectral phase reconstruction as an alternative method to replace the conventional source-ﬁlter model in text-to-speech (TTS) systems. As the primary purpose of the use of ﬁlters is considered as producing a waveform from the desired spectrum shape, one possible alternative of the source-ﬁlter framework is to directly convert the designed spectrum into a waveform by utilizing a recently developed “ phase reconstruction ” from the power spectrogram. Given cepstral features and fundamental frequency ( F 0 ) as desired spectrum from a TTS system, the spectrum to be heard by the listener is calculated by converting the cepstral features into a linear-scale power spectrum and multiplying with the pitch structure of F 0 . The signal waveform is generated from the power spectrogram by spectral phase reconstruction. An advantageous property of the proposed method is that it is free from undesired amplitude and long time decay often caused by sharp resonances in recursive ﬁlters. In preliminary experiments, we compared temporal and gain characteristics of the synthesized speech using the proposed method and mel-log spectrum approximation (MLSA) ﬁlter. Results show the proposed method performed better than the MLSA ﬁlter in the both characteristics of the synthesized speech, and imply a desirable properties of the proposed method for speech synthesis.\",\"PeriodicalId\":340820,\"journal\":{\"name\":\"Speech Synthesis Workshop\",\"volume\":\"121 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-09-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Synthesis Workshop\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/SSW.2016-5\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Synthesis Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/SSW.2016-5","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

本文讨论了在文本到语音(TTS)系统中，使用频谱相位重建作为替代传统源-滤波器模型的一种替代方法，从倒谱特征生成非滤波器波形。由于使用滤波器的主要目的被认为是从期望的频谱形状产生波形，源滤波器框架的一种可能的替代方案是通过利用最近开发的功率谱图的“相位重建”直接将设计的频谱转换为波形。将倒谱特征和基频(f0)作为TTS系统的期望频谱，通过将倒谱特征转换为线性尺度功率谱并乘以f0的音高结构来计算听众要听到的频谱。通过谱相位重构，从功率谱中生成信号波形。该方法的一个优点是不受递归滤波器中尖锐共振引起的不期望的振幅和长时间衰减的影响。在初步实验中，我们使用该方法和mel-log谱近似(MLSA)滤波器比较了合成语音的时间和增益特性。结果表明，该方法在合成语音的两个特征上都优于MLSA滤波器，表明该方法具有理想的语音合成性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Non-filter waveform generation from cepstrum using spectral phase reconstruction

This paper discusses non-ﬁlter waveform generation from cepstral features using spectral phase reconstruction as an alternative method to replace the conventional source-ﬁlter model in text-to-speech (TTS) systems. As the primary purpose of the use of ﬁlters is considered as producing a waveform from the desired spectrum shape, one possible alternative of the source-ﬁlter framework is to directly convert the designed spectrum into a waveform by utilizing a recently developed “ phase reconstruction ” from the power spectrogram. Given cepstral features and fundamental frequency ( F 0 ) as desired spectrum from a TTS system, the spectrum to be heard by the listener is calculated by converting the cepstral features into a linear-scale power spectrum and multiplying with the pitch structure of F 0 . The signal waveform is generated from the power spectrogram by spectral phase reconstruction. An advantageous property of the proposed method is that it is free from undesired amplitude and long time decay often caused by sharp resonances in recursive ﬁlters. In preliminary experiments, we compared temporal and gain characteristics of the synthesized speech using the proposed method and mel-log spectrum approximation (MLSA) ﬁlter. Results show the proposed method performed better than the MLSA ﬁlter in the both characteristics of the synthesized speech, and imply a desirable properties of the proposed method for speech synthesis.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Speech Synthesis Workshop

自引率

0.00%

发文量