{"title":"Investigating Spectral Amplitude Modulation Phase Hierarchy Features in Speech Synthesis","authors":"Alexandros Lazaridis, M. Cernak, Pierre-Edouard Honnet, Philip N. Garner","doi":"10.21437/SSW.2016-6","DOIUrl":"https://doi.org/10.21437/SSW.2016-6","url":null,"abstract":"Keywords: deep neural networks ; probabilistic amplitude demodulation ; spectral amplitude modulation phase hierarchy ; speech prosody ; speech synthesis Reference EPFL-CONF-222449 Related documents: http://publications.idiap.ch/index.php/publications/showcite/Lazaridis_Idiap-RR-22-2016 Record created on 2016-10-19, modified on 2017-05-10","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121977885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Voice conversion (VC) technique modifies the speech utter-ance spoken by a source speaker to make it sound like a target speaker is speaking. Gaussian Mixture Model (GMM)-based VC is a state-of-the-art method. It finds the mapping function by modeling the joint density of source and target speakers using GMM to convert spectral features framewise. As with any real dataset, the spectral parameters contain a few points that are inconsistent with the rest of the data, called outliers . Until now, there has been very few literature regarding the effect of outliers in voice conversion. In this paper, we have explored the effect of outliers in voice conversion, as a pre-processing step. In order to remove these outliers, we have used the score distance, which uses the scores estimated using Robust Principal Component Analysis (ROBPCA). The outliers are determined by using a cut-off value based on the degrees of freedom in a chi-squared distribution. They are then removed from the training dataset and a GMM is trained based on the least outlying points. This pre-processing step can be applied to various methods. Experimental results indicate that there is a clear improvement in both, the objective ( 8 %) as well as the subjective ( 4 % for MOS and 5 % for XAB) results.
{"title":"Novel Pre-processing using Outlier Removal in Voice Conversion","authors":"S. Rao, Nirmesh J. Shah, H. Patil","doi":"10.21437/SSW.2016-22","DOIUrl":"https://doi.org/10.21437/SSW.2016-22","url":null,"abstract":"Voice conversion (VC) technique modifies the speech utter-ance spoken by a source speaker to make it sound like a target speaker is speaking. Gaussian Mixture Model (GMM)-based VC is a state-of-the-art method. It finds the mapping function by modeling the joint density of source and target speakers using GMM to convert spectral features framewise. As with any real dataset, the spectral parameters contain a few points that are inconsistent with the rest of the data, called outliers . Until now, there has been very few literature regarding the effect of outliers in voice conversion. In this paper, we have explored the effect of outliers in voice conversion, as a pre-processing step. In order to remove these outliers, we have used the score distance, which uses the scores estimated using Robust Principal Component Analysis (ROBPCA). The outliers are determined by using a cut-off value based on the degrees of freedom in a chi-squared distribution. They are then removed from the training dataset and a GMM is trained based on the least outlying points. This pre-processing step can be applied to various methods. Experimental results indicate that there is a clear improvement in both, the objective ( 8 %) as well as the subjective ( 4 % for MOS and 5 % for XAB) results.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128506870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dong-Yan Huang, Lei Xie, Yvonne Siu Wa Lee, Jie Wu, Huaiping Ming, Xiaohai Tian, Shaofei Zhang, Chuang Ding, Mei Li, Nguyen Quy Hy, M. Dong, Haizhou Li
Voice conversion aims to modify the characteristics of one speaker to make it sound like spoken by another speaker without changing the language content. This task has attracted con-siderable attention and various approaches have been proposed since two decades ago. The evaluation of voice conversion approaches, usually through time-intensive subject listening tests, requires a huge amount of human labor. This paper proposes an automatic voice conversion evaluation strategy based on perceptual background noise distortion and speaker similarity. Ex-perimental results show that our automatic evaluation results match the subjective listening results quite well. We further use our strategy to select best converted samples from multiple voice conversion systems and our submission achieves promising results in the voice conversion challenge (VCC2016).
{"title":"An Automatic Voice Conversion Evaluation Strategy Based on Perceptual Background Noise Distortion and Speaker Similarity","authors":"Dong-Yan Huang, Lei Xie, Yvonne Siu Wa Lee, Jie Wu, Huaiping Ming, Xiaohai Tian, Shaofei Zhang, Chuang Ding, Mei Li, Nguyen Quy Hy, M. Dong, Haizhou Li","doi":"10.21437/SSW.2016-8","DOIUrl":"https://doi.org/10.21437/SSW.2016-8","url":null,"abstract":"Voice conversion aims to modify the characteristics of one speaker to make it sound like spoken by another speaker without changing the language content. This task has attracted con-siderable attention and various approaches have been proposed since two decades ago. The evaluation of voice conversion approaches, usually through time-intensive subject listening tests, requires a huge amount of human labor. This paper proposes an automatic voice conversion evaluation strategy based on perceptual background noise distortion and speaker similarity. Ex-perimental results show that our automatic evaluation results match the subjective listening results quite well. We further use our strategy to select best converted samples from multiple voice conversion systems and our submission achieves promising results in the voice conversion challenge (VCC2016).","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132950328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The quality of the vocoder plays a crucial role in the performance of parametric speech synthesis systems. In order to improve the vocoder quality, it is necessary to reconstruct as much of the perceived components of the speech signal as possible. In this paper, we first show that the noise component is currently not accurately modelled in the widely used STRAIGHT vocoder, thus, limiting the voice range that can be covered and also limiting the overall quality. In order to motivate a new, alternative, approach to this issue, we present a new synthesizer, which uses a uniform representation for voiced and unvoiced segments. This synthesizer has also the advantage of using a simple signal model compared to other approaches, thus offering a convenient and controlled alternative for future developments. Experiments analysing the synthesis quality of the noise component shows improved speech reconstruction using the suggested synthesizer compared to STRAIGHT. Additionally an experiment about analysis/resynthesis shows that the suggested synthesizer solves some of the issues of another uniform vocoder, Harmonic Model plus Phase Distortion (HMPD). In text-to-speech synthesis, it outperforms HMPD and exhibits a similar, or only slightly worse, quality to STRAIGHT’s quality, which is encouraging for a new vocoding approach.
{"title":"A Pulse Model in Log-domain for a Uniform Synthesizer","authors":"G. Degottex, P. Lanchantin, M. Gales","doi":"10.21437/SSW.2016-35","DOIUrl":"https://doi.org/10.21437/SSW.2016-35","url":null,"abstract":"The quality of the vocoder plays a crucial role in the performance of parametric speech synthesis systems. In order to improve the vocoder quality, it is necessary to reconstruct as much of the perceived components of the speech signal as possible. In this paper, we first show that the noise component is currently not accurately modelled in the widely used STRAIGHT vocoder, thus, limiting the voice range that can be covered and also limiting the overall quality. In order to motivate a new, alternative, approach to this issue, we present a new synthesizer, which uses a uniform representation for voiced and unvoiced segments. This synthesizer has also the advantage of using a simple signal model compared to other approaches, thus offering a convenient and controlled alternative for future developments. Experiments analysing the synthesis quality of the noise component shows improved speech reconstruction using the suggested synthesizer compared to STRAIGHT. Additionally an experiment about analysis/resynthesis shows that the suggested synthesizer solves some of the issues of another uniform vocoder, Harmonic Model plus Phase Distortion (HMPD). In text-to-speech synthesis, it outperforms HMPD and exhibits a similar, or only slightly worse, quality to STRAIGHT’s quality, which is encouraging for a new vocoding approach.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124009355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces a general and flexible framework for F0 and aperiodicity (additive non periodic component) analysis, specifically intended for high-quality speech synthesis and modification applications. The proposed framework consists of three subsystems: instantaneous frequency estimator and initial aperiodicity detector, F0 trajectory tracker, and F0 refinement and aperiodicity extractor. A preliminary implementation of the proposed framework substantially outperformed (by a factor of 10 in terms of RMS F0 estimation error) existing F0 extractors in tracking ability of temporally varying F0 trajectories. The front end aperiodicity detector consists of a complex-valued wavelet analysis filter with a highly selective temporal and spectral envelope. This front end aperiodicity detector uses a new measure that quantifies the deviation from periodicity. The measure is less sensitive to slow FM and AM and closely correlates with the signal to noise ratio.
{"title":"Using instantaneous frequency and aperiodicity detection to estimate F0 for high-quality speech synthesis","authors":"Hideki Kawahara, Yannis Agiomyrgiannakis, H. Zen","doi":"10.21437/SSW.2016-36","DOIUrl":"https://doi.org/10.21437/SSW.2016-36","url":null,"abstract":"This paper introduces a general and flexible framework for F0 and aperiodicity (additive non periodic component) analysis, specifically intended for high-quality speech synthesis and modification applications. The proposed framework consists of three subsystems: instantaneous frequency estimator and initial aperiodicity detector, F0 trajectory tracker, and F0 refinement and aperiodicity extractor. A preliminary implementation of the proposed framework substantially outperformed (by a factor of 10 in terms of RMS F0 estimation error) existing F0 extractors in tracking ability of temporally varying F0 trajectories. The front end aperiodicity detector consists of a complex-valued wavelet analysis filter with a highly selective temporal and spectral envelope. This front end aperiodicity detector uses a new measure that quantifies the deviation from periodicity. The measure is less sensitive to slow FM and AM and closely correlates with the signal to noise ratio.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116826941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
More and more semantic information has become available as RDF data recently, with the linked open data cloud as a prominent example. However, participating in the Semantic Web is cumbersome. Typically several steps are involved in using semantic knowledge. Information is first acquired, e.g. by information extraction, crowd sourcing or human experts. Then ontologies are published and distributed. Users may apply reasoning and otherwise modify their local ontology instances. However, currently these steps are treated separately and although each involves human effort, nearly no synergy effect is used and it is also mostly a one way process, e.g. user feedback hardly flows back into the main ontology version. Similarly, user cooperation is low. While there are approaches alleviating some of these limitations, e.g. extracting information at query time, personalizing queries, and integration of user feedback, this work combines all the pieces envisioning a social knowledge network that enables collaborative knowledge generation and exchange. Each aforementioned step is seen as a particular implementation of a network node responding to knowledge queries in its own way, e.g. by extracting it, applying reasoning or asking users, and learning from knowledge exchanged with neighbours. Original knowledge as well as user feedback is distributed over the network based on similar trust and provenance mechanisms. The extended query language we call for also allows for personalization.
{"title":"Colledge: a vision of collaborative knowledge networks","authors":"S. Metzger, K. Hose, Ralf Schenkel","doi":"10.1145/2494068.2494069","DOIUrl":"https://doi.org/10.1145/2494068.2494069","url":null,"abstract":"More and more semantic information has become available as RDF data recently, with the linked open data cloud as a prominent example. However, participating in the Semantic Web is cumbersome. Typically several steps are involved in using semantic knowledge. Information is first acquired, e.g. by information extraction, crowd sourcing or human experts. Then ontologies are published and distributed. Users may apply reasoning and otherwise modify their local ontology instances. However, currently these steps are treated separately and although each involves human effort, nearly no synergy effect is used and it is also mostly a one way process, e.g. user feedback hardly flows back into the main ontology version. Similarly, user cooperation is low.\u0000 While there are approaches alleviating some of these limitations, e.g. extracting information at query time, personalizing queries, and integration of user feedback, this work combines all the pieces envisioning a social knowledge network that enables collaborative knowledge generation and exchange. Each aforementioned step is seen as a particular implementation of a network node responding to knowledge queries in its own way, e.g. by extracting it, applying reasoning or asking users, and learning from knowledge exchanged with neighbours. Original knowledge as well as user feedback is distributed over the network based on similar trust and provenance mechanisms. The extended query language we call for also allows for personalization.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130173780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Effective support to web designers for fast development of web applications starting from third-party components or Web APIs requires to take into account different aspects. Among them, functional and non functional Web API features and suggestions coming from other web designers who faced similar problems and can share the solutions they adopted. In this paper, we propose a new model that brings together all these aspects to support Web API selection for building web mashups. We exploited the model to provide a map of existing Web API recommendation strategies, as well as to design new solutions based on the combined modeling of different Web API descriptive aspects. Since these aspects are extracted from different sources (such as Web API public repositories and social networks of web mashup developers), our model is built by relying on the Linked Data principles.
{"title":"A classification of web API selection solutions over the linked web","authors":"D. Bianchini","doi":"10.1145/2494068.2494072","DOIUrl":"https://doi.org/10.1145/2494068.2494072","url":null,"abstract":"Effective support to web designers for fast development of web applications starting from third-party components or Web APIs requires to take into account different aspects. Among them, functional and non functional Web API features and suggestions coming from other web designers who faced similar problems and can share the solutions they adopted. In this paper, we propose a new model that brings together all these aspects to support Web API selection for building web mashups. We exploited the model to provide a map of existing Web API recommendation strategies, as well as to design new solutions based on the combined modeling of different Web API descriptive aspects. Since these aspects are extracted from different sources (such as Web API public repositories and social networks of web mashup developers), our model is built by relying on the Linked Data principles.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128171023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Arbab, Francesco Santini, Stefano Bistarelli, Daniele Pirolandi
In this paper, we focus on the discovery process of Web Services (WSs) by basing the search on the similarities among the service requirements and candidate search results, in order to cope with over-constrained queries or to find satisfactory alternatives for user requirements. This discovery process needs to involve the so-called Soft Constraint Satisfaction Problems (SCSPs). First we represent both WSs and the search query of the user as Rooted Trees, i.e., a particular form of Conceptual Graphs. Then, we find a homomorphism between these two trees as a solution of an SCSP. The main contribution of this paper is the enhanced expressiveness offered by this "softness": in over-constrained scenarios, when a user query cannot be satisfied, classical crisp constraints (i.e., CSP) are not expressive enough to find "close" solutions to meet the users' needs.
{"title":"Towards a similarity-based web service discovery through soft constraint satisfaction problems","authors":"F. Arbab, Francesco Santini, Stefano Bistarelli, Daniele Pirolandi","doi":"10.1145/2494068.2494070","DOIUrl":"https://doi.org/10.1145/2494068.2494070","url":null,"abstract":"In this paper, we focus on the discovery process of Web Services (WSs) by basing the search on the similarities among the service requirements and candidate search results, in order to cope with over-constrained queries or to find satisfactory alternatives for user requirements. This discovery process needs to involve the so-called Soft Constraint Satisfaction Problems (SCSPs). First we represent both WSs and the search query of the user as Rooted Trees, i.e., a particular form of Conceptual Graphs. Then, we find a homomorphism between these two trees as a solution of an SCSP. The main contribution of this paper is the enhanced expressiveness offered by this \"softness\": in over-constrained scenarios, when a user query cannot be satisfied, classical crisp constraints (i.e., CSP) are not expressive enough to find \"close\" solutions to meet the users' needs.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116873886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper focuses the attention on the issues of searching Semantic Data Warehouses, by providing an overview on state-of-the-art approaches, along with a critical discussion on open issues and future research directions in the investigated scientific field.
{"title":"Searching semantic data warehouses: models, issues, architectures","authors":"A. Cuzzocrea, A. Simitsis","doi":"10.1145/2494068.2494074","DOIUrl":"https://doi.org/10.1145/2494068.2494074","url":null,"abstract":"This paper focuses the attention on the issues of searching Semantic Data Warehouses, by providing an overview on state-of-the-art approaches, along with a critical discussion on open issues and future research directions in the investigated scientific field.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121240638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The search terms that a user passes to a search engine are often ambiguous, referring to homonyms. The results in these cases are a mixture of links to documents that contain different meanings of the search terms. To improve the search for homonyms, we previously designed an Ontology-Supported Web Search System (OSWS) for "famous people." To serve this system, we built an ontology of famous people based on mining the suggested completions of a search engine and on data from DBpedia. In this paper, we present an approach to improve the OSWS ontology by mining data from Facebook "people public pages." Facebook attributes are cleaned up and mapped to the OSWS ontology.
{"title":"Enhancing the famous people ontology by mining a social network","authors":"Soon Ae Chun, Tian Tian, J. Geller","doi":"10.1145/2494068.2494073","DOIUrl":"https://doi.org/10.1145/2494068.2494073","url":null,"abstract":"The search terms that a user passes to a search engine are often ambiguous, referring to homonyms. The results in these cases are a mixture of links to documents that contain different meanings of the search terms. To improve the search for homonyms, we previously designed an Ontology-Supported Web Search System (OSWS) for \"famous people.\" To serve this system, we built an ontology of famous people based on mining the suggested completions of a search engine and on data from DBpedia. In this paper, we present an approach to improve the OSWS ontology by mining data from Facebook \"people public pages.\" Facebook attributes are cleaned up and mapped to the OSWS ontology.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126463009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}