{"title":"Towards a Neuro-Inspired No-Reference Instrumental Quality Measure for Text-to-Speech Systems","authors":"Rishabh Gupta, Anderson R. Avila, T. Falk","doi":"10.1109/QoMEX.2018.8463392","DOIUrl":null,"url":null,"abstract":"Subjective evaluation of synthesized speech is not an easy task as various quality dimensions can be affected, including naturalness, prosody, pronunciation, and continuity, to name a few. Evaluations typically rely on naive listeners, thus more closely representing the consumers of commercial products. As such, while the results of these costly and time consuming tests may provide text-to-speech (TTS) system developers with feedback on the perceived quality and acceptability of their devices, it provides little information on what the source of the problems are and what can be done about it. In this paper, we propose the use of neuroimaging to probe the unconscious cognitive processing of naive listeners as they listen to synthesized speech generated by different systems of varying quality. The obtained neural insights have allowed us to extract a small subset of very relevant features from the speech signals and to use these features to build a simple, no-reference instrumental quality metric specifically tailored to TTS speech. The metric is tested on an unseen dataset and shown to significantly outperform a benchmark algorithm.","PeriodicalId":6618,"journal":{"name":"2018 Tenth International Conference on Quality of Multimedia Experience (QoMEX)","volume":"45 1","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Tenth International Conference on Quality of Multimedia Experience (QoMEX)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/QoMEX.2018.8463392","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Subjective evaluation of synthesized speech is not an easy task as various quality dimensions can be affected, including naturalness, prosody, pronunciation, and continuity, to name a few. Evaluations typically rely on naive listeners, thus more closely representing the consumers of commercial products. As such, while the results of these costly and time consuming tests may provide text-to-speech (TTS) system developers with feedback on the perceived quality and acceptability of their devices, it provides little information on what the source of the problems are and what can be done about it. In this paper, we propose the use of neuroimaging to probe the unconscious cognitive processing of naive listeners as they listen to synthesized speech generated by different systems of varying quality. The obtained neural insights have allowed us to extract a small subset of very relevant features from the speech signals and to use these features to build a simple, no-reference instrumental quality metric specifically tailored to TTS speech. The metric is tested on an unseen dataset and shown to significantly outperform a benchmark algorithm.