Towards a Neuro-Inspired No-Reference Instrumental Quality Measure for Text-to-Speech Systems

2018 Tenth International Conference on Quality of Multimedia Experience (QoMEX) Pub Date : 2018-05-01 DOI:10.1109/QoMEX.2018.8463392

Rishabh Gupta, Anderson R. Avila, T. Falk

{"title":"Towards a Neuro-Inspired No-Reference Instrumental Quality Measure for Text-to-Speech Systems","authors":"Rishabh Gupta, Anderson R. Avila, T. Falk","doi":"10.1109/QoMEX.2018.8463392","DOIUrl":null,"url":null,"abstract":"Subjective evaluation of synthesized speech is not an easy task as various quality dimensions can be affected, including naturalness, prosody, pronunciation, and continuity, to name a few. Evaluations typically rely on naive listeners, thus more closely representing the consumers of commercial products. As such, while the results of these costly and time consuming tests may provide text-to-speech (TTS) system developers with feedback on the perceived quality and acceptability of their devices, it provides little information on what the source of the problems are and what can be done about it. In this paper, we propose the use of neuroimaging to probe the unconscious cognitive processing of naive listeners as they listen to synthesized speech generated by different systems of varying quality. The obtained neural insights have allowed us to extract a small subset of very relevant features from the speech signals and to use these features to build a simple, no-reference instrumental quality metric specifically tailored to TTS speech. The metric is tested on an unseen dataset and shown to significantly outperform a benchmark algorithm.","PeriodicalId":6618,"journal":{"name":"2018 Tenth International Conference on Quality of Multimedia Experience (QoMEX)","volume":"45 1","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Tenth International Conference on Quality of Multimedia Experience (QoMEX)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/QoMEX.2018.8463392","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Subjective evaluation of synthesized speech is not an easy task as various quality dimensions can be affected, including naturalness, prosody, pronunciation, and continuity, to name a few. Evaluations typically rely on naive listeners, thus more closely representing the consumers of commercial products. As such, while the results of these costly and time consuming tests may provide text-to-speech (TTS) system developers with feedback on the perceived quality and acceptability of their devices, it provides little information on what the source of the problems are and what can be done about it. In this paper, we propose the use of neuroimaging to probe the unconscious cognitive processing of naive listeners as they listen to synthesized speech generated by different systems of varying quality. The obtained neural insights have allowed us to extract a small subset of very relevant features from the speech signals and to use these features to build a simple, no-reference instrumental quality metric specifically tailored to TTS speech. The metric is tested on an unseen dataset and shown to significantly outperform a benchmark algorithm.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

面向文本到语音系统的神经启发的无参考仪器质量测量

对合成语音进行主观评价并不是一件容易的事情，因为各种质量维度都会受到影响，包括自然度、韵律、发音和连续性等等。评估通常依赖于天真的听众，因此更能代表商业产品的消费者。因此，虽然这些昂贵而耗时的测试结果可能会为文本到语音(TTS)系统开发人员提供有关其设备的感知质量和可接受性的反馈，但它几乎没有提供关于问题根源和如何解决问题的信息。在本文中，我们建议使用神经成像来探测天真听众在听由不同质量的不同系统生成的合成语音时的无意识认知加工。获得的神经洞察力使我们能够从语音信号中提取出一小部分非常相关的特征，并使用这些特征构建一个简单的，无参考的仪器质量指标，专门针对TTS语音。该指标在一个未见过的数据集上进行了测试，结果显示其性能明显优于基准算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2018 Tenth International Conference on Quality of Multimedia Experience (QoMEX)

自引率

0.00%

发文量