首页 > 最新文献

12th ISCA Speech Synthesis Workshop (SSW2023)最新文献

英文 中文
Audiobook synthesis with long-form neural text-to-speech 语音读物合成与长形式的神经文本到语音
Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-22
Weicheng Zhang, Cheng-chieh Yeh, Will Beckman, T. Raitio, Ramya Rasipuram, L. Golipour, David Winarsky
Despite recent advances in text-to-speech (TTS) technology, auto-narration of long-form content such as books remains a challenge. The goal of this work is to enhance neural TTS to be suitable for long-form content such as audiobooks. In addition to high quality, we aim to provide a compelling and engaging listening experience with expressivity that spans beyond a single sentence to a paragraph level so that the user can not only follow the story but also enjoy listening to it. Towards that goal, we made four enhancements to our baseline TTS system: incorporation of BERT embeddings, explicit prosody prediction from text, long-context modeling over multiple sentences, and pre-training on long-form data. We propose an evaluation framework tailored to long-form content that evaluates the synthesis on segments spanning multiple paragraphs and focuses on elements such as comprehension, ease of listening, ability to keep attention, and enjoyment. The evaluation results show that the proposed approach outperforms the baseline on all evaluated metrics, with an absolute 0.47 MOS gain in overall quality. Ablation studies further confirm the effectiveness of the proposed enhancements.
尽管文本到语音(TTS)技术最近取得了进展,但是长篇内容(如书籍)的自动叙述仍然是一个挑战。这项工作的目标是增强神经TTS,使其适合于长篇内容,如有声读物。除了高质量之外,我们的目标是提供一个引人注目的、引人入胜的倾听体验,其表现力跨越了一个句子到一个段落的水平,这样用户不仅可以跟随故事,而且还可以享受听故事的乐趣。为了实现这一目标,我们对基线TTS系统进行了四项增强:结合BERT嵌入、文本显式韵律预测、多句子的长上下文建模以及长格式数据的预训练。我们提出了一个针对长篇内容量身定制的评估框架,该框架评估了跨多个段落的片段综合,并侧重于理解、听力的易用性、保持注意力的能力和乐趣等要素。评估结果表明,所提出的方法在所有评估指标上都优于基线,总体质量的绝对增益为0.47 MOS。消融研究进一步证实了所提出的增强方法的有效性。
{"title":"Audiobook synthesis with long-form neural text-to-speech","authors":"Weicheng Zhang, Cheng-chieh Yeh, Will Beckman, T. Raitio, Ramya Rasipuram, L. Golipour, David Winarsky","doi":"10.21437/ssw.2023-22","DOIUrl":"https://doi.org/10.21437/ssw.2023-22","url":null,"abstract":"Despite recent advances in text-to-speech (TTS) technology, auto-narration of long-form content such as books remains a challenge. The goal of this work is to enhance neural TTS to be suitable for long-form content such as audiobooks. In addition to high quality, we aim to provide a compelling and engaging listening experience with expressivity that spans beyond a single sentence to a paragraph level so that the user can not only follow the story but also enjoy listening to it. Towards that goal, we made four enhancements to our baseline TTS system: incorporation of BERT embeddings, explicit prosody prediction from text, long-context modeling over multiple sentences, and pre-training on long-form data. We propose an evaluation framework tailored to long-form content that evaluates the synthesis on segments spanning multiple paragraphs and focuses on elements such as comprehension, ease of listening, ability to keep attention, and enjoyment. The evaluation results show that the proposed approach outperforms the baseline on all evaluated metrics, with an absolute 0.47 MOS gain in overall quality. Ablation studies further confirm the effectiveness of the proposed enhancements.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122861468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adaptive Duration Modification of Speech using Masked Convolutional Networks and Open-Loop Time Warping 基于掩模卷积网络和开环时间扭曲的语音持续时间自适应修正
Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-28
Ravi Shankar, Archana Venkataraman
We propose a new method to adaptively modify the rhythm of a given speech signal. We train a masked convolutional encoder-decoder network to generate this attention map via a stochastic version of the mean absolute error loss function. Our model also predicts the length of the target speech signal using the encoder embeddings, which determines the number of time steps for the decoding operation. During testing, we use the learned attention map as a proxy for the frame-wise similarity matrix between the given input speech and an unknown target speech signal. In an open-loop fashion, we compute a warping path for rhythm modification. Our experiments demonstrate that this adaptive framework achieves similar performance as the fully supervised dynamic time warping algorithm on both voice conversion and emotion conversion tasks. We also show that the modified speech utterances achieve high user quality ratings, thus highlighting the practical utility of our method.
我们提出了一种自适应修改给定语音信号节奏的新方法。我们训练了一个掩蔽卷积编码器-解码器网络,通过随机版本的平均绝对误差损失函数来生成这个注意图。我们的模型还使用编码器嵌入来预测目标语音信号的长度,这决定了解码操作的时间步数。在测试过程中,我们使用学习到的注意图作为给定输入语音和未知目标语音信号之间的帧相似矩阵的代理。在一个开环的方式,我们计算一个扭曲路径的节奏修改。我们的实验表明,该自适应框架在语音转换和情感转换任务上取得了与完全监督动态时间规整算法相似的性能。我们还表明,修改后的语音话语达到了很高的用户质量评级,从而突出了我们的方法的实用性。
{"title":"Adaptive Duration Modification of Speech using Masked Convolutional Networks and Open-Loop Time Warping","authors":"Ravi Shankar, Archana Venkataraman","doi":"10.21437/ssw.2023-28","DOIUrl":"https://doi.org/10.21437/ssw.2023-28","url":null,"abstract":"We propose a new method to adaptively modify the rhythm of a given speech signal. We train a masked convolutional encoder-decoder network to generate this attention map via a stochastic version of the mean absolute error loss function. Our model also predicts the length of the target speech signal using the encoder embeddings, which determines the number of time steps for the decoding operation. During testing, we use the learned attention map as a proxy for the frame-wise similarity matrix between the given input speech and an unknown target speech signal. In an open-loop fashion, we compute a warping path for rhythm modification. Our experiments demonstrate that this adaptive framework achieves similar performance as the fully supervised dynamic time warping algorithm on both voice conversion and emotion conversion tasks. We also show that the modified speech utterances achieve high user quality ratings, thus highlighting the practical utility of our method.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"279 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122768349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Better Replacement for TTS Naturalness Evaluation 更好地替代TTS自然度评价
Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-31
S. Shirali-Shahreza, Gerald Penn
Text-To-Speech (TTS) systems are commonly evaluated along two main dimensions: intelligibility and naturalness. While there are clear proxies for intelligibility measurements such as transcription Word-Error-Rate (WER), naturalness is not nearly so well defined. In this paper, we present the results of our attempt to learn what aspects human listeners consider when they are asked to evaluate the “naturalness” of TTS systems. We conducted a user study similar to common TTS evaluations and at the end asked the subject to define the sense of naturalness that they had used. Then we coded their answers and statistically analysed the distribution of codes to create a list of aspects that users consider as part of naturalness. We can now provide a list of suggested replacement questions to use instead of a single oblique notion of naturalness.
文本到语音(TTS)系统通常沿着两个主要维度进行评估:可理解性和自然性。虽然可理解性测量有明确的替代指标,如转录词错误率(WER),但自然度却没有这么好的定义。在本文中,我们展示了我们试图了解人类听众在被要求评估TTS系统的“自然性”时所考虑的方面的结果。我们进行了一项类似于普通TTS评估的用户研究,并在最后要求受试者定义他们使用的自然感。然后我们对他们的答案进行编码,并对代码的分布进行统计分析,以创建一个用户认为是自然度的一部分的方面列表。我们现在可以提供一个建议的替代问题列表,以取代单一的倾斜的自然性概念。
{"title":"Better Replacement for TTS Naturalness Evaluation","authors":"S. Shirali-Shahreza, Gerald Penn","doi":"10.21437/ssw.2023-31","DOIUrl":"https://doi.org/10.21437/ssw.2023-31","url":null,"abstract":"Text-To-Speech (TTS) systems are commonly evaluated along two main dimensions: intelligibility and naturalness. While there are clear proxies for intelligibility measurements such as transcription Word-Error-Rate (WER), naturalness is not nearly so well defined. In this paper, we present the results of our attempt to learn what aspects human listeners consider when they are asked to evaluate the “naturalness” of TTS systems. We conducted a user study similar to common TTS evaluations and at the end asked the subject to define the sense of naturalness that they had used. Then we coded their answers and statistically analysed the distribution of codes to create a list of aspects that users consider as part of naturalness. We can now provide a list of suggested replacement questions to use instead of a single oblique notion of naturalness.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125866015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Advocating for text input in multi-speaker text-to-speech systems 提倡在多扬声器文本到语音系统中进行文本输入
Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-1
G. Bailly, Martin Lenglet, O. Perrotin, E. Klabbers
{"title":"Advocating for text input in multi-speaker text-to-speech systems","authors":"G. Bailly, Martin Lenglet, O. Perrotin, E. Klabbers","doi":"10.21437/ssw.2023-1","DOIUrl":"https://doi.org/10.21437/ssw.2023-1","url":null,"abstract":"","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127568201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SPTK4: An Open-Source Software Toolkit for Speech Signal Processing SPTK4:一个用于语音信号处理的开源软件工具包
Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-33
Takenori Yoshimura, Takato Fujimoto, Keiichiro Oura, K. Tokuda
The Speech Signal Processing ToolKit (SPTK) is an open-source suite of speech signal processing tools, which has been developed and maintained by the SPTK working group and has widely contributed to the speech signal processing community since 1998. Although SPTK has reached over a hundred thousand downloads, the concepts as well as the features have not yet been widely disseminated. This paper gives an overview of SPTK and demonstrations to provide a better understanding of the toolkit. We have recently developed its differentiable Py-Torch version, diffsptk , to adapt to advancements in the deep learning field. The details of diffsptk are also presented in this paper. We hope that the toolkit will help developers and researchers working in the field of speech signal processing.
语音信号处理工具包(SPTK)是一个开源的语音信号处理工具套件,自1998年以来一直由SPTK工作组开发和维护,并为语音信号处理社区做出了广泛贡献。尽管SPTK的下载量已经超过10万次,但其概念和特性尚未得到广泛传播。本文给出了SPTK的概述和演示,以便更好地理解该工具包。我们最近开发了其可微分的Py-Torch版本diffsptk,以适应深度学习领域的进步。本文还详细介绍了diffsptk。我们希望该工具包能够帮助语音信号处理领域的开发人员和研究人员。
{"title":"SPTK4: An Open-Source Software Toolkit for Speech Signal Processing","authors":"Takenori Yoshimura, Takato Fujimoto, Keiichiro Oura, K. Tokuda","doi":"10.21437/ssw.2023-33","DOIUrl":"https://doi.org/10.21437/ssw.2023-33","url":null,"abstract":"The Speech Signal Processing ToolKit (SPTK) is an open-source suite of speech signal processing tools, which has been developed and maintained by the SPTK working group and has widely contributed to the speech signal processing community since 1998. Although SPTK has reached over a hundred thousand downloads, the concepts as well as the features have not yet been widely disseminated. This paper gives an overview of SPTK and demonstrations to provide a better understanding of the toolkit. We have recently developed its differentiable Py-Torch version, diffsptk , to adapt to advancements in the deep learning field. The details of diffsptk are also presented in this paper. We hope that the toolkit will help developers and researchers working in the field of speech signal processing.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133680724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spell4TTS: Acoustically-informed spellings for improving text-to-speech pronunciations Spell4TTS:声学信息拼写,提高文本到语音的发音
Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-2
Jason Fong, Hao Tang, Simon King
Ensuring accurate pronunciation is critical for high-quality text-to-speech (TTS). This typically requires a phoneme-based pro-nunciation dictionary, which is labour-intensive and costly to create. Previous work has suggested using graphemes instead of phonemes, but the inevitable pronunciation errors that occur cannot be fixed, since there is no longer a pronunciation dictionary. As an alternative, speech-based self-supervised learning (SSL) models have been proposed for pronunciation control, but these models are computationally expensive to train, produce representations that are not easily interpretable, and capture unwanted non-phonemic information. To address these limitations, we propose Spell4TTS, a novel method that generates acoustically-informed word spellings. Spellings are both inter-pretable and easily edited. The method could be applied to any existing pre-built TTS system. Our experiments show that the method creates word spellings that lead to fewer TTS pronunciation errors than the original spellings, or an Automatic Speech Recognition baseline. Additionally, we observe that pronunciation can be further enhanced by ranking candidates in the space of SSL speech representations, and by incorporating Human-in-the-Loop screening over the top-ranked spellings devised by our method. By working with spellings of words (composed of characters), the method lowers the entry barrier for TTS sys-tem development for languages with limited pronunciation resources. It should reduce the time and cost involved in creating and maintaining pronunciation dictionaries.
确保准确的发音对于高质量的文本到语音(TTS)至关重要。这通常需要一个基于音素的亲发音词典,这是一项劳动密集型的工作,而且创建成本很高。先前的研究建议使用字素代替音素,但不可避免的发音错误无法修复,因为不再有发音字典。作为一种替代方案,基于语音的自监督学习(SSL)模型已经被提出用于发音控制,但是这些模型的训练计算成本很高,产生的表示不容易解释,并且捕获不需要的非音位信息。为了解决这些限制,我们提出了Spell4TTS,一种生成声学信息单词拼写的新方法。拼写既可解释又易于编辑。该方法可应用于任何现有的预建TTS系统。我们的实验表明,该方法创建的单词拼写导致的TTS发音错误比原始拼写或自动语音识别基线更少。此外,我们观察到,通过对SSL语音表示空间中的候选词进行排名,以及通过对我们的方法设计的排名靠前的拼写进行Human-in-the-Loop筛选,发音可以进一步增强。通过处理单词(由字符组成)的拼写,该方法降低了针对发音资源有限的语言开发TTS系统的入门门槛。它可以减少创建和维护发音字典所涉及的时间和成本。
{"title":"Spell4TTS: Acoustically-informed spellings for improving text-to-speech pronunciations","authors":"Jason Fong, Hao Tang, Simon King","doi":"10.21437/ssw.2023-2","DOIUrl":"https://doi.org/10.21437/ssw.2023-2","url":null,"abstract":"Ensuring accurate pronunciation is critical for high-quality text-to-speech (TTS). This typically requires a phoneme-based pro-nunciation dictionary, which is labour-intensive and costly to create. Previous work has suggested using graphemes instead of phonemes, but the inevitable pronunciation errors that occur cannot be fixed, since there is no longer a pronunciation dictionary. As an alternative, speech-based self-supervised learning (SSL) models have been proposed for pronunciation control, but these models are computationally expensive to train, produce representations that are not easily interpretable, and capture unwanted non-phonemic information. To address these limitations, we propose Spell4TTS, a novel method that generates acoustically-informed word spellings. Spellings are both inter-pretable and easily edited. The method could be applied to any existing pre-built TTS system. Our experiments show that the method creates word spellings that lead to fewer TTS pronunciation errors than the original spellings, or an Automatic Speech Recognition baseline. Additionally, we observe that pronunciation can be further enhanced by ranking candidates in the space of SSL speech representations, and by incorporating Human-in-the-Loop screening over the top-ranked spellings devised by our method. By working with spellings of words (composed of characters), the method lowers the entry barrier for TTS sys-tem development for languages with limited pronunciation resources. It should reduce the time and cost involved in creating and maintaining pronunciation dictionaries.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132536684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PRVAE-VC: Non-Parallel Many-to-Many Voice Conversion with Perturbation-Resistant Variational Autoencoder 基于抗扰动变分自编码器的非并行多对多语音转换
Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-14
Kou Tanaka, H. Kameoka, Takuhiro Kaneko
This paper describes a novel approach to non-parallel many-to-many voice conversion (VC) that utilizes a variant of the conditional variational autoencoder (VAE) called a perturbation-resistant VAE (PRVAE). In VAE-based VC, it is commonly assumed that the encoder extracts content from the input speech while removing source speaker information. Following this extraction, the decoder generates output from the extracted content and target speaker information. However, in practice, the encoded features may still retain source speaker information, which can lead to a degradation of speech quality during speaker conversion tasks. To address this issue, we propose a perturbation-resistant encoder trained to match the encoded features of the input speech with those of a pseudo-speech generated through a content-preserving transformation of the input speech’s fundamental frequency and spectral envelope using a combination of pure signal processing techniques. Our experimental results demonstrate that this straightforward constraint significantly enhances the performance in non-parallel many-to-many speaker conversion tasks. Audio samples can be accessed at our webpage 1 .
本文描述了一种非并行多对多语音转换(VC)的新方法,该方法利用条件变分自编码器(VAE)的一种变体,称为抗摄动VAE (PRVAE)。在基于vae的VC中,通常假设编码器从输入语音中提取内容,同时去除源说话人信息。在此提取之后,解码器从提取的内容和目标说话人信息生成输出。然而,在实践中,编码特征可能仍然保留源说话人信息,这可能导致说话人转换任务期间语音质量的下降。为了解决这个问题,我们提出了一种抗扰动编码器,训练输入语音的编码特征与使用纯信号处理技术组合对输入语音的基频和谱包络进行内容保持变换生成的伪语音的编码特征相匹配。实验结果表明,这种简单的约束显著提高了非并行多对多说话人转换任务的性能。音频样本可以访问我们的网页1。
{"title":"PRVAE-VC: Non-Parallel Many-to-Many Voice Conversion with Perturbation-Resistant Variational Autoencoder","authors":"Kou Tanaka, H. Kameoka, Takuhiro Kaneko","doi":"10.21437/ssw.2023-14","DOIUrl":"https://doi.org/10.21437/ssw.2023-14","url":null,"abstract":"This paper describes a novel approach to non-parallel many-to-many voice conversion (VC) that utilizes a variant of the conditional variational autoencoder (VAE) called a perturbation-resistant VAE (PRVAE). In VAE-based VC, it is commonly assumed that the encoder extracts content from the input speech while removing source speaker information. Following this extraction, the decoder generates output from the extracted content and target speaker information. However, in practice, the encoded features may still retain source speaker information, which can lead to a degradation of speech quality during speaker conversion tasks. To address this issue, we propose a perturbation-resistant encoder trained to match the encoded features of the input speech with those of a pseudo-speech generated through a content-preserving transformation of the input speech’s fundamental frequency and spectral envelope using a combination of pure signal processing techniques. Our experimental results demonstrate that this straightforward constraint significantly enhances the performance in non-parallel many-to-many speaker conversion tasks. Audio samples can be accessed at our webpage 1 .","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128169759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluation 陷入MOS陷阱:TTS评估中MOS测试方法的批判性分析
Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-7
Ambika Kirkland, Shivam Mehta, Harm Lameris, G. Henter, Éva Székely, Joakim Gustafson
The Mean Opinion Score (MOS) is a prevalent metric in TTS evaluation. Although standards for collecting and reporting MOS exist, researchers seem to use the term inconsistently, and underreport the details of their testing methodologies. A survey of Interspeech and SSW papers from 2021-2022 shows that most authors do not report scale labels, increments, or instructions to participants, and those who do diverge in terms of their implementation. It is also unclear in many cases whether listeners were asked to rate naturalness, or overall quality. MOS obtained for natural speech using different testing methodologies vary in the surveyed papers: specifically, quality MOS is on average higher than naturalness MOS. We carried out several listening tests using the same stimuli but with differences in the scale increment and instructions about what participants should rate, and found that both of these variables affected MOS for some systems.
平均意见分数(MOS)是TTS评价中常用的指标。尽管存在收集和报告MOS的标准,但研究人员似乎不一致地使用了这个术语,并且低估了他们测试方法的细节。对Interspeech和SSW在2021-2022年间的论文进行的一项调查显示,大多数作者没有向参与者报告量表标签、增量或说明,而那些在实施方面存在分歧的作者。在许多情况下,也不清楚听众是被要求对自然程度进行评分,还是对整体质量进行评分。在被调查的论文中,使用不同测试方法获得的自然语音MOS各不相同:具体而言,质量MOS平均高于自然MOS。我们进行了几次听力测试,使用相同的刺激,但在尺度增量和参与者应该评分的指示上有所不同,并发现这两个变量都会影响某些系统的MOS。
{"title":"Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluation","authors":"Ambika Kirkland, Shivam Mehta, Harm Lameris, G. Henter, Éva Székely, Joakim Gustafson","doi":"10.21437/ssw.2023-7","DOIUrl":"https://doi.org/10.21437/ssw.2023-7","url":null,"abstract":"The Mean Opinion Score (MOS) is a prevalent metric in TTS evaluation. Although standards for collecting and reporting MOS exist, researchers seem to use the term inconsistently, and underreport the details of their testing methodologies. A survey of Interspeech and SSW papers from 2021-2022 shows that most authors do not report scale labels, increments, or instructions to participants, and those who do diverge in terms of their implementation. It is also unclear in many cases whether listeners were asked to rate naturalness, or overall quality. MOS obtained for natural speech using different testing methodologies vary in the surveyed papers: specifically, quality MOS is on average higher than naturalness MOS. We carried out several listening tests using the same stimuli but with differences in the scale increment and instructions about what participants should rate, and found that both of these variables affected MOS for some systems.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125232081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Synthesising turn-taking cues using natural conversational data 使用自然会话数据合成轮转提示
Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-12
Johannah O'Mahony, Catherine Lai, Simon King
As speech synthesis quality reaches high levels of naturalness for isolated utterances, more work is focusing on the synthesis of context-dependent conversational speech. The role of context in conversation is still poorly understood and many contextual factors can affect an utterances’s prosodic realisation. Most studies incorporating context use rich acoustic or textual embeddings of the previous context, then demonstrate improvements in overall naturalness. Such studies are not informative about what the context embedding represents, or how it affects an utterance’s realisation. So instead, we narrow the focus to a single, explicit contextual factor. In the current work, this is turn-taking. We condition a speech synthesis model on whether an utterance is turn-final. Objective measures and targeted subjective evaluation are used to demonstrate that the model can synthesise turn-taking cues which are perceived by listeners, with results being speaker-dependent.
随着孤立话语的语音合成质量达到高水平的自然度,更多的工作集中在上下文依赖性会话语音的合成上。人们对语境在会话中的作用仍然知之甚少,许多语境因素会影响话语的韵律实现。大多数结合上下文的研究使用前一个上下文的丰富的声学或文本嵌入,然后证明了整体自然度的改善。这样的研究并没有提供上下文嵌入代表什么,或者它如何影响话语的实现的信息。所以,我们把焦点缩小到一个单一的,明确的环境因素。在目前的工作中,这是轮流进行的。我们将一个语音合成模型设置为一个话语是否为turn-final。客观测量和有针对性的主观评价被用来证明该模型可以合成被听众感知的轮流线索,结果依赖于说话者。
{"title":"Synthesising turn-taking cues using natural conversational data","authors":"Johannah O'Mahony, Catherine Lai, Simon King","doi":"10.21437/ssw.2023-12","DOIUrl":"https://doi.org/10.21437/ssw.2023-12","url":null,"abstract":"As speech synthesis quality reaches high levels of naturalness for isolated utterances, more work is focusing on the synthesis of context-dependent conversational speech. The role of context in conversation is still poorly understood and many contextual factors can affect an utterances’s prosodic realisation. Most studies incorporating context use rich acoustic or textual embeddings of the previous context, then demonstrate improvements in overall naturalness. Such studies are not informative about what the context embedding represents, or how it affects an utterance’s realisation. So instead, we narrow the focus to a single, explicit contextual factor. In the current work, this is turn-taking. We condition a speech synthesis model on whether an utterance is turn-final. Objective measures and targeted subjective evaluation are used to demonstrate that the model can synthesise turn-taking cues which are perceived by listeners, with results being speaker-dependent.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120967024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Re-examining the quality dimensions of synthetic speech 重新审视合成语音的质量维度
Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-6
Fritz Seebauer, Michael Kuhlmann, Reinhold Haeb-Umbach, P. Wagner
The aim of this paper is to generate a more comprehensive framework for evaluating synthetic speech. To this end, a line of tests resulting in an exploratory factor analysis (EFA) have been carried out. The proposed dimensions that encapsulate the construct of “synthetic speech quality” are: “human-likeness”, “audio quality”, “negative emotion”, “dominance”, “positive emotion”, “calmness”, “seniority” and “gender”, with item-to-total correlations pointing towards “gender” being an orthogonal construct. A subsequent analysis on common acoustic features, found in forensic and phonetic literature, reveals very weak correlations with the proposed scales. Inter-rater and inter-item agreement measures additionally reveal low consistency within the scales. We also make the case that there is a need for a more fine grained approach when investigating the quality of synthetic speech systems, and propose a method that attempts to capture individual quality dimensions in the time domain.
本文的目的是生成一个更全面的评估合成语音的框架。为此目的,进行了一系列测试,结果是探索性因素分析(EFA)。概括“合成语音质量”结构的建议维度是:“与人类相似”、“音频质量”、“负面情绪”、“主导地位”、“积极情绪”、“冷静”、“资历”和“性别”,项目与总相关性指向“性别”是一个正交结构。随后对法医学和语音学文献中发现的常见声学特征的分析表明,与所提出的音阶之间的相关性非常弱。评价者之间和项目之间的一致性测量也表明量表内的一致性较低。我们还提出,在研究合成语音系统的质量时,需要一种更细粒度的方法,并提出了一种在时域中试图捕获单个质量维度的方法。
{"title":"Re-examining the quality dimensions of synthetic speech","authors":"Fritz Seebauer, Michael Kuhlmann, Reinhold Haeb-Umbach, P. Wagner","doi":"10.21437/ssw.2023-6","DOIUrl":"https://doi.org/10.21437/ssw.2023-6","url":null,"abstract":"The aim of this paper is to generate a more comprehensive framework for evaluating synthetic speech. To this end, a line of tests resulting in an exploratory factor analysis (EFA) have been carried out. The proposed dimensions that encapsulate the construct of “synthetic speech quality” are: “human-likeness”, “audio quality”, “negative emotion”, “dominance”, “positive emotion”, “calmness”, “seniority” and “gender”, with item-to-total correlations pointing towards “gender” being an orthogonal construct. A subsequent analysis on common acoustic features, found in forensic and phonetic literature, reveals very weak correlations with the proposed scales. Inter-rater and inter-item agreement measures additionally reveal low consistency within the scales. We also make the case that there is a need for a more fine grained approach when investigating the quality of synthetic speech systems, and propose a method that attempts to capture individual quality dimensions in the time domain.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115965443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
12th ISCA Speech Synthesis Workshop (SSW2023)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1