参加 2024 年 VoiceMOS 挑战赛的 T05 系统:从深度图像分类器到高质量合成语音自然度 MOS 预测的迁移学习

Kaito Baba, Wataru Nakata, Yuki Saito, Hiroshi Saruwatari
{"title":"参加 2024 年 VoiceMOS 挑战赛的 T05 系统:从深度图像分类器到高质量合成语音自然度 MOS 预测的迁移学习","authors":"Kaito Baba, Wataru Nakata, Yuki Saito, Hiroshi Saruwatari","doi":"arxiv-2409.09305","DOIUrl":null,"url":null,"abstract":"We present our system (denoted as T05) for the VoiceMOS Challenge (VMC) 2024.\nOur system was designed for the VMC 2024 Track 1, which focused on the accurate\nprediction of naturalness mean opinion score (MOS) for high-quality synthetic\nspeech. In addition to a pretrained self-supervised learning (SSL)-based speech\nfeature extractor, our system incorporates a pretrained image feature extractor\nto capture the difference of synthetic speech observed in speech spectrograms.\nWe first separately train two MOS predictors that use either of an SSL-based or\nspectrogram-based feature. Then, we fine-tune the two predictors for better MOS\nprediction using the fusion of two extracted features. In the VMC 2024 Track 1,\nour T05 system achieved first place in 7 out of 16 evaluation metrics and\nsecond place in the remaining 9 metrics, with a significant difference compared\nto those ranked third and below. We also report the results of our ablation\nstudy to investigate essential factors of our system.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The T05 System for The VoiceMOS Challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech\",\"authors\":\"Kaito Baba, Wataru Nakata, Yuki Saito, Hiroshi Saruwatari\",\"doi\":\"arxiv-2409.09305\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present our system (denoted as T05) for the VoiceMOS Challenge (VMC) 2024.\\nOur system was designed for the VMC 2024 Track 1, which focused on the accurate\\nprediction of naturalness mean opinion score (MOS) for high-quality synthetic\\nspeech. In addition to a pretrained self-supervised learning (SSL)-based speech\\nfeature extractor, our system incorporates a pretrained image feature extractor\\nto capture the difference of synthetic speech observed in speech spectrograms.\\nWe first separately train two MOS predictors that use either of an SSL-based or\\nspectrogram-based feature. Then, we fine-tune the two predictors for better MOS\\nprediction using the fusion of two extracted features. In the VMC 2024 Track 1,\\nour T05 system achieved first place in 7 out of 16 evaluation metrics and\\nsecond place in the remaining 9 metrics, with a significant difference compared\\nto those ranked third and below. We also report the results of our ablation\\nstudy to investigate essential factors of our system.\",\"PeriodicalId\":501178,\"journal\":{\"name\":\"arXiv - CS - Sound\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Sound\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.09305\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09305","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

我们的系统是为 VMC 2024 第 1 赛道设计的,该赛道的重点是准确预测高质量合成语音的自然度平均意见分(MOS)。除了预训练的基于自我监督学习(SSL)的语音特征提取器外,我们的系统还结合了预训练的图像特征提取器,以捕捉语音频谱图中观察到的合成语音的差异。然后,我们对这两个预测器进行微调,利用两个提取特征的融合获得更好的 MOS 预测效果。在 VMC 2024 Track 1 中,我们的 T05 系统在 16 个评估指标中的 7 个指标中获得第一名,在其余 9 个指标中获得第二名,与排名第三及以下的系统相比差距显著。我们还报告了消融研究的结果,以研究我们系统的关键因素。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
The T05 System for The VoiceMOS Challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech
We present our system (denoted as T05) for the VoiceMOS Challenge (VMC) 2024. Our system was designed for the VMC 2024 Track 1, which focused on the accurate prediction of naturalness mean opinion score (MOS) for high-quality synthetic speech. In addition to a pretrained self-supervised learning (SSL)-based speech feature extractor, our system incorporates a pretrained image feature extractor to capture the difference of synthetic speech observed in speech spectrograms. We first separately train two MOS predictors that use either of an SSL-based or spectrogram-based feature. Then, we fine-tune the two predictors for better MOS prediction using the fusion of two extracted features. In the VMC 2024 Track 1, our T05 system achieved first place in 7 out of 16 evaluation metrics and second place in the remaining 9 metrics, with a significant difference compared to those ranked third and below. We also report the results of our ablation study to investigate essential factors of our system.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Explaining Deep Learning Embeddings for Speech Emotion Recognition by Predicting Interpretable Acoustic Features ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration Prevailing Research Areas for Music AI in the Era of Foundation Models Egocentric Speaker Classification in Child-Adult Dyadic Interactions: From Sensing to Computational Modeling The T05 System for The VoiceMOS Challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1