{"title":"参加 2024 年 VoiceMOS 挑战赛的 T05 系统:从深度图像分类器到高质量合成语音自然度 MOS 预测的迁移学习","authors":"Kaito Baba, Wataru Nakata, Yuki Saito, Hiroshi Saruwatari","doi":"arxiv-2409.09305","DOIUrl":null,"url":null,"abstract":"We present our system (denoted as T05) for the VoiceMOS Challenge (VMC) 2024.\nOur system was designed for the VMC 2024 Track 1, which focused on the accurate\nprediction of naturalness mean opinion score (MOS) for high-quality synthetic\nspeech. In addition to a pretrained self-supervised learning (SSL)-based speech\nfeature extractor, our system incorporates a pretrained image feature extractor\nto capture the difference of synthetic speech observed in speech spectrograms.\nWe first separately train two MOS predictors that use either of an SSL-based or\nspectrogram-based feature. Then, we fine-tune the two predictors for better MOS\nprediction using the fusion of two extracted features. In the VMC 2024 Track 1,\nour T05 system achieved first place in 7 out of 16 evaluation metrics and\nsecond place in the remaining 9 metrics, with a significant difference compared\nto those ranked third and below. We also report the results of our ablation\nstudy to investigate essential factors of our system.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The T05 System for The VoiceMOS Challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech\",\"authors\":\"Kaito Baba, Wataru Nakata, Yuki Saito, Hiroshi Saruwatari\",\"doi\":\"arxiv-2409.09305\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present our system (denoted as T05) for the VoiceMOS Challenge (VMC) 2024.\\nOur system was designed for the VMC 2024 Track 1, which focused on the accurate\\nprediction of naturalness mean opinion score (MOS) for high-quality synthetic\\nspeech. In addition to a pretrained self-supervised learning (SSL)-based speech\\nfeature extractor, our system incorporates a pretrained image feature extractor\\nto capture the difference of synthetic speech observed in speech spectrograms.\\nWe first separately train two MOS predictors that use either of an SSL-based or\\nspectrogram-based feature. Then, we fine-tune the two predictors for better MOS\\nprediction using the fusion of two extracted features. In the VMC 2024 Track 1,\\nour T05 system achieved first place in 7 out of 16 evaluation metrics and\\nsecond place in the remaining 9 metrics, with a significant difference compared\\nto those ranked third and below. We also report the results of our ablation\\nstudy to investigate essential factors of our system.\",\"PeriodicalId\":501178,\"journal\":{\"name\":\"arXiv - CS - Sound\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Sound\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.09305\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09305","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
The T05 System for The VoiceMOS Challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech
We present our system (denoted as T05) for the VoiceMOS Challenge (VMC) 2024.
Our system was designed for the VMC 2024 Track 1, which focused on the accurate
prediction of naturalness mean opinion score (MOS) for high-quality synthetic
speech. In addition to a pretrained self-supervised learning (SSL)-based speech
feature extractor, our system incorporates a pretrained image feature extractor
to capture the difference of synthetic speech observed in speech spectrograms.
We first separately train two MOS predictors that use either of an SSL-based or
spectrogram-based feature. Then, we fine-tune the two predictors for better MOS
prediction using the fusion of two extracted features. In the VMC 2024 Track 1,
our T05 system achieved first place in 7 out of 16 evaluation metrics and
second place in the remaining 9 metrics, with a significant difference compared
to those ranked third and below. We also report the results of our ablation
study to investigate essential factors of our system.