{"title":"MacST: Multi-Accent Speech Synthesis via Text Transliteration for Accent Conversion","authors":"Sho Inoue, Shuai Wang, Wanxing Wang, Pengcheng Zhu, Mengxiao Bi, Haizhou Li","doi":"arxiv-2409.09352","DOIUrl":null,"url":null,"abstract":"In accented voice conversion or accent conversion, we seek to convert the\naccent in speech from one another while preserving speaker identity and\nsemantic content. In this study, we formulate a novel method for creating\nmulti-accented speech samples, thus pairs of accented speech samples by the\nsame speaker, through text transliteration for training accent conversion\nsystems. We begin by generating transliterated text with Large Language Models\n(LLMs), which is then fed into multilingual TTS models to synthesize accented\nEnglish speech. As a reference system, we built a sequence-to-sequence model on\nthe synthetic parallel corpus for accent conversion. We validated the proposed\nmethod for both native and non-native English speakers. Subjective and\nobjective evaluations further validate our dataset's effectiveness in accent\nconversion studies.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"27 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09352","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In accented voice conversion or accent conversion, we seek to convert the
accent in speech from one another while preserving speaker identity and
semantic content. In this study, we formulate a novel method for creating
multi-accented speech samples, thus pairs of accented speech samples by the
same speaker, through text transliteration for training accent conversion
systems. We begin by generating transliterated text with Large Language Models
(LLMs), which is then fed into multilingual TTS models to synthesize accented
English speech. As a reference system, we built a sequence-to-sequence model on
the synthetic parallel corpus for accent conversion. We validated the proposed
method for both native and non-native English speakers. Subjective and
objective evaluations further validate our dataset's effectiveness in accent
conversion studies.