{"title":"Discrete Unit based Masking for Improving Disentanglement in Voice Conversion","authors":"Philip H. Lee, Ismail Rasim Ulgen, Berrak Sisman","doi":"arxiv-2409.11560","DOIUrl":null,"url":null,"abstract":"Voice conversion (VC) aims to modify the speaker's identity while preserving\nthe linguistic content. Commonly, VC methods use an encoder-decoder\narchitecture, where disentangling the speaker's identity from linguistic\ninformation is crucial. However, the disentanglement approaches used in these\nmethods are limited as the speaker features depend on the phonetic content of\nthe utterance, compromising disentanglement. This dependency is amplified with\nattention-based methods. To address this, we introduce a novel masking\nmechanism in the input before speaker encoding, masking certain discrete speech\nunits that correspond highly with phoneme classes. Our work aims to reduce the\nphonetic dependency of speaker features by restricting access to some phonetic\ninformation. Furthermore, since our approach is at the input level, it is\napplicable to any encoder-decoder based VC framework. Our approach improves\ndisentanglement and conversion performance across multiple VC methods, showing\nsignificant effectiveness, particularly in attention-based method, with 44%\nrelative improvement in objective intelligibility.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"9 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11560","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Voice conversion (VC) aims to modify the speaker's identity while preserving
the linguistic content. Commonly, VC methods use an encoder-decoder
architecture, where disentangling the speaker's identity from linguistic
information is crucial. However, the disentanglement approaches used in these
methods are limited as the speaker features depend on the phonetic content of
the utterance, compromising disentanglement. This dependency is amplified with
attention-based methods. To address this, we introduce a novel masking
mechanism in the input before speaker encoding, masking certain discrete speech
units that correspond highly with phoneme classes. Our work aims to reduce the
phonetic dependency of speaker features by restricting access to some phonetic
information. Furthermore, since our approach is at the input level, it is
applicable to any encoder-decoder based VC framework. Our approach improves
disentanglement and conversion performance across multiple VC methods, showing
significant effectiveness, particularly in attention-based method, with 44%
relative improvement in objective intelligibility.