{"title":"Exploiting Information From Native Data for Non-Native Automatic Pronunciation Assessment","authors":"Binghuai Lin, Liyuan Wang","doi":"10.1109/SLT54892.2023.10022486","DOIUrl":null,"url":null,"abstract":"This paper proposes an end-to-end pronunciation assessment method to exploit the adequate native data and reduce the need for non-native data costly to label. To obtain discriminative acoustic representations at the phoneme level, the pretrained wav2vec 2.0 is re-trained with connectionist temporal classification (CTC) loss for phoneme recognition using native data. These acoustic representations are fused with phoneme representations derived from a phoneme encoder to obtain final pronunciation scores. An efficient fusion mechanism aligns each phoneme with acoustic frames based on attention, where all blank frames recognized by the CTC-based phoneme recognition are masked. Finally, the whole network is optimized by a multi-task learning framework combining CTC loss and mean square error loss between predicted and human scores. Extensive experiments demonstrate that it outperforms previous baselines in the Pearson correlation coefficient even with much fewer labeled non-native data.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT54892.2023.10022486","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
This paper proposes an end-to-end pronunciation assessment method to exploit the adequate native data and reduce the need for non-native data costly to label. To obtain discriminative acoustic representations at the phoneme level, the pretrained wav2vec 2.0 is re-trained with connectionist temporal classification (CTC) loss for phoneme recognition using native data. These acoustic representations are fused with phoneme representations derived from a phoneme encoder to obtain final pronunciation scores. An efficient fusion mechanism aligns each phoneme with acoustic frames based on attention, where all blank frames recognized by the CTC-based phoneme recognition are masked. Finally, the whole network is optimized by a multi-task learning framework combining CTC loss and mean square error loss between predicted and human scores. Extensive experiments demonstrate that it outperforms previous baselines in the Pearson correlation coefficient even with much fewer labeled non-native data.