Ryo Masumura, Suguru Kabashima, Takafumi Moriya, Satoshi Kobashikawa, Y. Yamaguchi, Y. Aono
{"title":"Relevant Phonetic-aware Neural Acoustic Models using Native English and Japanese Speech for Japanese-English Automatic Speech Recognition","authors":"Ryo Masumura, Suguru Kabashima, Takafumi Moriya, Satoshi Kobashikawa, Y. Yamaguchi, Y. Aono","doi":"10.23919/APSIPA.2018.8659784","DOIUrl":null,"url":null,"abstract":"This paper proposes relevant phonetic-aware neural acoustic models that leverage native Japanese speech and native English speech to create improved automatic speech recognition (ASR) of Japanese-English speech. In order to accurately transcribe Japanese-English speech in ASR, acoustic models are needed that are specific to Japanese-English speech since Japanese-English speech exhibits pronunciations that differ from those of native English speech. The major problem is that it is difficult to collect a lot of Japanese-English speech for constructing acoustic models. Therefore, our motivation is to efficiently leverage the significant amounts of native English and native Japanese speech material available since Japanese-English is definitely affected by both native English and native Japanese. Our idea is to utilize them indirectly to enhance the phonetic-awareness of Japanese-English acoustic models. It can be expected that the native English speech is effective in enhancing the classification performance of English-like phonemes, while the native Japanese speech is effective in enhancing the classification performance of Japanese-like phonemes. In the proposed relevant phonetic-aware neural acoustic models, this idea is implemented by utilizing bottleneck features of native English and native Japanese neural acoustic models. Our experiments construct the relevant phonetic-aware neural acoustic models by utilizing 300 hours of Japanese-English speech, 1,500 hours of native Japanese speech, and 900 hours of native English speech. We demonstrate effectiveness of our proposal using evaluation data sets that involve four levels of Japanese-English.","PeriodicalId":287799,"journal":{"name":"2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/APSIPA.2018.8659784","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
This paper proposes relevant phonetic-aware neural acoustic models that leverage native Japanese speech and native English speech to create improved automatic speech recognition (ASR) of Japanese-English speech. In order to accurately transcribe Japanese-English speech in ASR, acoustic models are needed that are specific to Japanese-English speech since Japanese-English speech exhibits pronunciations that differ from those of native English speech. The major problem is that it is difficult to collect a lot of Japanese-English speech for constructing acoustic models. Therefore, our motivation is to efficiently leverage the significant amounts of native English and native Japanese speech material available since Japanese-English is definitely affected by both native English and native Japanese. Our idea is to utilize them indirectly to enhance the phonetic-awareness of Japanese-English acoustic models. It can be expected that the native English speech is effective in enhancing the classification performance of English-like phonemes, while the native Japanese speech is effective in enhancing the classification performance of Japanese-like phonemes. In the proposed relevant phonetic-aware neural acoustic models, this idea is implemented by utilizing bottleneck features of native English and native Japanese neural acoustic models. Our experiments construct the relevant phonetic-aware neural acoustic models by utilizing 300 hours of Japanese-English speech, 1,500 hours of native Japanese speech, and 900 hours of native English speech. We demonstrate effectiveness of our proposal using evaluation data sets that involve four levels of Japanese-English.