Yu Tsao, Shigeki Matsuda, Satoshi Nakamura, Chin-Hui Lee
{"title":"MAP estimation of online mapping parameters in ensemble speaker and speaking environment modeling","authors":"Yu Tsao, Shigeki Matsuda, Satoshi Nakamura, Chin-Hui Lee","doi":"10.1109/ASRU.2009.5373236","DOIUrl":null,"url":null,"abstract":"Recently, an ensemble speaker and speaking environment modeling (ESSEM) framework was proposed to enhance automatic speech recognition performance under adverse conditions. In the online phase of ESSEM, the prepared environment structure in the offline stage is transformed to a set of acoustic models for the target testing environment by using a mapping function. In the original ESSEM framework, the mapping function parameters are estimated based on a maximum likelihood (ML) criterion. In this study, we propose to use a maximum a posteriori (MAP) criterion to calculate the mapping function to avoid a possible over-fitting problem that can degrade the accuracy of environment characterization. For the MAP estimation, we also study two types of prior densities, namely, clustered prior and hierarchical prior, in this paper. On the Aurora-2 task using either type of prior densities, MAP-based ESSEM can achieve better performance than ML-based ESSEM, especially under low SNR conditions. When comparing to our best baseline results, the MAP-based ESSEM achieves a 14.97% (5.41% to 4.60%) word error rate reduction in average at a signal to noise ratio of 0dB to 20dB over the three testing sets.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"61 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU.2009.5373236","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Recently, an ensemble speaker and speaking environment modeling (ESSEM) framework was proposed to enhance automatic speech recognition performance under adverse conditions. In the online phase of ESSEM, the prepared environment structure in the offline stage is transformed to a set of acoustic models for the target testing environment by using a mapping function. In the original ESSEM framework, the mapping function parameters are estimated based on a maximum likelihood (ML) criterion. In this study, we propose to use a maximum a posteriori (MAP) criterion to calculate the mapping function to avoid a possible over-fitting problem that can degrade the accuracy of environment characterization. For the MAP estimation, we also study two types of prior densities, namely, clustered prior and hierarchical prior, in this paper. On the Aurora-2 task using either type of prior densities, MAP-based ESSEM can achieve better performance than ML-based ESSEM, especially under low SNR conditions. When comparing to our best baseline results, the MAP-based ESSEM achieves a 14.97% (5.41% to 4.60%) word error rate reduction in average at a signal to noise ratio of 0dB to 20dB over the three testing sets.