{"title":"Multiresolution CNN for reverberant speech recognition","authors":"Sunchan Park, Yongwon Jeong, H. S. Kim","doi":"10.1109/ICSDA.2017.8384470","DOIUrl":null,"url":null,"abstract":"The performance of automatic speech recognition (ASR) has been greatly improved by deep neural network (DNN) acoustic models. However, DNN-based systems still perform poorly in reverberant environments. Convolutional neural network (CNN) acoustic models showed lower word error rate (WER) in distant speech recognition than fully-connected DNN acoustic models. To improve the performance of reverberant speech recognition using CNN acoustic models, we propose the multiresolution CNN that has two separate streams: one is the wideband feature with wide-context window and the other is the narrowband feature with narrow-context window. The experiments on the ASR task of the REVERB challenge 2014 showed that the proposed multiresolution CNN based approach reduced the WER by 8.79% and 8.83% for the simulated test data and the real-condition test data, respectively, compared with the conventional CNN based method.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSDA.2017.8384470","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 18
Abstract
The performance of automatic speech recognition (ASR) has been greatly improved by deep neural network (DNN) acoustic models. However, DNN-based systems still perform poorly in reverberant environments. Convolutional neural network (CNN) acoustic models showed lower word error rate (WER) in distant speech recognition than fully-connected DNN acoustic models. To improve the performance of reverberant speech recognition using CNN acoustic models, we propose the multiresolution CNN that has two separate streams: one is the wideband feature with wide-context window and the other is the narrowband feature with narrow-context window. The experiments on the ASR task of the REVERB challenge 2014 showed that the proposed multiresolution CNN based approach reduced the WER by 8.79% and 8.83% for the simulated test data and the real-condition test data, respectively, compared with the conventional CNN based method.