A. Suchato, S. Chanjaradwichai, N. Kertkeidkachorn, S. Vorapatratorn, P. Hirankan, T. Suri, K. Likitsupin, S. Chuetanapinyo, P. Punyabukkana
{"title":"Effects of acoustic mismatches on speech recognition accuracies due to playback-recorded speech corpus","authors":"A. Suchato, S. Chanjaradwichai, N. Kertkeidkachorn, S. Vorapatratorn, P. Hirankan, T. Suri, K. Likitsupin, S. Chuetanapinyo, P. Punyabukkana","doi":"10.1109/ECTICON.2012.6254211","DOIUrl":null,"url":null,"abstract":"Modern speech recognition techniques rely on large amount of speech data whose acoustic characteristics match with the operating environments to train their acoustic models. Gathering training data from loudspeakers playing recorded speech utterances are far more practical than from human speakers. This paper presents results from speech recognition experiments providing practical insights on effects caused by utterances re-recorded form loudspeakers. A clean-speech corpus of sixty human speakers was built using two different microphones and their playbacks were re-recorded. Results show that, with minimal lexical constraints, accuracies degraded for playback-trained system, even with no mismatches between training and test data. However, mismatches did not affect cases with tighter high-level constraints, such as number and limited-vocabulary word recognitions. A procedure to reduce mismatches caused by constructing corpus from playbacks was introduced. The procedure was shown to make the accuracy of a playback-trained system 48% closer to the one of the system trained with speech in matched environment.","PeriodicalId":6319,"journal":{"name":"2012 9th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology","volume":"30 1","pages":"1-4"},"PeriodicalIF":0.0000,"publicationDate":"2012-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 9th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ECTICON.2012.6254211","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Modern speech recognition techniques rely on large amount of speech data whose acoustic characteristics match with the operating environments to train their acoustic models. Gathering training data from loudspeakers playing recorded speech utterances are far more practical than from human speakers. This paper presents results from speech recognition experiments providing practical insights on effects caused by utterances re-recorded form loudspeakers. A clean-speech corpus of sixty human speakers was built using two different microphones and their playbacks were re-recorded. Results show that, with minimal lexical constraints, accuracies degraded for playback-trained system, even with no mismatches between training and test data. However, mismatches did not affect cases with tighter high-level constraints, such as number and limited-vocabulary word recognitions. A procedure to reduce mismatches caused by constructing corpus from playbacks was introduced. The procedure was shown to make the accuracy of a playback-trained system 48% closer to the one of the system trained with speech in matched environment.