{"title":"Leveraging large amounts of loosely transcribed corporate videos for acoustic model training","authors":"M. Paulik, P. Panchapagesan","doi":"10.1109/ASRU.2011.6163912","DOIUrl":null,"url":null,"abstract":"Lightly supervised acoustic model (AM) training has seen a tremendous amount of interest over the past decade. It promises significant cost-savings by relying on only small amounts of accurately transcribed speech and large amounts of imperfectly (loosely) transcribed speech. The latter can often times be acquired from existing sources, without additional cost. We identify corporate videos as one such source. After reviewing the state of the art in lightly supervised AM training, we describe our efforts on exploiting 977 hours of loosely transcribed corporate videos for AM training. We report strong reductions in word error rate of up to 19.4% over our baseline. We also report initial results for a simple, yet effective scheme to identify a subset of lightly supervised training labels that are more important to the training process.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU.2011.6163912","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
Lightly supervised acoustic model (AM) training has seen a tremendous amount of interest over the past decade. It promises significant cost-savings by relying on only small amounts of accurately transcribed speech and large amounts of imperfectly (loosely) transcribed speech. The latter can often times be acquired from existing sources, without additional cost. We identify corporate videos as one such source. After reviewing the state of the art in lightly supervised AM training, we describe our efforts on exploiting 977 hours of loosely transcribed corporate videos for AM training. We report strong reductions in word error rate of up to 19.4% over our baseline. We also report initial results for a simple, yet effective scheme to identify a subset of lightly supervised training labels that are more important to the training process.