{"title":"Adding filled pauses and disfluent events into language models for speech recognition","authors":"J. Staš, D. Hládek, J. Juhár","doi":"10.1109/COGINFOCOM.2016.7804538","DOIUrl":null,"url":null,"abstract":"The variation of spontaneous speech is much larger when compared to the planned speech because of speech disruption and a lot of ambiguities in conversations. These events cannot be properly evaluated during search and decoding in speech recognition systems and various errors occur in the output hypotheses. One possible solution is to include filled pauses and disfluent events into the training data for statistical language modeling. This paper describes experimental results of modeling the most frequent filled pauses and disfluent events in the annotated Slovak speech databases. We have significantly improved the robustness and speech recognition performance up to 10.88% on average in the transcription of parliament speech by correct representation of selected prosodic events and non-speech sounds in speech recognition dictionary and language models.","PeriodicalId":440408,"journal":{"name":"2016 7th IEEE International Conference on Cognitive Infocommunications (CogInfoCom)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 7th IEEE International Conference on Cognitive Infocommunications (CogInfoCom)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/COGINFOCOM.2016.7804538","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
The variation of spontaneous speech is much larger when compared to the planned speech because of speech disruption and a lot of ambiguities in conversations. These events cannot be properly evaluated during search and decoding in speech recognition systems and various errors occur in the output hypotheses. One possible solution is to include filled pauses and disfluent events into the training data for statistical language modeling. This paper describes experimental results of modeling the most frequent filled pauses and disfluent events in the annotated Slovak speech databases. We have significantly improved the robustness and speech recognition performance up to 10.88% on average in the transcription of parliament speech by correct representation of selected prosodic events and non-speech sounds in speech recognition dictionary and language models.