{"title":"Improving sentence-level alignment of speech with imperfect transcripts using utterance concatenation and VAD","authors":"Alexandru Moldovan, Adriana Stan, M. Giurgiu","doi":"10.1109/ICCP.2016.7737141","DOIUrl":null,"url":null,"abstract":"Preparing data for speech processing applications is in general a task which requires expert knowledge and takes up a large amount of time. Therefore, being able to automate as much as possible this process can have a significant impact on the expansion of the number of languages for which spoken interaction with the machines is available. In this paper we build upon a previously developed tool, ALISA, which was developed to align speech with imperfect transcripts using only 10 minutes of manually labelled data, in any alphabetic language. Although its error rate is around 0.6% at word-level, we noticed that the sentence-level accuracy is drastically affected by a large number of sentence-initial word deletions. To overcome this problem, we propose two methods: one based on utterance concatenation, and one based on voice activity detection (VAD). The results show that these simple methods can achieve around 10% relative improvement over the baseline results.","PeriodicalId":343658,"journal":{"name":"2016 IEEE 12th International Conference on Intelligent Computer Communication and Processing (ICCP)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 12th International Conference on Intelligent Computer Communication and Processing (ICCP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCP.2016.7737141","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Preparing data for speech processing applications is in general a task which requires expert knowledge and takes up a large amount of time. Therefore, being able to automate as much as possible this process can have a significant impact on the expansion of the number of languages for which spoken interaction with the machines is available. In this paper we build upon a previously developed tool, ALISA, which was developed to align speech with imperfect transcripts using only 10 minutes of manually labelled data, in any alphabetic language. Although its error rate is around 0.6% at word-level, we noticed that the sentence-level accuracy is drastically affected by a large number of sentence-initial word deletions. To overcome this problem, we propose two methods: one based on utterance concatenation, and one based on voice activity detection (VAD). The results show that these simple methods can achieve around 10% relative improvement over the baseline results.