{"title":"Building Test Speech Dataset on Russian Language for Spoken Document Retrieval Task","authors":"A. Tatarinova, D. Prozorov","doi":"10.1109/EWDTS.2018.8524598","DOIUrl":null,"url":null,"abstract":"The article presents a technique of creation of speech dataset which is applied for test of spoken document retrieval methods. The dataset includes radio news audio files with speech on Russian language, textual files with spoken words, textual files with recognition words from CMU Pocketsphinx and a set of queries with indication of relevant documents. Query words from the set is labeled with types of recognition errors which are determined word replacement, word distortion, word split and word deletion. The dataset contains expert's indication of documents which are relevant to queries.","PeriodicalId":127240,"journal":{"name":"2018 IEEE East-West Design & Test Symposium (EWDTS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE East-West Design & Test Symposium (EWDTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EWDTS.2018.8524598","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
The article presents a technique of creation of speech dataset which is applied for test of spoken document retrieval methods. The dataset includes radio news audio files with speech on Russian language, textual files with spoken words, textual files with recognition words from CMU Pocketsphinx and a set of queries with indication of relevant documents. Query words from the set is labeled with types of recognition errors which are determined word replacement, word distortion, word split and word deletion. The dataset contains expert's indication of documents which are relevant to queries.