{"title":"面向俄语口语文档检索任务的测试语音数据集构建","authors":"A. Tatarinova, D. Prozorov","doi":"10.1109/EWDTS.2018.8524598","DOIUrl":null,"url":null,"abstract":"The article presents a technique of creation of speech dataset which is applied for test of spoken document retrieval methods. The dataset includes radio news audio files with speech on Russian language, textual files with spoken words, textual files with recognition words from CMU Pocketsphinx and a set of queries with indication of relevant documents. Query words from the set is labeled with types of recognition errors which are determined word replacement, word distortion, word split and word deletion. The dataset contains expert's indication of documents which are relevant to queries.","PeriodicalId":127240,"journal":{"name":"2018 IEEE East-West Design & Test Symposium (EWDTS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Building Test Speech Dataset on Russian Language for Spoken Document Retrieval Task\",\"authors\":\"A. Tatarinova, D. Prozorov\",\"doi\":\"10.1109/EWDTS.2018.8524598\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The article presents a technique of creation of speech dataset which is applied for test of spoken document retrieval methods. The dataset includes radio news audio files with speech on Russian language, textual files with spoken words, textual files with recognition words from CMU Pocketsphinx and a set of queries with indication of relevant documents. Query words from the set is labeled with types of recognition errors which are determined word replacement, word distortion, word split and word deletion. The dataset contains expert's indication of documents which are relevant to queries.\",\"PeriodicalId\":127240,\"journal\":{\"name\":\"2018 IEEE East-West Design & Test Symposium (EWDTS)\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE East-West Design & Test Symposium (EWDTS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/EWDTS.2018.8524598\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE East-West Design & Test Symposium (EWDTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EWDTS.2018.8524598","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Building Test Speech Dataset on Russian Language for Spoken Document Retrieval Task
The article presents a technique of creation of speech dataset which is applied for test of spoken document retrieval methods. The dataset includes radio news audio files with speech on Russian language, textual files with spoken words, textual files with recognition words from CMU Pocketsphinx and a set of queries with indication of relevant documents. Query words from the set is labeled with types of recognition errors which are determined word replacement, word distortion, word split and word deletion. The dataset contains expert's indication of documents which are relevant to queries.