基于Albayzin 2018搜索的语音评价gts - ehu系统

IberSPEECH Conference Pub Date : 2018-11-21 DOI:10.21437/IberSPEECH.2018-52

Luis Javier Rodriguez-Fuentes, M. Peñagarikano, A. Varona, Germán Bordel

{"title":"基于Albayzin 2018搜索的语音评价gts - ehu系统","authors":"Luis Javier Rodriguez-Fuentes, M. Peñagarikano, A. Varona, Germán Bordel","doi":"10.21437/IberSPEECH.2018-52","DOIUrl":null,"url":null,"abstract":"This paper describes the systems developed by GTTS-EHU for the QbE-STD and STD tasks of the Albayzin 2018 Search on Speech Evaluation. Stacked bottleneck features (sBNF) are used as frame-level acoustic representation for both audio documents and spoken queries. In QbE-STD, a flavour of segmental DTW (originally developed for MediaEval 2013) is used to perform the search, which iteratively finds the match that minimizes the average distance between two test-normalized sBNF vectors, until either a maximum number of hits is obtained or the score does not attain a given threshold. The STD task is performed by synthesizing spoken queries (using publicly available TTS APIs), then averaging their sBNF representations and using the average query for QbE-STD. A publicly available toolkit (developed by BUT/Phonexia) has been used to extract three sBNF sets, trained for English monophone and triphone state posteriors (contrastive systems 3 and 4) and for multilingual triphone posteriors (contrastive system 2), respectively. The concatenation of the three sBNF sets has been also tested (contrastive system 1). The primary system consists of a discriminative fusion of the four contrastive systems. Detection scores are normalized on a query-by-query basis (qnorm), calibrated and, if two or more systems are considered, fused with other scores. Calibration and fusion parameters are discriminatively estimated using the ground truth of development data. Finally, due to a lack of robustness in calibration, Yes/No decisions are made by applying the MTWV thresholds obtained for the development sets, except for the COREMAH test set. In this case, calibration is based on the MAVIR corpus, and the 15% highest scores are taken as positive (Yes) detections.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"70 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"GTTS-EHU Systems for the Albayzin 2018 Search on Speech Evaluation\",\"authors\":\"Luis Javier Rodriguez-Fuentes, M. Peñagarikano, A. Varona, Germán Bordel\",\"doi\":\"10.21437/IberSPEECH.2018-52\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes the systems developed by GTTS-EHU for the QbE-STD and STD tasks of the Albayzin 2018 Search on Speech Evaluation. Stacked bottleneck features (sBNF) are used as frame-level acoustic representation for both audio documents and spoken queries. In QbE-STD, a flavour of segmental DTW (originally developed for MediaEval 2013) is used to perform the search, which iteratively finds the match that minimizes the average distance between two test-normalized sBNF vectors, until either a maximum number of hits is obtained or the score does not attain a given threshold. The STD task is performed by synthesizing spoken queries (using publicly available TTS APIs), then averaging their sBNF representations and using the average query for QbE-STD. A publicly available toolkit (developed by BUT/Phonexia) has been used to extract three sBNF sets, trained for English monophone and triphone state posteriors (contrastive systems 3 and 4) and for multilingual triphone posteriors (contrastive system 2), respectively. The concatenation of the three sBNF sets has been also tested (contrastive system 1). The primary system consists of a discriminative fusion of the four contrastive systems. Detection scores are normalized on a query-by-query basis (qnorm), calibrated and, if two or more systems are considered, fused with other scores. Calibration and fusion parameters are discriminatively estimated using the ground truth of development data. Finally, due to a lack of robustness in calibration, Yes/No decisions are made by applying the MTWV thresholds obtained for the development sets, except for the COREMAH test set. In this case, calibration is based on the MAVIR corpus, and the 15% highest scores are taken as positive (Yes) detections.\",\"PeriodicalId\":115963,\"journal\":{\"name\":\"IberSPEECH Conference\",\"volume\":\"70 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IberSPEECH Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/IberSPEECH.2018-52\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IberSPEECH Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/IberSPEECH.2018-52","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

本文介绍了gts - ehu为Albayzin 2018语音评价检索的QbE-STD和STD任务开发的系统。堆叠瓶颈特征(sBNF)被用作音频文档和语音查询的帧级声学表示。在QbE-STD中，使用分段DTW(最初是为MediaEval 2013开发的)来执行搜索，迭代地找到最小化两个测试标准化sBNF向量之间平均距离的匹配，直到获得最大命中数或得分未达到给定阈值。STD任务是通过合成语音查询(使用公开可用的TTS api)来执行的，然后对它们的sBNF表示取平均值，并使用QbE-STD的平均查询。一个公开可用的工具包(由BUT/Phonexia开发)被用来提取三个sBNF集，分别训练为英语单音和三音状态后置(对比系统3和4)和多语言三音后置(对比系统2)。我们还测试了三个sBNF集的连接(对比系统1)。主系统由四个对比系统的判别融合组成。检测分数在逐个查询的基础上进行标准化(qnorm)，进行校准，如果考虑两个或多个系统，则与其他分数融合。利用开发数据的真实值判别估计校准和融合参数。最后，由于校准缺乏鲁棒性，除了COREMAH测试集之外，Yes/No决策是通过应用开发集获得的MTWV阈值来做出的。在这种情况下，校准是基于MAVIR语料库，15%的最高分被视为阳性(Yes)检测。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

GTTS-EHU Systems for the Albayzin 2018 Search on Speech Evaluation

This paper describes the systems developed by GTTS-EHU for the QbE-STD and STD tasks of the Albayzin 2018 Search on Speech Evaluation. Stacked bottleneck features (sBNF) are used as frame-level acoustic representation for both audio documents and spoken queries. In QbE-STD, a flavour of segmental DTW (originally developed for MediaEval 2013) is used to perform the search, which iteratively finds the match that minimizes the average distance between two test-normalized sBNF vectors, until either a maximum number of hits is obtained or the score does not attain a given threshold. The STD task is performed by synthesizing spoken queries (using publicly available TTS APIs), then averaging their sBNF representations and using the average query for QbE-STD. A publicly available toolkit (developed by BUT/Phonexia) has been used to extract three sBNF sets, trained for English monophone and triphone state posteriors (contrastive systems 3 and 4) and for multilingual triphone posteriors (contrastive system 2), respectively. The concatenation of the three sBNF sets has been also tested (contrastive system 1). The primary system consists of a discriminative fusion of the four contrastive systems. Detection scores are normalized on a query-by-query basis (qnorm), calibrated and, if two or more systems are considered, fused with other scores. Calibration and fusion parameters are discriminatively estimated using the ground truth of development data. Finally, due to a lack of robustness in calibration, Yes/No decisions are made by applying the MTWV thresholds obtained for the development sets, except for the COREMAH test set. In this case, calibration is based on the MAVIR corpus, and the 15% highest scores are taken as positive (Yes) detections.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IberSPEECH Conference

自引率

0.00%

发文量