基于自动语音选择的儿童语音识别的DNN自适应

M. Matassoni, D. Falavigna, D. Giuliani
{"title":"基于自动语音选择的儿童语音识别的DNN自适应","authors":"M. Matassoni, D. Falavigna, D. Giuliani","doi":"10.1109/SLT.2016.7846331","DOIUrl":null,"url":null,"abstract":"This paper describes an approach for adapting a DNN trained on adult speech to children voices. The method extends a previous one, based on the Kullback-Leibler divergence between the original (adult) DNN output distribution and the target one, by accounting for the quality of the supervision of the adaptation utterances. In addition, starting from the observation that by gradually removing from the adaptation set the sentences with higher WERs significant performance improvements can be achieved, we also investigate the usage of automatic selection of adaptation utterances. For determining transcription quality we investigate the use of confidence estimates of recognized hypotheses. We present experiments and related results achieved on an Italian data set of children's speech. We show that the proposed DNN adaptation approach allows to significantly reduce the WER on a given test set from 14.2% (corresponding to using the non adapted DNN, trained on adult speech) to 10.6%. It is worth mentioning that the latter result has been achieved without making use of any training data specific of children's speech.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"DNN adaptation for recognition of children speech through automatic utterance selection\",\"authors\":\"M. Matassoni, D. Falavigna, D. Giuliani\",\"doi\":\"10.1109/SLT.2016.7846331\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes an approach for adapting a DNN trained on adult speech to children voices. The method extends a previous one, based on the Kullback-Leibler divergence between the original (adult) DNN output distribution and the target one, by accounting for the quality of the supervision of the adaptation utterances. In addition, starting from the observation that by gradually removing from the adaptation set the sentences with higher WERs significant performance improvements can be achieved, we also investigate the usage of automatic selection of adaptation utterances. For determining transcription quality we investigate the use of confidence estimates of recognized hypotheses. We present experiments and related results achieved on an Italian data set of children's speech. We show that the proposed DNN adaptation approach allows to significantly reduce the WER on a given test set from 14.2% (corresponding to using the non adapted DNN, trained on adult speech) to 10.6%. It is worth mentioning that the latter result has been achieved without making use of any training data specific of children's speech.\",\"PeriodicalId\":281635,\"journal\":{\"name\":\"2016 IEEE Spoken Language Technology Workshop (SLT)\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE Spoken Language Technology Workshop (SLT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SLT.2016.7846331\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT.2016.7846331","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

摘要

本文描述了一种将经过成人语言训练的深度神经网络应用于儿童语音的方法。该方法基于原始(成人)DNN输出分布与目标DNN输出分布之间的Kullback-Leibler分歧,通过考虑对自适应话语的监督质量,扩展了之前的方法。此外,我们还从逐步从适应集合中移除具有更高wer的句子可以显著提高性能的观察出发,研究了自动选择适应话语的使用情况。为了确定转录质量,我们研究了对公认假设的置信度估计的使用。本文介绍了在意大利语儿童语言数据集上的实验和相关结果。我们表明,提出的深度神经网络自适应方法可以将给定测试集上的WER从14.2%(对应于使用未经自适应的深度神经网络,对成人语音进行训练)显著降低到10.6%。值得一提的是,后一种结果是在没有使用任何针对儿童言语的训练数据的情况下取得的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
DNN adaptation for recognition of children speech through automatic utterance selection
This paper describes an approach for adapting a DNN trained on adult speech to children voices. The method extends a previous one, based on the Kullback-Leibler divergence between the original (adult) DNN output distribution and the target one, by accounting for the quality of the supervision of the adaptation utterances. In addition, starting from the observation that by gradually removing from the adaptation set the sentences with higher WERs significant performance improvements can be achieved, we also investigate the usage of automatic selection of adaptation utterances. For determining transcription quality we investigate the use of confidence estimates of recognized hypotheses. We present experiments and related results achieved on an Italian data set of children's speech. We show that the proposed DNN adaptation approach allows to significantly reduce the WER on a given test set from 14.2% (corresponding to using the non adapted DNN, trained on adult speech) to 10.6%. It is worth mentioning that the latter result has been achieved without making use of any training data specific of children's speech.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Further optimisations of constant Q cepstral processing for integrated utterance and text-dependent speaker verification Learning dialogue dynamics with the method of moments A study of speech distortion conditions in real scenarios for speech processing applications Comparing speaker independent and speaker adapted classification for word prominence detection Influence of corpus size and content on the perceptual quality of a unit selection MaryTTS voice
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1