Noisy Student Teacher Training with Self Supervised Learning for Children ASR

2022 IEEE International Conference on Signal Processing and Communications (SPCOM) Pub Date : 2022-07-11 DOI:10.1109/SPCOM55316.2022.9840763

Shreya S. Chaturvedi, Hardik B. Sailor, H. Patil

{"title":"Noisy Student Teacher Training with Self Supervised Learning for Children ASR","authors":"Shreya S. Chaturvedi, Hardik B. Sailor, H. Patil","doi":"10.1109/SPCOM55316.2022.9840763","DOIUrl":null,"url":null,"abstract":"Automatic Speech Recognition (ASR) is a fast-growing field, where reliable systems are made for high resource languages and for adult’s speech. However, performance of such ASR system is inefficient for children speech, due to numerous acoustic variability in children speech and scarcity of resources. In this paper, we propose to use the unlabeled data extensively to develop ASR system for low resourced children speech. State-of-the-art wav2vec 2.0 is the baseline ASR technique used here. The baseline’s performance is further enhanced with the intuition of Noisy Student Teacher (NST) learning. The proposed technique is not only limited to introducing the use of soft labels (i.e., word-level transcription) of unlabeled data, but also adapts the learning of teacher model or preceding student model, which results in reduction of the redundant training significantly. To that effect, a detailed analysis is reported in this paper, as there is a difference in teacher and student learning. In ASR experiments, character-level tokenization was used and hence, Connectionist Temporal Classification (CTC) loss was used for fine-tuning. Due to computational limitations, experiments are performed with approximately 12 hours of training, and 5 hours of development and test data was used from standard My Science Tutor (MyST) corpus. The baseline wav2vec 2.0 achieves 34% WER, while relatively 10% of performance was improved using the proposed approach. Further, the analysis of performance loss and effect of language model is discussed in details.","PeriodicalId":246982,"journal":{"name":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPCOM55316.2022.9840763","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Automatic Speech Recognition (ASR) is a fast-growing field, where reliable systems are made for high resource languages and for adult’s speech. However, performance of such ASR system is inefficient for children speech, due to numerous acoustic variability in children speech and scarcity of resources. In this paper, we propose to use the unlabeled data extensively to develop ASR system for low resourced children speech. State-of-the-art wav2vec 2.0 is the baseline ASR technique used here. The baseline’s performance is further enhanced with the intuition of Noisy Student Teacher (NST) learning. The proposed technique is not only limited to introducing the use of soft labels (i.e., word-level transcription) of unlabeled data, but also adapts the learning of teacher model or preceding student model, which results in reduction of the redundant training significantly. To that effect, a detailed analysis is reported in this paper, as there is a difference in teacher and student learning. In ASR experiments, character-level tokenization was used and hence, Connectionist Temporal Classification (CTC) loss was used for fine-tuning. Due to computational limitations, experiments are performed with approximately 12 hours of training, and 5 hours of development and test data was used from standard My Science Tutor (MyST) corpus. The baseline wav2vec 2.0 achieves 34% WER, while relatively 10% of performance was improved using the proposed approach. Further, the analysis of performance loss and effect of language model is discussed in details.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于自监督学习的吵闹学生教师对儿童ASR的培训

自动语音识别(ASR)是一个快速发展的领域，它为高资源语言和成人语音建立了可靠的系统。然而，由于儿童语音存在大量的声学变异性和资源的稀缺性，这种ASR系统对儿童语音的表现效率较低。在本文中，我们建议广泛使用未标记的数据来开发低资源儿童语音的ASR系统。最先进的wav2vec 2.0是这里使用的基线ASR技术。噪声学生教师(NST)学习的直觉进一步增强了基线的性能。所提出的技术不仅局限于引入对未标记数据的软标签(即词级转录)的使用，而且还适应了教师模型或前学生模型的学习，从而显著减少了冗余训练。鉴于教师和学生的学习存在差异，本文对此进行了详细的分析。在ASR实验中，使用了字符级标记化，因此，使用连接时间分类(CTC)损失进行微调。由于计算的限制，实验进行了大约12小时的训练，5小时的开发和测试数据来自标准的My Science Tutor (MyST)语料库。基线wav2vec 2.0实现了34%的WER，而使用所提出的方法提高了相对10%的性能。在此基础上，详细分析了语言模型的性能损失和影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 IEEE International Conference on Signal Processing and Communications (SPCOM)

自引率

0.00%

发文量