Improving the Performance of Zero-Resource Children’s ASR System through Formant and Duration Modification based Data Augmentation

2022 IEEE International Conference on Signal Processing and Communications (SPCOM) Pub Date : 2022-07-11 DOI:10.1109/SPCOM55316.2022.9840767

S. Shahnawazuddin, Vinit Kumar, Avinash Kumar, Waquar Ahmad

{"title":"Improving the Performance of Zero-Resource Children’s ASR System through Formant and Duration Modification based Data Augmentation","authors":"S. Shahnawazuddin, Vinit Kumar, Avinash Kumar, Waquar Ahmad","doi":"10.1109/SPCOM55316.2022.9840767","DOIUrl":null,"url":null,"abstract":"Developing an automatic speech recognition (ASR) system for children’s speech is extremely challenging due to the unavailability of data from the child domain for the majority of the languages. Consequently, in such zero-resource scenarios, we are forced to develop an ASR system using adults’ speech for transcribing data from child speakers. However, differences in formant frequencies and speaking-rate between the two groups of speakers degrade recognition performance. To reduce the said mismatch, out-of-domain data augmentation approaches based on formant and duration modification are proposed in this work. For that purpose, formant frequencies of adults’ speech training data are up-scaled using warping of linear predictive coding coefficients. Next, the speaking-rate of adults’ data is also increased through time-scale modification. Due to simultaneous altering of formant frequencies and duration of adults’ speech and then pooling the modified data into training, the acoustic mismatch due to the aforementioned factors gets reduced. This, in turn, enhances the recognition performance significantly. Additional improvement is obtained by combining the recently reported voice-conversion-based data augmentation technique with the proposed ones. On combining the proposed and voice-conversion-based data augmentation techniques, a relative reduction of nearly 32.3% in word error rate over the baseline is obtained.","PeriodicalId":246982,"journal":{"name":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPCOM55316.2022.9840767","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Developing an automatic speech recognition (ASR) system for children’s speech is extremely challenging due to the unavailability of data from the child domain for the majority of the languages. Consequently, in such zero-resource scenarios, we are forced to develop an ASR system using adults’ speech for transcribing data from child speakers. However, differences in formant frequencies and speaking-rate between the two groups of speakers degrade recognition performance. To reduce the said mismatch, out-of-domain data augmentation approaches based on formant and duration modification are proposed in this work. For that purpose, formant frequencies of adults’ speech training data are up-scaled using warping of linear predictive coding coefficients. Next, the speaking-rate of adults’ data is also increased through time-scale modification. Due to simultaneous altering of formant frequencies and duration of adults’ speech and then pooling the modified data into training, the acoustic mismatch due to the aforementioned factors gets reduced. This, in turn, enhances the recognition performance significantly. Additional improvement is obtained by combining the recently reported voice-conversion-based data augmentation technique with the proposed ones. On combining the proposed and voice-conversion-based data augmentation techniques, a relative reduction of nearly 32.3% in word error rate over the baseline is obtained.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于形成峰和持续时间修改的数据增强提高零资源儿童ASR系统的性能

由于大多数语言的儿童领域数据不可用，开发儿童语音自动识别(ASR)系统极具挑战性。因此，在这种零资源的情况下，我们被迫开发一个ASR系统，使用成人的语言来转录儿童说话者的数据。然而，两组说话者在共振频率和语速上的差异会降低识别性能。为了减少这种不匹配，本文提出了基于形成峰和持续时间修改的域外数据增强方法。为此，使用线性预测编码系数的扭曲来放大成人语音训练数据的形成峰频率。其次，通过时间尺度修正，也提高了成人数据的说话率。通过同时改变成人语音的共振峰频率和持续时间，然后将修改后的数据汇集到训练中，减少了由于上述因素造成的声学失配。这反过来又大大提高了识别性能。将最近报道的基于语音转换的数据增强技术与所提出的数据增强技术相结合，获得了额外的改进。将所提出的方法与基于语音转换的数据增强技术相结合，在基线的基础上，错误率相对降低了近32.3%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 IEEE International Conference on Signal Processing and Communications (SPCOM)

自引率

0.00%

发文量