自动语音识别的性别域自适应

2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI) Pub Date : 2021-01-21 DOI:10.1109/SAMI50585.2021.9378626

Artem Sokolov, A. Savchenko

{"title":"自动语音识别的性别域自适应","authors":"Artem Sokolov, A. Savchenko","doi":"10.1109/SAMI50585.2021.9378626","DOIUrl":null,"url":null,"abstract":"This paper is focused on the finetuning of acoustic models for speaker adaptation goals on a given gender. We pretrained the Transformer baseline model on Librispeech-960 and conducted experiments with finetuning on the gender-specific test subsets. The obtained word error rate (WER) relatively to the baseline is up to 5% and 3% lower on male and female subsets, respectively, if the layers in the encoder and decoder are not frozen, and the tuning is started from the last checkpoints. Moreover, we adapted our base model on the complete L2 Arctic dataset of accented speech and finetuned it for particular speakers and male and female genders separately. The models trained on the gender subsets obtained 1–2% lower WER when compared to the model tuned on the whole L2 Arctic dataset. Finally, it was experimentally confirmed that the concatenation of the pretrained voice embeddings (x-vector) and embeddings from a conventional encoder cannot significantly improve the speech recognition accuracy.","PeriodicalId":402414,"journal":{"name":"2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Gender domain adaptation for automatic speech recognition\",\"authors\":\"Artem Sokolov, A. Savchenko\",\"doi\":\"10.1109/SAMI50585.2021.9378626\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper is focused on the finetuning of acoustic models for speaker adaptation goals on a given gender. We pretrained the Transformer baseline model on Librispeech-960 and conducted experiments with finetuning on the gender-specific test subsets. The obtained word error rate (WER) relatively to the baseline is up to 5% and 3% lower on male and female subsets, respectively, if the layers in the encoder and decoder are not frozen, and the tuning is started from the last checkpoints. Moreover, we adapted our base model on the complete L2 Arctic dataset of accented speech and finetuned it for particular speakers and male and female genders separately. The models trained on the gender subsets obtained 1–2% lower WER when compared to the model tuned on the whole L2 Arctic dataset. Finally, it was experimentally confirmed that the concatenation of the pretrained voice embeddings (x-vector) and embeddings from a conventional encoder cannot significantly improve the speech recognition accuracy.\",\"PeriodicalId\":402414,\"journal\":{\"name\":\"2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI)\",\"volume\":\"9 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-01-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SAMI50585.2021.9378626\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SAMI50585.2021.9378626","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文主要研究针对特定性别的说话人自适应目标的声学模型的微调。我们在librisspeech -960上对Transformer基线模型进行了预训练，并对特定性别的测试子集进行了微调实验。如果编码器和解码器中的层没有冻结，并且从最后一个检查点开始调优，则相对于基线获得的单词错误率(WER)在男性和女性子集上分别降低5%和3%。此外，我们在完整的L2北极重音语音数据集上调整了我们的基本模型，并分别针对特定的说话者和男性和女性性别进行了微调。与在整个L2北极数据集上调整的模型相比，在性别子集上训练的模型获得的WER降低了1-2%。最后，通过实验证明，将预训练好的语音嵌入(x向量)与传统编码器的嵌入进行拼接并不能显著提高语音识别的准确率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Gender domain adaptation for automatic speech recognition

This paper is focused on the finetuning of acoustic models for speaker adaptation goals on a given gender. We pretrained the Transformer baseline model on Librispeech-960 and conducted experiments with finetuning on the gender-specific test subsets. The obtained word error rate (WER) relatively to the baseline is up to 5% and 3% lower on male and female subsets, respectively, if the layers in the encoder and decoder are not frozen, and the tuning is started from the last checkpoints. Moreover, we adapted our base model on the complete L2 Arctic dataset of accented speech and finetuned it for particular speakers and male and female genders separately. The models trained on the gender subsets obtained 1–2% lower WER when compared to the model tuned on the whole L2 Arctic dataset. Finally, it was experimentally confirmed that the concatenation of the pretrained voice embeddings (x-vector) and embeddings from a conventional encoder cannot significantly improve the speech recognition accuracy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI)

自引率

0.00%

发文量