端到端自动语音识别的超参数实验*

Phonetics and Speech Sciences Pub Date : 2021-03-01 DOI:10.13064/KSSS.2021.13.1.045

Hyungwon Yang, Hosung Nam

{"title":"端到端自动语音识别的超参数实验*","authors":"Hyungwon Yang, Hosung Nam","doi":"10.13064/KSSS.2021.13.1.045","DOIUrl":null,"url":null,"abstract":"End-to-end (E2E) automatic speech recognition (ASR) has achieved promising performance gains with the introduced self-attention network, Transformer. However, due to training time and the number of hyperparameters, finding the optimal hyperparameter set is computationally expensive. This paper investigates the impact of hyperparameters in the Transformer network to answer two questions: which hyperparameter plays a critical role in the task performance and training speed. The Transformer network for training has two encoder and decoder networks combined with Connectionist Temporal Classification (CTC). We have trained the model with Wall Street Journal (WSJ) SI-284 and tested on devl93 and eval92. Seventeen hyperparameters were selected from the ESPnet training configuration, and varying ranges of values were used for experiments. The result shows that “num blocks” and “linear units” hyperparameters in the encoder and decoder networks reduce Word Error Rate (WER) significantly. However, performance gain is more prominent when they are altered in the encoder network. Training duration also linearly increased as “num blocks” and “linear units” hyperparameters’ values grow. Based on the experimental results, we collected the optimal values from each hyperparameter and reduced the WER up to 2.9/1.9 from dev93 and eval93 respectively. and 2.6/2.5 respectively, but 3.4/3.5, and 0.8/0.6 in the decoder network. A “dropout rate” hyperparameter in the decoder network does not act like the one in the encoder network, but it reaches the lowest WER at the value 0.1 and maintains high WER at the other values. Meaningful result is not found in “attention heads” and “self attention dropout rate”.","PeriodicalId":255285,"journal":{"name":"Phonetics and Speech Sciences","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Hyperparameter experiments on end-to-end automatic speech\\n recognition*\",\"authors\":\"Hyungwon Yang, Hosung Nam\",\"doi\":\"10.13064/KSSS.2021.13.1.045\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"End-to-end (E2E) automatic speech recognition (ASR) has achieved promising performance gains with the introduced self-attention network, Transformer. However, due to training time and the number of hyperparameters, finding the optimal hyperparameter set is computationally expensive. This paper investigates the impact of hyperparameters in the Transformer network to answer two questions: which hyperparameter plays a critical role in the task performance and training speed. The Transformer network for training has two encoder and decoder networks combined with Connectionist Temporal Classification (CTC). We have trained the model with Wall Street Journal (WSJ) SI-284 and tested on devl93 and eval92. Seventeen hyperparameters were selected from the ESPnet training configuration, and varying ranges of values were used for experiments. The result shows that “num blocks” and “linear units” hyperparameters in the encoder and decoder networks reduce Word Error Rate (WER) significantly. However, performance gain is more prominent when they are altered in the encoder network. Training duration also linearly increased as “num blocks” and “linear units” hyperparameters’ values grow. Based on the experimental results, we collected the optimal values from each hyperparameter and reduced the WER up to 2.9/1.9 from dev93 and eval93 respectively. and 2.6/2.5 respectively, but 3.4/3.5, and 0.8/0.6 in the decoder network. A “dropout rate” hyperparameter in the decoder network does not act like the one in the encoder network, but it reaches the lowest WER at the value 0.1 and maintains high WER at the other values. Meaningful result is not found in “attention heads” and “self attention dropout rate”.\",\"PeriodicalId\":255285,\"journal\":{\"name\":\"Phonetics and Speech Sciences\",\"volume\":\"9 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Phonetics and Speech Sciences\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.13064/KSSS.2021.13.1.045\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Phonetics and Speech Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.13064/KSSS.2021.13.1.045","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

端到端(E2E)自动语音识别(ASR)通过引入自关注网络Transformer实现了有希望的性能提升。然而，由于训练时间和超参数的数量，寻找最优的超参数集在计算上是昂贵的。本文研究了超参数对变压器网络的影响，以回答哪个超参数对任务性能和训练速度起关键作用这两个问题。用于训练的Transformer网络包含两个编码器和解码器网络，并结合了连接时间分类(CTC)。我们用华尔街日报SI-284对模型进行了训练，并在devl93和eval92上进行了测试。从ESPnet训练组态中选取了17个超参数，选取不同范围的值进行实验。结果表明，编码器和解码器网络中的“num块”和“线性单元”超参数显著降低了字错误率(WER)。然而，当它们在编码器网络中改变时，性能增益更为突出。随着“num blocks”和“linear units”超参数值的增加，训练持续时间也呈线性增加。根据实验结果，我们从每个超参数中收集最优值，并将WER分别从dev93和eval93降低到2.9/1.9。和2.6/2.5，解码器网络为3.4/3.5和0.8/0.6。解码器网络中的“丢失率”超参数与编码器网络中的“丢失率”超参数不同，但它在0.1值处达到最低的WER，并在其他值处保持较高的WER。在“注意头”和“自我注意丢失率”方面没有发现有意义的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Hyperparameter experiments on end-to-end automatic speech recognition*

End-to-end (E2E) automatic speech recognition (ASR) has achieved promising performance gains with the introduced self-attention network, Transformer. However, due to training time and the number of hyperparameters, finding the optimal hyperparameter set is computationally expensive. This paper investigates the impact of hyperparameters in the Transformer network to answer two questions: which hyperparameter plays a critical role in the task performance and training speed. The Transformer network for training has two encoder and decoder networks combined with Connectionist Temporal Classification (CTC). We have trained the model with Wall Street Journal (WSJ) SI-284 and tested on devl93 and eval92. Seventeen hyperparameters were selected from the ESPnet training configuration, and varying ranges of values were used for experiments. The result shows that “num blocks” and “linear units” hyperparameters in the encoder and decoder networks reduce Word Error Rate (WER) significantly. However, performance gain is more prominent when they are altered in the encoder network. Training duration also linearly increased as “num blocks” and “linear units” hyperparameters’ values grow. Based on the experimental results, we collected the optimal values from each hyperparameter and reduced the WER up to 2.9/1.9 from dev93 and eval93 respectively. and 2.6/2.5 respectively, but 3.4/3.5, and 0.8/0.6 in the decoder network. A “dropout rate” hyperparameter in the decoder network does not act like the one in the encoder network, but it reaches the lowest WER at the value 0.1 and maintains high WER at the other values. Meaningful result is not found in “attention heads” and “self attention dropout rate”.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Phonetics and Speech Sciences

自引率

0.00%

发文量