{"title":"Improving Non-Autoregressive Translation Quality With Pretrained Language Model, Embedding Distillation and Upsampling Strategy for CTC","authors":"Shen-sian Syu;Juncheng Xie;Hung-yi Lee","doi":"10.1109/TASLP.2024.3451977","DOIUrl":null,"url":null,"abstract":"Non-autoregressive approaches, especially those that generate output in a one-pass forward manner, have shown great potential in improving the inference speed of translation models. However, these approaches often suffer from a significant drop in translation quality compared to autoregressive models (AT). To tackle this challenge, this paper introduces a series of innovative techniques to enhance the translation quality of non-autoregressive neural machine translation (NAT) models while still maintaining a substantial acceleration in inference speed. Specifically, we propose a method called CTCPMLM, which involves fine-tuning Pretrained Multilingual Language Models (PMLMs) with the Connectionist Temporal Classification (CTC) loss to effectively train NAT models. Additionally, we adopt the MASK insertion scheme instead of token duplication for up-sampling and present an embedding distillation method to further enhance the performance of NAT models. In our experiments, CTCPMLM surpasses the performance of the baseline autoregressive model (Transformer \n<italic>base</i>\n) on various datasets, including WMT'14 DE \n<inline-formula><tex-math>$\\leftrightarrow$</tex-math></inline-formula>\n EN, WMT'16 RO \n<inline-formula><tex-math>$\\leftrightarrow$</tex-math></inline-formula>\n EN, and IWSLT'14 DE \n<inline-formula><tex-math>$\\leftrightarrow$</tex-math></inline-formula>\n EN. Moreover, CTCPMLM represents the current state-of-the-art among NAT models. Notably, our model achieves superior results compared to the baseline autoregressive model on the IWSLT'14 En \n<inline-formula><tex-math>$\\leftrightarrow$</tex-math></inline-formula>\n De and WMT'16 En \n<inline-formula><tex-math>$\\leftrightarrow$</tex-math></inline-formula>\n Ro datasets, even without using distillation data during training. Particularly, on the IWSLT'14 DE \n<inline-formula><tex-math>$\\rightarrow$</tex-math></inline-formula>\n EN dataset, our model achieves an impressive BLEU score of 39.93, surpassing AT models and establishing a new state-of-the-art. Additionally, our model exhibits a remarkable speed improvement of 16.35 times compared to the autoregressive model.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4121-4133"},"PeriodicalIF":4.1000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10679261/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0
Abstract
Non-autoregressive approaches, especially those that generate output in a one-pass forward manner, have shown great potential in improving the inference speed of translation models. However, these approaches often suffer from a significant drop in translation quality compared to autoregressive models (AT). To tackle this challenge, this paper introduces a series of innovative techniques to enhance the translation quality of non-autoregressive neural machine translation (NAT) models while still maintaining a substantial acceleration in inference speed. Specifically, we propose a method called CTCPMLM, which involves fine-tuning Pretrained Multilingual Language Models (PMLMs) with the Connectionist Temporal Classification (CTC) loss to effectively train NAT models. Additionally, we adopt the MASK insertion scheme instead of token duplication for up-sampling and present an embedding distillation method to further enhance the performance of NAT models. In our experiments, CTCPMLM surpasses the performance of the baseline autoregressive model (Transformer
base
) on various datasets, including WMT'14 DE
$\leftrightarrow$
EN, WMT'16 RO
$\leftrightarrow$
EN, and IWSLT'14 DE
$\leftrightarrow$
EN. Moreover, CTCPMLM represents the current state-of-the-art among NAT models. Notably, our model achieves superior results compared to the baseline autoregressive model on the IWSLT'14 En
$\leftrightarrow$
De and WMT'16 En
$\leftrightarrow$
Ro datasets, even without using distillation data during training. Particularly, on the IWSLT'14 DE
$\rightarrow$
EN dataset, our model achieves an impressive BLEU score of 39.93, surpassing AT models and establishing a new state-of-the-art. Additionally, our model exhibits a remarkable speed improvement of 16.35 times compared to the autoregressive model.
期刊介绍:
The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.