Improving Generalization of Pre-trained Language Models via Stochastic Weight Averaging

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing Pub Date : 2022-12-12 DOI:10.48550/arXiv.2212.05956

Peng Lu, I. Kobyzev, Mehdi Rezagholizadeh, Ahmad Rashid, A. Ghodsi, P. Langlais

{"title":"Improving Generalization of Pre-trained Language Models via Stochastic Weight Averaging","authors":"Peng Lu, I. Kobyzev, Mehdi Rezagholizadeh, Ahmad Rashid, A. Ghodsi, P. Langlais","doi":"10.48550/arXiv.2212.05956","DOIUrl":null,"url":null,"abstract":"Knowledge Distillation (KD) is a commonly used technique for improving the generalization of compact Pre-trained Language Models (PLMs) on downstream tasks. However, such methods impose the additional burden of training a separate teacher model for every new dataset. Alternatively, one may directly work on the improvement of the optimization procedure of the compact model toward better generalization. Recent works observe that the flatness of the local minimum correlates well with better generalization. In this work, we adapt Stochastic Weight Averaging (SWA), a method encouraging convergence to a flatter minimum, to fine-tuning PLMs. We conduct extensive experiments on various NLP tasks (text classification, question answering, and generation) and different model architectures and demonstrate that our adaptation improves the generalization without extra computation cost. Moreover, we observe that this simple optimization technique is able to outperform the state-of-the-art KD methods for compact models.","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"158 1","pages":"4948-4954"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2212.05956","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Knowledge Distillation (KD) is a commonly used technique for improving the generalization of compact Pre-trained Language Models (PLMs) on downstream tasks. However, such methods impose the additional burden of training a separate teacher model for every new dataset. Alternatively, one may directly work on the improvement of the optimization procedure of the compact model toward better generalization. Recent works observe that the flatness of the local minimum correlates well with better generalization. In this work, we adapt Stochastic Weight Averaging (SWA), a method encouraging convergence to a flatter minimum, to fine-tuning PLMs. We conduct extensive experiments on various NLP tasks (text classification, question answering, and generation) and different model architectures and demonstrate that our adaptation improves the generalization without extra computation cost. Moreover, we observe that this simple optimization technique is able to outperform the state-of-the-art KD methods for compact models.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过随机加权平均提高预训练语言模型的泛化

知识蒸馏(Knowledge Distillation, KD)是一种常用的技术，用于提高精简预训练语言模型(PLMs)在下游任务上的泛化能力。然而，这种方法增加了额外的负担，即为每个新数据集训练一个单独的教师模型。或者，人们可以直接致力于改进紧凑模型的优化过程，以获得更好的泛化。最近的研究发现，局部最小值的平坦度与更好的泛化有很好的相关性。在这项工作中，我们采用随机加权平均(SWA)，一种鼓励收敛到更平坦的最小值的方法，来微调plm。我们在各种NLP任务(文本分类、问答和生成)和不同的模型架构上进行了广泛的实验，并证明我们的自适应在没有额外计算成本的情况下提高了泛化。此外，我们观察到这种简单的优化技术能够优于最先进的紧凑模型的KD方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing

自引率

0.00%

发文量