AdaDS: Adaptive data selection for accelerating pre-trained language model knowledge distillation

Qinhong Zhou , Peng Li , Yang Liu , Yuyang Guan , Qizhou Xing , Ming Chen , Maosong Sun , Yang Liu
{"title":"AdaDS: Adaptive data selection for accelerating pre-trained language model knowledge distillation","authors":"Qinhong Zhou ,&nbsp;Peng Li ,&nbsp;Yang Liu ,&nbsp;Yuyang Guan ,&nbsp;Qizhou Xing ,&nbsp;Ming Chen ,&nbsp;Maosong Sun ,&nbsp;Yang Liu","doi":"10.1016/j.aiopen.2023.08.005","DOIUrl":null,"url":null,"abstract":"<div><p>Knowledge distillation (KD) is a widely used method for transferring knowledge from large teacher models to computationally efficient student models. Unfortunately, the computational cost of KD becomes unaffordable as pre-trained language models (PLMs) grow larger. Computing KD loss on only part of the training set is a promising way to accelerate KD. However, existing works heuristically leverage only one static data selection strategy during the KD process, demonstrating inconsistent improvements across different distillation scenarios. In this work, we conduct a thorough study on various typical data selection strategies for KD, and show that this problem is due to the fact that the best data selection strategy is specific to various factors, including task, selected data size, and training stage. To automatically adapt to these factors, we propose a framework named AdaDS to learn to choose the data selection strategy adaptively during the KD process. Experimental results show that our proposed method is effective for various tasks and selected data sizes under both fine-tuning and pre-training stages, achieving comparable performance to DistilBERT with only 10% amount of queries to the teacher model.</p></div>","PeriodicalId":100068,"journal":{"name":"AI Open","volume":"4 ","pages":"Pages 56-63"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"AI Open","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666651023000074","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Knowledge distillation (KD) is a widely used method for transferring knowledge from large teacher models to computationally efficient student models. Unfortunately, the computational cost of KD becomes unaffordable as pre-trained language models (PLMs) grow larger. Computing KD loss on only part of the training set is a promising way to accelerate KD. However, existing works heuristically leverage only one static data selection strategy during the KD process, demonstrating inconsistent improvements across different distillation scenarios. In this work, we conduct a thorough study on various typical data selection strategies for KD, and show that this problem is due to the fact that the best data selection strategy is specific to various factors, including task, selected data size, and training stage. To automatically adapt to these factors, we propose a framework named AdaDS to learn to choose the data selection strategy adaptively during the KD process. Experimental results show that our proposed method is effective for various tasks and selected data sizes under both fine-tuning and pre-training stages, achieving comparable performance to DistilBERT with only 10% amount of queries to the teacher model.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
加速预训练语言模型知识升华的自适应数据选择
知识提取(KD)是一种广泛使用的方法,用于将知识从大型教师模型转移到计算高效的学生模型。不幸的是,随着预训练语言模型(PLM)的增长,KD的计算成本变得难以承受。仅在训练集的一部分上计算KD损失是加速KD的一种很有前途的方法。然而,现有的工作在KD过程中仅启发式地利用了一种静态数据选择策略,表明不同蒸馏场景的改进不一致。在这项工作中,我们对KD的各种典型数据选择策略进行了深入的研究,并表明这个问题是由于最佳数据选择策略是特定于各种因素的,包括任务、选择的数据大小和训练阶段。为了自动适应这些因素,我们提出了一个名为AdaDS的框架来学习在KD过程中自适应地选择数据选择策略。实验结果表明,在微调和预训练阶段,我们提出的方法对各种任务和选定的数据大小都是有效的,在对教师模型只有10%的查询量的情况下,实现了与DistilBERT相当的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
45.00
自引率
0.00%
发文量
0
期刊最新文献
GPT understands, too Adaptive negative representations for graph contrastive learning PM2.5 forecasting under distribution shift: A graph learning approach Enhancing neural network classification using fractional-order activation functions CPT: Colorful Prompt Tuning for pre-trained vision-language models
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1