Ensemble Compressed Language Model Based on Knowledge Distillation and Multi-Task Learning

Kun Xiang, Akihiro Fujii
{"title":"Ensemble Compressed Language Model Based on Knowledge Distillation and Multi-Task Learning","authors":"Kun Xiang, Akihiro Fujii","doi":"10.1109/ICBIR54589.2022.9786508","DOIUrl":null,"url":null,"abstract":"The success of pre-trained language representation models such as BERT benefits from their “overparameterized” nature, resulting in training time consuming, high computational complexity and superior requirement of devices. Among the variety of model compression and acceleration techniques, Knowledge Distillation(KD) has attracted extensive attention for compressing pre-trained language models. However, the major two challenges for KD are: (i)Transfer more knowledge from the teacher model to student model without scarifying accuracy while accelerating. (ii)Higher training speed of the lightweight model is accompanied by the risk of overfitting due to the noise influence. To address these problems, we propose a novel model based on knowledge distillation, called Theseus-BERT Guided Distill CNN(TBG-disCNN). BERT-of-Theseus [1] is employed as the teacher model and CNN as the student model. Aiming at the inherent noise problem, we propose coordinated CNN-BiLSTM as a parameter-sharing layer for Multi-Task Learning (MTL), in order to capture both regional and long-term dependence information. Our approach has approximately good performance as BERT-base and teacher model with $12 \\times$ and $281 \\times$ speedup of inference and $19.58 \\times$ and $8.94 \\times$ fewer parameters usage, respectively.","PeriodicalId":216904,"journal":{"name":"2022 7th International Conference on Business and Industrial Research (ICBIR)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 7th International Conference on Business and Industrial Research (ICBIR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICBIR54589.2022.9786508","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The success of pre-trained language representation models such as BERT benefits from their “overparameterized” nature, resulting in training time consuming, high computational complexity and superior requirement of devices. Among the variety of model compression and acceleration techniques, Knowledge Distillation(KD) has attracted extensive attention for compressing pre-trained language models. However, the major two challenges for KD are: (i)Transfer more knowledge from the teacher model to student model without scarifying accuracy while accelerating. (ii)Higher training speed of the lightweight model is accompanied by the risk of overfitting due to the noise influence. To address these problems, we propose a novel model based on knowledge distillation, called Theseus-BERT Guided Distill CNN(TBG-disCNN). BERT-of-Theseus [1] is employed as the teacher model and CNN as the student model. Aiming at the inherent noise problem, we propose coordinated CNN-BiLSTM as a parameter-sharing layer for Multi-Task Learning (MTL), in order to capture both regional and long-term dependence information. Our approach has approximately good performance as BERT-base and teacher model with $12 \times$ and $281 \times$ speedup of inference and $19.58 \times$ and $8.94 \times$ fewer parameters usage, respectively.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于知识蒸馏和多任务学习的集成压缩语言模型
BERT等预训练语言表示模型的成功得益于其“过度参数化”的特性,导致训练耗时、计算复杂度高、对设备要求高。在各种模型压缩和加速技术中,知识蒸馏(Knowledge Distillation, KD)对预训练语言模型的压缩引起了广泛的关注。然而,KD面临的两个主要挑战是:(i)将更多的知识从教师模型转移到学生模型,而不会在加速的同时损害准确性。(ii)轻量化模型的训练速度越快,由于噪声的影响,有过拟合的风险。为了解决这些问题,我们提出了一种基于知识蒸馏的新模型,称为忒修斯-伯特引导蒸馏CNN(TBG-disCNN)。BERT-of-Theseus[1]作为教师模型,CNN作为学生模型。针对固有噪声问题,提出了协同CNN-BiLSTM作为多任务学习(MTL)的参数共享层,以捕获区域和长期依赖信息。我们的方法具有近似良好的性能,作为BERT-base模型和teacher模型,推理加速分别为$12 \times$和$281 \times$,参数使用分别减少$19.58 \times$和$8.94 \times$。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Facial Emotional Expression Recognition Using Hybrid Deep Learning Algorithm A Study on the Implementation of Robotic Process Automation (RPA) to Decrease the Time Required for the Documentation Process: A case study of ABC Co., Ltd. Factors Influencing Efficiency of Online Purchase of Gen Z Customers in Pathum Thani Province of Thailand Green Logistics in Small and Medium Enterprises for Sustainable Development: A Developing Country Perspective Ensemble Compressed Language Model Based on Knowledge Distillation and Multi-Task Learning
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1