Ensemble Compressed Language Model Based on Knowledge Distillation and Multi-Task Learning

2022 7th International Conference on Business and Industrial Research (ICBIR) Pub Date : 2022-05-19 DOI:10.1109/ICBIR54589.2022.9786508

Kun Xiang, Akihiro Fujii

{"title":"Ensemble Compressed Language Model Based on Knowledge Distillation and Multi-Task Learning","authors":"Kun Xiang, Akihiro Fujii","doi":"10.1109/ICBIR54589.2022.9786508","DOIUrl":null,"url":null,"abstract":"The success of pre-trained language representation models such as BERT benefits from their “overparameterized” nature, resulting in training time consuming, high computational complexity and superior requirement of devices. Among the variety of model compression and acceleration techniques, Knowledge Distillation(KD) has attracted extensive attention for compressing pre-trained language models. However, the major two challenges for KD are: (i)Transfer more knowledge from the teacher model to student model without scarifying accuracy while accelerating. (ii)Higher training speed of the lightweight model is accompanied by the risk of overfitting due to the noise influence. To address these problems, we propose a novel model based on knowledge distillation, called Theseus-BERT Guided Distill CNN(TBG-disCNN). BERT-of-Theseus [1] is employed as the teacher model and CNN as the student model. Aiming at the inherent noise problem, we propose coordinated CNN-BiLSTM as a parameter-sharing layer for Multi-Task Learning (MTL), in order to capture both regional and long-term dependence information. Our approach has approximately good performance as BERT-base and teacher model with $12 \\times$ and $281 \\times$ speedup of inference and $19.58 \\times$ and $8.94 \\times$ fewer parameters usage, respectively.","PeriodicalId":216904,"journal":{"name":"2022 7th International Conference on Business and Industrial Research (ICBIR)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 7th International Conference on Business and Industrial Research (ICBIR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICBIR54589.2022.9786508","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The success of pre-trained language representation models such as BERT benefits from their “overparameterized” nature, resulting in training time consuming, high computational complexity and superior requirement of devices. Among the variety of model compression and acceleration techniques, Knowledge Distillation(KD) has attracted extensive attention for compressing pre-trained language models. However, the major two challenges for KD are: (i)Transfer more knowledge from the teacher model to student model without scarifying accuracy while accelerating. (ii)Higher training speed of the lightweight model is accompanied by the risk of overfitting due to the noise influence. To address these problems, we propose a novel model based on knowledge distillation, called Theseus-BERT Guided Distill CNN(TBG-disCNN). BERT-of-Theseus [1] is employed as the teacher model and CNN as the student model. Aiming at the inherent noise problem, we propose coordinated CNN-BiLSTM as a parameter-sharing layer for Multi-Task Learning (MTL), in order to capture both regional and long-term dependence information. Our approach has approximately good performance as BERT-base and teacher model with $12 \times$ and $281 \times$ speedup of inference and $19.58 \times$ and $8.94 \times$ fewer parameters usage, respectively.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于知识蒸馏和多任务学习的集成压缩语言模型

BERT等预训练语言表示模型的成功得益于其“过度参数化”的特性，导致训练耗时、计算复杂度高、对设备要求高。在各种模型压缩和加速技术中，知识蒸馏(Knowledge Distillation, KD)对预训练语言模型的压缩引起了广泛的关注。然而，KD面临的两个主要挑战是:(i)将更多的知识从教师模型转移到学生模型，而不会在加速的同时损害准确性。(ii)轻量化模型的训练速度越快，由于噪声的影响，有过拟合的风险。为了解决这些问题，我们提出了一种基于知识蒸馏的新模型，称为忒修斯-伯特引导蒸馏CNN(TBG-disCNN)。BERT-of-Theseus[1]作为教师模型，CNN作为学生模型。针对固有噪声问题，提出了协同CNN-BiLSTM作为多任务学习(MTL)的参数共享层，以捕获区域和长期依赖信息。我们的方法具有近似良好的性能，作为BERT-base模型和teacher模型，推理加速分别为$12 \times$和$281 \times$，参数使用分别减少$19.58 \times$和$8.94 \times$。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 7th International Conference on Business and Industrial Research (ICBIR)

自引率

0.00%

发文量