{"title":"一种用于文本分类的自适应知识蒸馏算法","authors":"Zuqin Chen, Tingkai Hu, Chao Chen, Jike Ge, Chengzhi Wu, Wenjun Cheng","doi":"10.1109/ICESIT53460.2021.9696948","DOIUrl":null,"url":null,"abstract":"Using knowledge distillation to compress pre-trained models such as Bert has proven to be highly effective in text classification tasks. However, the overhead of tuning parameters manually still hinders their application in practice. To alleviate the cost of manual tuning of parameters in training tasks, inspired by the inverse decrease of the word frequency of TF-IDF, this paper proposes an adaptive knowledge distillation method (AKD). This core idea of the method is based on the Cosine similarity score which is calculated by the probabilistic outputs similarity measurement in two networks. The higher the score, the closer the student model's understanding of knowledge is to the teacher model, and the lower the degree of imitation of the teacher model. On the contrary, we need to increase the degree to which the student model imitates the teacher model. Interestingly, this method can improve distillation model quality. Experimental results show that the proposed method significantly improves the precision, recall and F1 value of text classification tasks. However, training speed of AKD is slightly slower than baseline models. This study provides new insights into knowledge distillation.","PeriodicalId":164745,"journal":{"name":"2021 IEEE International Conference on Emergency Science and Information Technology (ICESIT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An adaptive knowledge distillation algorithm for text classification\",\"authors\":\"Zuqin Chen, Tingkai Hu, Chao Chen, Jike Ge, Chengzhi Wu, Wenjun Cheng\",\"doi\":\"10.1109/ICESIT53460.2021.9696948\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Using knowledge distillation to compress pre-trained models such as Bert has proven to be highly effective in text classification tasks. However, the overhead of tuning parameters manually still hinders their application in practice. To alleviate the cost of manual tuning of parameters in training tasks, inspired by the inverse decrease of the word frequency of TF-IDF, this paper proposes an adaptive knowledge distillation method (AKD). This core idea of the method is based on the Cosine similarity score which is calculated by the probabilistic outputs similarity measurement in two networks. The higher the score, the closer the student model's understanding of knowledge is to the teacher model, and the lower the degree of imitation of the teacher model. On the contrary, we need to increase the degree to which the student model imitates the teacher model. Interestingly, this method can improve distillation model quality. Experimental results show that the proposed method significantly improves the precision, recall and F1 value of text classification tasks. However, training speed of AKD is slightly slower than baseline models. This study provides new insights into knowledge distillation.\",\"PeriodicalId\":164745,\"journal\":{\"name\":\"2021 IEEE International Conference on Emergency Science and Information Technology (ICESIT)\",\"volume\":\"24 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-11-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Conference on Emergency Science and Information Technology (ICESIT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICESIT53460.2021.9696948\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Emergency Science and Information Technology (ICESIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICESIT53460.2021.9696948","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
An adaptive knowledge distillation algorithm for text classification
Using knowledge distillation to compress pre-trained models such as Bert has proven to be highly effective in text classification tasks. However, the overhead of tuning parameters manually still hinders their application in practice. To alleviate the cost of manual tuning of parameters in training tasks, inspired by the inverse decrease of the word frequency of TF-IDF, this paper proposes an adaptive knowledge distillation method (AKD). This core idea of the method is based on the Cosine similarity score which is calculated by the probabilistic outputs similarity measurement in two networks. The higher the score, the closer the student model's understanding of knowledge is to the teacher model, and the lower the degree of imitation of the teacher model. On the contrary, we need to increase the degree to which the student model imitates the teacher model. Interestingly, this method can improve distillation model quality. Experimental results show that the proposed method significantly improves the precision, recall and F1 value of text classification tasks. However, training speed of AKD is slightly slower than baseline models. This study provides new insights into knowledge distillation.