Student Can Also be a Good Teacher: Extracting Knowledge from Vision-and-Language Model for Cross-Modal Retrieval

Jun Rao, Tao Qian, Shuhan Qi, Yulin Wu, Qing Liao, Xuan Wang
{"title":"Student Can Also be a Good Teacher: Extracting Knowledge from Vision-and-Language Model for Cross-Modal Retrieval","authors":"Jun Rao, Tao Qian, Shuhan Qi, Yulin Wu, Qing Liao, Xuan Wang","doi":"10.1145/3459637.3482194","DOIUrl":null,"url":null,"abstract":"Astounding results from transformer models with Vision-and Language Pretraining (VLP) on joint vision-and-language downstream tasks have intrigued the multi-modal community. On the one hand, these models are usually so huge that make us more difficult to fine-tune and serve real-time online applications. On the other hand, the compression of the original transformer block will ignore the difference in information between modalities, which leads to the sharp decline of retrieval accuracy. In this work, we present a very light and effective cross-modal retrieval model compression method. With this method, by adopting a novel random replacement strategy and knowledge distillation, our module can learn the knowledge of the teacher with nearly the half number of parameters reduction. Furthermore, our compression method achieves nearly 130x acceleration with acceptable accuracy. To overcome the sharp decline in retrieval tasks because of compression, we introduce the co-attention interaction module to reflect the different information and interaction information. Experiments show that a multi-modal co-attention block is more suitable for cross-modal retrieval tasks rather than the source transformer encoder block.","PeriodicalId":405296,"journal":{"name":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3459637.3482194","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

Abstract

Astounding results from transformer models with Vision-and Language Pretraining (VLP) on joint vision-and-language downstream tasks have intrigued the multi-modal community. On the one hand, these models are usually so huge that make us more difficult to fine-tune and serve real-time online applications. On the other hand, the compression of the original transformer block will ignore the difference in information between modalities, which leads to the sharp decline of retrieval accuracy. In this work, we present a very light and effective cross-modal retrieval model compression method. With this method, by adopting a novel random replacement strategy and knowledge distillation, our module can learn the knowledge of the teacher with nearly the half number of parameters reduction. Furthermore, our compression method achieves nearly 130x acceleration with acceptable accuracy. To overcome the sharp decline in retrieval tasks because of compression, we introduce the co-attention interaction module to reflect the different information and interaction information. Experiments show that a multi-modal co-attention block is more suitable for cross-modal retrieval tasks rather than the source transformer encoder block.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
学生也可以成为好老师:从视觉和语言模型中提取知识进行跨模态检索
使用视觉和语言预训练(VLP)的变压器模型在联合视觉和语言下游任务上的惊人结果引起了多模态社区的兴趣。一方面,这些模型通常是如此巨大,使我们更难以微调和提供实时在线应用程序。另一方面,原始变压器块的压缩会忽略模态之间的信息差异,导致检索精度急剧下降。在这项工作中,我们提出了一种非常轻和有效的跨模态检索模型压缩方法。该方法采用了一种新颖的随机替换策略和知识精馏,模块可以在参数约简近一半的情况下学习到教师的知识。此外,我们的压缩方法在可接受的精度下实现了近130倍的加速度。为了克服由于压缩导致的检索任务急剧下降的问题,我们引入了共同关注交互模块来反映不同的信息和交互信息。实验表明,多模态共注意块比源转换器编码器块更适合于跨模态检索任务。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
UltraGCN Fine and Coarse Granular Argument Classification before Clustering CHASE Crawler Detection in Location-Based Services Using Attributed Action Net Failure Prediction for Large-scale Water Pipe Networks Using GNN and Temporal Failure Series
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1