BinaryViT: Toward Efficient and Accurate Binary Vision Transformers

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-09-10 DOI:10.1109/TCSVT.2024.3457610

Junrui Xiao;Zhikai Li;Jianquan Li;Lianwei Yang;Qingyi Gu

{"title":"BinaryViT: Toward Efficient and Accurate Binary Vision Transformers","authors":"Junrui Xiao;Zhikai Li;Jianquan Li;Lianwei Yang;Qingyi Gu","doi":"10.1109/TCSVT.2024.3457610","DOIUrl":null,"url":null,"abstract":"Vision Transformers (ViTs) have emerged as the new fundamental architecture for most computer vision fields. However, the considerable memory and computation costs also hinder their application on resource-limited devices. Currently, binarization has demonstrated remarkable potential as a model compression technique in traditional Convolutional Neural Networks (CNNs), albeit with some accuracy loss. In this paper, we focus on binarization of ViTs, which is still under-studied and suffering a significant performance drop. We start with constructing a strong baseline of binary ViTs, integrating some of the best practices from binary CNNs, which forms the foundation of our exploration. Subsequently, we identify that the severe performance degradation of the baseline is mainly caused by the weight oscillation around the quantization boundary and the information distortion in the activation of ViTs. To address these challenges, we introduce BinaryViT, a precise full binarization framework tailored for Vision Transformers (ViTs), effectively pushing the binarization of ViTs to its limit. Specifically, we propose a novel gradient regularization scheme (GRS), which mitigates oscillations by fostering a smooth moving of latent weights to be away from the quantization boundary during the training process. Additionally, we have devised an Activation Shift Module (ASM) that dynamically adjusts the activation distribution prior to the sign function, thereby minimizing the information distortion stemming from the significant inter-channel variations. Extensive experiments on ImageNet dataset show that our BinaryViT consistently surpasses the strong baseline by 2.05% and improves the accuracy of fully binarized ViTs to a usable level. Furthermore, our method achieves impressive savings of <inline-formula> <tex-math>$16.2\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$17.7\\times $ </tex-math></inline-formula> in model size and OPs compared to the full-precision DeiT-S.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"195-206"},"PeriodicalIF":11.1000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10671591/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Vision Transformers (ViTs) have emerged as the new fundamental architecture for most computer vision fields. However, the considerable memory and computation costs also hinder their application on resource-limited devices. Currently, binarization has demonstrated remarkable potential as a model compression technique in traditional Convolutional Neural Networks (CNNs), albeit with some accuracy loss. In this paper, we focus on binarization of ViTs, which is still under-studied and suffering a significant performance drop. We start with constructing a strong baseline of binary ViTs, integrating some of the best practices from binary CNNs, which forms the foundation of our exploration. Subsequently, we identify that the severe performance degradation of the baseline is mainly caused by the weight oscillation around the quantization boundary and the information distortion in the activation of ViTs. To address these challenges, we introduce BinaryViT, a precise full binarization framework tailored for Vision Transformers (ViTs), effectively pushing the binarization of ViTs to its limit. Specifically, we propose a novel gradient regularization scheme (GRS), which mitigates oscillations by fostering a smooth moving of latent weights to be away from the quantization boundary during the training process. Additionally, we have devised an Activation Shift Module (ASM) that dynamically adjusts the activation distribution prior to the sign function, thereby minimizing the information distortion stemming from the significant inter-channel variations. Extensive experiments on ImageNet dataset show that our BinaryViT consistently surpasses the strong baseline by 2.05% and improves the accuracy of fully binarized ViTs to a usable level. Furthermore, our method achieves impressive savings of

$16.2\times $

and

$17.7\times $

in model size and OPs compared to the full-precision DeiT-S.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

BinaryViT：实现高效准确的二进制视觉变换器

视觉变压器（Vision transformer, ViTs）已成为大多数计算机视觉领域新的基础架构。然而，可观的内存和计算成本也阻碍了它们在资源有限的设备上的应用。目前，二值化作为一种模型压缩技术在传统卷积神经网络（cnn）中表现出了显著的潜力，尽管存在一定的精度损失。在本文中，我们关注的是vit的二值化问题，这一问题目前还没有得到充分的研究，并且性能下降很大。我们从构建二进制vit的强大基线开始，整合二进制cnn的一些最佳实践，这构成了我们探索的基础。随后，我们发现基线的严重性能下降主要是由量化边界周围的权值振荡和vit激活过程中的信息失真引起的。为了解决这些挑战，我们引入了BinaryViT，这是一个为视觉变压器（vit）量身定制的精确的全二值化框架，有效地将视觉变压器的二值化推向了极限。具体来说，我们提出了一种新的梯度正则化方案（GRS），该方案通过在训练过程中促进潜在权值的平滑移动以远离量化边界来减轻振荡。此外，我们还设计了一个激活移位模块（ASM），它可以在符号函数之前动态调整激活分布，从而最大限度地减少由显着的信道间变化引起的信息失真。在ImageNet数据集上的大量实验表明，我们的BinaryViT持续优于强基线2.05%，并将完全二值化的vit精度提高到可用的水平。此外，与全精度的DeiT-S相比，我们的方法在模型大小和OPs方面分别节省了16.2美元和17.7美元。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.

期刊最新文献

IEEE Circuits and Systems Society Information IEEE Circuits and Systems Society Information 2025 Index IEEE Transactions on Circuits and Systems for Video Technology IEEE Circuits and Systems Society Information IEEE Circuits and Systems Society Information