{"title":"BinaryViT: Toward Efficient and Accurate Binary Vision Transformers","authors":"Junrui Xiao;Zhikai Li;Jianquan Li;Lianwei Yang;Qingyi Gu","doi":"10.1109/TCSVT.2024.3457610","DOIUrl":null,"url":null,"abstract":"Vision Transformers (ViTs) have emerged as the new fundamental architecture for most computer vision fields. However, the considerable memory and computation costs also hinder their application on resource-limited devices. Currently, binarization has demonstrated remarkable potential as a model compression technique in traditional Convolutional Neural Networks (CNNs), albeit with some accuracy loss. In this paper, we focus on binarization of ViTs, which is still under-studied and suffering a significant performance drop. We start with constructing a strong baseline of binary ViTs, integrating some of the best practices from binary CNNs, which forms the foundation of our exploration. Subsequently, we identify that the severe performance degradation of the baseline is mainly caused by the weight oscillation around the quantization boundary and the information distortion in the activation of ViTs. To address these challenges, we introduce BinaryViT, a precise full binarization framework tailored for Vision Transformers (ViTs), effectively pushing the binarization of ViTs to its limit. Specifically, we propose a novel gradient regularization scheme (GRS), which mitigates oscillations by fostering a smooth moving of latent weights to be away from the quantization boundary during the training process. Additionally, we have devised an Activation Shift Module (ASM) that dynamically adjusts the activation distribution prior to the sign function, thereby minimizing the information distortion stemming from the significant inter-channel variations. Extensive experiments on ImageNet dataset show that our BinaryViT consistently surpasses the strong baseline by 2.05% and improves the accuracy of fully binarized ViTs to a usable level. Furthermore, our method achieves impressive savings of <inline-formula> <tex-math>$16.2\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$17.7\\times $ </tex-math></inline-formula> in model size and OPs compared to the full-precision DeiT-S.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"195-206"},"PeriodicalIF":11.1000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10671591/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Vision Transformers (ViTs) have emerged as the new fundamental architecture for most computer vision fields. However, the considerable memory and computation costs also hinder their application on resource-limited devices. Currently, binarization has demonstrated remarkable potential as a model compression technique in traditional Convolutional Neural Networks (CNNs), albeit with some accuracy loss. In this paper, we focus on binarization of ViTs, which is still under-studied and suffering a significant performance drop. We start with constructing a strong baseline of binary ViTs, integrating some of the best practices from binary CNNs, which forms the foundation of our exploration. Subsequently, we identify that the severe performance degradation of the baseline is mainly caused by the weight oscillation around the quantization boundary and the information distortion in the activation of ViTs. To address these challenges, we introduce BinaryViT, a precise full binarization framework tailored for Vision Transformers (ViTs), effectively pushing the binarization of ViTs to its limit. Specifically, we propose a novel gradient regularization scheme (GRS), which mitigates oscillations by fostering a smooth moving of latent weights to be away from the quantization boundary during the training process. Additionally, we have devised an Activation Shift Module (ASM) that dynamically adjusts the activation distribution prior to the sign function, thereby minimizing the information distortion stemming from the significant inter-channel variations. Extensive experiments on ImageNet dataset show that our BinaryViT consistently surpasses the strong baseline by 2.05% and improves the accuracy of fully binarized ViTs to a usable level. Furthermore, our method achieves impressive savings of $16.2\times $ and $17.7\times $ in model size and OPs compared to the full-precision DeiT-S.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.