{"title":"P$^2$-ViT:用于全量化视觉变换器的二重幂后训练量化和加速技术","authors":"Huihong Shi;Xin Cheng;Wendong Mao;Zhongfeng Wang","doi":"10.1109/TVLSI.2024.3422684","DOIUrl":null,"url":null,"abstract":"Vision transformers (ViTs) have excelled in computer vision (CV) tasks but are memory-consuming and computation-intensive, challenging their deployment on resource-constrained devices. To tackle this limitation, prior works have explored ViT-tailored quantization algorithms but retained floating-point scaling factors, which yield nonnegligible requantization overhead, limiting ViTs’ hardware efficiency and motivating more hardware-friendly solutions. To this end, we propose P2-ViT, the first power-of-two (PoT) posttraining quantization (PTQ) and acceleration framework to accelerate fully quantized ViTs. Specifically, as for quantization, we explore a dedicated quantization scheme to effectively quantize ViTs with PoT scaling factors, thus minimizing the requantization overhead. Furthermore, we propose coarse-to-fine automatic mixed-precision quantization to enable better accuracy-efficiency tradeoffs. In terms of hardware, we develop a dedicated chunk-based accelerator featuring multiple tailored subprocessors to individually handle ViTs’ different types of operations, alleviating reconfigurable overhead. In addition, we design a tailored row-stationary dataflow to seize the pipeline processing opportunity introduced by our PoT scaling factors, thereby enhancing throughput. Extensive experiments consistently validate P2-ViT’s effectiveness. Particularly, we offer comparable or even superior quantization performance with PoT scaling factors when compared with the counterpart with floating-point scaling factors. Besides, we achieve up to \n<inline-formula> <tex-math>$10.1\\times $ </tex-math></inline-formula>\n speedup and \n<inline-formula> <tex-math>$36.8\\times $ </tex-math></inline-formula>\n energy saving over GPU’s Turing Tensor Cores, and up to \n<inline-formula> <tex-math>$1.84\\times $ </tex-math></inline-formula>\n higher computation utilization efficiency against SOTA quantization-based ViT accelerators. Codes are available at \n<uri>https://github.com/shihuihong214/P2-ViT</uri>\n.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 9","pages":"1704-1717"},"PeriodicalIF":2.8000,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"P2-ViT: Power-of-Two Post-Training Quantization and Acceleration for Fully Quantized Vision Transformer\",\"authors\":\"Huihong Shi;Xin Cheng;Wendong Mao;Zhongfeng Wang\",\"doi\":\"10.1109/TVLSI.2024.3422684\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Vision transformers (ViTs) have excelled in computer vision (CV) tasks but are memory-consuming and computation-intensive, challenging their deployment on resource-constrained devices. To tackle this limitation, prior works have explored ViT-tailored quantization algorithms but retained floating-point scaling factors, which yield nonnegligible requantization overhead, limiting ViTs’ hardware efficiency and motivating more hardware-friendly solutions. To this end, we propose P2-ViT, the first power-of-two (PoT) posttraining quantization (PTQ) and acceleration framework to accelerate fully quantized ViTs. Specifically, as for quantization, we explore a dedicated quantization scheme to effectively quantize ViTs with PoT scaling factors, thus minimizing the requantization overhead. Furthermore, we propose coarse-to-fine automatic mixed-precision quantization to enable better accuracy-efficiency tradeoffs. In terms of hardware, we develop a dedicated chunk-based accelerator featuring multiple tailored subprocessors to individually handle ViTs’ different types of operations, alleviating reconfigurable overhead. In addition, we design a tailored row-stationary dataflow to seize the pipeline processing opportunity introduced by our PoT scaling factors, thereby enhancing throughput. Extensive experiments consistently validate P2-ViT’s effectiveness. Particularly, we offer comparable or even superior quantization performance with PoT scaling factors when compared with the counterpart with floating-point scaling factors. Besides, we achieve up to \\n<inline-formula> <tex-math>$10.1\\\\times $ </tex-math></inline-formula>\\n speedup and \\n<inline-formula> <tex-math>$36.8\\\\times $ </tex-math></inline-formula>\\n energy saving over GPU’s Turing Tensor Cores, and up to \\n<inline-formula> <tex-math>$1.84\\\\times $ </tex-math></inline-formula>\\n higher computation utilization efficiency against SOTA quantization-based ViT accelerators. Codes are available at \\n<uri>https://github.com/shihuihong214/P2-ViT</uri>\\n.\",\"PeriodicalId\":13425,\"journal\":{\"name\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"volume\":\"32 9\",\"pages\":\"1704-1717\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2024-07-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10596041/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10596041/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
摘要
视觉变换器(ViT)在计算机视觉(CV)任务中表现出色,但其内存消耗大、计算密集,这对在资源受限的设备上部署视觉变换器提出了挑战。为解决这一限制,之前的工作探索了针对 ViT 的量化算法,但保留了浮点缩放因子,这产生了不可忽略的重新量化开销,限制了 ViT 的硬件效率,并激发了对硬件更友好的解决方案。为此,我们提出了 P2-ViT,这是首个二幂(PoT)训练后量化(PTQ)和加速框架,用于加速完全量化的 ViT。具体来说,在量化方面,我们探索了一种专用量化方案,以有效量化具有 PoT 缩放因子的 ViT,从而最大限度地减少重新量化开销。此外,我们还提出了从粗到细的自动混合精度量化方案,以实现更好的精度-效率权衡。在硬件方面,我们开发了一种基于分块的专用加速器,具有多个定制的子处理器,可单独处理 ViTs 不同类型的操作,从而减轻了可重新配置的开销。此外,我们还设计了量身定制的行静态数据流,以抓住 PoT 扩展因子带来的流水线处理机会,从而提高吞吐量。大量实验不断验证 P2-ViT 的有效性。特别是,与使用浮点缩放因子的对应方案相比,我们使用 PoT 缩放因子提供了相当甚至更优越的量化性能。此外,与GPU的图灵张量核相比,我们实现了高达10.1倍的提速和36.8倍的节能,与基于SOTA量化的ViT加速器相比,我们实现了高达1.84倍的计算利用效率。代码见 https://github.com/shihuihong214/P2-ViT。
P2-ViT: Power-of-Two Post-Training Quantization and Acceleration for Fully Quantized Vision Transformer
Vision transformers (ViTs) have excelled in computer vision (CV) tasks but are memory-consuming and computation-intensive, challenging their deployment on resource-constrained devices. To tackle this limitation, prior works have explored ViT-tailored quantization algorithms but retained floating-point scaling factors, which yield nonnegligible requantization overhead, limiting ViTs’ hardware efficiency and motivating more hardware-friendly solutions. To this end, we propose P2-ViT, the first power-of-two (PoT) posttraining quantization (PTQ) and acceleration framework to accelerate fully quantized ViTs. Specifically, as for quantization, we explore a dedicated quantization scheme to effectively quantize ViTs with PoT scaling factors, thus minimizing the requantization overhead. Furthermore, we propose coarse-to-fine automatic mixed-precision quantization to enable better accuracy-efficiency tradeoffs. In terms of hardware, we develop a dedicated chunk-based accelerator featuring multiple tailored subprocessors to individually handle ViTs’ different types of operations, alleviating reconfigurable overhead. In addition, we design a tailored row-stationary dataflow to seize the pipeline processing opportunity introduced by our PoT scaling factors, thereby enhancing throughput. Extensive experiments consistently validate P2-ViT’s effectiveness. Particularly, we offer comparable or even superior quantization performance with PoT scaling factors when compared with the counterpart with floating-point scaling factors. Besides, we achieve up to
$10.1\times $
speedup and
$36.8\times $
energy saving over GPU’s Turing Tensor Cores, and up to
$1.84\times $
higher computation utilization efficiency against SOTA quantization-based ViT accelerators. Codes are available at
https://github.com/shihuihong214/P2-ViT
.
期刊介绍:
The IEEE Transactions on VLSI Systems is published as a monthly journal under the co-sponsorship of the IEEE Circuits and Systems Society, the IEEE Computer Society, and the IEEE Solid-State Circuits Society.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels.
To address this critical area through a common forum, the IEEE Transactions on VLSI Systems have been founded. The editorial board, consisting of international experts, invites original papers which emphasize and merit the novel systems integration aspects of microelectronic systems including interactions among systems design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and systems level qualification. Thus, the coverage of these Transactions will focus on VLSI/ULSI microelectronic systems integration.