基于硬件感知的视觉变压器缩放研究

IF 2.8 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE ACM Transactions on Embedded Computing Systems Pub Date : 2023-08-21 DOI:10.1145/3611387

Chaojian Li, Kyungmin Kim, Bichen Wu, Peizhao Zhang, Hang Zhang, Xiaoliang Dai, Péter Vajda, Y. Lin

{"title":"基于硬件感知的视觉变压器缩放研究","authors":"Chaojian Li, Kyungmin Kim, Bichen Wu, Peizhao Zhang, Hang Zhang, Xiaoliang Dai, Péter Vajda, Y. Lin","doi":"10.1145/3611387","DOIUrl":null,"url":null,"abstract":"Vision Transformer (ViT) has demonstrated promising performance in various computer vision tasks, and recently attracted a lot of research attention. Many recent works have focused on proposing new architectures to improve ViT and deploying it into real-world applications. However, little effort has been made to analyze and understand ViT’s architecture design space and its implication of hardware-cost on different devices. In this work, by simply scaling ViT’s depth, width, input size, and other basic configurations, we show that a scaled vanilla ViT model without bells and whistles can achieve comparable or superior accuracy-efficiency trade-off than most of the latest ViT variants. Specifically, compared to DeiT-Tiny, our scaled model achieves a \\(\\uparrow 1.9\\% \\) higher ImageNet top-1 accuracy under the same FLOPs and a \\(\\uparrow 3.7\\% \\) better ImageNet top-1 accuracy under the same latency on an NVIDIA Edge GPU TX2. Motivated by this, we further investigate the extracted scaling strategies from the following two aspects: (1) “can these scaling strategies be transferred across different real hardware devices?”; and (2) “can these scaling strategies be transferred to different ViT variants and tasks?”. For (1), our exploration, based on various devices with different resource budgets, indicates that the transferability effectiveness depends on the underlying device together with its corresponding deployment tool; for (2), we validate the effective transferability of the aforementioned scaling strategies obtained from a vanilla ViT model on top of an image classification task to the PiT model, a strong ViT variant targeting efficiency, as well as object detection and video classification tasks. In particular, when transferred to PiT, our scaling strategies lead to a boosted ImageNet top-1 accuracy of from \\(74.6\\% \\) to \\(76.7\\% \\) ( \\(\\uparrow 2.1\\% \\) ) under the same 0.7G FLOPs; and when transferred to the COCO object detection task, the average precision is boosted by \\(\\uparrow 0.7\\% \\) under a similar throughput on a V100 GPU.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":" ","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2023-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An Investigation on Hardware-Aware Vision Transformer Scaling\",\"authors\":\"Chaojian Li, Kyungmin Kim, Bichen Wu, Peizhao Zhang, Hang Zhang, Xiaoliang Dai, Péter Vajda, Y. Lin\",\"doi\":\"10.1145/3611387\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Vision Transformer (ViT) has demonstrated promising performance in various computer vision tasks, and recently attracted a lot of research attention. Many recent works have focused on proposing new architectures to improve ViT and deploying it into real-world applications. However, little effort has been made to analyze and understand ViT’s architecture design space and its implication of hardware-cost on different devices. In this work, by simply scaling ViT’s depth, width, input size, and other basic configurations, we show that a scaled vanilla ViT model without bells and whistles can achieve comparable or superior accuracy-efficiency trade-off than most of the latest ViT variants. Specifically, compared to DeiT-Tiny, our scaled model achieves a \\\\(\\\\uparrow 1.9\\\\% \\\\) higher ImageNet top-1 accuracy under the same FLOPs and a \\\\(\\\\uparrow 3.7\\\\% \\\\) better ImageNet top-1 accuracy under the same latency on an NVIDIA Edge GPU TX2. Motivated by this, we further investigate the extracted scaling strategies from the following two aspects: (1) “can these scaling strategies be transferred across different real hardware devices?”; and (2) “can these scaling strategies be transferred to different ViT variants and tasks?”. For (1), our exploration, based on various devices with different resource budgets, indicates that the transferability effectiveness depends on the underlying device together with its corresponding deployment tool; for (2), we validate the effective transferability of the aforementioned scaling strategies obtained from a vanilla ViT model on top of an image classification task to the PiT model, a strong ViT variant targeting efficiency, as well as object detection and video classification tasks. In particular, when transferred to PiT, our scaling strategies lead to a boosted ImageNet top-1 accuracy of from \\\\(74.6\\\\% \\\\) to \\\\(76.7\\\\% \\\\) ( \\\\(\\\\uparrow 2.1\\\\% \\\\) ) under the same 0.7G FLOPs; and when transferred to the COCO object detection task, the average precision is boosted by \\\\(\\\\uparrow 0.7\\\\% \\\\) under a similar throughput on a V100 GPU.\",\"PeriodicalId\":50914,\"journal\":{\"name\":\"ACM Transactions on Embedded Computing Systems\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2023-08-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Embedded Computing Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3611387\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Embedded Computing Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3611387","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

视觉变压器(Vision Transformer, ViT)在各种计算机视觉任务中表现出了良好的性能，近年来引起了人们的广泛关注。最近的许多工作都集中在提出新的架构来改进ViT并将其部署到实际应用程序中。然而，对于ViT的架构设计空间及其在不同设备上对硬件成本的影响的分析和理解却很少。在这项工作中，通过简单地缩放ViT的深度、宽度、输入大小和其他基本配置，我们表明，与大多数最新的ViT变体相比，一个没有铃铛和哨子的缩放香草ViT模型可以实现相当或更高的精度-效率权衡。具体来说，与DeiT-Tiny相比，我们的缩放模型在相同的FLOPs下实现了\(\uparrow 1.9\% \)更高的ImageNet top-1精度，并且在NVIDIA Edge GPU TX2上在相同的延迟下实现了\(\uparrow 3.7\% \)更好的ImageNet top-1精度。基于此，我们从以下两个方面对提取的缩放策略进行了进一步的研究:(1)“这些缩放策略能否在不同的真实硬件设备之间迁移?”(2)“这些扩展策略是否可以转移到不同的ViT变体和任务中?”对于(1)，我们基于不同资源预算的各种设备的探索表明，可转移性有效性取决于底层设备及其相应的部署工具;对于(2)，我们验证了上述从图像分类任务之上的香草ViT模型获得的缩放策略到PiT模型，强大的ViT变体靶向效率以及目标检测和视频分类任务的有效可转移性。特别是，当转移到PiT时，我们的缩放策略可以在0.7G FLOPs的情况下将ImageNet top-1精度从\(74.6\% \)提高到\(76.7\% \) (\(\uparrow 2.1\% \));当转移到COCO对象检测任务时，在V100 GPU的类似吞吐量下，平均精度提高\(\uparrow 0.7\% \)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

An Investigation on Hardware-Aware Vision Transformer Scaling

Vision Transformer (ViT) has demonstrated promising performance in various computer vision tasks, and recently attracted a lot of research attention. Many recent works have focused on proposing new architectures to improve ViT and deploying it into real-world applications. However, little effort has been made to analyze and understand ViT’s architecture design space and its implication of hardware-cost on different devices. In this work, by simply scaling ViT’s depth, width, input size, and other basic configurations, we show that a scaled vanilla ViT model without bells and whistles can achieve comparable or superior accuracy-efficiency trade-off than most of the latest ViT variants. Specifically, compared to DeiT-Tiny, our scaled model achieves a \(\uparrow 1.9\% \) higher ImageNet top-1 accuracy under the same FLOPs and a \(\uparrow 3.7\% \) better ImageNet top-1 accuracy under the same latency on an NVIDIA Edge GPU TX2. Motivated by this, we further investigate the extracted scaling strategies from the following two aspects: (1) “can these scaling strategies be transferred across different real hardware devices?”; and (2) “can these scaling strategies be transferred to different ViT variants and tasks?”. For (1), our exploration, based on various devices with different resource budgets, indicates that the transferability effectiveness depends on the underlying device together with its corresponding deployment tool; for (2), we validate the effective transferability of the aforementioned scaling strategies obtained from a vanilla ViT model on top of an image classification task to the PiT model, a strong ViT variant targeting efficiency, as well as object detection and video classification tasks. In particular, when transferred to PiT, our scaling strategies lead to a boosted ImageNet top-1 accuracy of from \(74.6\% \) to \(76.7\% \) ( \(\uparrow 2.1\% \) ) under the same 0.7G FLOPs; and when transferred to the COCO object detection task, the average precision is boosted by \(\uparrow 0.7\% \) under a similar throughput on a V100 GPU.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Embedded Computing Systems 工程技术-计算机：软件工程

CiteScore

3.70

自引率

0.00%

发文量

138

审稿时长

6 months

期刊介绍： The design of embedded computing systems, both the software and hardware, increasingly relies on sophisticated algorithms, analytical models, and methodologies. ACM Transactions on Embedded Computing Systems (TECS) aims to present the leading work relating to the analysis, design, behavior, and experience with embedded computing systems.