AutoGTCO:图形和张量协同优化的图像识别与GPU上的变压器

Yang Bai, Xufeng Yao, Qi Sun, Bei Yu
{"title":"AutoGTCO:图形和张量协同优化的图像识别与GPU上的变压器","authors":"Yang Bai, Xufeng Yao, Qi Sun, Bei Yu","doi":"10.1109/ICCAD51958.2021.9643487","DOIUrl":null,"url":null,"abstract":"Performance optimization is the art of continuously seeking an effective mapping between algorithm and hardware. Existing deep learning compilers or frameworks optimize the computation graph via adapting transformations manually designed by expert efforts. We argue that these methods ignore some possible graph-level optimizations, thus it is difficult to generalize to emerging deep learning models or new operators. In this work, we propose AutoGTCO, a tensor program generation system for vision tasks with the transformer architecture on GPU. Compared with existing fusion strategies, AutoGTCO explores the optimization of operator fusion in the transformer model through a novel dynamic programming algorithm. Specifically, to construct an effective search space of the sampled programs, new sketch generation rules and a search policy are proposed for the batch matrix multiplication and softmax operators in each subgraph, which are capable of fusing them into large computation units, it can then map and transform them into efficient CUDA kernels. Overall, our evaluation on three real-world transformer-based vision tasks shows that AutoGTCO improves the execution performance relative to deep learning engine TensorRT by up to 1.38 ×.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"AutoGTCO: Graph and Tensor Co-Optimize for Image Recognition with Transformers on GPU\",\"authors\":\"Yang Bai, Xufeng Yao, Qi Sun, Bei Yu\",\"doi\":\"10.1109/ICCAD51958.2021.9643487\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Performance optimization is the art of continuously seeking an effective mapping between algorithm and hardware. Existing deep learning compilers or frameworks optimize the computation graph via adapting transformations manually designed by expert efforts. We argue that these methods ignore some possible graph-level optimizations, thus it is difficult to generalize to emerging deep learning models or new operators. In this work, we propose AutoGTCO, a tensor program generation system for vision tasks with the transformer architecture on GPU. Compared with existing fusion strategies, AutoGTCO explores the optimization of operator fusion in the transformer model through a novel dynamic programming algorithm. Specifically, to construct an effective search space of the sampled programs, new sketch generation rules and a search policy are proposed for the batch matrix multiplication and softmax operators in each subgraph, which are capable of fusing them into large computation units, it can then map and transform them into efficient CUDA kernels. Overall, our evaluation on three real-world transformer-based vision tasks shows that AutoGTCO improves the execution performance relative to deep learning engine TensorRT by up to 1.38 ×.\",\"PeriodicalId\":370791,\"journal\":{\"name\":\"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCAD51958.2021.9643487\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCAD51958.2021.9643487","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

摘要

性能优化是不断寻求算法和硬件之间有效映射的艺术。现有的深度学习编译器或框架通过适应由专家手工设计的转换来优化计算图。我们认为这些方法忽略了一些可能的图级优化,因此很难推广到新兴的深度学习模型或新的操作符。在这项工作中,我们提出了AutoGTCO,一个基于GPU的变压器架构的用于视觉任务的张量程序生成系统。与现有的融合策略相比,AutoGTCO通过一种新的动态规划算法探索了变压器模型中算子融合的优化问题。具体而言,为了构建采样程序的有效搜索空间,针对每个子图中的批矩阵乘法和softmax算子,提出了新的草图生成规则和搜索策略,能够将它们融合成大型计算单元,然后将它们映射并转换为高效的CUDA核。总体而言,我们对三个现实世界中基于变压器的视觉任务的评估表明,相对于深度学习引擎TensorRT, AutoGTCO将执行性能提高了1.38倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
AutoGTCO: Graph and Tensor Co-Optimize for Image Recognition with Transformers on GPU
Performance optimization is the art of continuously seeking an effective mapping between algorithm and hardware. Existing deep learning compilers or frameworks optimize the computation graph via adapting transformations manually designed by expert efforts. We argue that these methods ignore some possible graph-level optimizations, thus it is difficult to generalize to emerging deep learning models or new operators. In this work, we propose AutoGTCO, a tensor program generation system for vision tasks with the transformer architecture on GPU. Compared with existing fusion strategies, AutoGTCO explores the optimization of operator fusion in the transformer model through a novel dynamic programming algorithm. Specifically, to construct an effective search space of the sampled programs, new sketch generation rules and a search policy are proposed for the batch matrix multiplication and softmax operators in each subgraph, which are capable of fusing them into large computation units, it can then map and transform them into efficient CUDA kernels. Overall, our evaluation on three real-world transformer-based vision tasks shows that AutoGTCO improves the execution performance relative to deep learning engine TensorRT by up to 1.38 ×.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Fast and Accurate PPA Modeling with Transfer Learning Mobileware: A High-Performance MobileNet Accelerator with Channel Stationary Dataflow A General Hardware and Software Co-Design Framework for Energy-Efficient Edge AI ToPro: A Topology Projector and Waveguide Router for Wavelength-Routed Optical Networks-on-Chip Early Validation of SoCs Security Architecture Against Timing Flows Using SystemC-based VPs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1