AutoGTCO:图形和张量协同优化的图像识别与GPU上的变压器

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD) Pub Date : 2021-11-01 DOI:10.1109/ICCAD51958.2021.9643487

Yang Bai, Xufeng Yao, Qi Sun, Bei Yu

{"title":"AutoGTCO:图形和张量协同优化的图像识别与GPU上的变压器","authors":"Yang Bai, Xufeng Yao, Qi Sun, Bei Yu","doi":"10.1109/ICCAD51958.2021.9643487","DOIUrl":null,"url":null,"abstract":"Performance optimization is the art of continuously seeking an effective mapping between algorithm and hardware. Existing deep learning compilers or frameworks optimize the computation graph via adapting transformations manually designed by expert efforts. We argue that these methods ignore some possible graph-level optimizations, thus it is difficult to generalize to emerging deep learning models or new operators. In this work, we propose AutoGTCO, a tensor program generation system for vision tasks with the transformer architecture on GPU. Compared with existing fusion strategies, AutoGTCO explores the optimization of operator fusion in the transformer model through a novel dynamic programming algorithm. Specifically, to construct an effective search space of the sampled programs, new sketch generation rules and a search policy are proposed for the batch matrix multiplication and softmax operators in each subgraph, which are capable of fusing them into large computation units, it can then map and transform them into efficient CUDA kernels. Overall, our evaluation on three real-world transformer-based vision tasks shows that AutoGTCO improves the execution performance relative to deep learning engine TensorRT by up to 1.38 ×.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"AutoGTCO: Graph and Tensor Co-Optimize for Image Recognition with Transformers on GPU\",\"authors\":\"Yang Bai, Xufeng Yao, Qi Sun, Bei Yu\",\"doi\":\"10.1109/ICCAD51958.2021.9643487\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Performance optimization is the art of continuously seeking an effective mapping between algorithm and hardware. Existing deep learning compilers or frameworks optimize the computation graph via adapting transformations manually designed by expert efforts. We argue that these methods ignore some possible graph-level optimizations, thus it is difficult to generalize to emerging deep learning models or new operators. In this work, we propose AutoGTCO, a tensor program generation system for vision tasks with the transformer architecture on GPU. Compared with existing fusion strategies, AutoGTCO explores the optimization of operator fusion in the transformer model through a novel dynamic programming algorithm. Specifically, to construct an effective search space of the sampled programs, new sketch generation rules and a search policy are proposed for the batch matrix multiplication and softmax operators in each subgraph, which are capable of fusing them into large computation units, it can then map and transform them into efficient CUDA kernels. Overall, our evaluation on three real-world transformer-based vision tasks shows that AutoGTCO improves the execution performance relative to deep learning engine TensorRT by up to 1.38 ×.\",\"PeriodicalId\":370791,\"journal\":{\"name\":\"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCAD51958.2021.9643487\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCAD51958.2021.9643487","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

性能优化是不断寻求算法和硬件之间有效映射的艺术。现有的深度学习编译器或框架通过适应由专家手工设计的转换来优化计算图。我们认为这些方法忽略了一些可能的图级优化，因此很难推广到新兴的深度学习模型或新的操作符。在这项工作中，我们提出了AutoGTCO，一个基于GPU的变压器架构的用于视觉任务的张量程序生成系统。与现有的融合策略相比，AutoGTCO通过一种新的动态规划算法探索了变压器模型中算子融合的优化问题。具体而言，为了构建采样程序的有效搜索空间，针对每个子图中的批矩阵乘法和softmax算子，提出了新的草图生成规则和搜索策略，能够将它们融合成大型计算单元，然后将它们映射并转换为高效的CUDA核。总体而言，我们对三个现实世界中基于变压器的视觉任务的评估表明，相对于深度学习引擎TensorRT, AutoGTCO将执行性能提高了1.38倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

AutoGTCO: Graph and Tensor Co-Optimize for Image Recognition with Transformers on GPU

Performance optimization is the art of continuously seeking an effective mapping between algorithm and hardware. Existing deep learning compilers or frameworks optimize the computation graph via adapting transformations manually designed by expert efforts. We argue that these methods ignore some possible graph-level optimizations, thus it is difficult to generalize to emerging deep learning models or new operators. In this work, we propose AutoGTCO, a tensor program generation system for vision tasks with the transformer architecture on GPU. Compared with existing fusion strategies, AutoGTCO explores the optimization of operator fusion in the transformer model through a novel dynamic programming algorithm. Specifically, to construct an effective search space of the sampled programs, new sketch generation rules and a search policy are proposed for the batch matrix multiplication and softmax operators in each subgraph, which are capable of fusing them into large computation units, it can then map and transform them into efficient CUDA kernels. Overall, our evaluation on three real-world transformer-based vision tasks shows that AutoGTCO improves the execution performance relative to deep learning engine TensorRT by up to 1.38 ×.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

自引率

0.00%

发文量