Vortex: Efficient Sample-Free Dynamic Tensor Program Optimization via Hardware-aware Strategy Space Hierarchization

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-09-02 DOI:arxiv-2409.01075

Yangjie Zhou, Honglin Zhu, Qian Qiu, Weihao Cui, Zihan Liu, Cong Guo, Siyuan Feng, Jintao Meng, Haidong Lan, Jingwen Leng, Wenxi Zhu, Minwen Deng

{"title":"Vortex: Efficient Sample-Free Dynamic Tensor Program Optimization via Hardware-aware Strategy Space Hierarchization","authors":"Yangjie Zhou, Honglin Zhu, Qian Qiu, Weihao Cui, Zihan Liu, Cong Guo, Siyuan Feng, Jintao Meng, Haidong Lan, Jingwen Leng, Wenxi Zhu, Minwen Deng","doi":"arxiv-2409.01075","DOIUrl":null,"url":null,"abstract":"Dynamic-shape deep neural networks (DNNs) are rapidly evolving, attracting\nattention for their ability to handle variable input sizes in real-time\napplications. However, existing compilation optimization methods for such\nnetworks often rely heavily on predefined samples to guide the compilation\nprocess, which restricts their adaptability and efficiency. These sample-driven\nmethods struggle to efficiently manage the diverse and unpredictable shapes\nencountered in real-world scenarios, often resulting in suboptimal performance. To tackle these issues, we introduce Vortex, a hardware-driven and\nsample-free compiler tailored for dynamic-shape tensor programs. Vortex\ncapitalizes on detailed hardware information and hierarchizes the strategy\nspace to facilitate high-performance code generation without relying on runtime\nshape samples. It features a unique bidirectional compilation workflow,\ncombining top-down abstraction for aligning tensor program execution with\nhardware hierarchies and bottom-up kernel construction to narrow the search\nspace, enabling Vortex to achieve remarkable efficiency. Comprehensive\nevaluations confirm that Vortex reduces compilation time by $176\\times$\ncompared to the existing dynamic-shape compiler. Additionally, it substantially\noutperforms existing vendor-provided libraries and dynamic-shape compilers on\nboth CPU and GPU platforms, delivering speedups of $2.53\\times$ and\n$3.01\\times$, respectively.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"12 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.01075","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Dynamic-shape deep neural networks (DNNs) are rapidly evolving, attracting attention for their ability to handle variable input sizes in real-time applications. However, existing compilation optimization methods for such networks often rely heavily on predefined samples to guide the compilation process, which restricts their adaptability and efficiency. These sample-driven methods struggle to efficiently manage the diverse and unpredictable shapes encountered in real-world scenarios, often resulting in suboptimal performance. To tackle these issues, we introduce Vortex, a hardware-driven and sample-free compiler tailored for dynamic-shape tensor programs. Vortex capitalizes on detailed hardware information and hierarchizes the strategy space to facilitate high-performance code generation without relying on runtime shape samples. It features a unique bidirectional compilation workflow, combining top-down abstraction for aligning tensor program execution with hardware hierarchies and bottom-up kernel construction to narrow the search space, enabling Vortex to achieve remarkable efficiency. Comprehensive evaluations confirm that Vortex reduces compilation time by $176\times$ compared to the existing dynamic-shape compiler. Additionally, it substantially outperforms existing vendor-provided libraries and dynamic-shape compilers on both CPU and GPU platforms, delivering speedups of $2.53\times$ and $3.01\times$, respectively.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

涡旋：通过硬件感知的策略空间分层实现高效的无采样动态张量程序优化

动态形状深度神经网络（Dynamic-shape deep neural networks，DNN）发展迅速，因其在实时应用中处理可变输入大小的能力而备受关注。然而，针对此类网络的现有编译优化方法往往严重依赖于预定义样本来指导编译过程，这限制了它们的适应性和效率。这些样本驱动的方法难以有效管理真实世界场景中遇到的多样化和不可预测的形状，往往导致性能不理想。为了解决这些问题，我们引入了 Vortex，这是一种硬件驱动的无样本编译器，专为动态形状张量程序量身定制。Vortex 利用详细的硬件信息，对策略空间进行分层，以促进高性能代码生成，而无需依赖运行时形状样本。它采用独特的双向编译工作流，将自上而下的抽象与硬件分层相结合，以调整张量程序的执行，并自下而上地构建内核以缩小搜索空间，从而使 Vortex 实现了卓越的效率。综合评估证实，与现有的动态形状编译器相比，Vortex 将编译时间缩短了 176 倍。此外，在CPU和GPU平台上，它的性能大大优于现有供应商提供的库和动态形状编译器，速度分别提高了2.53倍和3.01倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Distributed, Parallel, and Cluster Computing

自引率

0.00%

发文量

期刊最新文献

Massively parallel CMA-ES with increasing population Communication Lower Bounds and Optimal Algorithms for Symmetric Matrix Computations Energy Efficiency Support for Software Defined Networks: a Serverless Computing Approach CountChain: A Decentralized Oracle Network for Counting Systems Delay Analysis of EIP-4844