{"title":"涡旋:通过硬件感知的策略空间分层实现高效的无采样动态张量程序优化","authors":"Yangjie Zhou, Honglin Zhu, Qian Qiu, Weihao Cui, Zihan Liu, Cong Guo, Siyuan Feng, Jintao Meng, Haidong Lan, Jingwen Leng, Wenxi Zhu, Minwen Deng","doi":"arxiv-2409.01075","DOIUrl":null,"url":null,"abstract":"Dynamic-shape deep neural networks (DNNs) are rapidly evolving, attracting\nattention for their ability to handle variable input sizes in real-time\napplications. However, existing compilation optimization methods for such\nnetworks often rely heavily on predefined samples to guide the compilation\nprocess, which restricts their adaptability and efficiency. These sample-driven\nmethods struggle to efficiently manage the diverse and unpredictable shapes\nencountered in real-world scenarios, often resulting in suboptimal performance. To tackle these issues, we introduce Vortex, a hardware-driven and\nsample-free compiler tailored for dynamic-shape tensor programs. Vortex\ncapitalizes on detailed hardware information and hierarchizes the strategy\nspace to facilitate high-performance code generation without relying on runtime\nshape samples. It features a unique bidirectional compilation workflow,\ncombining top-down abstraction for aligning tensor program execution with\nhardware hierarchies and bottom-up kernel construction to narrow the search\nspace, enabling Vortex to achieve remarkable efficiency. Comprehensive\nevaluations confirm that Vortex reduces compilation time by $176\\times$\ncompared to the existing dynamic-shape compiler. Additionally, it substantially\noutperforms existing vendor-provided libraries and dynamic-shape compilers on\nboth CPU and GPU platforms, delivering speedups of $2.53\\times$ and\n$3.01\\times$, respectively.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"12 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Vortex: Efficient Sample-Free Dynamic Tensor Program Optimization via Hardware-aware Strategy Space Hierarchization\",\"authors\":\"Yangjie Zhou, Honglin Zhu, Qian Qiu, Weihao Cui, Zihan Liu, Cong Guo, Siyuan Feng, Jintao Meng, Haidong Lan, Jingwen Leng, Wenxi Zhu, Minwen Deng\",\"doi\":\"arxiv-2409.01075\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Dynamic-shape deep neural networks (DNNs) are rapidly evolving, attracting\\nattention for their ability to handle variable input sizes in real-time\\napplications. However, existing compilation optimization methods for such\\nnetworks often rely heavily on predefined samples to guide the compilation\\nprocess, which restricts their adaptability and efficiency. These sample-driven\\nmethods struggle to efficiently manage the diverse and unpredictable shapes\\nencountered in real-world scenarios, often resulting in suboptimal performance. To tackle these issues, we introduce Vortex, a hardware-driven and\\nsample-free compiler tailored for dynamic-shape tensor programs. Vortex\\ncapitalizes on detailed hardware information and hierarchizes the strategy\\nspace to facilitate high-performance code generation without relying on runtime\\nshape samples. It features a unique bidirectional compilation workflow,\\ncombining top-down abstraction for aligning tensor program execution with\\nhardware hierarchies and bottom-up kernel construction to narrow the search\\nspace, enabling Vortex to achieve remarkable efficiency. Comprehensive\\nevaluations confirm that Vortex reduces compilation time by $176\\\\times$\\ncompared to the existing dynamic-shape compiler. Additionally, it substantially\\noutperforms existing vendor-provided libraries and dynamic-shape compilers on\\nboth CPU and GPU platforms, delivering speedups of $2.53\\\\times$ and\\n$3.01\\\\times$, respectively.\",\"PeriodicalId\":501422,\"journal\":{\"name\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"volume\":\"12 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.01075\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.01075","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Vortex: Efficient Sample-Free Dynamic Tensor Program Optimization via Hardware-aware Strategy Space Hierarchization
Dynamic-shape deep neural networks (DNNs) are rapidly evolving, attracting
attention for their ability to handle variable input sizes in real-time
applications. However, existing compilation optimization methods for such
networks often rely heavily on predefined samples to guide the compilation
process, which restricts their adaptability and efficiency. These sample-driven
methods struggle to efficiently manage the diverse and unpredictable shapes
encountered in real-world scenarios, often resulting in suboptimal performance. To tackle these issues, we introduce Vortex, a hardware-driven and
sample-free compiler tailored for dynamic-shape tensor programs. Vortex
capitalizes on detailed hardware information and hierarchizes the strategy
space to facilitate high-performance code generation without relying on runtime
shape samples. It features a unique bidirectional compilation workflow,
combining top-down abstraction for aligning tensor program execution with
hardware hierarchies and bottom-up kernel construction to narrow the search
space, enabling Vortex to achieve remarkable efficiency. Comprehensive
evaluations confirm that Vortex reduces compilation time by $176\times$
compared to the existing dynamic-shape compiler. Additionally, it substantially
outperforms existing vendor-provided libraries and dynamic-shape compilers on
both CPU and GPU platforms, delivering speedups of $2.53\times$ and
$3.01\times$, respectively.