SPLAT: A framework for optimised GPU code-generation for SParse reguLar ATtention

Ahan Gupta, Yueming Yuan, Devansh Jain, Yuhao Ge, David Aponte, Yanqi Zhou, Charith Mendis
{"title":"SPLAT: A framework for optimised GPU code-generation for SParse reguLar ATtention","authors":"Ahan Gupta, Yueming Yuan, Devansh Jain, Yuhao Ge, David Aponte, Yanqi Zhou, Charith Mendis","doi":"arxiv-2407.16847","DOIUrl":null,"url":null,"abstract":"Multi-head-self-attention (MHSA) mechanisms achieve state-of-the-art (SOTA)\nperformance across natural language processing and vision tasks. However, their\nquadratic dependence on sequence lengths has bottlenecked inference speeds. To\ncircumvent this bottleneck, researchers have proposed various sparse-MHSA\nmodels, where a subset of full attention is computed. Despite their promise,\ncurrent sparse libraries and compilers do not support high-performance\nimplementations for diverse sparse-MHSA patterns due to the underlying sparse\nformats they operate on. These formats, which are typically designed for\nhigh-performance & scientific computing applications, are either curated for\nextreme amounts of random sparsity (<1% non-zero values), or specific sparsity\npatterns. However, the sparsity patterns in sparse-MHSA are moderately sparse\n(10-50% non-zero values) and varied, resulting in existing sparse-formats\ntrading off generality for performance. We bridge this gap, achieving both generality and performance, by proposing a\nnovel sparse format: affine-compressed-sparse-row (ACSR) and supporting\ncode-generation scheme, SPLAT, that generates high-performance implementations\nfor diverse sparse-MHSA patterns on GPUs. Core to our proposed format and code\ngeneration algorithm is the observation that common sparse-MHSA patterns have\nuniquely regular geometric properties. These properties, which can be analyzed\njust-in-time, expose novel optimizations and tiling strategies that SPLAT\nexploits to generate high-performance implementations for diverse patterns. To\ndemonstrate SPLAT's efficacy, we use it to generate code for various\nsparse-MHSA models, achieving geomean speedups of 2.05x and 4.05x over\nhand-written kernels written in triton and TVM respectively on A100 GPUs.\nMoreover, its interfaces are intuitive and easy to use with existing\nimplementations of MHSA in JAX.","PeriodicalId":501197,"journal":{"name":"arXiv - CS - Programming Languages","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Programming Languages","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.16847","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Multi-head-self-attention (MHSA) mechanisms achieve state-of-the-art (SOTA) performance across natural language processing and vision tasks. However, their quadratic dependence on sequence lengths has bottlenecked inference speeds. To circumvent this bottleneck, researchers have proposed various sparse-MHSA models, where a subset of full attention is computed. Despite their promise, current sparse libraries and compilers do not support high-performance implementations for diverse sparse-MHSA patterns due to the underlying sparse formats they operate on. These formats, which are typically designed for high-performance & scientific computing applications, are either curated for extreme amounts of random sparsity (<1% non-zero values), or specific sparsity patterns. However, the sparsity patterns in sparse-MHSA are moderately sparse (10-50% non-zero values) and varied, resulting in existing sparse-formats trading off generality for performance. We bridge this gap, achieving both generality and performance, by proposing a novel sparse format: affine-compressed-sparse-row (ACSR) and supporting code-generation scheme, SPLAT, that generates high-performance implementations for diverse sparse-MHSA patterns on GPUs. Core to our proposed format and code generation algorithm is the observation that common sparse-MHSA patterns have uniquely regular geometric properties. These properties, which can be analyzed just-in-time, expose novel optimizations and tiling strategies that SPLAT exploits to generate high-performance implementations for diverse patterns. To demonstrate SPLAT's efficacy, we use it to generate code for various sparse-MHSA models, achieving geomean speedups of 2.05x and 4.05x over hand-written kernels written in triton and TVM respectively on A100 GPUs. Moreover, its interfaces are intuitive and easy to use with existing implementations of MHSA in JAX.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
SPLAT:优化 GPU 代码生成以实现稀疏重组的框架
多头自我注意(MHSA)机制在自然语言处理和视觉任务中实现了最先进的(SOTA)性能。然而,它们对序列长度的二次依赖性制约了推理速度。为了规避这一瓶颈,研究人员提出了各种稀疏-MHSA 模型,即计算全注意力的子集。尽管目前的稀疏库和编译器很有前途,但由于其运行的底层稀疏格式,它们并不支持各种稀疏-MHSA 模式的高性能实现。这些格式通常是为高性能计算和科学计算应用而设计的,要么是经过精心策划的大量随机稀疏性(<1% 非零值),要么是特定的稀疏性模式。然而,稀疏-MHSA 中的稀疏模式是中度稀疏(10%-50% 非零值)和多样的,导致现有的稀疏格式为了性能而牺牲了通用性。我们提出了一种新的稀疏格式:仿射压缩稀疏行(ACSR)和配套的代码生成方案 SPLAT,可以在 GPU 上生成各种稀疏-MHSA 模式的高性能实现,从而弥补了这一差距,实现了通用性和性能的双赢。我们提出的格式和代码生成算法的核心是观察到常见的稀疏-MHSA 图案具有独特的规则几何特性。这些特性可以及时分析,揭示出新颖的优化和平铺策略,SPLAT利用这些策略为各种图案生成高性能的实现。为了证明 SPLAT 的功效,我们用它来生成各种解析-MHSA 模型的代码,在 A100 GPU 上比用 triton 和 TVM 编写的内核分别提高了 2.05 倍和 4.05 倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Memory Consistency and Program Transformations No Saved Kaleidosope: an 100% Jitted Neural Network Coding Language with Pythonic Syntax Towards Quantum Multiparty Session Types The Incredible Shrinking Context... in a decompiler near you Scheme Pearl: Quantum Continuations
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1