Attention in SRAM on Tenstorrent Grayskull

Moritz Thüning
{"title":"Attention in SRAM on Tenstorrent Grayskull","authors":"Moritz Thüning","doi":"arxiv-2407.13885","DOIUrl":null,"url":null,"abstract":"When implementations of the Transformer's self-attention layer utilize SRAM\ninstead of DRAM, they can achieve significant speedups. The Tenstorrent\nGrayskull architecture provides a large SRAM, distributed across a grid of\ncores. This work presents a fused kernel for Grayskull, that exclusively\nutilizes its large SRAM by combining matrix multiplication, attention score\nscaling and Softmax operations. Additionally, a dedicated Softmax kernel\nutilizing the SRAM and a CPU implementation serving as a baseline are\npresented. The Softmax operation consumes most of the runtime in the\ncomputation of attention weights from queries and keys on Grayskull. The\nspeedup of the dedicated Softmax kernel compared to the CPU implementation is\nup to $10 \\times$, and the Softmax implementation inside the fused kernel is\napproximately $1.8 \\times$ faster than the dedicated Softmax kernel. The time\nand memory complexity of all implementations is quadratic in sequence length.\nCurrently, the Grayskull e150 is approximately $30 \\times$ cheaper for the\ngeneral public than an Nvidia H100 PCIe (a state-of-the-art GPU) and offers\napproximately $1.5 \\times$ more SRAM.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"2 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.13885","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

When implementations of the Transformer's self-attention layer utilize SRAM instead of DRAM, they can achieve significant speedups. The Tenstorrent Grayskull architecture provides a large SRAM, distributed across a grid of cores. This work presents a fused kernel for Grayskull, that exclusively utilizes its large SRAM by combining matrix multiplication, attention score scaling and Softmax operations. Additionally, a dedicated Softmax kernel utilizing the SRAM and a CPU implementation serving as a baseline are presented. The Softmax operation consumes most of the runtime in the computation of attention weights from queries and keys on Grayskull. The speedup of the dedicated Softmax kernel compared to the CPU implementation is up to $10 \times$, and the Softmax implementation inside the fused kernel is approximately $1.8 \times$ faster than the dedicated Softmax kernel. The time and memory complexity of all implementations is quadratic in sequence length. Currently, the Grayskull e150 is approximately $30 \times$ cheaper for the general public than an Nvidia H100 PCIe (a state-of-the-art GPU) and offers approximately $1.5 \times$ more SRAM.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
关注 Tenstorrent Grayskull 上的 SRAM
当 Transformer 的自我关注层利用 SRAM 而不是 DRAM 实现时,它们可以实现显著的提速。TenstorrentGrayskull 架构提供了一个大型 SRAM,分布在网格状的核心上。这项工作为 Grayskull 提出了一个融合内核,通过结合矩阵乘法、注意力评分和 Softmax 操作,充分利用其大型 SRAM。此外,还介绍了利用 SRAM 的专用 Softmax 内核和作为基线的 CPU 实现。在根据 Grayskull 上的查询和按键计算注意力权重时,Softmax 操作消耗了大部分运行时间。与CPU实现相比,专用Softmax内核的速度提高了10倍,而融合内核中的Softmax实现比专用Softmax内核快约1.8倍。目前,对于普通大众来说,Grayskull e150 比 Nvidia H100 PCIe(最先进的 GPU)便宜约 30 美元(times$),SRAM 也多出约 1.5 美元(times$)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
HRA: A Multi-Criteria Framework for Ranking Metaheuristic Optimization Algorithms Temporal Load Imbalance on Ondes3D Seismic Simulator for Different Multicore Architectures Can Graph Reordering Speed Up Graph Neural Network Training? An Experimental Study The Landscape of GPU-Centric Communication A Global Perspective on the Past, Present, and Future of Video Streaming over Starlink
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1