Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor

Yiqi Liu, Yuqi Xue, Yu Cheng, Lingxiao Ma, Ziming Miao, Jilong Xue, Jian Huang
{"title":"Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor","authors":"Yiqi Liu, Yuqi Xue, Yu Cheng, Lingxiao Ma, Ziming Miao, Jilong Xue, Jian Huang","doi":"arxiv-2408.04808","DOIUrl":null,"url":null,"abstract":"As AI chips incorporate numerous parallelized cores to scale deep learning\n(DL) computing, inter-core communication is enabled recently by employing\nhigh-bandwidth and low-latency interconnect links on the chip (e.g., Graphcore\nIPU). It allows each core to directly access the fast scratchpad memory in\nother cores, which enables new parallel computing paradigms. However, without\nproper support for the scalable inter-core connections in current DL compilers,\nit is hard for developers to exploit the benefits of this new architecture. We present T10, the first DL compiler to exploit the inter-core communication\nbandwidth and distributed on-chip memory on AI chips. To formulate the\ncomputation and communication patterns of tensor operators in this new\narchitecture, T10 introduces a distributed tensor abstraction rTensor. T10 maps\na DNN model to execution plans with a generalized compute-shift pattern, by\npartitioning DNN computation into sub-operators and mapping them to cores, so\nthat the cores can exchange data following predictable patterns. T10 makes\nglobally optimized trade-offs between on-chip memory consumption and inter-core\ncommunication overhead, selects the best execution plan from a vast\noptimization space, and alleviates unnecessary inter-core communications. Our\nevaluation with a real inter-core connected AI chip, the Graphcore IPU, shows\nup to 3.3$\\times$ performance improvement, and scalability support for larger\nmodels, compared to state-of-the-art DL compilers and vendor libraries.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"13 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.04808","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

As AI chips incorporate numerous parallelized cores to scale deep learning (DL) computing, inter-core communication is enabled recently by employing high-bandwidth and low-latency interconnect links on the chip (e.g., Graphcore IPU). It allows each core to directly access the fast scratchpad memory in other cores, which enables new parallel computing paradigms. However, without proper support for the scalable inter-core connections in current DL compilers, it is hard for developers to exploit the benefits of this new architecture. We present T10, the first DL compiler to exploit the inter-core communication bandwidth and distributed on-chip memory on AI chips. To formulate the computation and communication patterns of tensor operators in this new architecture, T10 introduces a distributed tensor abstraction rTensor. T10 maps a DNN model to execution plans with a generalized compute-shift pattern, by partitioning DNN computation into sub-operators and mapping them to cores, so that the cores can exchange data following predictable patterns. T10 makes globally optimized trade-offs between on-chip memory consumption and inter-core communication overhead, selects the best execution plan from a vast optimization space, and alleviates unnecessary inter-core communications. Our evaluation with a real inter-core connected AI chip, the Graphcore IPU, shows up to 3.3$\times$ performance improvement, and scalability support for larger models, compared to state-of-the-art DL compilers and vendor libraries.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
在内核间互联智能处理器上扩展深度学习计算
随着人工智能芯片采用大量并行化内核来扩展深度学习(DL)计算,最近通过在芯片上采用高带宽、低延迟的互连链路(如 GraphcoreIPU)实现了内核间通信。它允许每个内核直接访问其他内核的快速划板存储器,从而实现了新的并行计算模式。然而,由于目前的 DL 编译器不支持可扩展的内核间连接,开发人员很难利用这种新架构的优势。我们提出了 T10,这是第一个利用人工智能芯片上的核间通讯带宽和分布式片上内存的 DL 编译器。T10 引入了分布式张量抽象 rTensor,以制定这种新架构中张量运算符的计算和通信模式。T10 通过将 DNN 计算划分为子运算符并将其映射到内核,将 DNN 模型映射到具有广义计算转移模式的执行计划中,从而使内核可以按照可预测的模式交换数据。T10 在片上内存消耗和内核间通信开销之间进行了全局优化权衡,从广阔的优化空间中选择最佳执行计划,并减少不必要的内核间通信。与最先进的 DL 编译器和供应商库相比,使用真正的内核间连接 AI 芯片 Graphcore IPU 进行的评估显示,T10 的性能提高了 3.3 美元/次,并支持大型模型的可扩展性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Massively parallel CMA-ES with increasing population Communication Lower Bounds and Optimal Algorithms for Symmetric Matrix Computations Energy Efficiency Support for Software Defined Networks: a Serverless Computing Approach CountChain: A Decentralized Oracle Network for Counting Systems Delay Analysis of EIP-4844
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1