在内核间互联智能处理器上扩展深度学习计算

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-09 DOI:arxiv-2408.04808

Yiqi Liu, Yuqi Xue, Yu Cheng, Lingxiao Ma, Ziming Miao, Jilong Xue, Jian Huang

{"title":"在内核间互联智能处理器上扩展深度学习计算","authors":"Yiqi Liu, Yuqi Xue, Yu Cheng, Lingxiao Ma, Ziming Miao, Jilong Xue, Jian Huang","doi":"arxiv-2408.04808","DOIUrl":null,"url":null,"abstract":"As AI chips incorporate numerous parallelized cores to scale deep learning\n(DL) computing, inter-core communication is enabled recently by employing\nhigh-bandwidth and low-latency interconnect links on the chip (e.g., Graphcore\nIPU). It allows each core to directly access the fast scratchpad memory in\nother cores, which enables new parallel computing paradigms. However, without\nproper support for the scalable inter-core connections in current DL compilers,\nit is hard for developers to exploit the benefits of this new architecture. We present T10, the first DL compiler to exploit the inter-core communication\nbandwidth and distributed on-chip memory on AI chips. To formulate the\ncomputation and communication patterns of tensor operators in this new\narchitecture, T10 introduces a distributed tensor abstraction rTensor. T10 maps\na DNN model to execution plans with a generalized compute-shift pattern, by\npartitioning DNN computation into sub-operators and mapping them to cores, so\nthat the cores can exchange data following predictable patterns. T10 makes\nglobally optimized trade-offs between on-chip memory consumption and inter-core\ncommunication overhead, selects the best execution plan from a vast\noptimization space, and alleviates unnecessary inter-core communications. Our\nevaluation with a real inter-core connected AI chip, the Graphcore IPU, shows\nup to 3.3$\\times$ performance improvement, and scalability support for larger\nmodels, compared to state-of-the-art DL compilers and vendor libraries.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"13 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor\",\"authors\":\"Yiqi Liu, Yuqi Xue, Yu Cheng, Lingxiao Ma, Ziming Miao, Jilong Xue, Jian Huang\",\"doi\":\"arxiv-2408.04808\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As AI chips incorporate numerous parallelized cores to scale deep learning\\n(DL) computing, inter-core communication is enabled recently by employing\\nhigh-bandwidth and low-latency interconnect links on the chip (e.g., Graphcore\\nIPU). It allows each core to directly access the fast scratchpad memory in\\nother cores, which enables new parallel computing paradigms. However, without\\nproper support for the scalable inter-core connections in current DL compilers,\\nit is hard for developers to exploit the benefits of this new architecture. We present T10, the first DL compiler to exploit the inter-core communication\\nbandwidth and distributed on-chip memory on AI chips. To formulate the\\ncomputation and communication patterns of tensor operators in this new\\narchitecture, T10 introduces a distributed tensor abstraction rTensor. T10 maps\\na DNN model to execution plans with a generalized compute-shift pattern, by\\npartitioning DNN computation into sub-operators and mapping them to cores, so\\nthat the cores can exchange data following predictable patterns. T10 makes\\nglobally optimized trade-offs between on-chip memory consumption and inter-core\\ncommunication overhead, selects the best execution plan from a vast\\noptimization space, and alleviates unnecessary inter-core communications. Our\\nevaluation with a real inter-core connected AI chip, the Graphcore IPU, shows\\nup to 3.3$\\\\times$ performance improvement, and scalability support for larger\\nmodels, compared to state-of-the-art DL compilers and vendor libraries.\",\"PeriodicalId\":501422,\"journal\":{\"name\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"volume\":\"13 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.04808\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.04808","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

随着人工智能芯片采用大量并行化内核来扩展深度学习（DL）计算，最近通过在芯片上采用高带宽、低延迟的互连链路（如 GraphcoreIPU）实现了内核间通信。它允许每个内核直接访问其他内核的快速划板存储器，从而实现了新的并行计算模式。然而，由于目前的 DL 编译器不支持可扩展的内核间连接，开发人员很难利用这种新架构的优势。我们提出了 T10，这是第一个利用人工智能芯片上的核间通讯带宽和分布式片上内存的 DL 编译器。T10 引入了分布式张量抽象 rTensor，以制定这种新架构中张量运算符的计算和通信模式。T10 通过将 DNN 计算划分为子运算符并将其映射到内核，将 DNN 模型映射到具有广义计算转移模式的执行计划中，从而使内核可以按照可预测的模式交换数据。T10 在片上内存消耗和内核间通信开销之间进行了全局优化权衡，从广阔的优化空间中选择最佳执行计划，并减少不必要的内核间通信。与最先进的 DL 编译器和供应商库相比，使用真正的内核间连接 AI 芯片 Graphcore IPU 进行的评估显示，T10 的性能提高了 3.3 美元/次，并支持大型模型的可扩展性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor

As AI chips incorporate numerous parallelized cores to scale deep learning (DL) computing, inter-core communication is enabled recently by employing high-bandwidth and low-latency interconnect links on the chip (e.g., Graphcore IPU). It allows each core to directly access the fast scratchpad memory in other cores, which enables new parallel computing paradigms. However, without proper support for the scalable inter-core connections in current DL compilers, it is hard for developers to exploit the benefits of this new architecture. We present T10, the first DL compiler to exploit the inter-core communication bandwidth and distributed on-chip memory on AI chips. To formulate the computation and communication patterns of tensor operators in this new architecture, T10 introduces a distributed tensor abstraction rTensor. T10 maps a DNN model to execution plans with a generalized compute-shift pattern, by partitioning DNN computation into sub-operators and mapping them to cores, so that the cores can exchange data following predictable patterns. T10 makes globally optimized trade-offs between on-chip memory consumption and inter-core communication overhead, selects the best execution plan from a vast optimization space, and alleviates unnecessary inter-core communications. Our evaluation with a real inter-core connected AI chip, the Graphcore IPU, shows up to 3.3$\times$ performance improvement, and scalability support for larger models, compared to state-of-the-art DL compilers and vendor libraries.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助