{"title":"在内核间互联智能处理器上扩展深度学习计算","authors":"Yiqi Liu, Yuqi Xue, Yu Cheng, Lingxiao Ma, Ziming Miao, Jilong Xue, Jian Huang","doi":"arxiv-2408.04808","DOIUrl":null,"url":null,"abstract":"As AI chips incorporate numerous parallelized cores to scale deep learning\n(DL) computing, inter-core communication is enabled recently by employing\nhigh-bandwidth and low-latency interconnect links on the chip (e.g., Graphcore\nIPU). It allows each core to directly access the fast scratchpad memory in\nother cores, which enables new parallel computing paradigms. However, without\nproper support for the scalable inter-core connections in current DL compilers,\nit is hard for developers to exploit the benefits of this new architecture. We present T10, the first DL compiler to exploit the inter-core communication\nbandwidth and distributed on-chip memory on AI chips. To formulate the\ncomputation and communication patterns of tensor operators in this new\narchitecture, T10 introduces a distributed tensor abstraction rTensor. T10 maps\na DNN model to execution plans with a generalized compute-shift pattern, by\npartitioning DNN computation into sub-operators and mapping them to cores, so\nthat the cores can exchange data following predictable patterns. T10 makes\nglobally optimized trade-offs between on-chip memory consumption and inter-core\ncommunication overhead, selects the best execution plan from a vast\noptimization space, and alleviates unnecessary inter-core communications. Our\nevaluation with a real inter-core connected AI chip, the Graphcore IPU, shows\nup to 3.3$\\times$ performance improvement, and scalability support for larger\nmodels, compared to state-of-the-art DL compilers and vendor libraries.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"13 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor\",\"authors\":\"Yiqi Liu, Yuqi Xue, Yu Cheng, Lingxiao Ma, Ziming Miao, Jilong Xue, Jian Huang\",\"doi\":\"arxiv-2408.04808\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As AI chips incorporate numerous parallelized cores to scale deep learning\\n(DL) computing, inter-core communication is enabled recently by employing\\nhigh-bandwidth and low-latency interconnect links on the chip (e.g., Graphcore\\nIPU). It allows each core to directly access the fast scratchpad memory in\\nother cores, which enables new parallel computing paradigms. However, without\\nproper support for the scalable inter-core connections in current DL compilers,\\nit is hard for developers to exploit the benefits of this new architecture. We present T10, the first DL compiler to exploit the inter-core communication\\nbandwidth and distributed on-chip memory on AI chips. To formulate the\\ncomputation and communication patterns of tensor operators in this new\\narchitecture, T10 introduces a distributed tensor abstraction rTensor. T10 maps\\na DNN model to execution plans with a generalized compute-shift pattern, by\\npartitioning DNN computation into sub-operators and mapping them to cores, so\\nthat the cores can exchange data following predictable patterns. T10 makes\\nglobally optimized trade-offs between on-chip memory consumption and inter-core\\ncommunication overhead, selects the best execution plan from a vast\\noptimization space, and alleviates unnecessary inter-core communications. Our\\nevaluation with a real inter-core connected AI chip, the Graphcore IPU, shows\\nup to 3.3$\\\\times$ performance improvement, and scalability support for larger\\nmodels, compared to state-of-the-art DL compilers and vendor libraries.\",\"PeriodicalId\":501422,\"journal\":{\"name\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"volume\":\"13 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.04808\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.04808","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor
As AI chips incorporate numerous parallelized cores to scale deep learning
(DL) computing, inter-core communication is enabled recently by employing
high-bandwidth and low-latency interconnect links on the chip (e.g., Graphcore
IPU). It allows each core to directly access the fast scratchpad memory in
other cores, which enables new parallel computing paradigms. However, without
proper support for the scalable inter-core connections in current DL compilers,
it is hard for developers to exploit the benefits of this new architecture. We present T10, the first DL compiler to exploit the inter-core communication
bandwidth and distributed on-chip memory on AI chips. To formulate the
computation and communication patterns of tensor operators in this new
architecture, T10 introduces a distributed tensor abstraction rTensor. T10 maps
a DNN model to execution plans with a generalized compute-shift pattern, by
partitioning DNN computation into sub-operators and mapping them to cores, so
that the cores can exchange data following predictable patterns. T10 makes
globally optimized trade-offs between on-chip memory consumption and inter-core
communication overhead, selects the best execution plan from a vast
optimization space, and alleviates unnecessary inter-core communications. Our
evaluation with a real inter-core connected AI chip, the Graphcore IPU, shows
up to 3.3$\times$ performance improvement, and scalability support for larger
models, compared to state-of-the-art DL compilers and vendor libraries.