Context Swap: Multi-PIM System Preventing Remote Memory Access for Large Embedding Model Acceleration

2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS) Pub Date : 2023-06-11 DOI:10.1109/AICAS57966.2023.10168595

Hong Kal, Cheolhwan Kim, Minjae Kim, W. Ro

{"title":"Context Swap: Multi-PIM System Preventing Remote Memory Access for Large Embedding Model Acceleration","authors":"Hong Kal, Cheolhwan Kim, Minjae Kim, W. Ro","doi":"10.1109/AICAS57966.2023.10168595","DOIUrl":null,"url":null,"abstract":"Processing-in-Memory (PIM) has been an attractive solution to accelerate memory-intensive neural network layers. Especially, PIM is efficient for layers using embeddings, such as the embedding layer and graph convolution layer, because of their large capacity and low arithmetic intensity. The embedding tables of such layers are stored across multiple memory nodes and processed by local PIM modules with sparse access patterns. Towards computing data from other memory nodes on a local PIM module, a naive approach is to allow the local PIM to retrieve data from remote memory nodes. This approach might incur significant performance degradation due to the long latency overhead of remote accesses. To avoid remote access, PIM system can adopt a framework based on MapReduce programming model, which enables PIMs to compute the local data only and CPUs to compute intermediate results of PIMs. However, the multi-PIM system still suffers from performance degradation because the framework is processed on the CPU and it has a long delay compared to the PIM kernel execution. Therefore, we propose a context swap technique that prevents remote data access even without a high-latency framework. We observe that transferring PIM contexts to the remote PIM node needs much fewer data traffic than remote accesses of data. Our PIM system makes PIM nodes swap their context data with each other when they complete their own computation and no longer have local data to compute. Until all PIMs calculate all local data, several context swaps occur. The context swap is performed by a memory controller between PIMs in the same CPU socket and simple software between PIMs in different CPU sockets. To this end, the proposed multi-PIM system outperforms the base PIM system transferring remote data and the PIM system with the kernel-managing framework by 4.1 × and 3.3 ×, respectively.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AICAS57966.2023.10168595","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Processing-in-Memory (PIM) has been an attractive solution to accelerate memory-intensive neural network layers. Especially, PIM is efficient for layers using embeddings, such as the embedding layer and graph convolution layer, because of their large capacity and low arithmetic intensity. The embedding tables of such layers are stored across multiple memory nodes and processed by local PIM modules with sparse access patterns. Towards computing data from other memory nodes on a local PIM module, a naive approach is to allow the local PIM to retrieve data from remote memory nodes. This approach might incur significant performance degradation due to the long latency overhead of remote accesses. To avoid remote access, PIM system can adopt a framework based on MapReduce programming model, which enables PIMs to compute the local data only and CPUs to compute intermediate results of PIMs. However, the multi-PIM system still suffers from performance degradation because the framework is processed on the CPU and it has a long delay compared to the PIM kernel execution. Therefore, we propose a context swap technique that prevents remote data access even without a high-latency framework. We observe that transferring PIM contexts to the remote PIM node needs much fewer data traffic than remote accesses of data. Our PIM system makes PIM nodes swap their context data with each other when they complete their own computation and no longer have local data to compute. Until all PIMs calculate all local data, several context swaps occur. The context swap is performed by a memory controller between PIMs in the same CPU socket and simple software between PIMs in different CPU sockets. To this end, the proposed multi-PIM system outperforms the base PIM system transferring remote data and the PIM system with the kernel-managing framework by 4.1 × and 3.3 ×, respectively.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

上下文交换:多pim系统防止大型嵌入模型加速的远程内存访问

内存中处理(PIM)已成为加速内存密集型神经网络层的一种有吸引力的解决方案。特别是对于使用嵌入的层，如嵌入层和图卷积层，由于它们的容量大，算法强度低，PIM是有效的。这些层的嵌入表存储在多个内存节点上，并由具有稀疏访问模式的本地PIM模块处理。为了计算来自本地PIM模块上其他内存节点的数据，一种简单的方法是允许本地PIM从远程内存节点检索数据。由于远程访问的长延迟开销，这种方法可能会导致显著的性能下降。为了避免远程访问，PIM系统可以采用基于MapReduce编程模型的框架，使PIM只计算本地数据，cpu计算PIM的中间结果。然而，由于框架是在CPU上处理的，并且与PIM内核执行相比，它具有较长的延迟，因此多PIM系统仍然存在性能下降的问题。因此，我们提出了一种上下文交换技术，即使在没有高延迟框架的情况下也可以防止远程数据访问。我们观察到，将PIM上下文传输到远程PIM节点所需的数据流量要比远程访问数据少得多。我们的PIM系统使PIM节点在完成自己的计算并且不再需要计算本地数据时相互交换上下文数据。在所有pim计算所有本地数据之前，会发生多次上下文交换。上下文交换由内存控制器在同一CPU套接字中的pim之间执行，由不同CPU套接字中的pim之间的简单软件执行。为此，本文提出的多PIM系统比传输远程数据的基本PIM系统和具有内核管理框架的PIM系统分别高出4.1倍和3.3倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)

自引率

0.00%

发文量