{"title":"Context Swap: Multi-PIM System Preventing Remote Memory Access for Large Embedding Model Acceleration","authors":"Hong Kal, Cheolhwan Kim, Minjae Kim, W. Ro","doi":"10.1109/AICAS57966.2023.10168595","DOIUrl":null,"url":null,"abstract":"Processing-in-Memory (PIM) has been an attractive solution to accelerate memory-intensive neural network layers. Especially, PIM is efficient for layers using embeddings, such as the embedding layer and graph convolution layer, because of their large capacity and low arithmetic intensity. The embedding tables of such layers are stored across multiple memory nodes and processed by local PIM modules with sparse access patterns. Towards computing data from other memory nodes on a local PIM module, a naive approach is to allow the local PIM to retrieve data from remote memory nodes. This approach might incur significant performance degradation due to the long latency overhead of remote accesses. To avoid remote access, PIM system can adopt a framework based on MapReduce programming model, which enables PIMs to compute the local data only and CPUs to compute intermediate results of PIMs. However, the multi-PIM system still suffers from performance degradation because the framework is processed on the CPU and it has a long delay compared to the PIM kernel execution. Therefore, we propose a context swap technique that prevents remote data access even without a high-latency framework. We observe that transferring PIM contexts to the remote PIM node needs much fewer data traffic than remote accesses of data. Our PIM system makes PIM nodes swap their context data with each other when they complete their own computation and no longer have local data to compute. Until all PIMs calculate all local data, several context swaps occur. The context swap is performed by a memory controller between PIMs in the same CPU socket and simple software between PIMs in different CPU sockets. To this end, the proposed multi-PIM system outperforms the base PIM system transferring remote data and the PIM system with the kernel-managing framework by 4.1 × and 3.3 ×, respectively.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AICAS57966.2023.10168595","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Processing-in-Memory (PIM) has been an attractive solution to accelerate memory-intensive neural network layers. Especially, PIM is efficient for layers using embeddings, such as the embedding layer and graph convolution layer, because of their large capacity and low arithmetic intensity. The embedding tables of such layers are stored across multiple memory nodes and processed by local PIM modules with sparse access patterns. Towards computing data from other memory nodes on a local PIM module, a naive approach is to allow the local PIM to retrieve data from remote memory nodes. This approach might incur significant performance degradation due to the long latency overhead of remote accesses. To avoid remote access, PIM system can adopt a framework based on MapReduce programming model, which enables PIMs to compute the local data only and CPUs to compute intermediate results of PIMs. However, the multi-PIM system still suffers from performance degradation because the framework is processed on the CPU and it has a long delay compared to the PIM kernel execution. Therefore, we propose a context swap technique that prevents remote data access even without a high-latency framework. We observe that transferring PIM contexts to the remote PIM node needs much fewer data traffic than remote accesses of data. Our PIM system makes PIM nodes swap their context data with each other when they complete their own computation and no longer have local data to compute. Until all PIMs calculate all local data, several context swaps occur. The context swap is performed by a memory controller between PIMs in the same CPU socket and simple software between PIMs in different CPU sockets. To this end, the proposed multi-PIM system outperforms the base PIM system transferring remote data and the PIM system with the kernel-managing framework by 4.1 × and 3.3 ×, respectively.