{"title":"XLA-NDP:用于近数据处理存储器上的深度学习模型训练的高效调度和代码生成","authors":"Jueon Park;Hyojin Sung","doi":"10.1109/LCA.2023.3261136","DOIUrl":null,"url":null,"abstract":"Deep learning (DL) model training must address the memory bottleneck to continue scaling. Processing-in-memory approaches can be a viable solution as they move computations near or into the memory, reducing substantial data movement. However, to deploy applications on such hardware, end-to-end software support is crucial for efficient computation mapping and scheduling as well as extensible code generation, but no consideration has been made for DL training workloads. In this paper, we propose XLA-NDP, a compiler and runtime solution for NDPX, a near-data processing (NDP) architecture integrated with an existing DL training framework. XLA-NDP offloads NDPX kernels and schedules them to overlap with GPU kernels to maximize parallelism based on GPU and NDPX costs, while providing a template-based code generator with low-level optimizations. The experiments showed that XLA-NDP provides up to a 41% speedup (24% on average) over the GPU baseline for four DL model training.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"61-64"},"PeriodicalIF":1.4000,"publicationDate":"2023-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"XLA-NDP: Efficient Scheduling and Code Generation for Deep Learning Model Training on Near-Data Processing Memory\",\"authors\":\"Jueon Park;Hyojin Sung\",\"doi\":\"10.1109/LCA.2023.3261136\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep learning (DL) model training must address the memory bottleneck to continue scaling. Processing-in-memory approaches can be a viable solution as they move computations near or into the memory, reducing substantial data movement. However, to deploy applications on such hardware, end-to-end software support is crucial for efficient computation mapping and scheduling as well as extensible code generation, but no consideration has been made for DL training workloads. In this paper, we propose XLA-NDP, a compiler and runtime solution for NDPX, a near-data processing (NDP) architecture integrated with an existing DL training framework. XLA-NDP offloads NDPX kernels and schedules them to overlap with GPU kernels to maximize parallelism based on GPU and NDPX costs, while providing a template-based code generator with low-level optimizations. The experiments showed that XLA-NDP provides up to a 41% speedup (24% on average) over the GPU baseline for four DL model training.\",\"PeriodicalId\":51248,\"journal\":{\"name\":\"IEEE Computer Architecture Letters\",\"volume\":\"22 1\",\"pages\":\"61-64\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2023-03-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Computer Architecture Letters\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10079098/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Computer Architecture Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10079098/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
XLA-NDP: Efficient Scheduling and Code Generation for Deep Learning Model Training on Near-Data Processing Memory
Deep learning (DL) model training must address the memory bottleneck to continue scaling. Processing-in-memory approaches can be a viable solution as they move computations near or into the memory, reducing substantial data movement. However, to deploy applications on such hardware, end-to-end software support is crucial for efficient computation mapping and scheduling as well as extensible code generation, but no consideration has been made for DL training workloads. In this paper, we propose XLA-NDP, a compiler and runtime solution for NDPX, a near-data processing (NDP) architecture integrated with an existing DL training framework. XLA-NDP offloads NDPX kernels and schedules them to overlap with GPU kernels to maximize parallelism based on GPU and NDPX costs, while providing a template-based code generator with low-level optimizations. The experiments showed that XLA-NDP provides up to a 41% speedup (24% on average) over the GPU baseline for four DL model training.
期刊介绍:
IEEE Computer Architecture Letters is a rigorously peer-reviewed forum for publishing early, high-impact results in the areas of uni- and multiprocessor computer systems, computer architecture, microarchitecture, workload characterization, performance evaluation and simulation techniques, and power-aware computing. Submissions are welcomed on any topic in computer architecture, especially but not limited to: microprocessor and multiprocessor systems, microarchitecture and ILP processors, workload characterization, performance evaluation and simulation techniques, compiler-hardware and operating system-hardware interactions, interconnect architectures, memory and cache systems, power and thermal issues at the architecture level, I/O architectures and techniques, independent validation of previously published results, analysis of unsuccessful techniques, domain-specific processor architectures (e.g., embedded, graphics, network, etc.), real-time and high-availability architectures, reconfigurable systems.