Jens Domke, Emil Vatai, Balazs Gerofi, Yuetsu Kodama, Mohamed Wahib, Artur Podobas, Sparsh Mittal, Miquel Pericàs, Lingqi Zhang, Peng Chen, Aleksandr Drozd, Satoshi Matsuoka
Over the last three decades, innovations in the memory subsystem were primarily targeted at overcoming the data movement bottleneck. In this paper, we focus on a specific market trend in memory technology: 3D-stacked memory and caches. We investigate the impact of extending the on-chip memory capabilities in future HPC-focused processors, particularly by 3D-stacked SRAM. First, we propose a method oblivious to the memory subsystem to gauge the upper-bound in performance improvements when data movement costs are eliminated. Then, using the gem5 simulator, we model two variants of a hypothetical LARge Cache processor (LARC), fabricated in 1.5 nm and enriched with high-capacity 3D-stacked cache. With a volume of experiments involving a broad set of proxy-applications and benchmarks, we aim to reveal how HPC CPU performance will evolve, and conclude an average boost of 9.56x for cache-sensitive HPC applications, on a per-chip basis. Additionally, we exhaustively document our methodological exploration to motivate HPC centers to drive their own technological agenda through enhanced co-design.
{"title":"At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads","authors":"Jens Domke, Emil Vatai, Balazs Gerofi, Yuetsu Kodama, Mohamed Wahib, Artur Podobas, Sparsh Mittal, Miquel Pericàs, Lingqi Zhang, Peng Chen, Aleksandr Drozd, Satoshi Matsuoka","doi":"10.1145/3629520","DOIUrl":"https://doi.org/10.1145/3629520","url":null,"abstract":"Over the last three decades, innovations in the memory subsystem were primarily targeted at overcoming the data movement bottleneck. In this paper, we focus on a specific market trend in memory technology: 3D-stacked memory and caches. We investigate the impact of extending the on-chip memory capabilities in future HPC-focused processors, particularly by 3D-stacked SRAM. First, we propose a method oblivious to the memory subsystem to gauge the upper-bound in performance improvements when data movement costs are eliminated. Then, using the gem5 simulator, we model two variants of a hypothetical LARge Cache processor (LARC), fabricated in 1.5 nm and enriched with high-capacity 3D-stacked cache. With a volume of experiments involving a broad set of proxy-applications and benchmarks, we aim to reveal how HPC CPU performance will evolve, and conclude an average boost of 9.56x for cache-sensitive HPC applications, on a per-chip basis. Additionally, we exhaustively document our methodological exploration to motivate HPC centers to drive their own technological agenda through enhanced co-design.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"65 sp1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135219250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The exponential performance growth guaranteed by Moore’s law has started to taper in recent years. At the same time, emerging applications like image processing demand heavy computational performance. These factors inevitably lead to the emergence of domain-specific accelerators (DSA) to fill the performance void left by conventional architectures. FPGAs are rapidly evolving towards becoming an alternative to custom ASICs for designing DSAs because of their low power consumption and a higher degree of parallelism. DSA design on FPGAs requires careful calibration of the FPGA compute and memory resources towards achieving optimal throughput. Hardware Descriptive Languages (HDL) like Verilog have been traditionally used to design FPGA hardware. HDLs are not geared towards any domain, and the user has to put in much effort to describe the hardware at the register transfer level. Domain Specific Languages (DSLs) and compilers have been recently used to weave together handwritten HDLs templates targeting a specific domain. Recent efforts have designed DSAs with image-processing DSLs targeting FPGAs. Image computations in the DSL are lowered to pre-existing templates or lower-level languages like HLS-C. This approach requires expensive FPGA re-flashing for every new workload. In contrast to this fixed-function hardware approach, overlays are gaining traction. Overlays are DSAs resembling a processor, which is synthesized and flashed on the FPGA once but is flexible enough to process a broad class of computations through soft reconfiguration. Less work has been reported in the context of image processing overlays. Image processing algorithms vary in size and shape, ranging from simple blurring operations to complex pyramid systems. The primary challenge in designing an image-processing overlay is maintaining flexibility in mapping different algorithms. This paper proposes a DSL-based overlay accelerator called FlowPix for image processing applications. The DSL programs are expressed as pipelines, with each stage representing a computational step in the overall algorithm. We implement 15 image-processing benchmarks using FlowPix on a Virtex-7-690t FPGA. The benchmarks range from simple blur operations to complex pipelines like Lucas-Kande optical flow. We compare FlowPix against existing DSL-to-FPGA frameworks like Hetero-Halide and Vitis Vision library that generate fixed-function hardware. On most benchmarks, we see up to 25% degradation in latency with approximately a 1.7x to 2x increase in the FPGA LUT consumption. Our ability to execute any benchmark without incurring the high costs of hardware synthesis, place-and-route, and FPGA re-flashing justifies the slight performance loss and increased resource consumption that we experience. FlowPix achieves an average frame rate of 170 FPS on HD frames of 1920x1080 pixels in the implemented benchmarks.
{"title":"FlowPix: Accelerating Image Processing Pipelines on an FPGA Overlay using a Domain Specific Compiler","authors":"Ziaul Choudhury, Anish Gulati, Suresh Purini","doi":"10.1145/3629523","DOIUrl":"https://doi.org/10.1145/3629523","url":null,"abstract":"The exponential performance growth guaranteed by Moore’s law has started to taper in recent years. At the same time, emerging applications like image processing demand heavy computational performance. These factors inevitably lead to the emergence of domain-specific accelerators (DSA) to fill the performance void left by conventional architectures. FPGAs are rapidly evolving towards becoming an alternative to custom ASICs for designing DSAs because of their low power consumption and a higher degree of parallelism. DSA design on FPGAs requires careful calibration of the FPGA compute and memory resources towards achieving optimal throughput. Hardware Descriptive Languages (HDL) like Verilog have been traditionally used to design FPGA hardware. HDLs are not geared towards any domain, and the user has to put in much effort to describe the hardware at the register transfer level. Domain Specific Languages (DSLs) and compilers have been recently used to weave together handwritten HDLs templates targeting a specific domain. Recent efforts have designed DSAs with image-processing DSLs targeting FPGAs. Image computations in the DSL are lowered to pre-existing templates or lower-level languages like HLS-C. This approach requires expensive FPGA re-flashing for every new workload. In contrast to this fixed-function hardware approach, overlays are gaining traction. Overlays are DSAs resembling a processor, which is synthesized and flashed on the FPGA once but is flexible enough to process a broad class of computations through soft reconfiguration. Less work has been reported in the context of image processing overlays. Image processing algorithms vary in size and shape, ranging from simple blurring operations to complex pyramid systems. The primary challenge in designing an image-processing overlay is maintaining flexibility in mapping different algorithms. This paper proposes a DSL-based overlay accelerator called FlowPix for image processing applications. The DSL programs are expressed as pipelines, with each stage representing a computational step in the overall algorithm. We implement 15 image-processing benchmarks using FlowPix on a Virtex-7-690t FPGA. The benchmarks range from simple blur operations to complex pipelines like Lucas-Kande optical flow. We compare FlowPix against existing DSL-to-FPGA frameworks like Hetero-Halide and Vitis Vision library that generate fixed-function hardware. On most benchmarks, we see up to 25% degradation in latency with approximately a 1.7x to 2x increase in the FPGA LUT consumption. Our ability to execute any benchmark without incurring the high costs of hardware synthesis, place-and-route, and FPGA re-flashing justifies the slight performance loss and increased resource consumption that we experience. FlowPix achieves an average frame rate of 170 FPS on HD frames of 1920x1080 pixels in the implemented benchmarks.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"42 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134973341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, benefiting from the increase in model size and complexity, deep learning has achieved tremendous success in computer vision (CV) and natural language processing (NLP). Training deep learning models using accelerators such as GPUs often requires much iterative data to be transferred from NVMe SSD to GPU memory. Much recent work has focused on data transfer during the pre-processing phase and has introduced techniques such as multiprocessing and GPU Direct Storage (GDS) to accelerate it. However, tensor data during training (such as Checkpoints, logs, and intermediate feature maps) which is also time-consuming, is often transferred using traditional serial, long-I/O-path transfer methods. In this paper, based on GDS technology, we built Fastensor, an efficient tool for tensor data transfer between NVMe SSDs and GPUs. To achieve higher tensor data I/O throughput, we optimized the traditional data I/O process. We also proposed a data and runtime context-aware tensor I/O algorithm. Fastensor can select the most suitable data transfer tool for the current tensor from a candidate set of tools during model training. The optimal tool is derived from a dictionary generated by our adaptive exploration algorithm in the first few training iterations. We used Fastensor’s unified interface to test the read/write bandwidth and energy consumption of different transfer tools for different sizes of tensor blocks. We found that the execution efficiency of different tensor transfer tools is related to both the tensor block size and the runtime context. We then deployed Fastensor in the widely applicable Pytorch deep learning framework. We showed that Fastensor could perform superior in typical scenarios of model parameter saving and intermediate feature map transfer with the same hardware configuration. Fastensor achieves a 5.37x read performance improvement compared to torch . save () when used for model parameter saving. When used for intermediate feature map transfer, Fastensor can increase the supported training batch size by 20x, while the total read and write speed is increased by 2.96x compared to the torch I/O API.
{"title":"Fastensor: Optimise the Tensor I/O Path from SSD to GPU for Deep Learning Training","authors":"Jia Wei, Xingjun Zhang, Longxiang Wang, Zheng Wei","doi":"10.1145/3630108","DOIUrl":"https://doi.org/10.1145/3630108","url":null,"abstract":"In recent years, benefiting from the increase in model size and complexity, deep learning has achieved tremendous success in computer vision (CV) and natural language processing (NLP). Training deep learning models using accelerators such as GPUs often requires much iterative data to be transferred from NVMe SSD to GPU memory. Much recent work has focused on data transfer during the pre-processing phase and has introduced techniques such as multiprocessing and GPU Direct Storage (GDS) to accelerate it. However, tensor data during training (such as Checkpoints, logs, and intermediate feature maps) which is also time-consuming, is often transferred using traditional serial, long-I/O-path transfer methods. In this paper, based on GDS technology, we built Fastensor, an efficient tool for tensor data transfer between NVMe SSDs and GPUs. To achieve higher tensor data I/O throughput, we optimized the traditional data I/O process. We also proposed a data and runtime context-aware tensor I/O algorithm. Fastensor can select the most suitable data transfer tool for the current tensor from a candidate set of tools during model training. The optimal tool is derived from a dictionary generated by our adaptive exploration algorithm in the first few training iterations. We used Fastensor’s unified interface to test the read/write bandwidth and energy consumption of different transfer tools for different sizes of tensor blocks. We found that the execution efficiency of different tensor transfer tools is related to both the tensor block size and the runtime context. We then deployed Fastensor in the widely applicable Pytorch deep learning framework. We showed that Fastensor could perform superior in typical scenarios of model parameter saving and intermediate feature map transfer with the same hardware configuration. Fastensor achieves a 5.37x read performance improvement compared to torch . save () when used for model parameter saving. When used for intermediate feature map transfer, Fastensor can increase the supported training batch size by 20x, while the total read and write speed is increased by 2.96x compared to the torch I/O API.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134973614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zachary Susskind, Aman Arora, Igor D. S. Miranda, Alan T. L. Bacellar, Luis A. Q. Villon, Rafael F. Katopodis, Leandro S. de Araújo, Diego L. C. Dutra, Priscila M. V. Lima, Felipe M. G. França, Mauricio Breternitz Jr., Lizy K. John
”Extreme edge“ devices such as smart sensors are a uniquely challenging environment for the deployment of machine learning. The tiny energy budgets of these devices lie beyond what is feasible for conventional deep neural networks, particularly in high-throughput scenarios, requiring us to rethink how we approach edge inference. In this work, we propose ULEEN, a model and FPGA-based accelerator architecture based on weightless neural networks (WNNs). WNNs eliminate energy-intensive arithmetic operations, instead using table lookups to perform computation, which makes them theoretically well-suited for edge inference. However, WNNs have historically suffered from poor accuracy and excessive memory usage. ULEEN incorporates algorithmic improvements and a novel training strategy inspired by binary neural networks (BNNs) to make significant strides in addressing these issues. We compare ULEEN against BNNs in software and hardware using the four MLPerf Tiny datasets and MNIST. Our FPGA implementations of ULEEN accomplish classification at 4.0-14.3 million inferences per second, improving area-normalized throughput by an average of 3.6 × and steady-state energy efficiency by an average of 7.1 × compared to the FPGA-based Xilinx FINN BNN inference platform. While ULEEN is not a universally applicable machine learning model, we demonstrate that it can be an excellent choice for certain applications in energy- and latency-critical edge environments.
智能传感器等“极端边缘”设备是部署机器学习的独特挑战环境。这些设备的微小能量预算超出了传统深度神经网络的可行范围,特别是在高吞吐量场景下,这需要我们重新思考如何处理边缘推理。在这项工作中,我们提出了ULEEN,一种基于失重神经网络(WNNs)的模型和基于fpga的加速器架构。wnn消除了能量密集的算术运算,而是使用表查找来执行计算,这使得它们在理论上非常适合于边缘推理。然而,wnn在历史上一直存在准确性差和内存使用过多的问题。ULEEN结合了算法改进和受二元神经网络(bnn)启发的新颖训练策略,在解决这些问题方面取得了重大进展。我们使用四个MLPerf Tiny数据集和MNIST在软件和硬件上比较了ULEEN与bnn。与基于FPGA的Xilinx FINN BNN推理平台相比,ULEEN的FPGA实现以每秒40 - 1430万次推理的速度完成分类,将区域标准化吞吐量平均提高3.6倍,稳态能效平均提高7.1倍。虽然ULEEN不是一个普遍适用的机器学习模型,但我们证明它可以成为能源和延迟关键边缘环境中某些应用程序的绝佳选择。
{"title":"ULEEN: A Novel Architecture for Ultra Low-Energy Edge Neural Networks","authors":"Zachary Susskind, Aman Arora, Igor D. S. Miranda, Alan T. L. Bacellar, Luis A. Q. Villon, Rafael F. Katopodis, Leandro S. de Araújo, Diego L. C. Dutra, Priscila M. V. Lima, Felipe M. G. França, Mauricio Breternitz Jr., Lizy K. John","doi":"10.1145/3629522","DOIUrl":"https://doi.org/10.1145/3629522","url":null,"abstract":"”Extreme edge“ devices such as smart sensors are a uniquely challenging environment for the deployment of machine learning. The tiny energy budgets of these devices lie beyond what is feasible for conventional deep neural networks, particularly in high-throughput scenarios, requiring us to rethink how we approach edge inference. In this work, we propose ULEEN, a model and FPGA-based accelerator architecture based on weightless neural networks (WNNs). WNNs eliminate energy-intensive arithmetic operations, instead using table lookups to perform computation, which makes them theoretically well-suited for edge inference. However, WNNs have historically suffered from poor accuracy and excessive memory usage. ULEEN incorporates algorithmic improvements and a novel training strategy inspired by binary neural networks (BNNs) to make significant strides in addressing these issues. We compare ULEEN against BNNs in software and hardware using the four MLPerf Tiny datasets and MNIST. Our FPGA implementations of ULEEN accomplish classification at 4.0-14.3 million inferences per second, improving area-normalized throughput by an average of 3.6 × and steady-state energy efficiency by an average of 7.1 × compared to the FPGA-based Xilinx FINN BNN inference platform. While ULEEN is not a universally applicable machine learning model, we demonstrate that it can be an excellent choice for certain applications in energy- and latency-critical edge environments.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134973617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Collecting sufficient microarchitecture performance data is essential for performance evaluation and workload characterization. There are many events to be monitored in a modern processor while only a few hardware performance monitoring counters (PMCs) can be used, so multiplexing is commonly adopted. However, inefficiency commonly exists in state-of-the-art profiling tools when grouping events for multiplexing PMCs. It has the risk of inaccurate measurement and misleading analysis. Commercial tools can leverage PMCs but they are closed-source and only support their specified platforms. To this end, we propose an approach for efficient cross-platform microarchitecture performance measurement via adaptive grouping, aiming to improve the metrics’ sampling ratios. The approach generates event groups based on the number of available PMCs detected on arbitrary machines while avoiding the scheduling pitfall of Linux perf_event subsystem. We evaluate our approach with SPEC CPU 2017 on four mainstream x86-64 and AArch64 processors and conduct comparative analyses of efficiency with two other state-of-the-art tools, LIKWID and ARM Top-down Tool. The experimental results indicate that our approach gains around 50% improvement in the average sampling ratio of metrics without compromising the correctness and reliability.
收集足够的微架构性能数据对于性能评估和工作负载表征至关重要。在现代处理器中需要监视的事件很多,而只能使用几个硬件性能监视计数器(pmc),因此通常采用多路复用。然而,在为多路pmc分组事件时,最先进的分析工具通常存在效率低下的问题。它有测量不准确和误导性分析的风险。商业工具可以利用pmc,但它们是闭源的,只支持它们指定的平台。为此,我们提出了一种基于自适应分组的高效跨平台微架构性能测量方法,旨在提高指标的采样率。该方法根据在任意机器上检测到的可用pmc数量生成事件组,同时避免了Linux perf_event子系统的调度缺陷。我们在四种主流x86-64和AArch64处理器上使用SPEC CPU 2017评估了我们的方法,并与另外两种最先进的工具LIKWID和ARM自上而下工具进行了效率比较分析。实验结果表明,我们的方法在不影响正确性和可靠性的情况下,在指标的平均抽样比上提高了约50%。
{"title":"Efficient Cross-platform Multiplexing of Hardware Performance Counters via Adaptive Grouping","authors":"Tong-yu Liu, Jianmei Guo, Bo Huang","doi":"10.1145/3629525","DOIUrl":"https://doi.org/10.1145/3629525","url":null,"abstract":"Collecting sufficient microarchitecture performance data is essential for performance evaluation and workload characterization. There are many events to be monitored in a modern processor while only a few hardware performance monitoring counters (PMCs) can be used, so multiplexing is commonly adopted. However, inefficiency commonly exists in state-of-the-art profiling tools when grouping events for multiplexing PMCs. It has the risk of inaccurate measurement and misleading analysis. Commercial tools can leverage PMCs but they are closed-source and only support their specified platforms. To this end, we propose an approach for efficient cross-platform microarchitecture performance measurement via adaptive grouping, aiming to improve the metrics’ sampling ratios. The approach generates event groups based on the number of available PMCs detected on arbitrary machines while avoiding the scheduling pitfall of Linux perf_event subsystem. We evaluate our approach with SPEC CPU 2017 on four mainstream x86-64 and AArch64 processors and conduct comparative analyses of efficiency with two other state-of-the-art tools, LIKWID and ARM Top-down Tool. The experimental results indicate that our approach gains around 50% improvement in the average sampling ratio of metrics without compromising the correctness and reliability.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"75 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135511089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Battery-less technology evolved to replace battery usage in space, deep mines, and other environments to reduce cost and pollution. Non-volatile memory (NVM) based processors were explored for saving the system state during a power failure. Such devices have a small SRAM and large non-volatile memory. To make the system energy efficient, we need to use SRAM efficiently. So we must select some portions of the application and map them to either SRAM or FRAM. This paper proposes an ILP-based memory mapping technique for intermittently powered IoT devices. Our proposed technique gives an optimal mapping choice that reduces the system’s Energy-Delay Product (EDP). We validated our system using TI-based MSP430FR6989 and MSP430F5529 development boards. Our proposed memory configuration consumes 38.10% less EDP than the baseline configuration and 9.30% less EDP than the existing work under stable power. Our proposed configuration achieves 20.15% less EDP than the baseline configuration and 26.87% less EDP than the existing work under unstable power. This work supports intermittent computing and works efficiently during frequent power failures.
{"title":"Mapi-Pro: An Energy Efficient Memory Mapping Technique for Intermittent Computing","authors":"Satya Jaswanth Badri, Mukesh Saini, Neeraj Goel","doi":"10.1145/3629524","DOIUrl":"https://doi.org/10.1145/3629524","url":null,"abstract":"Battery-less technology evolved to replace battery usage in space, deep mines, and other environments to reduce cost and pollution. Non-volatile memory (NVM) based processors were explored for saving the system state during a power failure. Such devices have a small SRAM and large non-volatile memory. To make the system energy efficient, we need to use SRAM efficiently. So we must select some portions of the application and map them to either SRAM or FRAM. This paper proposes an ILP-based memory mapping technique for intermittently powered IoT devices. Our proposed technique gives an optimal mapping choice that reduces the system’s Energy-Delay Product (EDP). We validated our system using TI-based MSP430FR6989 and MSP430F5529 development boards. Our proposed memory configuration consumes 38.10% less EDP than the baseline configuration and 9.30% less EDP than the existing work under stable power. Our proposed configuration achieves 20.15% less EDP than the baseline configuration and 26.87% less EDP than the existing work under unstable power. This work supports intermittent computing and works efficiently during frequent power failures.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135618053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shiqing Zhang, Mahmood Naderan-Tahan, Magnus Jahre, Lieven Eeckhout
Multi-chip GPU systems are critical to scale performance beyond a single GPU chip for a wide variety of important emerging applications. A key challenge for multi-chip GPUs though is how to overcome the bandwidth gap between inter-chip and intra-chip communication. Accesses to shared data, i.e., data accessed by multiple chips, pose a major performance challenge as they incur remote memory accesses possibly congesting the inter-chip links and degrading overall system performance. This paper characterizes the shared data set in multi-chip GPUs in terms of (1) truly versus falsely shared data, (2) how the shared data set scales with input size, (3) along which dimensions the shared data set scales, and (4) how sensitive the shared data set is with respect to the input’s characteristics, i.e., node degree and connectivity in graph workloads. We observe significant variety in scaling behavior across workloads: some workloads feature a shared data set that scales linearly with input size, while others feature sublinear scaling (following a (sqrt {2} ) or (sqrt [3]{2} ) relationship). We further demonstrate how the shared data set affects the optimum last-level cache organization (memory-side versus SM-side) in multi-chip GPUs, as well as optimum memory page allocation and thread scheduling policy. Sensitivity analyses demonstrate the insights across the broad design space.
{"title":"Characterizing Multi-Chip GPU Data Sharing","authors":"Shiqing Zhang, Mahmood Naderan-Tahan, Magnus Jahre, Lieven Eeckhout","doi":"10.1145/3629521","DOIUrl":"https://doi.org/10.1145/3629521","url":null,"abstract":"Multi-chip GPU systems are critical to scale performance beyond a single GPU chip for a wide variety of important emerging applications. A key challenge for multi-chip GPUs though is how to overcome the bandwidth gap between inter-chip and intra-chip communication. Accesses to shared data, i.e., data accessed by multiple chips, pose a major performance challenge as they incur remote memory accesses possibly congesting the inter-chip links and degrading overall system performance. This paper characterizes the shared data set in multi-chip GPUs in terms of (1) truly versus falsely shared data, (2) how the shared data set scales with input size, (3) along which dimensions the shared data set scales, and (4) how sensitive the shared data set is with respect to the input’s characteristics, i.e., node degree and connectivity in graph workloads. We observe significant variety in scaling behavior across workloads: some workloads feature a shared data set that scales linearly with input size, while others feature sublinear scaling (following a (sqrt {2} ) or (sqrt [3]{2} ) relationship). We further demonstrate how the shared data set affects the optimum last-level cache organization (memory-side versus SM-side) in multi-chip GPUs, as well as optimum memory page allocation and thread scheduling policy. Sensitivity analyses demonstrate the insights across the broad design space.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135618434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The rapid adoption of AI and convenience offered by cloud services have resulted in the growing demands for GPUs in the cloud. Generally, GPUs are physically attached to host servers as PCIe devices. However, the fixed assembly combination of host servers and GPUs is extremely inefficient in resource utilization, upgrade, and maintenance. Due to these issues, the GPU disaggregation technique has been proposed to decouple GPUs from host servers. It aggregates GPUs into a pool, and allocates GPU node(s) according to user demands. However, existing GPU disaggregation systems have flaws in software-hardware compatibility, disaggregation scope, and capacity. In this paper, we present a new implementation of datacenter-scale GPU disaggregation, named DxPU. DxPU efficiently solves the above problems and can flexibly allocate as many GPU node(s) as users demand. In order to understand the performance overhead incurred by DxPU, we build up a performance model for AI specific workloads. With the guidance of modeling results, we develop a prototype system, which has been deployed into the datacenter of a leading cloud provider for a test run. We also conduct detailed experiments to evaluate the performance overhead caused by our system. The results show that the overhead of DxPU is less than 10%, compared with native GPU servers, in most of user scenarios.
{"title":"DxPU: Large Scale Disaggregated GPU Pools in the Datacenter","authors":"Bowen He, Xiao Zheng, Yuan Chen, Weinan Li, Yajin Zhou, Xin Long, Pengcheng Zhang, Xiaowei Lu, Linquan Jiang, Qiang Liu, Dennis Cai, Xiantao Zhang","doi":"10.1145/3617995","DOIUrl":"https://doi.org/10.1145/3617995","url":null,"abstract":"The rapid adoption of AI and convenience offered by cloud services have resulted in the growing demands for GPUs in the cloud. Generally, GPUs are physically attached to host servers as PCIe devices. However, the fixed assembly combination of host servers and GPUs is extremely inefficient in resource utilization, upgrade, and maintenance. Due to these issues, the GPU disaggregation technique has been proposed to decouple GPUs from host servers. It aggregates GPUs into a pool, and allocates GPU node(s) according to user demands. However, existing GPU disaggregation systems have flaws in software-hardware compatibility, disaggregation scope, and capacity. In this paper, we present a new implementation of datacenter-scale GPU disaggregation, named DxPU. DxPU efficiently solves the above problems and can flexibly allocate as many GPU node(s) as users demand. In order to understand the performance overhead incurred by DxPU, we build up a performance model for AI specific workloads. With the guidance of modeling results, we develop a prototype system, which has been deployed into the datacenter of a leading cloud provider for a test run. We also conduct detailed experiments to evaluate the performance overhead caused by our system. The results show that the overhead of DxPU is less than 10%, compared with native GPU servers, in most of user scenarios.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135481968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Erasure codes are widely deployed in modern storage systems, leading to frequent usage of their encoding/decoding operations. The encoding/decoding process for erasure codes is generally carried out using the parity-check matrix approach. However, this approach is serial and computationally expensive, mainly due to dealing with matrix operations, which results in low encoding/decoding performance. These drawbacks are particularly evident for newer erasure codes, including SD and LRC codes. To address these limitations, this paper introduces the Partitioned and Parallel Matrix ( PPM ) algorithm. This algorithm partitions the parity-check matrix, parallelizes encoding/decoding operations, and optimizes calculation sequence to facilitate fast encoding/decoding of these codes. Furthermore, we present a generalized PPM ( gPPM ) algorithm that surpasses PPM in performance by employing fine-grained dynamic matrix calculation sequence selection. Unlike PPM, gPPM is also applicable to erasure codes such as RS code. Experimental results demonstrate that PPM improves the encoding/decoding speed of SD and LRC codes by up to (210.81% ) . Besides, gPPM achieves up to (102.41% ) improvement over PPM and (32.25% ) improvement over RS regarding encoding/decoding speed.
{"title":"gPPM: A Generalized Matrix Operation and Parallel Algorithm to Accelerate the Encoding/Decoding Process of Erasure Codes","authors":"Shiyi Li, Qiang Cao, Shenggang Wan, Wen Xia, Changsheng Xie","doi":"10.1145/3625005","DOIUrl":"https://doi.org/10.1145/3625005","url":null,"abstract":"Erasure codes are widely deployed in modern storage systems, leading to frequent usage of their encoding/decoding operations. The encoding/decoding process for erasure codes is generally carried out using the parity-check matrix approach. However, this approach is serial and computationally expensive, mainly due to dealing with matrix operations, which results in low encoding/decoding performance. These drawbacks are particularly evident for newer erasure codes, including SD and LRC codes. To address these limitations, this paper introduces the Partitioned and Parallel Matrix ( PPM ) algorithm. This algorithm partitions the parity-check matrix, parallelizes encoding/decoding operations, and optimizes calculation sequence to facilitate fast encoding/decoding of these codes. Furthermore, we present a generalized PPM ( gPPM ) algorithm that surpasses PPM in performance by employing fine-grained dynamic matrix calculation sequence selection. Unlike PPM, gPPM is also applicable to erasure codes such as RS code. Experimental results demonstrate that PPM improves the encoding/decoding speed of SD and LRC codes by up to (210.81% ) . Besides, gPPM achieves up to (102.41% ) improvement over PPM and (32.25% ) improvement over RS regarding encoding/decoding speed.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136235362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Victor Ferrari, Rafael Sousa, Marcio Pereira, João P. L. de Carvalho, José Nelson Amaral, José Moreira, Guido Araujo
Convolution is one of the most computationally intensive operations that must be performed for machine-learning model inference. A traditional approach to computing convolutions is known as the Im2Col + BLAS method. This paper proposes SConv: a direct-convolution algorithm based on an MLIR/LLVM code-generation toolchain that can be integrated into machine-learning compilers. This algorithm introduces: (a) Convolution Slicing Analysis (CSA) — a convolution-specific 3D cache-blocking analysis pass that focuses on tile reuse over the cache hierarchy; (b) Convolution Slicing Optimization (CSO) — a code-generation pass that uses CSA to generate a tiled direct-convolution macro-kernel; and (c) Vector-Based Packing (VBP) — an architecture-specific optimized input-tensor packing solution based on vector-register shift instructions for convolutions with unitary stride. Experiments conducted on 393 convolutions from full ONNX-MLIR machine-learning models indicate that the elimination of the Im2Col transformation and the use of fast packing routines result in a total packing time reduction, on full model inference, of 2.3x – 4.0x on Intel x86 and 3.3x – 5.9x on IBM POWER10. The speed-up over an Im2Col + BLAS method based on current BLAS implementations for end-to-end machine-learning model inference is in the range of 11% – 27% for Intel x86 and 11% – 34% for IBM POWER10 architectures. The total convolution speedup for model inference is 13% – 28% on Intel x86 and 23% – 39% on IBM POWER10. SConv also outperforms BLAS GEMM, when computing pointwise convolutions in more than 82% of the 219 tested instances.
{"title":"Advancing Direct Convolution using Convolution Slicing Optimization and ISA Extensions","authors":"Victor Ferrari, Rafael Sousa, Marcio Pereira, João P. L. de Carvalho, José Nelson Amaral, José Moreira, Guido Araujo","doi":"10.1145/3625004","DOIUrl":"https://doi.org/10.1145/3625004","url":null,"abstract":"Convolution is one of the most computationally intensive operations that must be performed for machine-learning model inference. A traditional approach to computing convolutions is known as the Im2Col + BLAS method. This paper proposes SConv: a direct-convolution algorithm based on an MLIR/LLVM code-generation toolchain that can be integrated into machine-learning compilers. This algorithm introduces: (a) Convolution Slicing Analysis (CSA) — a convolution-specific 3D cache-blocking analysis pass that focuses on tile reuse over the cache hierarchy; (b) Convolution Slicing Optimization (CSO) — a code-generation pass that uses CSA to generate a tiled direct-convolution macro-kernel; and (c) Vector-Based Packing (VBP) — an architecture-specific optimized input-tensor packing solution based on vector-register shift instructions for convolutions with unitary stride. Experiments conducted on 393 convolutions from full ONNX-MLIR machine-learning models indicate that the elimination of the Im2Col transformation and the use of fast packing routines result in a total packing time reduction, on full model inference, of 2.3x – 4.0x on Intel x86 and 3.3x – 5.9x on IBM POWER10. The speed-up over an Im2Col + BLAS method based on current BLAS implementations for end-to-end machine-learning model inference is in the range of 11% – 27% for Intel x86 and 11% – 34% for IBM POWER10 architectures. The total convolution speedup for model inference is 13% – 28% on Intel x86 and 23% – 39% on IBM POWER10. SConv also outperforms BLAS GEMM, when computing pointwise convolutions in more than 82% of the 219 tested instances.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136313810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}