ACM Transactions on Architecture and Code Optimization最新文献_第7页

At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads 在性能轨迹:量化丰富的3d堆叠缓存对高性能计算工作负载的影响

3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-10-25 DOI: 10.1145/3629520

Jens Domke, Emil Vatai, Balazs Gerofi, Yuetsu Kodama, Mohamed Wahib, Artur Podobas, Sparsh Mittal, Miquel Pericàs, Lingqi Zhang, Peng Chen, Aleksandr Drozd, Satoshi Matsuoka

Over the last three decades, innovations in the memory subsystem were primarily targeted at overcoming the data movement bottleneck. In this paper, we focus on a specific market trend in memory technology: 3D-stacked memory and caches. We investigate the impact of extending the on-chip memory capabilities in future HPC-focused processors, particularly by 3D-stacked SRAM. First, we propose a method oblivious to the memory subsystem to gauge the upper-bound in performance improvements when data movement costs are eliminated. Then, using the gem5 simulator, we model two variants of a hypothetical LARge Cache processor (LARC), fabricated in 1.5 nm and enriched with high-capacity 3D-stacked cache. With a volume of experiments involving a broad set of proxy-applications and benchmarks, we aim to reveal how HPC CPU performance will evolve, and conclude an average boost of 9.56x for cache-sensitive HPC applications, on a per-chip basis. Additionally, we exhaustively document our methodological exploration to motivate HPC centers to drive their own technological agenda through enhanced co-design.

在过去的三十年中，内存子系统的创新主要是为了克服数据移动瓶颈。在本文中，我们专注于存储技术的特定市场趋势:3d堆叠内存和缓存。我们研究了扩展片上存储能力对未来以高性能计算为重点的处理器的影响，特别是通过3d堆叠SRAM。首先，我们提出了一种与内存子系统无关的方法来衡量消除数据移动成本后性能改进的上限。然后，使用gem5模拟器，我们模拟了假设的大缓存处理器(LARC)的两种变体，该处理器由1.5 nm制造，并富含高容量3d堆叠缓存。通过大量涉及代理应用程序和基准测试的实验，我们的目标是揭示HPC CPU性能将如何发展，并得出结论，在每个芯片的基础上，对缓存敏感的HPC应用程序的平均提升为9.56倍。此外，我们详尽地记录了我们的方法探索，以激励HPC中心通过增强的协同设计来推动他们自己的技术议程。

{"title":"At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads","authors":"Jens Domke, Emil Vatai, Balazs Gerofi, Yuetsu Kodama, Mohamed Wahib, Artur Podobas, Sparsh Mittal, Miquel Pericàs, Lingqi Zhang, Peng Chen, Aleksandr Drozd, Satoshi Matsuoka","doi":"10.1145/3629520","DOIUrl":"https://doi.org/10.1145/3629520","url":null,"abstract":"Over the last three decades, innovations in the memory subsystem were primarily targeted at overcoming the data movement bottleneck. In this paper, we focus on a specific market trend in memory technology: 3D-stacked memory and caches. We investigate the impact of extending the on-chip memory capabilities in future HPC-focused processors, particularly by 3D-stacked SRAM. First, we propose a method oblivious to the memory subsystem to gauge the upper-bound in performance improvements when data movement costs are eliminated. Then, using the gem5 simulator, we model two variants of a hypothetical LARge Cache processor (LARC), fabricated in 1.5 nm and enriched with high-capacity 3D-stacked cache. With a volume of experiments involving a broad set of proxy-applications and benchmarks, we aim to reveal how HPC CPU performance will evolve, and conclude an average boost of 9.56x for cache-sensitive HPC applications, on a per-chip basis. Additionally, we exhaustively document our methodological exploration to motivate HPC centers to drive their own technological agenda through enhanced co-design.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"65 sp1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135219250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FlowPix: Accelerating Image Processing Pipelines on an FPGA Overlay using a Domain Specific Compiler FlowPix:使用特定领域编译器加速FPGA覆盖上的图像处理管道

3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-10-25 DOI: 10.1145/3629523

Ziaul Choudhury, Anish Gulati, Suresh Purini

The exponential performance growth guaranteed by Moore’s law has started to taper in recent years. At the same time, emerging applications like image processing demand heavy computational performance. These factors inevitably lead to the emergence of domain-specific accelerators (DSA) to fill the performance void left by conventional architectures. FPGAs are rapidly evolving towards becoming an alternative to custom ASICs for designing DSAs because of their low power consumption and a higher degree of parallelism. DSA design on FPGAs requires careful calibration of the FPGA compute and memory resources towards achieving optimal throughput. Hardware Descriptive Languages (HDL) like Verilog have been traditionally used to design FPGA hardware. HDLs are not geared towards any domain, and the user has to put in much effort to describe the hardware at the register transfer level. Domain Specific Languages (DSLs) and compilers have been recently used to weave together handwritten HDLs templates targeting a specific domain. Recent efforts have designed DSAs with image-processing DSLs targeting FPGAs. Image computations in the DSL are lowered to pre-existing templates or lower-level languages like HLS-C. This approach requires expensive FPGA re-flashing for every new workload. In contrast to this fixed-function hardware approach, overlays are gaining traction. Overlays are DSAs resembling a processor, which is synthesized and flashed on the FPGA once but is flexible enough to process a broad class of computations through soft reconfiguration. Less work has been reported in the context of image processing overlays. Image processing algorithms vary in size and shape, ranging from simple blurring operations to complex pyramid systems. The primary challenge in designing an image-processing overlay is maintaining flexibility in mapping different algorithms. This paper proposes a DSL-based overlay accelerator called FlowPix for image processing applications. The DSL programs are expressed as pipelines, with each stage representing a computational step in the overall algorithm. We implement 15 image-processing benchmarks using FlowPix on a Virtex-7-690t FPGA. The benchmarks range from simple blur operations to complex pipelines like Lucas-Kande optical flow. We compare FlowPix against existing DSL-to-FPGA frameworks like Hetero-Halide and Vitis Vision library that generate fixed-function hardware. On most benchmarks, we see up to 25% degradation in latency with approximately a 1.7x to 2x increase in the FPGA LUT consumption. Our ability to execute any benchmark without incurring the high costs of hardware synthesis, place-and-route, and FPGA re-flashing justifies the slight performance loss and increased resource consumption that we experience. FlowPix achieves an average frame rate of 170 FPS on HD frames of 1920x1080 pixels in the implemented benchmarks.

近年来，摩尔定律所保证的指数级性能增长已经开始逐渐减弱。同时，像图像处理这样的新兴应用需要大量的计算性能。这些因素不可避免地导致领域特定加速器(DSA)的出现，以填补传统架构留下的性能空白。fpga由于其低功耗和更高程度的并行性，正迅速发展成为设计dsa的定制asic的替代方案。FPGA上的DSA设计需要仔细校准FPGA计算和内存资源，以实现最佳吞吐量。像Verilog这样的硬件描述语言(HDL)传统上被用于设计FPGA硬件。HDLs不面向任何域，用户必须投入大量精力来描述寄存器传输级别的硬件。领域特定语言(dsl)和编译器最近被用于将针对特定领域的手写hdl模板编织在一起。最近的努力是设计带有针对fpga的图像处理dsl的dsa。DSL中的图像计算降低到预先存在的模板或低级语言(如HLS-C)。这种方法需要为每个新的工作负载重新刷新昂贵的FPGA。与这种固定功能的硬件方法相比，覆盖正在获得吸引力。覆盖层是类似处理器的dsa，它被合成并在FPGA上闪现一次，但足够灵活，可以通过软重新配置处理广泛的计算。在图像处理覆盖的背景下，报告的工作较少。图像处理算法在大小和形状上各不相同，从简单的模糊操作到复杂的金字塔系统。设计图像处理覆盖层的主要挑战是保持映射不同算法的灵活性。本文提出了一种基于dsl的图像处理叠加加速器FlowPix。DSL程序表示为管道，每个阶段表示整个算法中的一个计算步骤。我们在Virtex-7-690t FPGA上使用FlowPix实现了15个图像处理基准测试。基准测试范围从简单的模糊操作到复杂的管道，如Lucas-Kande光流。我们将FlowPix与现有的DSL-to-FPGA框架(如Hetero-Halide和Vitis Vision库)进行比较，这些框架可以生成固定功能的硬件。在大多数基准测试中，我们看到延迟降低了25%，FPGA LUT消耗增加了大约1.7到2倍。我们能够执行任何基准测试，而不会产生硬件合成、放置和路由以及FPGA重新闪烁的高成本，这证明了我们所经历的轻微性能损失和增加的资源消耗是合理的。在实现的基准测试中，FlowPix在1920x1080像素的高清帧上实现了170 FPS的平均帧率。

{"title":"FlowPix: Accelerating Image Processing Pipelines on an FPGA Overlay using a Domain Specific Compiler","authors":"Ziaul Choudhury, Anish Gulati, Suresh Purini","doi":"10.1145/3629523","DOIUrl":"https://doi.org/10.1145/3629523","url":null,"abstract":"The exponential performance growth guaranteed by Moore’s law has started to taper in recent years. At the same time, emerging applications like image processing demand heavy computational performance. These factors inevitably lead to the emergence of domain-specific accelerators (DSA) to fill the performance void left by conventional architectures. FPGAs are rapidly evolving towards becoming an alternative to custom ASICs for designing DSAs because of their low power consumption and a higher degree of parallelism. DSA design on FPGAs requires careful calibration of the FPGA compute and memory resources towards achieving optimal throughput. Hardware Descriptive Languages (HDL) like Verilog have been traditionally used to design FPGA hardware. HDLs are not geared towards any domain, and the user has to put in much effort to describe the hardware at the register transfer level. Domain Specific Languages (DSLs) and compilers have been recently used to weave together handwritten HDLs templates targeting a specific domain. Recent efforts have designed DSAs with image-processing DSLs targeting FPGAs. Image computations in the DSL are lowered to pre-existing templates or lower-level languages like HLS-C. This approach requires expensive FPGA re-flashing for every new workload. In contrast to this fixed-function hardware approach, overlays are gaining traction. Overlays are DSAs resembling a processor, which is synthesized and flashed on the FPGA once but is flexible enough to process a broad class of computations through soft reconfiguration. Less work has been reported in the context of image processing overlays. Image processing algorithms vary in size and shape, ranging from simple blurring operations to complex pyramid systems. The primary challenge in designing an image-processing overlay is maintaining flexibility in mapping different algorithms. This paper proposes a DSL-based overlay accelerator called FlowPix for image processing applications. The DSL programs are expressed as pipelines, with each stage representing a computational step in the overall algorithm. We implement 15 image-processing benchmarks using FlowPix on a Virtex-7-690t FPGA. The benchmarks range from simple blur operations to complex pipelines like Lucas-Kande optical flow. We compare FlowPix against existing DSL-to-FPGA frameworks like Hetero-Halide and Vitis Vision library that generate fixed-function hardware. On most benchmarks, we see up to 25% degradation in latency with approximately a 1.7x to 2x increase in the FPGA LUT consumption. Our ability to execute any benchmark without incurring the high costs of hardware synthesis, place-and-route, and FPGA re-flashing justifies the slight performance loss and increased resource consumption that we experience. FlowPix achieves an average frame rate of 170 FPS on HD frames of 1920x1080 pixels in the implemented benchmarks.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"42 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134973341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fastensor: Optimise the Tensor I/O Path from SSD to GPU for Deep Learning Training fastsensor:优化从SSD到GPU的张量I/O路径，用于深度学习训练

3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-10-25 DOI: 10.1145/3630108

Jia Wei, Xingjun Zhang, Longxiang Wang, Zheng Wei

In recent years, benefiting from the increase in model size and complexity, deep learning has achieved tremendous success in computer vision (CV) and natural language processing (NLP). Training deep learning models using accelerators such as GPUs often requires much iterative data to be transferred from NVMe SSD to GPU memory. Much recent work has focused on data transfer during the pre-processing phase and has introduced techniques such as multiprocessing and GPU Direct Storage (GDS) to accelerate it. However, tensor data during training (such as Checkpoints, logs, and intermediate feature maps) which is also time-consuming, is often transferred using traditional serial, long-I/O-path transfer methods. In this paper, based on GDS technology, we built Fastensor, an efficient tool for tensor data transfer between NVMe SSDs and GPUs. To achieve higher tensor data I/O throughput, we optimized the traditional data I/O process. We also proposed a data and runtime context-aware tensor I/O algorithm. Fastensor can select the most suitable data transfer tool for the current tensor from a candidate set of tools during model training. The optimal tool is derived from a dictionary generated by our adaptive exploration algorithm in the first few training iterations. We used Fastensor’s unified interface to test the read/write bandwidth and energy consumption of different transfer tools for different sizes of tensor blocks. We found that the execution efficiency of different tensor transfer tools is related to both the tensor block size and the runtime context. We then deployed Fastensor in the widely applicable Pytorch deep learning framework. We showed that Fastensor could perform superior in typical scenarios of model parameter saving and intermediate feature map transfer with the same hardware configuration. Fastensor achieves a 5.37x read performance improvement compared to torch . save () when used for model parameter saving. When used for intermediate feature map transfer, Fastensor can increase the supported training batch size by 20x, while the total read and write speed is increased by 2.96x compared to the torch I/O API.

近年来，得益于模型规模和复杂性的增加，深度学习在计算机视觉(CV)和自然语言处理(NLP)领域取得了巨大的成功。使用GPU等加速器训练深度学习模型通常需要将大量迭代数据从NVMe SSD传输到GPU内存。最近的工作主要集中在预处理阶段的数据传输，并引入了多处理和GPU直接存储(GDS)等技术来加速数据传输。然而，训练期间的张量数据(如检查点、日志和中间特征映射)也很耗时，通常使用传统的串行、长i / o路径传输方法进行传输。本文基于GDS技术，构建了一种高效的NVMe ssd与gpu之间张量数据传输工具Fastensor。为了实现更高的张量数据I/O吞吐量，我们对传统的数据I/O流程进行了优化。我们还提出了一个数据和运行时上下文感知张量I/O算法。在模型训练过程中，Fastensor可以从候选工具集中选择最适合当前张量的数据传输工具。最优工具来源于自适应探索算法在前几次训练迭代中生成的字典。我们使用Fastensor的统一接口测试了不同传输工具对不同张量块大小的读写带宽和能耗。我们发现不同张量传输工具的执行效率与张量块大小和运行时环境有关。然后，我们将Fastensor部署到广泛应用的Pytorch深度学习框架中。结果表明，在相同的硬件配置下，Fastensor在典型的模型参数保存和中间特征映射传递场景中表现优异。与火炬相比，fastsensor实现了5.37倍的读取性能改进。用于模型参数保存时的Save()。当用于中间特征图传输时，Fastensor可以将支持的训练批大小增加20倍，而与torch I/O API相比，总读写速度提高了2.96倍。

{"title":"Fastensor: Optimise the Tensor I/O Path from SSD to GPU for Deep Learning Training","authors":"Jia Wei, Xingjun Zhang, Longxiang Wang, Zheng Wei","doi":"10.1145/3630108","DOIUrl":"https://doi.org/10.1145/3630108","url":null,"abstract":"In recent years, benefiting from the increase in model size and complexity, deep learning has achieved tremendous success in computer vision (CV) and natural language processing (NLP). Training deep learning models using accelerators such as GPUs often requires much iterative data to be transferred from NVMe SSD to GPU memory. Much recent work has focused on data transfer during the pre-processing phase and has introduced techniques such as multiprocessing and GPU Direct Storage (GDS) to accelerate it. However, tensor data during training (such as Checkpoints, logs, and intermediate feature maps) which is also time-consuming, is often transferred using traditional serial, long-I/O-path transfer methods. In this paper, based on GDS technology, we built Fastensor, an efficient tool for tensor data transfer between NVMe SSDs and GPUs. To achieve higher tensor data I/O throughput, we optimized the traditional data I/O process. We also proposed a data and runtime context-aware tensor I/O algorithm. Fastensor can select the most suitable data transfer tool for the current tensor from a candidate set of tools during model training. The optimal tool is derived from a dictionary generated by our adaptive exploration algorithm in the first few training iterations. We used Fastensor’s unified interface to test the read/write bandwidth and energy consumption of different transfer tools for different sizes of tensor blocks. We found that the execution efficiency of different tensor transfer tools is related to both the tensor block size and the runtime context. We then deployed Fastensor in the widely applicable Pytorch deep learning framework. We showed that Fastensor could perform superior in typical scenarios of model parameter saving and intermediate feature map transfer with the same hardware configuration. Fastensor achieves a 5.37x read performance improvement compared to torch . save () when used for model parameter saving. When used for intermediate feature map transfer, Fastensor can increase the supported training batch size by 20x, while the total read and write speed is increased by 2.96x compared to the torch I/O API.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134973614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ULEEN: A Novel Architecture for Ultra Low-Energy Edge Neural Networks 一种新的超低能量边缘神经网络架构

3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-10-25 DOI: 10.1145/3629522

Zachary Susskind, Aman Arora, Igor D. S. Miranda, Alan T. L. Bacellar, Luis A. Q. Villon, Rafael F. Katopodis, Leandro S. de Araújo, Diego L. C. Dutra, Priscila M. V. Lima, Felipe M. G. França, Mauricio Breternitz Jr., Lizy K. John

”Extreme edge“ devices such as smart sensors are a uniquely challenging environment for the deployment of machine learning. The tiny energy budgets of these devices lie beyond what is feasible for conventional deep neural networks, particularly in high-throughput scenarios, requiring us to rethink how we approach edge inference. In this work, we propose ULEEN, a model and FPGA-based accelerator architecture based on weightless neural networks (WNNs). WNNs eliminate energy-intensive arithmetic operations, instead using table lookups to perform computation, which makes them theoretically well-suited for edge inference. However, WNNs have historically suffered from poor accuracy and excessive memory usage. ULEEN incorporates algorithmic improvements and a novel training strategy inspired by binary neural networks (BNNs) to make significant strides in addressing these issues. We compare ULEEN against BNNs in software and hardware using the four MLPerf Tiny datasets and MNIST. Our FPGA implementations of ULEEN accomplish classification at 4.0-14.3 million inferences per second, improving area-normalized throughput by an average of 3.6 × and steady-state energy efficiency by an average of 7.1 × compared to the FPGA-based Xilinx FINN BNN inference platform. While ULEEN is not a universally applicable machine learning model, we demonstrate that it can be an excellent choice for certain applications in energy- and latency-critical edge environments.

智能传感器等“极端边缘”设备是部署机器学习的独特挑战环境。这些设备的微小能量预算超出了传统深度神经网络的可行范围，特别是在高吞吐量场景下，这需要我们重新思考如何处理边缘推理。在这项工作中，我们提出了ULEEN，一种基于失重神经网络(WNNs)的模型和基于fpga的加速器架构。wnn消除了能量密集的算术运算，而是使用表查找来执行计算，这使得它们在理论上非常适合于边缘推理。然而，wnn在历史上一直存在准确性差和内存使用过多的问题。ULEEN结合了算法改进和受二元神经网络(bnn)启发的新颖训练策略，在解决这些问题方面取得了重大进展。我们使用四个MLPerf Tiny数据集和MNIST在软件和硬件上比较了ULEEN与bnn。与基于FPGA的Xilinx FINN BNN推理平台相比，ULEEN的FPGA实现以每秒40 - 1430万次推理的速度完成分类，将区域标准化吞吐量平均提高3.6倍，稳态能效平均提高7.1倍。虽然ULEEN不是一个普遍适用的机器学习模型，但我们证明它可以成为能源和延迟关键边缘环境中某些应用程序的绝佳选择。

{"title":"ULEEN: A Novel Architecture for Ultra Low-Energy Edge Neural Networks","authors":"Zachary Susskind, Aman Arora, Igor D. S. Miranda, Alan T. L. Bacellar, Luis A. Q. Villon, Rafael F. Katopodis, Leandro S. de Araújo, Diego L. C. Dutra, Priscila M. V. Lima, Felipe M. G. França, Mauricio Breternitz Jr., Lizy K. John","doi":"10.1145/3629522","DOIUrl":"https://doi.org/10.1145/3629522","url":null,"abstract":"”Extreme edge“ devices such as smart sensors are a uniquely challenging environment for the deployment of machine learning. The tiny energy budgets of these devices lie beyond what is feasible for conventional deep neural networks, particularly in high-throughput scenarios, requiring us to rethink how we approach edge inference. In this work, we propose ULEEN, a model and FPGA-based accelerator architecture based on weightless neural networks (WNNs). WNNs eliminate energy-intensive arithmetic operations, instead using table lookups to perform computation, which makes them theoretically well-suited for edge inference. However, WNNs have historically suffered from poor accuracy and excessive memory usage. ULEEN incorporates algorithmic improvements and a novel training strategy inspired by binary neural networks (BNNs) to make significant strides in addressing these issues. We compare ULEEN against BNNs in software and hardware using the four MLPerf Tiny datasets and MNIST. Our FPGA implementations of ULEEN accomplish classification at 4.0-14.3 million inferences per second, improving area-normalized throughput by an average of 3.6 × and steady-state energy efficiency by an average of 7.1 × compared to the FPGA-based Xilinx FINN BNN inference platform. While ULEEN is not a universally applicable machine learning model, we demonstrate that it can be an excellent choice for certain applications in energy- and latency-critical edge environments.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134973617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient Cross-platform Multiplexing of Hardware Performance Counters via Adaptive Grouping 基于自适应分组的硬件性能计数器高效跨平台复用

3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-10-21 DOI: 10.1145/3629525

Tong-yu Liu, Jianmei Guo, Bo Huang

Collecting sufficient microarchitecture performance data is essential for performance evaluation and workload characterization. There are many events to be monitored in a modern processor while only a few hardware performance monitoring counters (PMCs) can be used, so multiplexing is commonly adopted. However, inefficiency commonly exists in state-of-the-art profiling tools when grouping events for multiplexing PMCs. It has the risk of inaccurate measurement and misleading analysis. Commercial tools can leverage PMCs but they are closed-source and only support their specified platforms. To this end, we propose an approach for efficient cross-platform microarchitecture performance measurement via adaptive grouping, aiming to improve the metrics’ sampling ratios. The approach generates event groups based on the number of available PMCs detected on arbitrary machines while avoiding the scheduling pitfall of Linux perf_event subsystem. We evaluate our approach with SPEC CPU 2017 on four mainstream x86-64 and AArch64 processors and conduct comparative analyses of efficiency with two other state-of-the-art tools, LIKWID and ARM Top-down Tool. The experimental results indicate that our approach gains around 50% improvement in the average sampling ratio of metrics without compromising the correctness and reliability.

收集足够的微架构性能数据对于性能评估和工作负载表征至关重要。在现代处理器中需要监视的事件很多，而只能使用几个硬件性能监视计数器(pmc)，因此通常采用多路复用。然而，在为多路pmc分组事件时，最先进的分析工具通常存在效率低下的问题。它有测量不准确和误导性分析的风险。商业工具可以利用pmc，但它们是闭源的，只支持它们指定的平台。为此，我们提出了一种基于自适应分组的高效跨平台微架构性能测量方法，旨在提高指标的采样率。该方法根据在任意机器上检测到的可用pmc数量生成事件组，同时避免了Linux perf_event子系统的调度缺陷。我们在四种主流x86-64和AArch64处理器上使用SPEC CPU 2017评估了我们的方法，并与另外两种最先进的工具LIKWID和ARM自上而下工具进行了效率比较分析。实验结果表明，我们的方法在不影响正确性和可靠性的情况下，在指标的平均抽样比上提高了约50%。

{"title":"Efficient Cross-platform Multiplexing of Hardware Performance Counters via Adaptive Grouping","authors":"Tong-yu Liu, Jianmei Guo, Bo Huang","doi":"10.1145/3629525","DOIUrl":"https://doi.org/10.1145/3629525","url":null,"abstract":"Collecting sufficient microarchitecture performance data is essential for performance evaluation and workload characterization. There are many events to be monitored in a modern processor while only a few hardware performance monitoring counters (PMCs) can be used, so multiplexing is commonly adopted. However, inefficiency commonly exists in state-of-the-art profiling tools when grouping events for multiplexing PMCs. It has the risk of inaccurate measurement and misleading analysis. Commercial tools can leverage PMCs but they are closed-source and only support their specified platforms. To this end, we propose an approach for efficient cross-platform microarchitecture performance measurement via adaptive grouping, aiming to improve the metrics’ sampling ratios. The approach generates event groups based on the number of available PMCs detected on arbitrary machines while avoiding the scheduling pitfall of Linux perf_event subsystem. We evaluate our approach with SPEC CPU 2017 on four mainstream x86-64 and AArch64 processors and conduct comparative analyses of efficiency with two other state-of-the-art tools, LIKWID and ARM Top-down Tool. The experimental results indicate that our approach gains around 50% improvement in the average sampling ratio of metrics without compromising the correctness and reliability.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"75 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135511089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mapi-Pro: An Energy Efficient Memory Mapping Technique for Intermittent Computing Mapi-Pro:一种用于间歇性计算的高能效内存映射技术

3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-10-20 DOI: 10.1145/3629524

Satya Jaswanth Badri, Mukesh Saini, Neeraj Goel

Battery-less technology evolved to replace battery usage in space, deep mines, and other environments to reduce cost and pollution. Non-volatile memory (NVM) based processors were explored for saving the system state during a power failure. Such devices have a small SRAM and large non-volatile memory. To make the system energy efficient, we need to use SRAM efficiently. So we must select some portions of the application and map them to either SRAM or FRAM. This paper proposes an ILP-based memory mapping technique for intermittently powered IoT devices. Our proposed technique gives an optimal mapping choice that reduces the system’s Energy-Delay Product (EDP). We validated our system using TI-based MSP430FR6989 and MSP430F5529 development boards. Our proposed memory configuration consumes 38.10% less EDP than the baseline configuration and 9.30% less EDP than the existing work under stable power. Our proposed configuration achieves 20.15% less EDP than the baseline configuration and 26.87% less EDP than the existing work under unstable power. This work supports intermittent computing and works efficiently during frequent power failures.

无电池技术的发展取代了电池在太空、深矿和其他环境中的使用，以降低成本和污染。研究了基于非易失性存储器(NVM)的处理器，用于在电源故障期间保存系统状态。这种器件具有小的SRAM和大的非易失性存储器。为了使系统节能，我们需要有效地使用SRAM。因此，我们必须选择应用程序的某些部分，并将它们映射到SRAM或FRAM。本文提出了一种用于间歇供电物联网设备的基于ilp的内存映射技术。我们提出的技术给出了一个最优的映射选择，降低了系统的能量延迟积(EDP)。我们使用基于ti的MSP430FR6989和MSP430F5529开发板验证了系统。我们提出的内存配置比基线配置消耗的EDP少38.10%，比稳定电源下现有工作的EDP少9.30%。我们提出的配置比基线配置的EDP低20.15%，比不稳定功率下现有工作的EDP低26.87%。该工作支持间歇计算，可以在频繁掉电的情况下高效工作。

{"title":"Mapi-Pro: An Energy Efficient Memory Mapping Technique for Intermittent Computing","authors":"Satya Jaswanth Badri, Mukesh Saini, Neeraj Goel","doi":"10.1145/3629524","DOIUrl":"https://doi.org/10.1145/3629524","url":null,"abstract":"Battery-less technology evolved to replace battery usage in space, deep mines, and other environments to reduce cost and pollution. Non-volatile memory (NVM) based processors were explored for saving the system state during a power failure. Such devices have a small SRAM and large non-volatile memory. To make the system energy efficient, we need to use SRAM efficiently. So we must select some portions of the application and map them to either SRAM or FRAM. This paper proposes an ILP-based memory mapping technique for intermittently powered IoT devices. Our proposed technique gives an optimal mapping choice that reduces the system’s Energy-Delay Product (EDP). We validated our system using TI-based MSP430FR6989 and MSP430F5529 development boards. Our proposed memory configuration consumes 38.10% less EDP than the baseline configuration and 9.30% less EDP than the existing work under stable power. Our proposed configuration achieves 20.15% less EDP than the baseline configuration and 26.87% less EDP than the existing work under unstable power. This work supports intermittent computing and works efficiently during frequent power failures.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135618053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Characterizing Multi-Chip GPU Data Sharing 多芯片GPU数据共享特性研究

3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-10-20 DOI: 10.1145/3629521

Shiqing Zhang, Mahmood Naderan-Tahan, Magnus Jahre, Lieven Eeckhout

Multi-chip GPU systems are critical to scale performance beyond a single GPU chip for a wide variety of important emerging applications. A key challenge for multi-chip GPUs though is how to overcome the bandwidth gap between inter-chip and intra-chip communication. Accesses to shared data, i.e., data accessed by multiple chips, pose a major performance challenge as they incur remote memory accesses possibly congesting the inter-chip links and degrading overall system performance. This paper characterizes the shared data set in multi-chip GPUs in terms of (1) truly versus falsely shared data, (2) how the shared data set scales with input size, (3) along which dimensions the shared data set scales, and (4) how sensitive the shared data set is with respect to the input’s characteristics, i.e., node degree and connectivity in graph workloads. We observe significant variety in scaling behavior across workloads: some workloads feature a shared data set that scales linearly with input size, while others feature sublinear scaling (following a (sqrt {2} ) or (sqrt [3]{2} ) relationship). We further demonstrate how the shared data set affects the optimum last-level cache organization (memory-side versus SM-side) in multi-chip GPUs, as well as optimum memory page allocation and thread scheduling policy. Sensitivity analyses demonstrate the insights across the broad design space.

多芯片GPU系统对于超越单个GPU芯片的性能扩展至关重要，适用于各种重要的新兴应用。然而，多芯片gpu的一个关键挑战是如何克服芯片间和芯片内通信之间的带宽差距。对共享数据的访问，即由多个芯片访问的数据，会带来很大的性能挑战，因为它们会导致远程内存访问，可能会阻塞芯片间的链路并降低整体系统性能。本文从以下方面描述了多芯片gpu中的共享数据集:(1)真实共享数据与虚假共享数据，(2)共享数据集如何随输入大小缩放，(3)共享数据集沿哪些维度缩放，以及(4)共享数据集对输入特征的敏感程度，即图工作负载中的节点度和连通性。我们观察到不同工作负载的扩展行为存在显著差异:一些工作负载的特征是共享数据集随输入大小线性扩展，而其他工作负载的特征是亚线性扩展(遵循(sqrt {2} )或(sqrt [3]{2} )关系)。我们进一步演示了共享数据集如何影响多芯片gpu中最优的最后一级缓存组织(内存端与sm端)，以及最优内存页分配和线程调度策略。敏感性分析展示了跨广泛设计空间的洞察力。

{"title":"Characterizing Multi-Chip GPU Data Sharing","authors":"Shiqing Zhang, Mahmood Naderan-Tahan, Magnus Jahre, Lieven Eeckhout","doi":"10.1145/3629521","DOIUrl":"https://doi.org/10.1145/3629521","url":null,"abstract":"Multi-chip GPU systems are critical to scale performance beyond a single GPU chip for a wide variety of important emerging applications. A key challenge for multi-chip GPUs though is how to overcome the bandwidth gap between inter-chip and intra-chip communication. Accesses to shared data, i.e., data accessed by multiple chips, pose a major performance challenge as they incur remote memory accesses possibly congesting the inter-chip links and degrading overall system performance. This paper characterizes the shared data set in multi-chip GPUs in terms of (1) truly versus falsely shared data, (2) how the shared data set scales with input size, (3) along which dimensions the shared data set scales, and (4) how sensitive the shared data set is with respect to the input’s characteristics, i.e., node degree and connectivity in graph workloads. We observe significant variety in scaling behavior across workloads: some workloads feature a shared data set that scales linearly with input size, while others feature sublinear scaling (following a (sqrt {2} ) or (sqrt [3]{2} ) relationship). We further demonstrate how the shared data set affects the optimum last-level cache organization (memory-side versus SM-side) in multi-chip GPUs, as well as optimum memory page allocation and thread scheduling policy. Sensitivity analyses demonstrate the insights across the broad design space.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135618434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DxPU: Large Scale Disaggregated GPU Pools in the Datacenter DxPU:数据中心的大规模解聚GPU池

3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-10-05 DOI: 10.1145/3617995

Bowen He, Xiao Zheng, Yuan Chen, Weinan Li, Yajin Zhou, Xin Long, Pengcheng Zhang, Xiaowei Lu, Linquan Jiang, Qiang Liu, Dennis Cai, Xiantao Zhang

The rapid adoption of AI and convenience offered by cloud services have resulted in the growing demands for GPUs in the cloud. Generally, GPUs are physically attached to host servers as PCIe devices. However, the fixed assembly combination of host servers and GPUs is extremely inefficient in resource utilization, upgrade, and maintenance. Due to these issues, the GPU disaggregation technique has been proposed to decouple GPUs from host servers. It aggregates GPUs into a pool, and allocates GPU node(s) according to user demands. However, existing GPU disaggregation systems have flaws in software-hardware compatibility, disaggregation scope, and capacity. In this paper, we present a new implementation of datacenter-scale GPU disaggregation, named DxPU. DxPU efficiently solves the above problems and can flexibly allocate as many GPU node(s) as users demand. In order to understand the performance overhead incurred by DxPU, we build up a performance model for AI specific workloads. With the guidance of modeling results, we develop a prototype system, which has been deployed into the datacenter of a leading cloud provider for a test run. We also conduct detailed experiments to evaluate the performance overhead caused by our system. The results show that the overhead of DxPU is less than 10%, compared with native GPU servers, in most of user scenarios.

人工智能的快速采用和云服务提供的便利性导致了对云中的gpu的需求不断增长。一般情况下，gpu作为PCIe设备物理绑定到主机服务器上。但是，主机服务器和gpu的固定组装组合在资源利用、升级和维护方面效率极低。针对这些问题，提出了GPU解聚技术，将GPU与主机服务器解耦。它将GPU聚合成一个池，并根据用户需求分配GPU节点。但是现有的GPU解聚系统在软硬件兼容性、解聚范围、容量等方面存在缺陷。在本文中，我们提出了一种新的数据中心规模的GPU分解实现，称为DxPU。DxPU有效地解决了上述问题，并可根据用户需求灵活分配任意数量的GPU节点。为了理解DxPU带来的性能开销，我们为AI特定工作负载建立了一个性能模型。在建模结果的指导下，我们开发了一个原型系统，该系统已部署到一家领先的云提供商的数据中心进行测试运行。我们还进行了详细的实验来评估系统造成的性能开销。结果表明，在大多数用户场景下，与本地GPU服务器相比，DxPU的开销小于10%。

{"title":"DxPU: Large Scale Disaggregated GPU Pools in the Datacenter","authors":"Bowen He, Xiao Zheng, Yuan Chen, Weinan Li, Yajin Zhou, Xin Long, Pengcheng Zhang, Xiaowei Lu, Linquan Jiang, Qiang Liu, Dennis Cai, Xiantao Zhang","doi":"10.1145/3617995","DOIUrl":"https://doi.org/10.1145/3617995","url":null,"abstract":"The rapid adoption of AI and convenience offered by cloud services have resulted in the growing demands for GPUs in the cloud. Generally, GPUs are physically attached to host servers as PCIe devices. However, the fixed assembly combination of host servers and GPUs is extremely inefficient in resource utilization, upgrade, and maintenance. Due to these issues, the GPU disaggregation technique has been proposed to decouple GPUs from host servers. It aggregates GPUs into a pool, and allocates GPU node(s) according to user demands. However, existing GPU disaggregation systems have flaws in software-hardware compatibility, disaggregation scope, and capacity. In this paper, we present a new implementation of datacenter-scale GPU disaggregation, named DxPU. DxPU efficiently solves the above problems and can flexibly allocate as many GPU node(s) as users demand. In order to understand the performance overhead incurred by DxPU, we build up a performance model for AI specific workloads. With the guidance of modeling results, we develop a prototype system, which has been deployed into the datacenter of a leading cloud provider for a test run. We also conduct detailed experiments to evaluate the performance overhead caused by our system. The results show that the overhead of DxPU is less than 10%, compared with native GPU servers, in most of user scenarios.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135481968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

gPPM: A Generalized Matrix Operation and Parallel Algorithm to Accelerate the Encoding/Decoding Process of Erasure Codes gPPM:一种加速Erasure码编码/解码过程的广义矩阵运算和并行算法

3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-09-21 DOI: 10.1145/3625005

Shiyi Li, Qiang Cao, Shenggang Wan, Wen Xia, Changsheng Xie

Erasure codes are widely deployed in modern storage systems, leading to frequent usage of their encoding/decoding operations. The encoding/decoding process for erasure codes is generally carried out using the parity-check matrix approach. However, this approach is serial and computationally expensive, mainly due to dealing with matrix operations, which results in low encoding/decoding performance. These drawbacks are particularly evident for newer erasure codes, including SD and LRC codes. To address these limitations, this paper introduces the Partitioned and Parallel Matrix ( PPM ) algorithm. This algorithm partitions the parity-check matrix, parallelizes encoding/decoding operations, and optimizes calculation sequence to facilitate fast encoding/decoding of these codes. Furthermore, we present a generalized PPM ( gPPM ) algorithm that surpasses PPM in performance by employing fine-grained dynamic matrix calculation sequence selection. Unlike PPM, gPPM is also applicable to erasure codes such as RS code. Experimental results demonstrate that PPM improves the encoding/decoding speed of SD and LRC codes by up to (210.81% ) . Besides, gPPM achieves up to (102.41% ) improvement over PPM and (32.25% ) improvement over RS regarding encoding/decoding speed.

Erasure码广泛应用于现代存储系统中，因此编码/解码操作被频繁使用。擦除码的编码/解码过程通常使用奇偶校验矩阵方法进行。然而，这种方法是串行和计算昂贵的，主要是由于处理矩阵运算，这导致较低的编码/解码性能。这些缺点对于较新的擦除代码(包括SD和LRC代码)尤其明显。为了解决这些限制，本文引入了分区并行矩阵(PPM)算法。该算法对奇偶校验矩阵进行分区，对编码/解码操作进行并行化，并优化计算顺序，以方便这些代码的快速编码/解码。此外，我们提出了一种广义PPM (gPPM)算法，该算法采用细粒度动态矩阵计算序列选择，在性能上优于PPM。与PPM不同，gPPM也适用于RS码等擦除码。实验结果表明，PPM将SD和LRC码的编解码速度提高了(210.81% )。此外，gPPM在编码/解码速度方面比PPM提高(102.41% )，比RS提高(32.25% )。

{"title":"gPPM: A Generalized Matrix Operation and Parallel Algorithm to Accelerate the Encoding/Decoding Process of Erasure Codes","authors":"Shiyi Li, Qiang Cao, Shenggang Wan, Wen Xia, Changsheng Xie","doi":"10.1145/3625005","DOIUrl":"https://doi.org/10.1145/3625005","url":null,"abstract":"Erasure codes are widely deployed in modern storage systems, leading to frequent usage of their encoding/decoding operations. The encoding/decoding process for erasure codes is generally carried out using the parity-check matrix approach. However, this approach is serial and computationally expensive, mainly due to dealing with matrix operations, which results in low encoding/decoding performance. These drawbacks are particularly evident for newer erasure codes, including SD and LRC codes. To address these limitations, this paper introduces the Partitioned and Parallel Matrix ( PPM ) algorithm. This algorithm partitions the parity-check matrix, parallelizes encoding/decoding operations, and optimizes calculation sequence to facilitate fast encoding/decoding of these codes. Furthermore, we present a generalized PPM ( gPPM ) algorithm that surpasses PPM in performance by employing fine-grained dynamic matrix calculation sequence selection. Unlike PPM, gPPM is also applicable to erasure codes such as RS code. Experimental results demonstrate that PPM improves the encoding/decoding speed of SD and LRC codes by up to (210.81% ) . Besides, gPPM achieves up to (102.41% ) improvement over PPM and (32.25% ) improvement over RS regarding encoding/decoding speed.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136235362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Advancing Direct Convolution using Convolution Slicing Optimization and ISA Extensions 使用卷积切片优化和ISA扩展推进直接卷积

3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-09-20 DOI: 10.1145/3625004

Victor Ferrari, Rafael Sousa, Marcio Pereira, João P. L. de Carvalho, José Nelson Amaral, José Moreira, Guido Araujo

Convolution is one of the most computationally intensive operations that must be performed for machine-learning model inference. A traditional approach to computing convolutions is known as the Im2Col + BLAS method. This paper proposes SConv: a direct-convolution algorithm based on an MLIR/LLVM code-generation toolchain that can be integrated into machine-learning compilers. This algorithm introduces: (a) Convolution Slicing Analysis (CSA) — a convolution-specific 3D cache-blocking analysis pass that focuses on tile reuse over the cache hierarchy; (b) Convolution Slicing Optimization (CSO) — a code-generation pass that uses CSA to generate a tiled direct-convolution macro-kernel; and (c) Vector-Based Packing (VBP) — an architecture-specific optimized input-tensor packing solution based on vector-register shift instructions for convolutions with unitary stride. Experiments conducted on 393 convolutions from full ONNX-MLIR machine-learning models indicate that the elimination of the Im2Col transformation and the use of fast packing routines result in a total packing time reduction, on full model inference, of 2.3x – 4.0x on Intel x86 and 3.3x – 5.9x on IBM POWER10. The speed-up over an Im2Col + BLAS method based on current BLAS implementations for end-to-end machine-learning model inference is in the range of 11% – 27% for Intel x86 and 11% – 34% for IBM POWER10 architectures. The total convolution speedup for model inference is 13% – 28% on Intel x86 and 23% – 39% on IBM POWER10. SConv also outperforms BLAS GEMM, when computing pointwise convolutions in more than 82% of the 219 tested instances.

卷积是机器学习模型推理必须执行的计算最密集的操作之一。计算卷积的传统方法被称为Im2Col + BLAS方法。本文提出了SConv:一种基于MLIR/LLVM代码生成工具链的直接卷积算法，可以集成到机器学习编译器中。该算法引入:(a)卷积切片分析(CSA)——一种特定于卷积的3D缓存阻塞分析方法，专注于缓存层次结构上的块重用;(b)卷积切片优化(CSO)——使用CSA生成平铺直接卷积宏核的代码生成通道;(c)基于向量的填充(Vector-Based Packing, VBP)——一种特定于架构的优化输入张量填充解决方案，该解决方案基于向量寄存器移位指令，用于酉步卷积。对来自完整ONNX-MLIR机器学习模型的393个卷积进行的实验表明，在完整模型推理中，消除Im2Col变换和使用快速打包例程可以减少总打包时间，在Intel x86上减少2.3倍- 4.0倍，在IBM POWER10上减少3.3倍- 5.9倍。对于端到端机器学习模型推理，基于当前BLAS实现的Im2Col + BLAS方法的加速速度在英特尔x86和IBM POWER10架构中分别为11% - 27%和11% - 34%。模型推理的总卷积加速在Intel x86上为13% - 28%，在IBM POWER10上为23% - 39%。在219个测试实例中，SConv在超过82%的情况下计算点卷积，其性能也优于BLAS gem。

{"title":"Advancing Direct Convolution using Convolution Slicing Optimization and ISA Extensions","authors":"Victor Ferrari, Rafael Sousa, Marcio Pereira, João P. L. de Carvalho, José Nelson Amaral, José Moreira, Guido Araujo","doi":"10.1145/3625004","DOIUrl":"https://doi.org/10.1145/3625004","url":null,"abstract":"Convolution is one of the most computationally intensive operations that must be performed for machine-learning model inference. A traditional approach to computing convolutions is known as the Im2Col + BLAS method. This paper proposes SConv: a direct-convolution algorithm based on an MLIR/LLVM code-generation toolchain that can be integrated into machine-learning compilers. This algorithm introduces: (a) Convolution Slicing Analysis (CSA) — a convolution-specific 3D cache-blocking analysis pass that focuses on tile reuse over the cache hierarchy; (b) Convolution Slicing Optimization (CSO) — a code-generation pass that uses CSA to generate a tiled direct-convolution macro-kernel; and (c) Vector-Based Packing (VBP) — an architecture-specific optimized input-tensor packing solution based on vector-register shift instructions for convolutions with unitary stride. Experiments conducted on 393 convolutions from full ONNX-MLIR machine-learning models indicate that the elimination of the Im2Col transformation and the use of fast packing routines result in a total packing time reduction, on full model inference, of 2.3x – 4.0x on Intel x86 and 3.3x – 5.9x on IBM POWER10. The speed-up over an Im2Col + BLAS method based on current BLAS implementations for end-to-end machine-learning model inference is in the range of 11% – 27% for Intel x86 and 11% – 34% for IBM POWER10 architectures. The total convolution speedup for model inference is 13% – 28% on Intel x86 and 23% – 39% on IBM POWER10. SConv also outperforms BLAS GEMM, when computing pointwise convolutions in more than 82% of the 219 tested instances.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136313810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1