ACM Transactions on Architecture and Code Optimization最新文献_第10页

SplitZNS: Towards an Efficient LSM-tree on Zoned Namespace SSDs SplitZNS:在分区命名空间ssd上实现高效的lsm树

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-07-10 DOI: https://dl.acm.org/doi/10.1145/3608476

Dong Huang, Dan Feng, Qiankun Liu, Bo Ding, Wei Zhao, Xueliang Wei, Wei Tong

The Zoned Namespace (ZNS) Solid State Drive (SSD) is a nascent form of storage device that offers novel prospects for the Log Structured Merge Tree (LSM-tree). ZNS exposes erase blocks in SSD as append-only zones, enabling the LSM-tree to gain awareness of the physical layout of data. Nevertheless, LSM-tree on ZNS SSDs necessitates Garbage Collection (GC) owing to the mismatch between the gigantic zones and relatively small Sorted String Tables (SSTables). Through extensive experiments, we observe that a smaller zone size can reduce data migration in GC at the cost of a significant performance decline owing to inadequate parallelism exploitation. In this paper, we present SplitZNS, which introduces small zones by tweaking the zone-to-chip mapping to maximize GC efficiency for LSM-tree on ZNS SSDs. Following the multi-level peculiarity of LSM-tree and the inherent parallel architecture of ZNS SSDs, we propose a number of techniques to leverage and accelerate small zones to alleviate the performance impact due to underutilized parallelism. (1) First, we use small zones selectively to prevent exacerbating write slowdowns and stalls due to their suboptimal performance. (2) Second, to enhance parallelism utilization, we propose SubZone Ring, which employs a per-chip FIFO buffer to imitate a large zone writing style; (3) Read Prefetcher, which prefetches data concurrently through multiple chips during compactions; (4) and Read Scheduler, which assigns query requests the highest priority. We build a prototype integrated with SplitZNS to validate its efficiency and efficacy. Experimental results demonstrate that SplitZNS achieves up to 2.77x performance and reduces data migration considerably compared to the lifetime-based data placement.¹

分区命名空间(ZNS)固态硬盘(SSD)是一种新兴的存储设备形式，它为日志结构合并树(LSM-tree)提供了新的前景。ZNS将SSD中的擦除块暴露为仅追加区域，使lsm树能够了解数据的物理布局。然而，由于巨大的区域和相对较小的有序字符串表(sstable)之间的不匹配，ZNS ssd上的LSM-tree需要垃圾收集(GC)。通过大量的实验，我们观察到较小的区域大小可以减少GC中的数据迁移，但代价是由于并行性利用不足而导致性能显著下降。在本文中，我们提出了SplitZNS，它通过调整区域到芯片的映射来引入小区域，以最大限度地提高ZNS ssd上lsm树的GC效率。根据LSM-tree的多层次特性和ZNS ssd固有的并行架构，我们提出了一些技术来利用和加速小区域，以减轻由于未充分利用的并行性而导致的性能影响。(1)首先，我们有选择地使用小区域，以防止由于它们的次优性能而加剧写入减速和停滞。(2)其次，为了提高并行利用率，我们提出了子区环，它采用每个芯片FIFO缓冲区来模仿大区写入风格;(3) Read Prefetcher，在压缩时通过多个芯片并发预取数据;(4)和Read Scheduler，它为查询请求分配最高优先级。我们建立了一个与SplitZNS集成的原型，以验证其效率和功效。实验结果表明，与基于生命周期的数据放置相比，SplitZNS实现了高达2.77倍的性能，并大大减少了数据迁移

{"title":"SplitZNS: Towards an Efficient LSM-tree on Zoned Namespace SSDs","authors":"Dong Huang, Dan Feng, Qiankun Liu, Bo Ding, Wei Zhao, Xueliang Wei, Wei Tong","doi":"https://dl.acm.org/doi/10.1145/3608476","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3608476","url":null,"abstract":"<p>The Zoned Namespace (ZNS) Solid State Drive (SSD) is a nascent form of storage device that offers novel prospects for the Log Structured Merge Tree (LSM-tree). ZNS exposes erase blocks in SSD as append-only zones, enabling the LSM-tree to gain awareness of the physical layout of data. Nevertheless, LSM-tree on ZNS SSDs necessitates Garbage Collection (GC) owing to the mismatch between the gigantic zones and relatively small Sorted String Tables (SSTables). Through extensive experiments, we observe that a smaller zone size can reduce data migration in GC at the cost of a significant performance decline owing to inadequate parallelism exploitation. In this paper, we present SplitZNS, which introduces small zones by tweaking the zone-to-chip mapping to maximize GC efficiency for LSM-tree on ZNS SSDs. Following the multi-level peculiarity of LSM-tree and the inherent parallel architecture of ZNS SSDs, we propose a number of techniques to leverage and accelerate small zones to alleviate the performance impact due to underutilized parallelism. (1) First, we use small zones selectively to prevent exacerbating write slowdowns and stalls due to their suboptimal performance. (2) Second, to enhance parallelism utilization, we propose SubZone Ring, which employs a per-chip FIFO buffer to imitate a large zone writing style; (3) Read Prefetcher, which prefetches data concurrently through multiple chips during compactions; (4) and Read Scheduler, which assigns query requests the highest priority. We build a prototype integrated with SplitZNS to validate its efficiency and efficacy. Experimental results demonstrate that SplitZNS achieves up to 2.77x performance and reduces data migration considerably compared to the lifetime-based data placement.<sup>1</sup></p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"13 4","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138509526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Approx-RM: Reducing Energy on Heterogeneous Multicore Processors under Accuracy and Timing Constraints 近似rm:在精度和时间约束下减少异构多核处理器的能量

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-06-22 DOI: 10.1145/3605214

M. Azhar, M. Manivannan, P. Stenström

Reducing energy consumption while providing performance and quality guarantees is crucial for computing systems ranging from battery-powered embedded systems to data centers. This article considers approximate iterative applications executing on heterogeneous multi-core platforms under user-specified performance and quality targets. We note that allowing a slight yet bounded relaxation in solution quality can considerably reduce the required iteration count and thereby can save significant amounts of energy. To this end, this article proposes Approx-RM, a resource management scheme that reduces energy expenditure while guaranteeing a specified performance as well as accuracy target. Approx-RM predicts the number of iterations required to meet the relaxed accuracy target at runtime. The time saved generates execution-time slack, which allows Approx-RM to allocate fewer resources on a heterogeneous multi-core platform in terms of DVFS, core type, and core count to save energy while meeting the performance target. Approx-RM contributes with lightweight methods for predicting the iteration count needed to meet the accuracy target and the resources needed to meet the performance target. Approx-RM uses the aforementioned predictions to allocate just enough resources to comply with quality of service constraints to save energy. Our evaluation shows energy savings of 31.6%, on average, compared to Race-to-idle when the accuracy is only relaxed by 1%. Approx-RM incurs timing and energy overheads of less than 0.1%.

从电池供电的嵌入式系统到数据中心，在提供性能和质量保证的同时降低能耗对于计算系统至关重要。本文考虑在用户指定的性能和质量目标下在异构多核平台上执行的近似迭代应用程序。我们注意到，在解决方案质量上允许轻微的但有限的松弛可以大大减少所需的迭代计数，从而可以节省大量的能量。为此，本文提出了一种在保证指定性能和精度目标的同时减少能源消耗的资源管理方案——approximate - rm。大约- rm预测在运行时满足放宽精度目标所需的迭代次数。节省的时间产生了执行时间的松弛，这使得大约- rm可以在异构多核平台上分配更少的资源，包括DVFS、核心类型和核心数量，从而在满足性能目标的同时节省能源。约- rm提供轻量级方法，用于预测满足精度目标所需的迭代计数和满足性能目标所需的资源。大约- rm使用上述预测来分配刚好足够的资源，以符合服务质量约束，从而节省能源。我们的评估显示，与精确度仅降低1%的Race-to-idle相比，平均节省了31.6%的能源。大约- rm产生的时间和能源开销小于0.1%。

{"title":"Approx-RM: Reducing Energy on Heterogeneous Multicore Processors under Accuracy and Timing Constraints","authors":"M. Azhar, M. Manivannan, P. Stenström","doi":"10.1145/3605214","DOIUrl":"https://doi.org/10.1145/3605214","url":null,"abstract":"Reducing energy consumption while providing performance and quality guarantees is crucial for computing systems ranging from battery-powered embedded systems to data centers. This article considers approximate iterative applications executing on heterogeneous multi-core platforms under user-specified performance and quality targets. We note that allowing a slight yet bounded relaxation in solution quality can considerably reduce the required iteration count and thereby can save significant amounts of energy. To this end, this article proposes Approx-RM, a resource management scheme that reduces energy expenditure while guaranteeing a specified performance as well as accuracy target. Approx-RM predicts the number of iterations required to meet the relaxed accuracy target at runtime. The time saved generates execution-time slack, which allows Approx-RM to allocate fewer resources on a heterogeneous multi-core platform in terms of DVFS, core type, and core count to save energy while meeting the performance target. Approx-RM contributes with lightweight methods for predicting the iteration count needed to meet the accuracy target and the resources needed to meet the performance target. Approx-RM uses the aforementioned predictions to allocate just enough resources to comply with quality of service constraints to save energy. Our evaluation shows energy savings of 31.6%, on average, compared to Race-to-idle when the accuracy is only relaxed by 1%. Approx-RM incurs timing and energy overheads of less than 0.1%.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"32 1","pages":"1 - 25"},"PeriodicalIF":1.6,"publicationDate":"2023-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76940819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MFFT: A GPU Accelerated Highly Efficient Mixed-Precision Large-Scale FFT Framework 一个GPU加速的高效混合精度大规模FFT框架

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-06-19 DOI: 10.1145/3605148

Yuwen Zhao, Fangfang Liu, Wenjing Ma, Huiyuan Li, Yuan-Xi Peng, Cui Wang

Fast Fourier transform (FFT) is widely used in computing applications in large-scale parallel programs, and data communication is the main performance bottleneck of FFT and seriously affects its parallel efficiency. To tackle this problem, we propose a new large-scale FFT framework, MFFT, which optimizes parallel FFT with a new mixed-precision optimization technique, adopting the “high precision computation, low precision communication” strategy. To enable “low precision communication”, we propose a shared-exponent floating-point number compression technique, which reduces the volume of data communication, while maintaining higher accuracy. In addition, we apply a two-phase normalization technique to further reduce the round-off error. Based on the mixed-precision MFFT framework, we apply several optimization techniques to improve the performance, such as streaming of GPU kernels, MPI message combination, kernel optimization, and memory optimization. We evaluate MFFT on a system with 4,096 GPUs. The results show that shared-exponent MFFT is 1.23 × faster than that of double-precision MFFT on average, and double-precision MFFT achieves performance 3.53× and 9.48× on average higher than open source library 2Decomp&FFT (CPU-based version) and heFFTe (AMD GPU-based version), respectively. The parallel efficiency of double-precision MFFT increased from 53.2% to 78.1% compared with 2Decomp&FFT, and shared-exponent MFFT further increases the parallel efficiency to 83.8%.

快速傅里叶变换(FFT)在大规模并行程序的计算应用中得到了广泛的应用，而数据通信是FFT的主要性能瓶颈，严重影响其并行效率。为了解决这个问题，我们提出了一种新的大规模FFT框架MFFT，它采用一种新的混合精度优化技术，采用“高精度计算，低精度通信”的策略来优化并行FFT。为了实现“低精度通信”，我们提出了一种共享指数浮点数压缩技术，该技术在保持较高精度的同时减少了数据通信量。此外，我们还采用了一种两阶段归一化技术来进一步减小舍入误差。在混合精度MFFT框架的基础上，采用了GPU内核流化、MPI消息组合、内核优化和内存优化等优化技术来提高性能。我们在一个有4,096个gpu的系统上评估MFFT。结果表明，共享指数MFFT比双精度MFFT平均快1.23倍，双精度MFFT的性能比开源库2Decomp&FFT(基于cpu的版本)和heFFTe(基于AMD gpu的版本)分别平均高3.53倍和9.48倍。与2Decomp&FFT相比，双精度MFFT的并行效率从53.2%提高到78.1%，共享指数MFFT的并行效率进一步提高到83.8%。

{"title":"MFFT: A GPU Accelerated Highly Efficient Mixed-Precision Large-Scale FFT Framework","authors":"Yuwen Zhao, Fangfang Liu, Wenjing Ma, Huiyuan Li, Yuan-Xi Peng, Cui Wang","doi":"10.1145/3605148","DOIUrl":"https://doi.org/10.1145/3605148","url":null,"abstract":"Fast Fourier transform (FFT) is widely used in computing applications in large-scale parallel programs, and data communication is the main performance bottleneck of FFT and seriously affects its parallel efficiency. To tackle this problem, we propose a new large-scale FFT framework, MFFT, which optimizes parallel FFT with a new mixed-precision optimization technique, adopting the “high precision computation, low precision communication” strategy. To enable “low precision communication”, we propose a shared-exponent floating-point number compression technique, which reduces the volume of data communication, while maintaining higher accuracy. In addition, we apply a two-phase normalization technique to further reduce the round-off error. Based on the mixed-precision MFFT framework, we apply several optimization techniques to improve the performance, such as streaming of GPU kernels, MPI message combination, kernel optimization, and memory optimization. We evaluate MFFT on a system with 4,096 GPUs. The results show that shared-exponent MFFT is 1.23 × faster than that of double-precision MFFT on average, and double-precision MFFT achieves performance 3.53× and 9.48× on average higher than open source library 2Decomp&FFT (CPU-based version) and heFFTe (AMD GPU-based version), respectively. The parallel efficiency of double-precision MFFT increased from 53.2% to 78.1% compared with 2Decomp&FFT, and shared-exponent MFFT further increases the parallel efficiency to 83.8%.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"88 1","pages":"1 - 23"},"PeriodicalIF":1.6,"publicationDate":"2023-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77707368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hierarchical Model Parallelism for Optimizing Inference on Many-core Processor via Decoupled 3D-CNN Structure 基于解耦3D-CNN结构的分层模型并行化多核处理器推理优化

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-06-18 DOI: 10.1145/3605149

Jiazhi Jiang, Zijiang Huang, Dan-E Huang, Jiangsu Du, Lin Chen, Ziguan Chen, Yutong Lu

The tremendous success of convolutional neural network (CNN) has made it ubiquitous in many fields of human endeavor. Many applications such as biomedical analysis and scientific data analysis involve analyzing volumetric data. This spawns huge demand for 3D-CNN. Although accelerators such as GPU may provide higher throughput on deep learning applications, they may not be available in all scenarios. CPU, especially many-core CPU with non-uniform memory access (NUMA) architecture, remains an attractive choice for deep learning inference in many scenarios. In this article, we propose a distributed inference solution for 3D-CNN that targets on the emerging ARM many-core CPU platform. A hierarchical partition approach is claimed to accelerate 3D-CNN inference by exploiting characteristics of memory and cache on ARM many-core CPU. Based on the hierarchical model partition approach, other optimization techniques such as NUMA-aware thread scheduling and optimization of 3D-img2row convolution are designed to exploit the potential of ARM many-core CPU for 3D-CNN. We evaluate our proposed inference solution with several classic 3D-CNNs: C3D, 3D-resnet34, 3D-resnet50, 3D-vgg11, and P3D. Our experimental results show that our solution can boost the performance of the 3D-CNN inference, and achieve much better scalability, with a negligible fluctuation in accuracy. When employing our 3D-CNN inference solution on ACL libraries, it can outperform naive ACL implementations by 11× to 50× on ARM many-core processor. When employing our 3D-CNN inference solution on NCNN libraries, it can outperform the naive NCNN implementations by 5.2× to 14.2× on ARM many-core processor.

卷积神经网络(CNN)的巨大成功使其在人类努力的许多领域无处不在。生物医学分析和科学数据分析等许多应用都涉及分析体积数据。这催生了对3D-CNN的巨大需求。尽管GPU等加速器可以为深度学习应用程序提供更高的吞吐量，但它们可能不适用于所有场景。CPU，特别是具有非统一内存访问(NUMA)架构的多核CPU，在许多场景下仍然是深度学习推理的一个有吸引力的选择。在本文中，我们提出了一种针对新兴的ARM多核CPU平台的3D-CNN分布式推理解决方案。利用ARM多核CPU的内存和缓存特性，提出了一种分层分区方法来加速3D-CNN推理。基于分层模型划分方法，设计了numa感知线程调度和3D-img2row卷积优化等优化技术，以挖掘ARM多核CPU对3D-CNN的潜力。我们用几种经典的3d - cnn: C3D, 3D-resnet34, 3D-resnet50, 3D-vgg11和P3D来评估我们提出的推理解决方案。我们的实验结果表明，我们的解决方案可以提高3D-CNN推理的性能，并实现更好的可扩展性，精度波动可以忽略不计。在ACL库上采用我们的3D-CNN推理解决方案，在ARM多核处理器上的性能比原始ACL实现高出11倍到50倍。在NCNN库上使用我们的3D-CNN推理解决方案，在ARM多核处理器上的性能比原始的NCNN实现高出5.2倍到14.2倍。

{"title":"Hierarchical Model Parallelism for Optimizing Inference on Many-core Processor via Decoupled 3D-CNN Structure","authors":"Jiazhi Jiang, Zijiang Huang, Dan-E Huang, Jiangsu Du, Lin Chen, Ziguan Chen, Yutong Lu","doi":"10.1145/3605149","DOIUrl":"https://doi.org/10.1145/3605149","url":null,"abstract":"The tremendous success of convolutional neural network (CNN) has made it ubiquitous in many fields of human endeavor. Many applications such as biomedical analysis and scientific data analysis involve analyzing volumetric data. This spawns huge demand for 3D-CNN. Although accelerators such as GPU may provide higher throughput on deep learning applications, they may not be available in all scenarios. CPU, especially many-core CPU with non-uniform memory access (NUMA) architecture, remains an attractive choice for deep learning inference in many scenarios. In this article, we propose a distributed inference solution for 3D-CNN that targets on the emerging ARM many-core CPU platform. A hierarchical partition approach is claimed to accelerate 3D-CNN inference by exploiting characteristics of memory and cache on ARM many-core CPU. Based on the hierarchical model partition approach, other optimization techniques such as NUMA-aware thread scheduling and optimization of 3D-img2row convolution are designed to exploit the potential of ARM many-core CPU for 3D-CNN. We evaluate our proposed inference solution with several classic 3D-CNNs: C3D, 3D-resnet34, 3D-resnet50, 3D-vgg11, and P3D. Our experimental results show that our solution can boost the performance of the 3D-CNN inference, and achieve much better scalability, with a negligible fluctuation in accuracy. When employing our 3D-CNN inference solution on ACL libraries, it can outperform naive ACL implementations by 11× to 50× on ARM many-core processor. When employing our 3D-CNN inference solution on NCNN libraries, it can outperform the naive NCNN implementations by 5.2× to 14.2× on ARM many-core processor.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"28 6 1","pages":"1 - 21"},"PeriodicalIF":1.6,"publicationDate":"2023-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87658453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

rNdN: Fast Query Compilation for NVIDIA GPUs rNdN: NVIDIA gpu快速查询编译

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-06-09 DOI: 10.1145/3603503

Alexander Krolik, Clark Verbrugge, L. Hendren

GPU database systems are an effective solution to query optimization, particularly with compilation and data caching. They fall short, however, in end-to-end workloads, as existing compiler toolchains are too expensive for use with short-running queries. In this work, we define and evaluate a runtime-suitable query compilation pipeline for NVIDIA GPUs that extracts high performance with only minimal optimization. In particular, our balanced approach successfully trades minor slowdowns in execution for major speedups in compilation, even as data sizes increase. We demonstrate performance benefits compared to both CPU and GPU database systems using interpreters and compilers, extending query compilation for GPUs beyond cached use cases.

GPU数据库系统是查询优化的有效解决方案，特别是在编译和数据缓存方面。然而，在端到端工作负载中，它们的作用不大，因为现有的编译器工具链对于短时间运行的查询来说太昂贵了。在这项工作中，我们为NVIDIA gpu定义并评估了一个适合运行时的查询编译管道，该管道仅通过最小的优化即可提取高性能。特别是，我们的平衡方法成功地以执行上的小减速换取了编译上的大加速，即使在数据大小增加时也是如此。我们演示了与使用解释器和编译器的CPU和GPU数据库系统相比的性能优势，将GPU的查询编译扩展到缓存用例之外。

引用次数: 0

Cache Programming for Scientific Loops Using Leases 使用租约的科学循环缓存编程

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-05-31 DOI: 10.1145/3600090

Ben Reber, Matthew Gould, Alexander H. Kneipp, Fangzhou Liu, Ian Prechtl, C. Ding, Linlin Chen, D. Patru

Cache management is important in exploiting locality and reducing data movement. This article studies a new type of programmable cache called the lease cache. By assigning leases, software exerts the primary control on when and how long data stays in the cache. Previous work has shown an optimal solution for an ideal lease cache. This article develops and evaluates a set of practical solutions for a physical lease cache emulated in FPGA with the full suite of PolyBench benchmarks. Compared to automatic caching, lease programming can further reduce data movement by 10% to over 60% when the data size is 16 times to 3,000 times the cache size, and the techniques in this article realize over 80% of this potential. Moreover, lease programming can reduce data movement by another 0.8% to 20% after polyhedral locality optimization.

缓存管理在利用局部性和减少数据移动方面非常重要。本文研究了一种新型的可编程缓存——租赁缓存。通过分配租期，软件对数据在缓存中停留的时间和长度施加主要控制。以前的工作已经展示了理想租约缓存的最佳解决方案。本文开发并评估了一套实用的解决方案，用于在FPGA中模拟物理租约缓存，并使用全套PolyBench基准测试。与自动缓存相比，当数据大小是缓存大小的16倍到3000倍时，租约编程可以进一步减少10%到60%以上的数据移动，本文中的技术实现了80%以上的潜力。此外，在多面体局部优化之后，租约编程可以将数据移动减少0.8%到20%。

引用次数: 0

MPU: Memory-centric SIMT Processor via In-DRAM Near-bank Computing MPU:以内存为中心的SIMT处理器，通过In-DRAM近银行计算

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-05-29 DOI: 10.1145/3603113

Xinfeng Xie, P. Gu, Yufei Ding, Dimin Niu, Hongzhong Zheng, Yuan Xie

With the growing number of data-intensive workloads, GPU, which is the state-of-the-art single-instruction-multiple-thread (SIMT) processor, is hindered by the memory bandwidth wall. To alleviate this bottleneck, previously proposed 3D-stacking near-bank computing accelerators benefit from abundant bank-internal bandwidth by bringing computations closer to the DRAM banks. However, these accelerators are specialized for certain application domains with simple architecture data paths and customized software mapping schemes. For general-purpose scenarios, lightweight hardware designs for diverse data paths, architectural supports for the SIMT programming model, and end-to-end software optimizations remain challenging. To address these issues, we propose Memory-centric Processing Unit (MPU), the first SIMT processor based on 3D-stacking near-bank computing architecture. First, to realize diverse data paths with small overheads, MPU adopts a hybrid pipeline with the capability of offloading instructions to near-bank compute-logic. Second, we explore two architectural supports for the SIMT programming model, including a near-bank shared memory design and a multiple activated row-buffers enhancement. Third, we present an end-to-end compilation flow for MPU to support CUDA programs. To fully utilize MPU’s hybrid pipeline, we develop a backend optimization for the instruction offloading decision. The evaluation results of MPU demonstrate 3.46× speedup and 2.57× energy reduction compared with an NVIDIA Tesla V100 GPU on a set of representative data-intensive workloads.

随着数据密集型工作负载的不断增加，GPU作为最先进的单指令多线程(SIMT)处理器，受到内存带宽墙的阻碍。为了缓解这一瓶颈，之前提出的3d堆叠近库计算加速器通过使计算更接近DRAM库，从丰富的库内部带宽中获益。但是，这些加速器专门用于具有简单架构数据路径和定制软件映射方案的特定应用程序域。对于通用场景，用于各种数据路径的轻量级硬件设计、SIMT编程模型的体系结构支持以及端到端软件优化仍然具有挑战性。为了解决这些问题，我们提出了内存中心处理单元(MPU)，这是第一个基于3d堆叠近岸计算架构的SIMT处理器。首先，为了以较小的开销实现多种数据路径，MPU采用混合管道，具有将指令卸载到近岸计算逻辑的能力。其次，我们探讨了SIMT编程模型的两种体系结构支持，包括近银行共享内存设计和多激活行缓冲区增强。第三，我们提出了支持CUDA程序的MPU端到端编译流程。为了充分利用MPU的混合管道，我们开发了指令卸载决策的后端优化。在一组具有代表性的数据密集型工作负载上，MPU的评估结果显示，与NVIDIA Tesla V100 GPU相比，MPU的加速提升了3.46倍，能耗降低了2.57倍。

{"title":"MPU: Memory-centric SIMT Processor via In-DRAM Near-bank Computing","authors":"Xinfeng Xie, P. Gu, Yufei Ding, Dimin Niu, Hongzhong Zheng, Yuan Xie","doi":"10.1145/3603113","DOIUrl":"https://doi.org/10.1145/3603113","url":null,"abstract":"With the growing number of data-intensive workloads, GPU, which is the state-of-the-art single-instruction-multiple-thread (SIMT) processor, is hindered by the memory bandwidth wall. To alleviate this bottleneck, previously proposed 3D-stacking near-bank computing accelerators benefit from abundant bank-internal bandwidth by bringing computations closer to the DRAM banks. However, these accelerators are specialized for certain application domains with simple architecture data paths and customized software mapping schemes. For general-purpose scenarios, lightweight hardware designs for diverse data paths, architectural supports for the SIMT programming model, and end-to-end software optimizations remain challenging. To address these issues, we propose Memory-centric Processing Unit (MPU), the first SIMT processor based on 3D-stacking near-bank computing architecture. First, to realize diverse data paths with small overheads, MPU adopts a hybrid pipeline with the capability of offloading instructions to near-bank compute-logic. Second, we explore two architectural supports for the SIMT programming model, including a near-bank shared memory design and a multiple activated row-buffers enhancement. Third, we present an end-to-end compilation flow for MPU to support CUDA programs. To fully utilize MPU’s hybrid pipeline, we develop a backend optimization for the instruction offloading decision. The evaluation results of MPU demonstrate 3.46× speedup and 2.57× energy reduction compared with an NVIDIA Tesla V100 GPU on a set of representative data-intensive workloads.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"23 1","pages":"1 - 26"},"PeriodicalIF":1.6,"publicationDate":"2023-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78796740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Impact of Page Size and Microarchitecture on Instruction Address Translation Overhead 页面大小和微架构对指令地址转换开销的影响

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-05-27 DOI: 10.1145/3600089

Yufeng Zhou, A. Cox, S. Dwarkadas, Xiaowan Dong

As the volume of data processed by applications has increased, considerable attention has been paid to data address translation overheads, leading to the widespread use of larger page sizes (“superpages”) and multi-level translation lookaside buffers (TLBs). However, far less attention has been paid to instruction address translation and its relation to TLB and pipeline structure. In prior work, we quantified the impact of using code superpages on a variety of widely used applications, ranging from compilers to web user-interface frameworks, and the impact of sharing page table pages for executables and shared libraries. Within this article, we augment those results by first uncovering the effects that microarchitectural differences between Intel Skylake and AMD Zen+, particularly their different TLB organizations, have on instruction address translation overhead. This analysis provides some key insights into the microarchitectural design decisions that impact the cost of instruction address translation. First, a lower-level (level 2) TLB that has both instruction and data mappings competing for space within the same structure allows better overall performance and utilization when using code superpages. Code superpages not only reduce instruction address translation overhead but also indirectly reduce data address translation overhead. In fact, for a few applications, the use of just a few code superpages has a larger impact on overall performance than the use of a much larger number of data superpages. Second, a level 1 (L1) TLB with separate structures for different page sizes may require careful tuning of the superpage promotion policy for code, and a correspondingly suboptimal utilization of the level 2 TLB. In particular, increasing the number of superpages when the size of the L1 superpage structure is small may result in more L1 TLB misses for some applications. Moreover, on some microarchitectures, the cost of these misses can be highly variable, because replacement is delayed until all of the in-flight instructions mapped by the victim entry are retired. Hence, more superpage promotions can result in a performance regression. Finally, our findings also make a case for first-class OS support for superpages on ordinary files containing executables and shared libraries, as well as a more aggressive superpage policy for code.

随着应用程序处理的数据量的增加，数据地址转换开销受到了相当大的关注，导致广泛使用更大的页面大小(“超级页面”)和多级翻译备用缓冲区(tlb)。然而，指令地址的翻译及其与TLB和管道结构的关系却很少受到关注。在之前的工作中，我们量化了使用代码超页对各种广泛使用的应用程序(从编译器到web用户界面框架)的影响，以及为可执行文件和共享库共享页表页面的影响。在本文中，我们通过首先揭示英特尔Skylake和AMD Zen+之间的微架构差异(特别是它们不同的TLB组织)对指令地址转换开销的影响来增强这些结果。这一分析为影响指令地址转换成本的微架构设计决策提供了一些关键见解。首先，在使用代码超页时，具有指令和数据映射竞争同一结构中的空间的较低级(2级)TLB允许更好的总体性能和利用率。代码超页不仅减少了指令地址转换开销，而且间接地减少了数据地址转换开销。事实上，对于一些应用程序，使用少量代码超页比使用大量数据超页对整体性能的影响更大。其次，具有针对不同页面大小的单独结构的1级(L1) TLB可能需要仔细调整代码的超页提升策略，并相应地对2级TLB进行次优利用。特别是，当L1超页结构的大小较小时，增加超页的数量可能会导致某些应用程序丢失更多的L1 TLB。此外，在一些微体系结构上，这些失误的代价可能是高度可变的，因为替换会延迟到受害者条目所映射的所有运行中的指令都退役为止。因此，更多的超级页面促销可能会导致性能下降。最后，我们的研究结果还为包含可执行文件和共享库的普通文件上的超级页提供了一流的操作系统支持，并为代码提供了更积极的超级页策略。

{"title":"The Impact of Page Size and Microarchitecture on Instruction Address Translation Overhead","authors":"Yufeng Zhou, A. Cox, S. Dwarkadas, Xiaowan Dong","doi":"10.1145/3600089","DOIUrl":"https://doi.org/10.1145/3600089","url":null,"abstract":"As the volume of data processed by applications has increased, considerable attention has been paid to data address translation overheads, leading to the widespread use of larger page sizes (“superpages”) and multi-level translation lookaside buffers (TLBs). However, far less attention has been paid to instruction address translation and its relation to TLB and pipeline structure. In prior work, we quantified the impact of using code superpages on a variety of widely used applications, ranging from compilers to web user-interface frameworks, and the impact of sharing page table pages for executables and shared libraries. Within this article, we augment those results by first uncovering the effects that microarchitectural differences between Intel Skylake and AMD Zen+, particularly their different TLB organizations, have on instruction address translation overhead. This analysis provides some key insights into the microarchitectural design decisions that impact the cost of instruction address translation. First, a lower-level (level 2) TLB that has both instruction and data mappings competing for space within the same structure allows better overall performance and utilization when using code superpages. Code superpages not only reduce instruction address translation overhead but also indirectly reduce data address translation overhead. In fact, for a few applications, the use of just a few code superpages has a larger impact on overall performance than the use of a much larger number of data superpages. Second, a level 1 (L1) TLB with separate structures for different page sizes may require careful tuning of the superpage promotion policy for code, and a correspondingly suboptimal utilization of the level 2 TLB. In particular, increasing the number of superpages when the size of the L1 superpage structure is small may result in more L1 TLB misses for some applications. Moreover, on some microarchitectures, the cost of these misses can be highly variable, because replacement is delayed until all of the in-flight instructions mapped by the victim entry are retired. Hence, more superpage promotions can result in a performance regression. Finally, our findings also make a case for first-class OS support for superpages on ordinary files containing executables and shared libraries, as well as a more aggressive superpage policy for code.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"74 1","pages":"1 - 25"},"PeriodicalIF":1.6,"publicationDate":"2023-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73840544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GraphTune: An Efficient Dependency-Aware Substrate to Alleviate Irregularity in Concurrent Graph Processing GraphTune:一种有效的依赖性感知基板以减轻并发图处理中的不规则性

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-05-26 DOI: 10.1145/3600091

Jin Zhao, Yu Zhang, Ligang He, Qikun Li, Xiang-dong Zhang, Xinyu Jiang, Hui Yu, Xiaofei Liao, Hai Jin, Lin Gu, Haikun Liu, Bin He, Ji Zhang, Xianzheng Song, Lin Wang, Jun Zhou

With the increasing need for graph analysis, massive Concurrent iterative Graph Processing (CGP) jobs are usually performed on the common large-scale real-world graph. Although several solutions have been proposed, these CGP jobs are not coordinated with the consideration of the inherent dependencies in graph data driven by graph topology. As a result, they suffer from redundant and fragmented accesses of the same underlying graph dispersed over distributed platform, because the same graph is typically irregularly traversed by these jobs along different paths at the same time. In this work, we develop GraphTune, which can be integrated into existing distributed graph processing systems, such as D-Galois, Gemini, PowerGraph, and Chaos, to efficiently perform CGP jobs and enhance system throughput. The key component of GraphTune is a dependency-aware synchronous execution engine in conjunction with several optimization strategies based on the constructed cross-iteration dependency graph of chunks. Specifically, GraphTune transparently regularizes the processing behavior of the CGP jobs in a novel synchronous way and assigns the chunks of graph data to be handled by them based on the topological order of the dependency graph so as to maximize the performance. In this way, it can transform the irregular accesses of the chunks into more regular ones so that as many CGP jobs as possible can fully share the data accesses to the common graph. Meanwhile, it also efficiently synchronizes the communications launched by different CGP jobs based on the dependency graph to minimize the communication cost. We integrate it into four cutting-edge distributed graph processing systems and a popular out-of-core graph processing system to demonstrate the efficiency of GraphTune. Experimental results show that GraphTune improves the throughput of CGP jobs by 3.1∼6.2, 3.8∼8.5, 3.5∼10.8, 4.3∼12.4, and 3.8∼6.9 times over D-Galois, Gemini, PowerGraph, Chaos, and GraphChi, respectively.

随着图形分析需求的不断增长，大量的并行迭代图处理(CGP)作业通常是在常见的大规模真实图上进行的。尽管已经提出了几种解决方案，但这些CGP作业并没有考虑到由图拓扑驱动的图数据中的固有依赖关系。因此，它们遭受分散在分布式平台上的相同底层图的冗余和碎片访问，因为这些作业通常会同时沿着不同的路径不规则地遍历相同的图。在这项工作中，我们开发了GraphTune，它可以集成到现有的分布式图形处理系统中，如D-Galois, Gemini, PowerGraph和Chaos，以有效地执行CGP作业并提高系统吞吐量。GraphTune的关键组件是一个依赖感知的同步执行引擎，它结合了几个基于构建的块的交叉迭代依赖图的优化策略。具体来说，GraphTune以一种新颖的同步方式透明地规范了CGP作业的处理行为，并根据依赖图的拓扑顺序分配它们处理的图数据块，从而使性能最大化。通过这种方式，它可以将块的不规则访问转换为更规则的访问，从而使尽可能多的CGP作业可以完全共享对公共图的数据访问。同时，基于依赖图对不同CGP作业发起的通信进行高效同步，使通信成本最小化。我们将其集成到四个先进的分布式图形处理系统和一个流行的核外图形处理系统中，以展示GraphTune的效率。实验结果表明，与D-Galois、Gemini、PowerGraph、Chaos和GraphChi相比，GraphTune将CGP作业的吞吐量分别提高了3.1 ~ 6.2倍、3.8 ~ 8.5倍、3.5 ~ 10.8倍、4.3 ~ 12.4倍和3.8 ~ 6.9倍。

{"title":"GraphTune: An Efficient Dependency-Aware Substrate to Alleviate Irregularity in Concurrent Graph Processing","authors":"Jin Zhao, Yu Zhang, Ligang He, Qikun Li, Xiang-dong Zhang, Xinyu Jiang, Hui Yu, Xiaofei Liao, Hai Jin, Lin Gu, Haikun Liu, Bin He, Ji Zhang, Xianzheng Song, Lin Wang, Jun Zhou","doi":"10.1145/3600091","DOIUrl":"https://doi.org/10.1145/3600091","url":null,"abstract":"With the increasing need for graph analysis, massive Concurrent iterative Graph Processing (CGP) jobs are usually performed on the common large-scale real-world graph. Although several solutions have been proposed, these CGP jobs are not coordinated with the consideration of the inherent dependencies in graph data driven by graph topology. As a result, they suffer from redundant and fragmented accesses of the same underlying graph dispersed over distributed platform, because the same graph is typically irregularly traversed by these jobs along different paths at the same time. In this work, we develop GraphTune, which can be integrated into existing distributed graph processing systems, such as D-Galois, Gemini, PowerGraph, and Chaos, to efficiently perform CGP jobs and enhance system throughput. The key component of GraphTune is a dependency-aware synchronous execution engine in conjunction with several optimization strategies based on the constructed cross-iteration dependency graph of chunks. Specifically, GraphTune transparently regularizes the processing behavior of the CGP jobs in a novel synchronous way and assigns the chunks of graph data to be handled by them based on the topological order of the dependency graph so as to maximize the performance. In this way, it can transform the irregular accesses of the chunks into more regular ones so that as many CGP jobs as possible can fully share the data accesses to the common graph. Meanwhile, it also efficiently synchronizes the communications launched by different CGP jobs based on the dependency graph to minimize the communication cost. We integrate it into four cutting-edge distributed graph processing systems and a popular out-of-core graph processing system to demonstrate the efficiency of GraphTune. Experimental results show that GraphTune improves the throughput of CGP jobs by 3.1∼6.2, 3.8∼8.5, 3.5∼10.8, 4.3∼12.4, and 3.8∼6.9 times over D-Galois, Gemini, PowerGraph, Chaos, and GraphChi, respectively.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"44 1","pages":"1 - 24"},"PeriodicalIF":1.6,"publicationDate":"2023-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79304491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TNT: A Modular Approach to Traversing Physically Heterogeneous NOCs at Bare-wire Latency TNT:在裸线延迟下穿越物理异构noc的模块化方法

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-05-22 DOI: 10.1145/3597611

Gokul Subramanian Ravi, T. Krishna, Mikko H. Lipasti

The ideal latency for on-chip network traversal would be the delay incurred from wire traversal alone. Unfortunately, in a realistic modular network, the latency for a packet to traverse the network is significantly higher than this wire delay. The main limiter to achieving lower latency is the modular quantization of network traversal into hops. Beyond this, the physical heterogeneity in real-world systems further complicate the ability to reach ideal wire-only delay. In this work, we propose TNT or Transparent Network Traversal. TNT targets ideal network latency by attempting source to destination network traversal as a single multi-cycle ‘long-hop’, bypassing the quantization effects of intermediate routers via transparent data/information flow. TNT is built in a modular tile-scalable manner via a novel control path performing neighbor-to-neighbor interactions but enabling end-to-end transparent flit traversal. Further, TNT’s fine grained on-the-fly delay tracking allows it to cope with physical NOC heterogeneity across the chip. Analysis on Ligra graph workloads shows that TNT can reduce NOC latency by as much as 43% compared to the state of the art and allows efficiency gains up to 38%. Further, it can achieve more than 3x the benefits of the best/closest alternative research proposal, SMART [43].

片上网络遍历的理想延迟是仅由导线遍历引起的延迟。不幸的是，在实际的模块化网络中，数据包穿越网络的延迟明显高于这个有线延迟。实现较低延迟的主要限制是将网络遍历模块化量化为跳数。除此之外，现实系统中的物理异质性进一步使实现理想的纯线延迟的能力复杂化。在这项工作中，我们提出TNT或透明网络遍历。TNT通过尝试将源到目的网络遍历作为单个多周期“长跳”来实现理想的网络延迟，通过透明的数据/信息流绕过中间路由器的量化影响。TNT是通过一种新颖的控制路径构建的模块化瓷砖可扩展方式，执行邻居到邻居的交互，但支持端到端的透明飞行遍历。此外，TNT的细粒度实时延迟跟踪使其能够应对芯片上的物理NOC异质性。对Ligra图工作负载的分析表明，与现有技术相比，TNT可以将NOC延迟减少43%，效率提高38%。此外，它可以实现最佳/最接近的替代研究方案SMART的3倍以上的效益[43]。

{"title":"TNT: A Modular Approach to Traversing Physically Heterogeneous NOCs at Bare-wire Latency","authors":"Gokul Subramanian Ravi, T. Krishna, Mikko H. Lipasti","doi":"10.1145/3597611","DOIUrl":"https://doi.org/10.1145/3597611","url":null,"abstract":"The ideal latency for on-chip network traversal would be the delay incurred from wire traversal alone. Unfortunately, in a realistic modular network, the latency for a packet to traverse the network is significantly higher than this wire delay. The main limiter to achieving lower latency is the modular quantization of network traversal into hops. Beyond this, the physical heterogeneity in real-world systems further complicate the ability to reach ideal wire-only delay. In this work, we propose TNT or Transparent Network Traversal. TNT targets ideal network latency by attempting source to destination network traversal as a single multi-cycle ‘long-hop’, bypassing the quantization effects of intermediate routers via transparent data/information flow. TNT is built in a modular tile-scalable manner via a novel control path performing neighbor-to-neighbor interactions but enabling end-to-end transparent flit traversal. Further, TNT’s fine grained on-the-fly delay tracking allows it to cope with physical NOC heterogeneity across the chip. Analysis on Ligra graph workloads shows that TNT can reduce NOC latency by as much as 43% compared to the state of the art and allows efficiency gains up to 38%. Further, it can achieve more than 3x the benefits of the best/closest alternative research proposal, SMART [43].","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"1 1","pages":"1 - 25"},"PeriodicalIF":1.6,"publicationDate":"2023-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83415426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0