Pub Date : 2023-07-10DOI: https://dl.acm.org/doi/10.1145/3608476
Dong Huang, Dan Feng, Qiankun Liu, Bo Ding, Wei Zhao, Xueliang Wei, Wei Tong
The Zoned Namespace (ZNS) Solid State Drive (SSD) is a nascent form of storage device that offers novel prospects for the Log Structured Merge Tree (LSM-tree). ZNS exposes erase blocks in SSD as append-only zones, enabling the LSM-tree to gain awareness of the physical layout of data. Nevertheless, LSM-tree on ZNS SSDs necessitates Garbage Collection (GC) owing to the mismatch between the gigantic zones and relatively small Sorted String Tables (SSTables). Through extensive experiments, we observe that a smaller zone size can reduce data migration in GC at the cost of a significant performance decline owing to inadequate parallelism exploitation. In this paper, we present SplitZNS, which introduces small zones by tweaking the zone-to-chip mapping to maximize GC efficiency for LSM-tree on ZNS SSDs. Following the multi-level peculiarity of LSM-tree and the inherent parallel architecture of ZNS SSDs, we propose a number of techniques to leverage and accelerate small zones to alleviate the performance impact due to underutilized parallelism. (1) First, we use small zones selectively to prevent exacerbating write slowdowns and stalls due to their suboptimal performance. (2) Second, to enhance parallelism utilization, we propose SubZone Ring, which employs a per-chip FIFO buffer to imitate a large zone writing style; (3) Read Prefetcher, which prefetches data concurrently through multiple chips during compactions; (4) and Read Scheduler, which assigns query requests the highest priority. We build a prototype integrated with SplitZNS to validate its efficiency and efficacy. Experimental results demonstrate that SplitZNS achieves up to 2.77x performance and reduces data migration considerably compared to the lifetime-based data placement.1
{"title":"SplitZNS: Towards an Efficient LSM-tree on Zoned Namespace SSDs","authors":"Dong Huang, Dan Feng, Qiankun Liu, Bo Ding, Wei Zhao, Xueliang Wei, Wei Tong","doi":"https://dl.acm.org/doi/10.1145/3608476","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3608476","url":null,"abstract":"<p>The Zoned Namespace (ZNS) Solid State Drive (SSD) is a nascent form of storage device that offers novel prospects for the Log Structured Merge Tree (LSM-tree). ZNS exposes erase blocks in SSD as append-only zones, enabling the LSM-tree to gain awareness of the physical layout of data. Nevertheless, LSM-tree on ZNS SSDs necessitates Garbage Collection (GC) owing to the mismatch between the gigantic zones and relatively small Sorted String Tables (SSTables). Through extensive experiments, we observe that a smaller zone size can reduce data migration in GC at the cost of a significant performance decline owing to inadequate parallelism exploitation. In this paper, we present SplitZNS, which introduces small zones by tweaking the zone-to-chip mapping to maximize GC efficiency for LSM-tree on ZNS SSDs. Following the multi-level peculiarity of LSM-tree and the inherent parallel architecture of ZNS SSDs, we propose a number of techniques to leverage and accelerate small zones to alleviate the performance impact due to underutilized parallelism. (1) First, we use small zones selectively to prevent exacerbating write slowdowns and stalls due to their suboptimal performance. (2) Second, to enhance parallelism utilization, we propose SubZone Ring, which employs a per-chip FIFO buffer to imitate a large zone writing style; (3) Read Prefetcher, which prefetches data concurrently through multiple chips during compactions; (4) and Read Scheduler, which assigns query requests the highest priority. We build a prototype integrated with SplitZNS to validate its efficiency and efficacy. Experimental results demonstrate that SplitZNS achieves up to 2.77x performance and reduces data migration considerably compared to the lifetime-based data placement.<sup>1</sup></p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"13 4","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138509526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reducing energy consumption while providing performance and quality guarantees is crucial for computing systems ranging from battery-powered embedded systems to data centers. This article considers approximate iterative applications executing on heterogeneous multi-core platforms under user-specified performance and quality targets. We note that allowing a slight yet bounded relaxation in solution quality can considerably reduce the required iteration count and thereby can save significant amounts of energy. To this end, this article proposes Approx-RM, a resource management scheme that reduces energy expenditure while guaranteeing a specified performance as well as accuracy target. Approx-RM predicts the number of iterations required to meet the relaxed accuracy target at runtime. The time saved generates execution-time slack, which allows Approx-RM to allocate fewer resources on a heterogeneous multi-core platform in terms of DVFS, core type, and core count to save energy while meeting the performance target. Approx-RM contributes with lightweight methods for predicting the iteration count needed to meet the accuracy target and the resources needed to meet the performance target. Approx-RM uses the aforementioned predictions to allocate just enough resources to comply with quality of service constraints to save energy. Our evaluation shows energy savings of 31.6%, on average, compared to Race-to-idle when the accuracy is only relaxed by 1%. Approx-RM incurs timing and energy overheads of less than 0.1%.
{"title":"Approx-RM: Reducing Energy on Heterogeneous Multicore Processors under Accuracy and Timing Constraints","authors":"M. Azhar, M. Manivannan, P. Stenström","doi":"10.1145/3605214","DOIUrl":"https://doi.org/10.1145/3605214","url":null,"abstract":"Reducing energy consumption while providing performance and quality guarantees is crucial for computing systems ranging from battery-powered embedded systems to data centers. This article considers approximate iterative applications executing on heterogeneous multi-core platforms under user-specified performance and quality targets. We note that allowing a slight yet bounded relaxation in solution quality can considerably reduce the required iteration count and thereby can save significant amounts of energy. To this end, this article proposes Approx-RM, a resource management scheme that reduces energy expenditure while guaranteeing a specified performance as well as accuracy target. Approx-RM predicts the number of iterations required to meet the relaxed accuracy target at runtime. The time saved generates execution-time slack, which allows Approx-RM to allocate fewer resources on a heterogeneous multi-core platform in terms of DVFS, core type, and core count to save energy while meeting the performance target. Approx-RM contributes with lightweight methods for predicting the iteration count needed to meet the accuracy target and the resources needed to meet the performance target. Approx-RM uses the aforementioned predictions to allocate just enough resources to comply with quality of service constraints to save energy. Our evaluation shows energy savings of 31.6%, on average, compared to Race-to-idle when the accuracy is only relaxed by 1%. Approx-RM incurs timing and energy overheads of less than 0.1%.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"32 1","pages":"1 - 25"},"PeriodicalIF":1.6,"publicationDate":"2023-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76940819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuwen Zhao, Fangfang Liu, Wenjing Ma, Huiyuan Li, Yuan-Xi Peng, Cui Wang
Fast Fourier transform (FFT) is widely used in computing applications in large-scale parallel programs, and data communication is the main performance bottleneck of FFT and seriously affects its parallel efficiency. To tackle this problem, we propose a new large-scale FFT framework, MFFT, which optimizes parallel FFT with a new mixed-precision optimization technique, adopting the “high precision computation, low precision communication” strategy. To enable “low precision communication”, we propose a shared-exponent floating-point number compression technique, which reduces the volume of data communication, while maintaining higher accuracy. In addition, we apply a two-phase normalization technique to further reduce the round-off error. Based on the mixed-precision MFFT framework, we apply several optimization techniques to improve the performance, such as streaming of GPU kernels, MPI message combination, kernel optimization, and memory optimization. We evaluate MFFT on a system with 4,096 GPUs. The results show that shared-exponent MFFT is 1.23 × faster than that of double-precision MFFT on average, and double-precision MFFT achieves performance 3.53× and 9.48× on average higher than open source library 2Decomp&FFT (CPU-based version) and heFFTe (AMD GPU-based version), respectively. The parallel efficiency of double-precision MFFT increased from 53.2% to 78.1% compared with 2Decomp&FFT, and shared-exponent MFFT further increases the parallel efficiency to 83.8%.
{"title":"MFFT: A GPU Accelerated Highly Efficient Mixed-Precision Large-Scale FFT Framework","authors":"Yuwen Zhao, Fangfang Liu, Wenjing Ma, Huiyuan Li, Yuan-Xi Peng, Cui Wang","doi":"10.1145/3605148","DOIUrl":"https://doi.org/10.1145/3605148","url":null,"abstract":"Fast Fourier transform (FFT) is widely used in computing applications in large-scale parallel programs, and data communication is the main performance bottleneck of FFT and seriously affects its parallel efficiency. To tackle this problem, we propose a new large-scale FFT framework, MFFT, which optimizes parallel FFT with a new mixed-precision optimization technique, adopting the “high precision computation, low precision communication” strategy. To enable “low precision communication”, we propose a shared-exponent floating-point number compression technique, which reduces the volume of data communication, while maintaining higher accuracy. In addition, we apply a two-phase normalization technique to further reduce the round-off error. Based on the mixed-precision MFFT framework, we apply several optimization techniques to improve the performance, such as streaming of GPU kernels, MPI message combination, kernel optimization, and memory optimization. We evaluate MFFT on a system with 4,096 GPUs. The results show that shared-exponent MFFT is 1.23 × faster than that of double-precision MFFT on average, and double-precision MFFT achieves performance 3.53× and 9.48× on average higher than open source library 2Decomp&FFT (CPU-based version) and heFFTe (AMD GPU-based version), respectively. The parallel efficiency of double-precision MFFT increased from 53.2% to 78.1% compared with 2Decomp&FFT, and shared-exponent MFFT further increases the parallel efficiency to 83.8%.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"88 1","pages":"1 - 23"},"PeriodicalIF":1.6,"publicationDate":"2023-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77707368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiazhi Jiang, Zijiang Huang, Dan-E Huang, Jiangsu Du, Lin Chen, Ziguan Chen, Yutong Lu
The tremendous success of convolutional neural network (CNN) has made it ubiquitous in many fields of human endeavor. Many applications such as biomedical analysis and scientific data analysis involve analyzing volumetric data. This spawns huge demand for 3D-CNN. Although accelerators such as GPU may provide higher throughput on deep learning applications, they may not be available in all scenarios. CPU, especially many-core CPU with non-uniform memory access (NUMA) architecture, remains an attractive choice for deep learning inference in many scenarios. In this article, we propose a distributed inference solution for 3D-CNN that targets on the emerging ARM many-core CPU platform. A hierarchical partition approach is claimed to accelerate 3D-CNN inference by exploiting characteristics of memory and cache on ARM many-core CPU. Based on the hierarchical model partition approach, other optimization techniques such as NUMA-aware thread scheduling and optimization of 3D-img2row convolution are designed to exploit the potential of ARM many-core CPU for 3D-CNN. We evaluate our proposed inference solution with several classic 3D-CNNs: C3D, 3D-resnet34, 3D-resnet50, 3D-vgg11, and P3D. Our experimental results show that our solution can boost the performance of the 3D-CNN inference, and achieve much better scalability, with a negligible fluctuation in accuracy. When employing our 3D-CNN inference solution on ACL libraries, it can outperform naive ACL implementations by 11× to 50× on ARM many-core processor. When employing our 3D-CNN inference solution on NCNN libraries, it can outperform the naive NCNN implementations by 5.2× to 14.2× on ARM many-core processor.
{"title":"Hierarchical Model Parallelism for Optimizing Inference on Many-core Processor via Decoupled 3D-CNN Structure","authors":"Jiazhi Jiang, Zijiang Huang, Dan-E Huang, Jiangsu Du, Lin Chen, Ziguan Chen, Yutong Lu","doi":"10.1145/3605149","DOIUrl":"https://doi.org/10.1145/3605149","url":null,"abstract":"The tremendous success of convolutional neural network (CNN) has made it ubiquitous in many fields of human endeavor. Many applications such as biomedical analysis and scientific data analysis involve analyzing volumetric data. This spawns huge demand for 3D-CNN. Although accelerators such as GPU may provide higher throughput on deep learning applications, they may not be available in all scenarios. CPU, especially many-core CPU with non-uniform memory access (NUMA) architecture, remains an attractive choice for deep learning inference in many scenarios. In this article, we propose a distributed inference solution for 3D-CNN that targets on the emerging ARM many-core CPU platform. A hierarchical partition approach is claimed to accelerate 3D-CNN inference by exploiting characteristics of memory and cache on ARM many-core CPU. Based on the hierarchical model partition approach, other optimization techniques such as NUMA-aware thread scheduling and optimization of 3D-img2row convolution are designed to exploit the potential of ARM many-core CPU for 3D-CNN. We evaluate our proposed inference solution with several classic 3D-CNNs: C3D, 3D-resnet34, 3D-resnet50, 3D-vgg11, and P3D. Our experimental results show that our solution can boost the performance of the 3D-CNN inference, and achieve much better scalability, with a negligible fluctuation in accuracy. When employing our 3D-CNN inference solution on ACL libraries, it can outperform naive ACL implementations by 11× to 50× on ARM many-core processor. When employing our 3D-CNN inference solution on NCNN libraries, it can outperform the naive NCNN implementations by 5.2× to 14.2× on ARM many-core processor.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"28 6 1","pages":"1 - 21"},"PeriodicalIF":1.6,"publicationDate":"2023-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87658453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GPU database systems are an effective solution to query optimization, particularly with compilation and data caching. They fall short, however, in end-to-end workloads, as existing compiler toolchains are too expensive for use with short-running queries. In this work, we define and evaluate a runtime-suitable query compilation pipeline for NVIDIA GPUs that extracts high performance with only minimal optimization. In particular, our balanced approach successfully trades minor slowdowns in execution for major speedups in compilation, even as data sizes increase. We demonstrate performance benefits compared to both CPU and GPU database systems using interpreters and compilers, extending query compilation for GPUs beyond cached use cases.
{"title":"rNdN: Fast Query Compilation for NVIDIA GPUs","authors":"Alexander Krolik, Clark Verbrugge, L. Hendren","doi":"10.1145/3603503","DOIUrl":"https://doi.org/10.1145/3603503","url":null,"abstract":"GPU database systems are an effective solution to query optimization, particularly with compilation and data caching. They fall short, however, in end-to-end workloads, as existing compiler toolchains are too expensive for use with short-running queries. In this work, we define and evaluate a runtime-suitable query compilation pipeline for NVIDIA GPUs that extracts high performance with only minimal optimization. In particular, our balanced approach successfully trades minor slowdowns in execution for major speedups in compilation, even as data sizes increase. We demonstrate performance benefits compared to both CPU and GPU database systems using interpreters and compilers, extending query compilation for GPUs beyond cached use cases.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"301 1","pages":"1 - 25"},"PeriodicalIF":1.6,"publicationDate":"2023-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79720139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ben Reber, Matthew Gould, Alexander H. Kneipp, Fangzhou Liu, Ian Prechtl, C. Ding, Linlin Chen, D. Patru
Cache management is important in exploiting locality and reducing data movement. This article studies a new type of programmable cache called the lease cache. By assigning leases, software exerts the primary control on when and how long data stays in the cache. Previous work has shown an optimal solution for an ideal lease cache. This article develops and evaluates a set of practical solutions for a physical lease cache emulated in FPGA with the full suite of PolyBench benchmarks. Compared to automatic caching, lease programming can further reduce data movement by 10% to over 60% when the data size is 16 times to 3,000 times the cache size, and the techniques in this article realize over 80% of this potential. Moreover, lease programming can reduce data movement by another 0.8% to 20% after polyhedral locality optimization.
{"title":"Cache Programming for Scientific Loops Using Leases","authors":"Ben Reber, Matthew Gould, Alexander H. Kneipp, Fangzhou Liu, Ian Prechtl, C. Ding, Linlin Chen, D. Patru","doi":"10.1145/3600090","DOIUrl":"https://doi.org/10.1145/3600090","url":null,"abstract":"Cache management is important in exploiting locality and reducing data movement. This article studies a new type of programmable cache called the lease cache. By assigning leases, software exerts the primary control on when and how long data stays in the cache. Previous work has shown an optimal solution for an ideal lease cache. This article develops and evaluates a set of practical solutions for a physical lease cache emulated in FPGA with the full suite of PolyBench benchmarks. Compared to automatic caching, lease programming can further reduce data movement by 10% to over 60% when the data size is 16 times to 3,000 times the cache size, and the techniques in this article realize over 80% of this potential. Moreover, lease programming can reduce data movement by another 0.8% to 20% after polyhedral locality optimization.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"32 1","pages":"1 - 25"},"PeriodicalIF":1.6,"publicationDate":"2023-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81103003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the growing number of data-intensive workloads, GPU, which is the state-of-the-art single-instruction-multiple-thread (SIMT) processor, is hindered by the memory bandwidth wall. To alleviate this bottleneck, previously proposed 3D-stacking near-bank computing accelerators benefit from abundant bank-internal bandwidth by bringing computations closer to the DRAM banks. However, these accelerators are specialized for certain application domains with simple architecture data paths and customized software mapping schemes. For general-purpose scenarios, lightweight hardware designs for diverse data paths, architectural supports for the SIMT programming model, and end-to-end software optimizations remain challenging. To address these issues, we propose Memory-centric Processing Unit (MPU), the first SIMT processor based on 3D-stacking near-bank computing architecture. First, to realize diverse data paths with small overheads, MPU adopts a hybrid pipeline with the capability of offloading instructions to near-bank compute-logic. Second, we explore two architectural supports for the SIMT programming model, including a near-bank shared memory design and a multiple activated row-buffers enhancement. Third, we present an end-to-end compilation flow for MPU to support CUDA programs. To fully utilize MPU’s hybrid pipeline, we develop a backend optimization for the instruction offloading decision. The evaluation results of MPU demonstrate 3.46× speedup and 2.57× energy reduction compared with an NVIDIA Tesla V100 GPU on a set of representative data-intensive workloads.
随着数据密集型工作负载的不断增加,GPU作为最先进的单指令多线程(SIMT)处理器,受到内存带宽墙的阻碍。为了缓解这一瓶颈,之前提出的3d堆叠近库计算加速器通过使计算更接近DRAM库,从丰富的库内部带宽中获益。但是,这些加速器专门用于具有简单架构数据路径和定制软件映射方案的特定应用程序域。对于通用场景,用于各种数据路径的轻量级硬件设计、SIMT编程模型的体系结构支持以及端到端软件优化仍然具有挑战性。为了解决这些问题,我们提出了内存中心处理单元(MPU),这是第一个基于3d堆叠近岸计算架构的SIMT处理器。首先,为了以较小的开销实现多种数据路径,MPU采用混合管道,具有将指令卸载到近岸计算逻辑的能力。其次,我们探讨了SIMT编程模型的两种体系结构支持,包括近银行共享内存设计和多激活行缓冲区增强。第三,我们提出了支持CUDA程序的MPU端到端编译流程。为了充分利用MPU的混合管道,我们开发了指令卸载决策的后端优化。在一组具有代表性的数据密集型工作负载上,MPU的评估结果显示,与NVIDIA Tesla V100 GPU相比,MPU的加速提升了3.46倍,能耗降低了2.57倍。
{"title":"MPU: Memory-centric SIMT Processor via In-DRAM Near-bank Computing","authors":"Xinfeng Xie, P. Gu, Yufei Ding, Dimin Niu, Hongzhong Zheng, Yuan Xie","doi":"10.1145/3603113","DOIUrl":"https://doi.org/10.1145/3603113","url":null,"abstract":"With the growing number of data-intensive workloads, GPU, which is the state-of-the-art single-instruction-multiple-thread (SIMT) processor, is hindered by the memory bandwidth wall. To alleviate this bottleneck, previously proposed 3D-stacking near-bank computing accelerators benefit from abundant bank-internal bandwidth by bringing computations closer to the DRAM banks. However, these accelerators are specialized for certain application domains with simple architecture data paths and customized software mapping schemes. For general-purpose scenarios, lightweight hardware designs for diverse data paths, architectural supports for the SIMT programming model, and end-to-end software optimizations remain challenging. To address these issues, we propose Memory-centric Processing Unit (MPU), the first SIMT processor based on 3D-stacking near-bank computing architecture. First, to realize diverse data paths with small overheads, MPU adopts a hybrid pipeline with the capability of offloading instructions to near-bank compute-logic. Second, we explore two architectural supports for the SIMT programming model, including a near-bank shared memory design and a multiple activated row-buffers enhancement. Third, we present an end-to-end compilation flow for MPU to support CUDA programs. To fully utilize MPU’s hybrid pipeline, we develop a backend optimization for the instruction offloading decision. The evaluation results of MPU demonstrate 3.46× speedup and 2.57× energy reduction compared with an NVIDIA Tesla V100 GPU on a set of representative data-intensive workloads.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"23 1","pages":"1 - 26"},"PeriodicalIF":1.6,"publicationDate":"2023-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78796740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As the volume of data processed by applications has increased, considerable attention has been paid to data address translation overheads, leading to the widespread use of larger page sizes (“superpages”) and multi-level translation lookaside buffers (TLBs). However, far less attention has been paid to instruction address translation and its relation to TLB and pipeline structure. In prior work, we quantified the impact of using code superpages on a variety of widely used applications, ranging from compilers to web user-interface frameworks, and the impact of sharing page table pages for executables and shared libraries. Within this article, we augment those results by first uncovering the effects that microarchitectural differences between Intel Skylake and AMD Zen+, particularly their different TLB organizations, have on instruction address translation overhead. This analysis provides some key insights into the microarchitectural design decisions that impact the cost of instruction address translation. First, a lower-level (level 2) TLB that has both instruction and data mappings competing for space within the same structure allows better overall performance and utilization when using code superpages. Code superpages not only reduce instruction address translation overhead but also indirectly reduce data address translation overhead. In fact, for a few applications, the use of just a few code superpages has a larger impact on overall performance than the use of a much larger number of data superpages. Second, a level 1 (L1) TLB with separate structures for different page sizes may require careful tuning of the superpage promotion policy for code, and a correspondingly suboptimal utilization of the level 2 TLB. In particular, increasing the number of superpages when the size of the L1 superpage structure is small may result in more L1 TLB misses for some applications. Moreover, on some microarchitectures, the cost of these misses can be highly variable, because replacement is delayed until all of the in-flight instructions mapped by the victim entry are retired. Hence, more superpage promotions can result in a performance regression. Finally, our findings also make a case for first-class OS support for superpages on ordinary files containing executables and shared libraries, as well as a more aggressive superpage policy for code.
{"title":"The Impact of Page Size and Microarchitecture on Instruction Address Translation Overhead","authors":"Yufeng Zhou, A. Cox, S. Dwarkadas, Xiaowan Dong","doi":"10.1145/3600089","DOIUrl":"https://doi.org/10.1145/3600089","url":null,"abstract":"As the volume of data processed by applications has increased, considerable attention has been paid to data address translation overheads, leading to the widespread use of larger page sizes (“superpages”) and multi-level translation lookaside buffers (TLBs). However, far less attention has been paid to instruction address translation and its relation to TLB and pipeline structure. In prior work, we quantified the impact of using code superpages on a variety of widely used applications, ranging from compilers to web user-interface frameworks, and the impact of sharing page table pages for executables and shared libraries. Within this article, we augment those results by first uncovering the effects that microarchitectural differences between Intel Skylake and AMD Zen+, particularly their different TLB organizations, have on instruction address translation overhead. This analysis provides some key insights into the microarchitectural design decisions that impact the cost of instruction address translation. First, a lower-level (level 2) TLB that has both instruction and data mappings competing for space within the same structure allows better overall performance and utilization when using code superpages. Code superpages not only reduce instruction address translation overhead but also indirectly reduce data address translation overhead. In fact, for a few applications, the use of just a few code superpages has a larger impact on overall performance than the use of a much larger number of data superpages. Second, a level 1 (L1) TLB with separate structures for different page sizes may require careful tuning of the superpage promotion policy for code, and a correspondingly suboptimal utilization of the level 2 TLB. In particular, increasing the number of superpages when the size of the L1 superpage structure is small may result in more L1 TLB misses for some applications. Moreover, on some microarchitectures, the cost of these misses can be highly variable, because replacement is delayed until all of the in-flight instructions mapped by the victim entry are retired. Hence, more superpage promotions can result in a performance regression. Finally, our findings also make a case for first-class OS support for superpages on ordinary files containing executables and shared libraries, as well as a more aggressive superpage policy for code.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"74 1","pages":"1 - 25"},"PeriodicalIF":1.6,"publicationDate":"2023-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73840544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jin Zhao, Yu Zhang, Ligang He, Qikun Li, Xiang-dong Zhang, Xinyu Jiang, Hui Yu, Xiaofei Liao, Hai Jin, Lin Gu, Haikun Liu, Bin He, Ji Zhang, Xianzheng Song, Lin Wang, Jun Zhou
With the increasing need for graph analysis, massive Concurrent iterative Graph Processing (CGP) jobs are usually performed on the common large-scale real-world graph. Although several solutions have been proposed, these CGP jobs are not coordinated with the consideration of the inherent dependencies in graph data driven by graph topology. As a result, they suffer from redundant and fragmented accesses of the same underlying graph dispersed over distributed platform, because the same graph is typically irregularly traversed by these jobs along different paths at the same time. In this work, we develop GraphTune, which can be integrated into existing distributed graph processing systems, such as D-Galois, Gemini, PowerGraph, and Chaos, to efficiently perform CGP jobs and enhance system throughput. The key component of GraphTune is a dependency-aware synchronous execution engine in conjunction with several optimization strategies based on the constructed cross-iteration dependency graph of chunks. Specifically, GraphTune transparently regularizes the processing behavior of the CGP jobs in a novel synchronous way and assigns the chunks of graph data to be handled by them based on the topological order of the dependency graph so as to maximize the performance. In this way, it can transform the irregular accesses of the chunks into more regular ones so that as many CGP jobs as possible can fully share the data accesses to the common graph. Meanwhile, it also efficiently synchronizes the communications launched by different CGP jobs based on the dependency graph to minimize the communication cost. We integrate it into four cutting-edge distributed graph processing systems and a popular out-of-core graph processing system to demonstrate the efficiency of GraphTune. Experimental results show that GraphTune improves the throughput of CGP jobs by 3.1∼6.2, 3.8∼8.5, 3.5∼10.8, 4.3∼12.4, and 3.8∼6.9 times over D-Galois, Gemini, PowerGraph, Chaos, and GraphChi, respectively.
{"title":"GraphTune: An Efficient Dependency-Aware Substrate to Alleviate Irregularity in Concurrent Graph Processing","authors":"Jin Zhao, Yu Zhang, Ligang He, Qikun Li, Xiang-dong Zhang, Xinyu Jiang, Hui Yu, Xiaofei Liao, Hai Jin, Lin Gu, Haikun Liu, Bin He, Ji Zhang, Xianzheng Song, Lin Wang, Jun Zhou","doi":"10.1145/3600091","DOIUrl":"https://doi.org/10.1145/3600091","url":null,"abstract":"With the increasing need for graph analysis, massive Concurrent iterative Graph Processing (CGP) jobs are usually performed on the common large-scale real-world graph. Although several solutions have been proposed, these CGP jobs are not coordinated with the consideration of the inherent dependencies in graph data driven by graph topology. As a result, they suffer from redundant and fragmented accesses of the same underlying graph dispersed over distributed platform, because the same graph is typically irregularly traversed by these jobs along different paths at the same time. In this work, we develop GraphTune, which can be integrated into existing distributed graph processing systems, such as D-Galois, Gemini, PowerGraph, and Chaos, to efficiently perform CGP jobs and enhance system throughput. The key component of GraphTune is a dependency-aware synchronous execution engine in conjunction with several optimization strategies based on the constructed cross-iteration dependency graph of chunks. Specifically, GraphTune transparently regularizes the processing behavior of the CGP jobs in a novel synchronous way and assigns the chunks of graph data to be handled by them based on the topological order of the dependency graph so as to maximize the performance. In this way, it can transform the irregular accesses of the chunks into more regular ones so that as many CGP jobs as possible can fully share the data accesses to the common graph. Meanwhile, it also efficiently synchronizes the communications launched by different CGP jobs based on the dependency graph to minimize the communication cost. We integrate it into four cutting-edge distributed graph processing systems and a popular out-of-core graph processing system to demonstrate the efficiency of GraphTune. Experimental results show that GraphTune improves the throughput of CGP jobs by 3.1∼6.2, 3.8∼8.5, 3.5∼10.8, 4.3∼12.4, and 3.8∼6.9 times over D-Galois, Gemini, PowerGraph, Chaos, and GraphChi, respectively.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"44 1","pages":"1 - 24"},"PeriodicalIF":1.6,"publicationDate":"2023-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79304491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gokul Subramanian Ravi, T. Krishna, Mikko H. Lipasti
The ideal latency for on-chip network traversal would be the delay incurred from wire traversal alone. Unfortunately, in a realistic modular network, the latency for a packet to traverse the network is significantly higher than this wire delay. The main limiter to achieving lower latency is the modular quantization of network traversal into hops. Beyond this, the physical heterogeneity in real-world systems further complicate the ability to reach ideal wire-only delay. In this work, we propose TNT or Transparent Network Traversal. TNT targets ideal network latency by attempting source to destination network traversal as a single multi-cycle ‘long-hop’, bypassing the quantization effects of intermediate routers via transparent data/information flow. TNT is built in a modular tile-scalable manner via a novel control path performing neighbor-to-neighbor interactions but enabling end-to-end transparent flit traversal. Further, TNT’s fine grained on-the-fly delay tracking allows it to cope with physical NOC heterogeneity across the chip. Analysis on Ligra graph workloads shows that TNT can reduce NOC latency by as much as 43% compared to the state of the art and allows efficiency gains up to 38%. Further, it can achieve more than 3x the benefits of the best/closest alternative research proposal, SMART [43].
{"title":"TNT: A Modular Approach to Traversing Physically Heterogeneous NOCs at Bare-wire Latency","authors":"Gokul Subramanian Ravi, T. Krishna, Mikko H. Lipasti","doi":"10.1145/3597611","DOIUrl":"https://doi.org/10.1145/3597611","url":null,"abstract":"The ideal latency for on-chip network traversal would be the delay incurred from wire traversal alone. Unfortunately, in a realistic modular network, the latency for a packet to traverse the network is significantly higher than this wire delay. The main limiter to achieving lower latency is the modular quantization of network traversal into hops. Beyond this, the physical heterogeneity in real-world systems further complicate the ability to reach ideal wire-only delay. In this work, we propose TNT or Transparent Network Traversal. TNT targets ideal network latency by attempting source to destination network traversal as a single multi-cycle ‘long-hop’, bypassing the quantization effects of intermediate routers via transparent data/information flow. TNT is built in a modular tile-scalable manner via a novel control path performing neighbor-to-neighbor interactions but enabling end-to-end transparent flit traversal. Further, TNT’s fine grained on-the-fly delay tracking allows it to cope with physical NOC heterogeneity across the chip. Analysis on Ligra graph workloads shows that TNT can reduce NOC latency by as much as 43% compared to the state of the art and allows efficiency gains up to 38%. Further, it can achieve more than 3x the benefits of the best/closest alternative research proposal, SMART [43].","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"1 1","pages":"1 - 25"},"PeriodicalIF":1.6,"publicationDate":"2023-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83415426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}