ACM Transactions on Architecture and Code Optimization最新文献_第9页

MPU: Memory-centric SIMT Processor via In-DRAM Near-bank Computing MPU:以内存为中心的SIMT处理器，通过In-DRAM近银行计算

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-07-19 DOI: https://dl.acm.org/doi/10.1145/3603113

Xinfeng Xie, Peng Gu, Yufei Ding, Dimin Niu, Hongzhong Zheng, Yuan Xie

With the growing number of data-intensive workloads, GPU, which is the state-of-the-art single-instruction-multiple-thread (SIMT) processor, is hindered by the memory bandwidth wall. To alleviate this bottleneck, previously proposed 3D-stacking near-bank computing accelerators benefit from abundant bank-internal bandwidth by bringing computations closer to the DRAM banks. However, these accelerators are specialized for certain application domains with simple architecture data paths and customized software mapping schemes. For general-purpose scenarios, lightweight hardware designs for diverse data paths, architectural supports for the SIMT programming model, and end-to-end software optimizations remain challenging.

To address these issues, we propose Memory-centric Processing Unit (MPU), the first SIMT processor based on 3D-stacking near-bank computing architecture. First, to realize diverse data paths with small overheads, MPU adopts a hybrid pipeline with the capability of offloading instructions to near-bank compute-logic. Second, we explore two architectural supports for the SIMT programming model, including a near-bank shared memory design and a multiple activated row-buffers enhancement. Third, we present an end-to-end compilation flow for MPU to support CUDA programs. To fully utilize MPU’s hybrid pipeline, we develop a backend optimization for the instruction offloading decision. The evaluation results of MPU demonstrate 3.46× speedup and 2.57× energy reduction compared with an NVIDIA Tesla V100 GPU on a set of representative data-intensive workloads.

随着数据密集型工作负载的不断增加，GPU作为最先进的单指令多线程(SIMT)处理器，受到内存带宽墙的阻碍。为了缓解这一瓶颈，之前提出的3d堆叠近库计算加速器通过使计算更接近DRAM库，从丰富的库内部带宽中获益。但是，这些加速器专门用于具有简单架构数据路径和定制软件映射方案的特定应用程序域。对于通用场景，用于各种数据路径的轻量级硬件设计、SIMT编程模型的体系结构支持以及端到端软件优化仍然具有挑战性。为了解决这些问题，我们提出了内存中心处理单元(MPU)，这是第一个基于3d堆叠近岸计算架构的SIMT处理器。首先，为了以较小的开销实现多种数据路径，MPU采用混合管道，具有将指令卸载到近岸计算逻辑的能力。其次，我们探讨了SIMT编程模型的两种体系结构支持，包括近银行共享内存设计和多激活行缓冲区增强。第三，我们提出了支持CUDA程序的MPU端到端编译流程。为了充分利用MPU的混合管道，我们开发了指令卸载决策的后端优化。在一组具有代表性的数据密集型工作负载上，MPU的评估结果显示，与NVIDIA Tesla V100 GPU相比，MPU的加速提升了3.46倍，能耗降低了2.57倍。

{"title":"MPU: Memory-centric SIMT Processor via In-DRAM Near-bank Computing","authors":"Xinfeng Xie, Peng Gu, Yufei Ding, Dimin Niu, Hongzhong Zheng, Yuan Xie","doi":"https://dl.acm.org/doi/10.1145/3603113","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3603113","url":null,"abstract":"With the growing number of data-intensive workloads, GPU, which is the state-of-the-art single-instruction-multiple-thread (SIMT) processor, is hindered by the memory bandwidth wall. To alleviate this bottleneck, previously proposed 3D-stacking near-bank computing accelerators benefit from abundant bank-internal bandwidth by bringing computations closer to the DRAM banks. However, these accelerators are specialized for certain application domains with simple architecture data paths and customized software mapping schemes. For general-purpose scenarios, lightweight hardware designs for diverse data paths, architectural supports for the SIMT programming model, and end-to-end software optimizations remain challenging.To address these issues, we propose Memory-centric Processing Unit (MPU), the first SIMT processor based on 3D-stacking near-bank computing architecture. First, to realize diverse data paths with small overheads, MPU adopts a hybrid pipeline with the capability of offloading instructions to near-bank compute-logic. Second, we explore two architectural supports for the SIMT programming model, including a near-bank shared memory design and a multiple activated row-buffers enhancement. Third, we present an end-to-end compilation flow for MPU to support CUDA programs. To fully utilize MPU’s hybrid pipeline, we develop a backend optimization for the instruction offloading decision. The evaluation results of MPU demonstrate 3.46× speedup and 2.57× energy reduction compared with an NVIDIA Tesla V100 GPU on a set of representative data-intensive workloads.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"15 3-4","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138509553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Impact of Page Size and Microarchitecture on Instruction Address Translation Overhead 页面大小和微架构对指令地址转换开销的影响

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-07-19 DOI: https://dl.acm.org/doi/10.1145/3600089

Yufeng Zhou, Alan L. Cox, Sandhya Dwarkadas, Xiaowan Dong

As the volume of data processed by applications has increased, considerable attention has been paid to data address translation overheads, leading to the widespread use of larger page sizes (“superpages”) and multi-level translation lookaside buffers (TLBs). However, far less attention has been paid to instruction address translation and its relation to TLB and pipeline structure. In prior work, we quantified the impact of using code superpages on a variety of widely used applications, ranging from compilers to web user-interface frameworks, and the impact of sharing page table pages for executables and shared libraries. Within this article, we augment those results by first uncovering the effects that microarchitectural differences between Intel Skylake and AMD Zen+, particularly their different TLB organizations, have on instruction address translation overhead. This analysis provides some key insights into the microarchitectural design decisions that impact the cost of instruction address translation. First, a lower-level (level 2) TLB that has both instruction and data mappings competing for space within the same structure allows better overall performance and utilization when using code superpages. Code superpages not only reduce instruction address translation overhead but also indirectly reduce data address translation overhead. In fact, for a few applications, the use of just a few code superpages has a larger impact on overall performance than the use of a much larger number of data superpages. Second, a level 1 (L1) TLB with separate structures for different page sizes may require careful tuning of the superpage promotion policy for code, and a correspondingly suboptimal utilization of the level 2 TLB. In particular, increasing the number of superpages when the size of the L1 superpage structure is small may result in more L1 TLB misses for some applications. Moreover, on some microarchitectures, the cost of these misses can be highly variable, because replacement is delayed until all of the in-flight instructions mapped by the victim entry are retired. Hence, more superpage promotions can result in a performance regression. Finally, our findings also make a case for first-class OS support for superpages on ordinary files containing executables and shared libraries, as well as a more aggressive superpage policy for code.

随着应用程序处理的数据量的增加，数据地址转换开销受到了相当大的关注，导致广泛使用更大的页面大小(“超级页面”)和多级翻译备用缓冲区(tlb)。然而，指令地址的翻译及其与TLB和管道结构的关系却很少受到关注。在之前的工作中，我们量化了使用代码超页对各种广泛使用的应用程序(从编译器到web用户界面框架)的影响，以及为可执行文件和共享库共享页表页面的影响。在本文中，我们通过首先揭示英特尔Skylake和AMD Zen+之间的微架构差异(特别是它们不同的TLB组织)对指令地址转换开销的影响来增强这些结果。这一分析为影响指令地址转换成本的微架构设计决策提供了一些关键见解。首先，在使用代码超页时，具有指令和数据映射竞争同一结构中的空间的较低级(2级)TLB允许更好的总体性能和利用率。代码超页不仅减少了指令地址转换开销，而且间接地减少了数据地址转换开销。事实上，对于一些应用程序，使用少量代码超页比使用大量数据超页对整体性能的影响更大。其次，具有针对不同页面大小的单独结构的1级(L1) TLB可能需要仔细调整代码的超页提升策略，并相应地对2级TLB进行次优利用。特别是，当L1超页结构的大小较小时，增加超页的数量可能会导致某些应用程序丢失更多的L1 TLB。此外，在一些微体系结构上，这些失误的代价可能是高度可变的，因为替换会延迟到受害者条目所映射的所有运行中的指令都退役为止。因此，更多的超级页面促销可能会导致性能下降。最后，我们的研究结果还为包含可执行文件和共享库的普通文件上的超级页提供了一流的操作系统支持，并为代码提供了更积极的超级页策略。

{"title":"The Impact of Page Size and Microarchitecture on Instruction Address Translation Overhead","authors":"Yufeng Zhou, Alan L. Cox, Sandhya Dwarkadas, Xiaowan Dong","doi":"https://dl.acm.org/doi/10.1145/3600089","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3600089","url":null,"abstract":"As the volume of data processed by applications has increased, considerable attention has been paid to data address translation overheads, leading to the widespread use of larger page sizes (“superpages”) and multi-level translation lookaside buffers (TLBs). However, far less attention has been paid to instruction address translation and its relation to TLB and pipeline structure. In prior work, we quantified the impact of using code superpages on a variety of widely used applications, ranging from compilers to web user-interface frameworks, and the impact of sharing page table pages for executables and shared libraries. Within this article, we augment those results by first uncovering the effects that microarchitectural differences between Intel Skylake and AMD Zen+, particularly their different TLB organizations, have on instruction address translation overhead. This analysis provides some key insights into the microarchitectural design decisions that impact the cost of instruction address translation. First, a lower-level (level 2) TLB that has both instruction and data mappings competing for space within the same structure allows better overall performance and utilization when using code superpages. Code superpages not only reduce instruction address translation overhead but also indirectly reduce data address translation overhead. In fact, for a few applications, the use of just a few code superpages has a larger impact on overall performance than the use of a much larger number of data superpages. Second, a level 1 (L1) TLB with separate structures for different page sizes may require careful tuning of the superpage promotion policy for code, and a correspondingly suboptimal utilization of the level 2 TLB. In particular, increasing the number of superpages when the size of the L1 superpage structure is small may result in more L1 TLB misses for some applications. Moreover, on some microarchitectures, the cost of these misses can be highly variable, because replacement is delayed until all of the in-flight instructions mapped by the victim entry are retired. Hence, more superpage promotions can result in a performance regression. Finally, our findings also make a case for first-class OS support for superpages on ordinary files containing executables and shared libraries, as well as a more aggressive superpage policy for code.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"4 3","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138509551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Turn-based Spatiotemporal Coherence for GPUs gpu的回合制时空相干性

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-07-19 DOI: https://dl.acm.org/doi/10.1145/3593054

Sooraj Puthoor, Mikko H. Lipasti

This article introduces turn-based spatiotemporal coherence. Spatiotemporal coherence is a novel coherence implementation that assigns write permission to epochs (or turns) as opposed to a processor core. This paradigm shift in the assignment of write permissions satisfies all conditions of a coherence protocol with virtually no coherence overhead. We discuss the implementation of this coherence mechanism on a baseline GPU. The evaluation shows that spatiotemporal coherence achieves a speedup of 7.13% for workloads with read data reuse across kernels compared to the baseline software-managed GPU coherence implementation while also providing write atomicity and avoiding the need for software inserted acquire-release operations.¹

本文介绍了基于回合的时空相干性。时空相干是一种新的相干实现，它将写权限分配给epoch(或turn)，而不是处理器核心。这种写权限分配的范式转换满足一致性协议的所有条件，几乎没有一致性开销。我们讨论了这种一致性机制在基线GPU上的实现。评估表明，与基线软件管理的GPU一致性实现相比，时空一致性在跨内核读取数据重用的工作负载上实现了7.13%的加速，同时还提供了写原子性并避免了对软件插入的获取-释放操作的需要

引用次数: 0

TNT: A Modular Approach to Traversing Physically Heterogeneous NOCs at Bare-wire Latency TNT:在裸线延迟下穿越物理异构noc的模块化方法

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-07-19 DOI: https://dl.acm.org/doi/10.1145/3597611

Gokul Subramanian Ravi, Tushar Krishna, Mikko Lipasti

The ideal latency for on-chip network traversal would be the delay incurred from wire traversal alone. Unfortunately, in a realistic modular network, the latency for a packet to traverse the network is significantly higher than this wire delay. The main limiter to achieving lower latency is the modular quantization of network traversal into hops. Beyond this, the physical heterogeneity in real-world systems further complicate the ability to reach ideal wire-only delay.

In this work, we propose TNT or Transparent Network Traversal. TNT targets ideal network latency by attempting source to destination network traversal as a single multi-cycle ‘long-hop’, bypassing the quantization effects of intermediate routers via transparent data/information flow. TNT is built in a modular tile-scalable manner via a novel control path performing neighbor-to-neighbor interactions but enabling end-to-end transparent flit traversal. Further, TNT’s fine grained on-the-fly delay tracking allows it to cope with physical NOC heterogeneity across the chip.

Analysis on Ligra graph workloads shows that TNT can reduce NOC latency by as much as 43% compared to the state of the art and allows efficiency gains up to 38%. Further, it can achieve more than 3x the benefits of the best/closest alternative research proposal, SMART [43].

片上网络遍历的理想延迟是仅由导线遍历引起的延迟。不幸的是，在实际的模块化网络中，数据包穿越网络的延迟明显高于这个有线延迟。实现较低延迟的主要限制是将网络遍历模块化量化为跳数。除此之外，现实系统中的物理异质性进一步使实现理想的纯线延迟的能力复杂化。在这项工作中，我们提出TNT或透明网络遍历。TNT通过尝试将源到目的网络遍历作为单个多周期“长跳”来实现理想的网络延迟，通过透明的数据/信息流绕过中间路由器的量化影响。TNT是通过一种新颖的控制路径构建的模块化瓷砖可扩展方式，执行邻居到邻居的交互，但支持端到端的透明飞行遍历。此外，TNT的细粒度实时延迟跟踪使其能够应对芯片上的物理NOC异质性。对Ligra图工作负载的分析表明，与现有技术相比，TNT可以将NOC延迟减少43%，效率提高38%。此外，它可以实现最佳/最接近的替代研究方案SMART的3倍以上的效益[43]。

{"title":"TNT: A Modular Approach to Traversing Physically Heterogeneous NOCs at Bare-wire Latency","authors":"Gokul Subramanian Ravi, Tushar Krishna, Mikko Lipasti","doi":"https://dl.acm.org/doi/10.1145/3597611","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3597611","url":null,"abstract":"The ideal latency for on-chip network traversal would be the delay incurred from wire traversal alone. Unfortunately, in a realistic modular network, the latency for a packet to traverse the network is significantly higher than this wire delay. The main limiter to achieving lower latency is the modular quantization of network traversal into hops. Beyond this, the physical heterogeneity in real-world systems further complicate the ability to reach ideal wire-only delay.In this work, we propose TNT or Transparent Network Traversal. TNT targets ideal network latency by attempting source to destination network traversal as a single multi-cycle ‘long-hop’, bypassing the quantization effects of intermediate routers via transparent data/information flow. TNT is built in a modular tile-scalable manner via a novel control path performing neighbor-to-neighbor interactions but enabling end-to-end transparent flit traversal. Further, TNT’s fine grained on-the-fly delay tracking allows it to cope with physical NOC heterogeneity across the chip.Analysis on Ligra graph workloads shows that TNT can reduce NOC latency by as much as 43% compared to the state of the art and allows efficiency gains up to 38%. Further, it can achieve more than 3x the benefits of the best/closest alternative research proposal, SMART [43].","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"82 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Jointly Optimizing Job Assignment and Resource Partitioning for Improving System Throughput in Cloud Datacenters 共同优化作业分配和资源划分，提高云数据中心系统吞吐量

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-07-19 DOI: https://dl.acm.org/doi/10.1145/3593055

Ruobing Chen, Haosen Shi, Jinping Wu, Yusen Li, Xiaoguang Liu, Gang Wang

Colocating multiple jobs on the same server has been widely applied for improving resource utilization in cloud datacenters. However, the colocated jobs would contend for the shared resources, which could lead to significant performance degradation. An efficient approach for eliminating performance interference is to partition the shared resources among the colocated jobs. However, this makes the resource management in datacenters very challenging. In this paper, we propose JointOPT, the first resource management framework that optimizes job assignment and resource partitioning jointly for improving the throughput of cloud datacenters. JointOPT uses a local search based algorithm to find the near optimal job assignment configuration, and uses a deep reinforcement learning (DRL) based approach to dynamically partition the shared resources among the colocated jobs. In order to reduce the interaction overhead with real systems, it leverages deep learning to estimate job performance without running them on real servers. We conduct extensive experiments to evaluate JointOPT and the results show that JointOPT significantly outperforms the state-of-the-art baselines, with an advantage from 13.3% to 47.7%.

为了提高云数据中心的资源利用率，在同一台服务器上配置多个作业已经得到了广泛的应用。然而，共存的作业将争夺共享资源，这可能导致显著的性能下降。消除性能干扰的一种有效方法是将共享资源在多个并发作业之间进行分区。然而，这使得数据中心的资源管理非常具有挑战性。在本文中，我们提出了JointOPT，这是第一个共同优化作业分配和资源划分的资源管理框架，以提高云数据中心的吞吐量。JointOPT使用基于局部搜索的算法来找到接近最优的作业分配配置，并使用基于深度强化学习(DRL)的方法在并发作业之间动态划分共享资源。为了减少与真实系统的交互开销，它利用深度学习来评估作业性能，而无需在真实服务器上运行它们。我们进行了大量的实验来评估JointOPT，结果表明JointOPT显著优于最先进的基线，优势从13.3%到47.7%。

{"title":"Jointly Optimizing Job Assignment and Resource Partitioning for Improving System Throughput in Cloud Datacenters","authors":"Ruobing Chen, Haosen Shi, Jinping Wu, Yusen Li, Xiaoguang Liu, Gang Wang","doi":"https://dl.acm.org/doi/10.1145/3593055","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3593055","url":null,"abstract":"Colocating multiple jobs on the same server has been widely applied for improving resource utilization in cloud datacenters. However, the colocated jobs would contend for the shared resources, which could lead to significant performance degradation. An efficient approach for eliminating performance interference is to partition the shared resources among the colocated jobs. However, this makes the resource management in datacenters very challenging. In this paper, we propose JointOPT, the first resource management framework that optimizes job assignment and resource partitioning jointly for improving the throughput of cloud datacenters. JointOPT uses a local search based algorithm to find the near optimal job assignment configuration, and uses a deep reinforcement learning (DRL) based approach to dynamically partition the shared resources among the colocated jobs. In order to reduce the interaction overhead with real systems, it leverages deep learning to estimate job performance without running them on real servers. We conduct extensive experiments to evaluate JointOPT and the results show that JointOPT significantly outperforms the state-of-the-art baselines, with an advantage from 13.3% to 47.7%.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"254-255 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cache Programming for Scientific Loops Using Leases 使用租约的科学循环缓存编程

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-07-19 DOI: https://dl.acm.org/doi/10.1145/3600090

Benjamin Reber, Matthew Gould, Alexander H. Kneipp, Fangzhou Liu, Ian Prechtl, Chen Ding, Linlin Chen, Dorin Patru

Cache management is important in exploiting locality and reducing data movement. This article studies a new type of programmable cache called the lease cache. By assigning leases, software exerts the primary control on when and how long data stays in the cache. Previous work has shown an optimal solution for an ideal lease cache.

This article develops and evaluates a set of practical solutions for a physical lease cache emulated in FPGA with the full suite of PolyBench benchmarks. Compared to automatic caching, lease programming can further reduce data movement by 10% to over 60% when the data size is 16 times to 3,000 times the cache size, and the techniques in this article realize over 80% of this potential. Moreover, lease programming can reduce data movement by another 0.8% to 20% after polyhedral locality optimization.

缓存管理在利用局部性和减少数据移动方面非常重要。本文研究了一种新型的可编程缓存——租赁缓存。通过分配租期，软件对数据在缓存中停留的时间和长度施加主要控制。以前的工作已经展示了理想租约缓存的最佳解决方案。本文开发并评估了一套实用的解决方案，用于在FPGA中模拟物理租约缓存，并使用全套PolyBench基准测试。与自动缓存相比，当数据大小是缓存大小的16倍到3000倍时，租约编程可以进一步减少10%到60%以上的数据移动，本文中的技术实现了80%以上的潜力。此外，在多面体局部优化之后，租约编程可以将数据移动减少0.8%到20%。

引用次数: 0

ASM: An Adaptive Secure Multicore for Co-located Mutually Distrusting Processes ASM:一种自适应安全多核共存的互不信任进程

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-07-19 DOI: https://dl.acm.org/doi/10.1145/3587480

Abdul Rasheed Sahni, Hamza Omar, Usman Ali, Omer Khan

With the ever-increasing virtualization of software and hardware, the privacy of user-sensitive data is a fundamental concern in computation outsourcing. Secure processors enable a trusted execution environment to guarantee security properties based on the principles of isolation, sealing, and integrity. However, the shared hardware resources within the microarchitecture are increasingly being used by co-located adversarial software to create timing-based side-channel attacks. State-of-the-art secure processors implement the strong isolation primitive to enable non-interference for shared hardware but suffer from frequent state purging and resource utilization overheads, leading to degraded performance. This article proposes ASM, an adaptive secure multicore architecture that enables a reconfigurable, yet strongly isolated execution environment. For outsourced security-critical processes, the proposed security kernel and hardware extensions allow either a given process to execute using all available cores or co-execute multiple processes on strongly isolated clusters of cores. This spatio-temporal execution environment is configured based on resource demands of processes, such that the secure processor mitigates state purging overheads and maximizes hardware resource utilization.

随着软件和硬件虚拟化程度的不断提高，用户敏感数据的隐私是计算外包的一个基本问题。安全处理器支持可信的执行环境，以保证基于隔离、密封和完整性原则的安全属性。然而，微架构内的共享硬件资源正越来越多地被位于同一位置的对抗软件用于创建基于时序的侧信道攻击。最先进的安全处理器实现了强隔离原语，使共享硬件不受干扰，但会遭受频繁的状态清除和资源利用开销，从而导致性能下降。本文提出了ASM，这是一种自适应的安全多核体系结构，它支持可重构但又高度隔离的执行环境。对于外包的安全关键流程，建议的安全内核和硬件扩展允许使用所有可用的内核执行给定的进程，或者在高度隔离的内核集群上共同执行多个进程。这种时空执行环境是根据进程的资源需求配置的，这样安全处理器就可以减少状态清除开销并最大化硬件资源利用率。

{"title":"ASM: An Adaptive Secure Multicore for Co-located Mutually Distrusting Processes","authors":"Abdul Rasheed Sahni, Hamza Omar, Usman Ali, Omer Khan","doi":"https://dl.acm.org/doi/10.1145/3587480","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3587480","url":null,"abstract":"With the ever-increasing virtualization of software and hardware, the privacy of user-sensitive data is a fundamental concern in computation outsourcing. Secure processors enable a trusted execution environment to guarantee security properties based on the principles of isolation, sealing, and integrity. However, the shared hardware resources within the microarchitecture are increasingly being used by co-located adversarial software to create timing-based side-channel attacks. State-of-the-art secure processors implement the strong isolation primitive to enable non-interference for shared hardware but suffer from frequent state purging and resource utilization overheads, leading to degraded performance. This article proposes <sans-serif>ASM</sans-serif>, an adaptive secure multicore architecture that enables a reconfigurable, yet strongly isolated execution environment. For outsourced security-critical processes, the proposed security kernel and hardware extensions allow either a given process to execute using all available cores or co-execute multiple processes on strongly isolated clusters of cores. This spatio-temporal execution environment is configured based on resource demands of processes, such that the secure processor mitigates state purging overheads and maximizes hardware resource utilization.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"19 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GraphTune: An Efficient Dependency-Aware Substrate to Alleviate Irregularity in Concurrent Graph Processing GraphTune:一种有效的依赖性感知基板以减轻并发图处理中的不规则性

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-07-19 DOI: https://dl.acm.org/doi/10.1145/3600091

Jin Zhao, Yu Zhang, Ligang He, Qikun Li, Xiang Zhang, Xinyu Jiang, Hui Yu, Xiaofei Liao, Hai Jin, Lin Gu, Haikun Liu, Bingsheng He, Ji Zhang, Xianzheng Song, Lin Wang, Jun Zhou

With the increasing need for graph analysis, massive Concurrent iterative Graph Processing (CGP) jobs are usually performed on the common large-scale real-world graph. Although several solutions have been proposed, these CGP jobs are not coordinated with the consideration of the inherent dependencies in graph data driven by graph topology. As a result, they suffer from redundant and fragmented accesses of the same underlying graph dispersed over distributed platform, because the same graph is typically irregularly traversed by these jobs along different paths at the same time.

In this work, we develop GraphTune, which can be integrated into existing distributed graph processing systems, such as D-Galois, Gemini, PowerGraph, and Chaos, to efficiently perform CGP jobs and enhance system throughput. The key component of GraphTune is a dependency-aware synchronous execution engine in conjunction with several optimization strategies based on the constructed cross-iteration dependency graph of chunks. Specifically, GraphTune transparently regularizes the processing behavior of the CGP jobs in a novel synchronous way and assigns the chunks of graph data to be handled by them based on the topological order of the dependency graph so as to maximize the performance. In this way, it can transform the irregular accesses of the chunks into more regular ones so that as many CGP jobs as possible can fully share the data accesses to the common graph. Meanwhile, it also efficiently synchronizes the communications launched by different CGP jobs based on the dependency graph to minimize the communication cost. We integrate it into four cutting-edge distributed graph processing systems and a popular out-of-core graph processing system to demonstrate the efficiency of GraphTune. Experimental results show that GraphTune improves the throughput of CGP jobs by 3.1∼6.2, 3.8∼8.5, 3.5∼10.8, 4.3∼12.4, and 3.8∼6.9 times over D-Galois, Gemini, PowerGraph, Chaos, and GraphChi, respectively.

随着图形分析需求的不断增长，大量的并行迭代图处理(CGP)作业通常是在常见的大规模真实图上进行的。尽管已经提出了几种解决方案，但这些CGP作业并没有考虑到由图拓扑驱动的图数据中的固有依赖关系。因此，它们遭受分散在分布式平台上的相同底层图的冗余和碎片访问，因为这些作业通常会同时沿着不同的路径不规则地遍历相同的图。在这项工作中，我们开发了GraphTune，它可以集成到现有的分布式图形处理系统中，如D-Galois, Gemini, PowerGraph和Chaos，以有效地执行CGP作业并提高系统吞吐量。GraphTune的关键组件是一个依赖感知的同步执行引擎，它结合了几个基于构建的块的交叉迭代依赖图的优化策略。具体来说，GraphTune以一种新颖的同步方式透明地规范了CGP作业的处理行为，并根据依赖图的拓扑顺序分配它们处理的图数据块，从而使性能最大化。通过这种方式，它可以将块的不规则访问转换为更规则的访问，从而使尽可能多的CGP作业可以完全共享对公共图的数据访问。同时，基于依赖图对不同CGP作业发起的通信进行高效同步，使通信成本最小化。我们将其集成到四个先进的分布式图形处理系统和一个流行的核外图形处理系统中，以展示GraphTune的效率。实验结果表明，与D-Galois、Gemini、PowerGraph、Chaos和GraphChi相比，GraphTune将CGP作业的吞吐量分别提高了3.1 ~ 6.2倍、3.8 ~ 8.5倍、3.5 ~ 10.8倍、4.3 ~ 12.4倍和3.8 ~ 6.9倍。

{"title":"GraphTune: An Efficient Dependency-Aware Substrate to Alleviate Irregularity in Concurrent Graph Processing","authors":"Jin Zhao, Yu Zhang, Ligang He, Qikun Li, Xiang Zhang, Xinyu Jiang, Hui Yu, Xiaofei Liao, Hai Jin, Lin Gu, Haikun Liu, Bingsheng He, Ji Zhang, Xianzheng Song, Lin Wang, Jun Zhou","doi":"https://dl.acm.org/doi/10.1145/3600091","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3600091","url":null,"abstract":"With the increasing need for graph analysis, massive Concurrent iterative Graph Processing (CGP) jobs are usually performed on the common large-scale real-world graph. Although several solutions have been proposed, these CGP jobs are not coordinated with the consideration of the inherent dependencies in graph data driven by graph topology. As a result, they suffer from redundant and fragmented accesses of the same underlying graph dispersed over distributed platform, because the same graph is typically irregularly traversed by these jobs along different paths at the same time.In this work, we develop GraphTune, which can be integrated into existing distributed graph processing systems, such as D-Galois, Gemini, PowerGraph, and Chaos, to efficiently perform CGP jobs and enhance system throughput. The key component of GraphTune is a dependency-aware synchronous execution engine in conjunction with several optimization strategies based on the constructed cross-iteration dependency graph of chunks. Specifically, GraphTune transparently regularizes the processing behavior of the CGP jobs in a novel synchronous way and assigns the chunks of graph data to be handled by them based on the topological order of the dependency graph so as to maximize the performance. In this way, it can transform the irregular accesses of the chunks into more regular ones so that as many CGP jobs as possible can fully share the data accesses to the common graph. Meanwhile, it also efficiently synchronizes the communications launched by different CGP jobs based on the dependency graph to minimize the communication cost. We integrate it into four cutting-edge distributed graph processing systems and a popular out-of-core graph processing system to demonstrate the efficiency of GraphTune. Experimental results show that GraphTune improves the throughput of CGP jobs by 3.1∼6.2, 3.8∼8.5, 3.5∼10.8, 4.3∼12.4, and 3.8∼6.9 times over D-Galois, Gemini, PowerGraph, Chaos, and GraphChi, respectively.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"5 2","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138509545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hierarchical Model Parallelism for Optimizing Inference on Many-core Processor via Decoupled 3D-CNN Structure 基于解耦3D-CNN结构的分层模型并行化多核处理器推理优化

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-07-19 DOI: https://dl.acm.org/doi/10.1145/3605149

Jiazhi Jiang, Zijian Huang, Dan Huang, Jiangsu Du, Lin Chen, Ziguan Chen, Yutong Lu

The tremendous success of convolutional neural network (CNN) has made it ubiquitous in many fields of human endeavor. Many applications such as biomedical analysis and scientific data analysis involve analyzing volumetric data. This spawns huge demand for 3D-CNN. Although accelerators such as GPU may provide higher throughput on deep learning applications, they may not be available in all scenarios. CPU, especially many-core CPU with non-uniform memory access (NUMA) architecture, remains an attractive choice for deep learning inference in many scenarios.

In this article, we propose a distributed inference solution for 3D-CNN that targets on the emerging ARM many-core CPU platform. A hierarchical partition approach is claimed to accelerate 3D-CNN inference by exploiting characteristics of memory and cache on ARM many-core CPU. Based on the hierarchical model partition approach, other optimization techniques such as NUMA-aware thread scheduling and optimization of 3D-img2row convolution are designed to exploit the potential of ARM many-core CPU for 3D-CNN. We evaluate our proposed inference solution with several classic 3D-CNNs: C3D, 3D-resnet34, 3D-resnet50, 3D-vgg11, and P3D. Our experimental results show that our solution can boost the performance of the 3D-CNN inference, and achieve much better scalability, with a negligible fluctuation in accuracy. When employing our 3D-CNN inference solution on ACL libraries, it can outperform naive ACL implementations by 11× to 50× on ARM many-core processor. When employing our 3D-CNN inference solution on NCNN libraries, it can outperform the naive NCNN implementations by 5.2× to 14.2× on ARM many-core processor.

卷积神经网络(CNN)的巨大成功使其在人类努力的许多领域无处不在。生物医学分析和科学数据分析等许多应用都涉及分析体积数据。这催生了对3D-CNN的巨大需求。尽管GPU等加速器可以为深度学习应用程序提供更高的吞吐量，但它们可能不适用于所有场景。CPU，特别是具有非统一内存访问(NUMA)架构的多核CPU，在许多场景下仍然是深度学习推理的一个有吸引力的选择。在本文中，我们提出了一种针对新兴的ARM多核CPU平台的3D-CNN分布式推理解决方案。利用ARM多核CPU的内存和缓存特性，提出了一种分层分区方法来加速3D-CNN推理。基于分层模型划分方法，设计了numa感知线程调度和3D-img2row卷积优化等优化技术，以挖掘ARM多核CPU对3D-CNN的潜力。我们用几种经典的3d - cnn: C3D, 3D-resnet34, 3D-resnet50, 3D-vgg11和P3D来评估我们提出的推理解决方案。我们的实验结果表明，我们的解决方案可以提高3D-CNN推理的性能，并实现更好的可扩展性，精度波动可以忽略不计。在ACL库上采用我们的3D-CNN推理解决方案，在ARM多核处理器上的性能比原始ACL实现高出11倍到50倍。在NCNN库上使用我们的3D-CNN推理解决方案，在ARM多核处理器上的性能比原始的NCNN实现高出5.2倍到14.2倍。

{"title":"Hierarchical Model Parallelism for Optimizing Inference on Many-core Processor via Decoupled 3D-CNN Structure","authors":"Jiazhi Jiang, Zijian Huang, Dan Huang, Jiangsu Du, Lin Chen, Ziguan Chen, Yutong Lu","doi":"https://dl.acm.org/doi/10.1145/3605149","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3605149","url":null,"abstract":"The tremendous success of convolutional neural network (CNN) has made it ubiquitous in many fields of human endeavor. Many applications such as biomedical analysis and scientific data analysis involve analyzing volumetric data. This spawns huge demand for 3D-CNN. Although accelerators such as GPU may provide higher throughput on deep learning applications, they may not be available in all scenarios. CPU, especially many-core CPU with non-uniform memory access (NUMA) architecture, remains an attractive choice for deep learning inference in many scenarios.In this article, we propose a distributed inference solution for 3D-CNN that targets on the emerging ARM many-core CPU platform. A hierarchical partition approach is claimed to accelerate 3D-CNN inference by exploiting characteristics of memory and cache on ARM many-core CPU. Based on the hierarchical model partition approach, other optimization techniques such as NUMA-aware thread scheduling and optimization of 3D-img2row convolution are designed to exploit the potential of ARM many-core CPU for 3D-CNN. We evaluate our proposed inference solution with several classic 3D-CNNs: C3D, 3D-resnet34, 3D-resnet50, 3D-vgg11, and P3D. Our experimental results show that our solution can boost the performance of the 3D-CNN inference, and achieve much better scalability, with a negligible fluctuation in accuracy. When employing our 3D-CNN inference solution on ACL libraries, it can outperform naive ACL implementations by 11× to 50× on ARM many-core processor. When employing our 3D-CNN inference solution on NCNN libraries, it can outperform the naive NCNN implementations by 5.2× to 14.2× on ARM many-core processor.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"7 4","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138509541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SplitZNS: Towards an Efficient LSM-Tree on Zoned Namespace SSDs SplitZNS:在分区命名空间ssd上实现高效的lsm树

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-07-10 DOI: 10.1145/3608476

Dong Huang, D. Feng, Qian-Qian Liu, Bo Ding, Wei Zhao, Xueliang Wei, Wei Tong

The Zoned Namespace (ZNS) Solid State Drive (SSD) is a nascent form of storage device that offers novel prospects for the Log Structured Merge Tree (LSM-tree). ZNS exposes erase blocks in SSD as append-only zones, enabling the LSM-tree to gain awareness of the physical layout of data. Nevertheless, LSM-tree on ZNS SSDs necessitates Garbage Collection (GC) owing to the mismatch between the gigantic zones and relatively small Sorted String Tables (SSTables). Through extensive experiments, we observe that a smaller zone size can reduce data migration in GC at the cost of a significant performance decline owing to inadequate parallelism exploitation. In this article, we present SplitZNS, which introduces small zones by tweaking the zone-to-chip mapping to maximize GC efficiency for LSM-tree on ZNS SSDs. Following the multi-level peculiarity of LSM-tree and the inherent parallel architecture of ZNS SSDs, we propose a number of techniques to leverage and accelerate small zones to alleviate the performance impact due to underutilized parallelism. (1) First, we use small zones selectively to prevent exacerbating write slowdowns and stalls due to their suboptimal performance. (2) Second, to enhance parallelism utilization, we propose SubZone Ring, which employs a per-chip FIFO buffer to imitate a large zone writing style; (3) Read Prefetcher, which prefetches data concurrently through multiple chips during compactions; (4) and Read Scheduler, which assigns query requests the highest priority. We build a prototype integrated with SplitZNS to validate its efficiency and efficacy. Experimental results demonstrate that SplitZNS achieves up to 2.77× performance and reduces data migration considerably compared to the lifetime-based data placement.1

分区命名空间(ZNS)固态硬盘(SSD)是一种新兴的存储设备形式，它为日志结构合并树(LSM-tree)提供了新的前景。ZNS将SSD中的擦除块暴露为仅追加区域，使lsm树能够了解数据的物理布局。然而，由于巨大的区域和相对较小的有序字符串表(sstable)之间的不匹配，ZNS ssd上的LSM-tree需要垃圾收集(GC)。通过大量的实验，我们观察到较小的区域大小可以减少GC中的数据迁移，但代价是由于并行性利用不足而导致性能显著下降。在本文中，我们介绍SplitZNS，它通过调整区域到芯片的映射来引入小区域，从而最大化ZNS ssd上lsm树的GC效率。根据LSM-tree的多层次特性和ZNS ssd固有的并行架构，我们提出了一些技术来利用和加速小区域，以减轻由于未充分利用的并行性而导致的性能影响。(1)首先，我们有选择地使用小区域，以防止由于它们的次优性能而加剧写入减速和停滞。(2)其次，为了提高并行利用率，我们提出了子区环，它采用每个芯片FIFO缓冲区来模仿大区写入风格;(3) Read Prefetcher，在压缩时通过多个芯片并发预取数据;(4)和Read Scheduler，它为查询请求分配最高优先级。我们建立了一个与SplitZNS集成的原型，以验证其效率和功效。实验结果表明，与基于生命周期的数据放置相比，SplitZNS实现了高达2.77倍的性能，并大大减少了数据迁移1

{"title":"SplitZNS: Towards an Efficient LSM-Tree on Zoned Namespace SSDs","authors":"Dong Huang, D. Feng, Qian-Qian Liu, Bo Ding, Wei Zhao, Xueliang Wei, Wei Tong","doi":"10.1145/3608476","DOIUrl":"https://doi.org/10.1145/3608476","url":null,"abstract":"The Zoned Namespace (ZNS) Solid State Drive (SSD) is a nascent form of storage device that offers novel prospects for the Log Structured Merge Tree (LSM-tree). ZNS exposes erase blocks in SSD as append-only zones, enabling the LSM-tree to gain awareness of the physical layout of data. Nevertheless, LSM-tree on ZNS SSDs necessitates Garbage Collection (GC) owing to the mismatch between the gigantic zones and relatively small Sorted String Tables (SSTables). Through extensive experiments, we observe that a smaller zone size can reduce data migration in GC at the cost of a significant performance decline owing to inadequate parallelism exploitation. In this article, we present SplitZNS, which introduces small zones by tweaking the zone-to-chip mapping to maximize GC efficiency for LSM-tree on ZNS SSDs. Following the multi-level peculiarity of LSM-tree and the inherent parallel architecture of ZNS SSDs, we propose a number of techniques to leverage and accelerate small zones to alleviate the performance impact due to underutilized parallelism. (1) First, we use small zones selectively to prevent exacerbating write slowdowns and stalls due to their suboptimal performance. (2) Second, to enhance parallelism utilization, we propose SubZone Ring, which employs a per-chip FIFO buffer to imitate a large zone writing style; (3) Read Prefetcher, which prefetches data concurrently through multiple chips during compactions; (4) and Read Scheduler, which assigns query requests the highest priority. We build a prototype integrated with SplitZNS to validate its efficiency and efficacy. Experimental results demonstrate that SplitZNS achieves up to 2.77× performance and reduces data migration considerably compared to the lifetime-based data placement.1","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"140 1","pages":"1 - 26"},"PeriodicalIF":1.6,"publicationDate":"2023-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73369174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0