ACM Transactions on Architecture and Code Optimization最新文献_第5页

Improving Utilization of Dataflow Unit for Multi-Batch Processing. 提高数据流单元在多批次处理中的利用率。

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-12-18 DOI: 10.1145/3637906

Zhihua Fan, Wenming Li, Zhen Wang, Yu Yang, Xiaochun Ye, Dongrui Fan, Ninghui Sun, Xuejun An

Dataflow architectures can achieve much better performance and higher efficiency than general-purpose core, approaching the performance of a specialized design while retaining programmability. However, advanced application scenarios place higher demands on the hardware in terms of cross-domain and multi-batch processing. In this paper, we propose a unified scale-vector architecture that can work in multiple modes and adapt to diverse algorithms and requirements efficiently. First, a novel reconfigurable interconnection structure is proposed, which can organize execution units into different cluster typologies as a way to accommodate different data-level parallelism. Second, we decouple threads within each DFG node into consecutive pipeline stages and provide architectural support. By time-multiplexing during these stages, dataflow hardware can achieve much higher utilization and performance. In addition, the task-based program model can also exploit multi-level parallelism and deploy applications efficiently. Evaluated in a wide range of benchmarks, including digital signal processing algorithms, CNNs, and scientific computing algorithms, our design attains up to 11.95 × energy efficiency (performance-per-watt) improvement over GPU (V100), and 2.01 × energy efficiency improvement over state-of-the-art dataflow architectures.

与通用内核相比，数据流架构可以实现更好的性能和更高的效率，在保持可编程性的同时，接近专用设计的性能。然而，先进的应用场景在跨域和多批次处理方面对硬件提出了更高的要求。在本文中，我们提出了一种统一的标度矢量架构，它可以在多种模式下工作，并能有效地适应不同的算法和要求。首先，我们提出了一种新颖的可重构互连结构，它可以将执行单元组织成不同的集群类型，以此来适应不同的数据级并行性。其次，我们将每个 DFG 节点内的线程解耦为连续的流水线阶段，并提供架构支持。通过在这些阶段进行时间复用，数据流硬件可以实现更高的利用率和性能。此外，基于任务的程序模型还能利用多级并行性，高效地部署应用程序。在数字信号处理算法、CNN 和科学计算算法等广泛的基准测试中，我们的设计与 GPU（V100）相比，能效（每瓦性能）提高了 11.95 倍，与最先进的数据流架构相比，能效提高了 2.01 倍。

{"title":"Improving Utilization of Dataflow Unit for Multi-Batch Processing.","authors":"Zhihua Fan, Wenming Li, Zhen Wang, Yu Yang, Xiaochun Ye, Dongrui Fan, Ninghui Sun, Xuejun An","doi":"10.1145/3637906","DOIUrl":"https://doi.org/10.1145/3637906","url":null,"abstract":"Dataflow architectures can achieve much better performance and higher efficiency than general-purpose core, approaching the performance of a specialized design while retaining programmability. However, advanced application scenarios place higher demands on the hardware in terms of cross-domain and multi-batch processing. In this paper, we propose a unified scale-vector architecture that can work in multiple modes and adapt to diverse algorithms and requirements efficiently. First, a novel reconfigurable interconnection structure is proposed, which can organize execution units into different cluster typologies as a way to accommodate different data-level parallelism. Second, we decouple threads within each DFG node into consecutive pipeline stages and provide architectural support. By time-multiplexing during these stages, dataflow hardware can achieve much higher utilization and performance. In addition, the task-based program model can also exploit multi-level parallelism and deploy applications efficiently. Evaluated in a wide range of benchmarks, including digital signal processing algorithms, CNNs, and scientific computing algorithms, our design attains up to 11.95 × energy efficiency (performance-per-watt) improvement over GPU (V100), and 2.01 × energy efficiency improvement over state-of-the-art dataflow architectures.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"4 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138716786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

WA-Zone: Wear-Aware Zone Management Optimization for LSM-Tree on ZNS SSDs WA-Zone：针对 ZNS SSD 上 LSM-Tree 的磨损感知区域管理优化

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-12-13 DOI: 10.1145/3637488

Linbo Long, Shuiyong He, Jingcheng Shen, Renping Liu, Zhenhua Tan, Congming Gao, Duo Liu, Kan Zhong, Yi Jiang

ZNS SSDs divide the storage space into sequential-write zones, reducing costs of DRAM utilization, garbage collection, and over-provisioning. The sequential-write feature of zones is well-suited for LSM-based databases, where random writes are organized into sequential writes to improve performance. However, the current compaction mechanism of LSM-tree results in widely varying access frequencies (i.e., hotness) of data and thus incurs an extreme imbalance in the distribution of erasure counts across zones. The imbalance significantly limits the lifetime of SSDs. Moreover, the current zone-reset method involves a large number of unnecessary erase operations on unused blocks, further shortening the SSD lifetime.

Considering the access pattern of LSM-tree, this paper proposes a wear-aware zone-management technique, termed WA-Zone, to effectively balance inter- and intra-zone wear in ZNS SSDs. In WA-Zone, a wear-aware zone allocator is first proposed to dynamically allocate data with different hotness to zones with corresponding lifetimes, enabling an even distribution of the erasure counts across zones. Then, a partial-erase-based zone-reset method is presented to avoid unnecessary erase operations. Furthermore, because the novel zone-reset method might lead to an unbalanced distribution of erasure counts across blocks in a zone, a wear-aware block allocator is proposed. Experimental results based on the FEMU emulator demonstrate the proposed WA-Zone enhances the ZNS-SSD lifetime by 5.23 ×, compared with the baseline scheme.

ZNS ssd将存储空间划分为顺序写分区，降低了DRAM的使用成本、垃圾回收成本和过度分配成本。区域的顺序写特性非常适合基于lsm的数据库，其中随机写被组织成顺序写以提高性能。然而，当前LSM-tree的压缩机制导致数据的访问频率(即热度)变化很大，从而导致擦除计数跨区域分布的极度不平衡。这种不平衡严重限制了ssd的寿命。此外，当前的zone-reset方法会对未使用的块进行大量不必要的擦除操作，进一步缩短了SSD的寿命。考虑到LSM-tree的访问模式，本文提出了一种磨损感知区域管理技术WA-Zone，以有效平衡ZNS ssd的区域间和区域内磨损。在WA-Zone中，首先提出了一种磨损感知的区域分配器，将不同热度的数据动态分配到具有相应生存期的区域中，从而实现擦除计数在区域间的均匀分布。然后，提出了一种基于部分擦除的区域重置方法，以避免不必要的擦除操作。此外，由于新的区域重置方法可能导致擦除计数在区域内的块分布不平衡，因此提出了一种磨损感知的块分配器。基于FEMU仿真器的实验结果表明，与基准方案相比，WA-Zone方案使ZNS-SSD寿命提高了5.23倍。

{"title":"WA-Zone: Wear-Aware Zone Management Optimization for LSM-Tree on ZNS SSDs","authors":"Linbo Long, Shuiyong He, Jingcheng Shen, Renping Liu, Zhenhua Tan, Congming Gao, Duo Liu, Kan Zhong, Yi Jiang","doi":"10.1145/3637488","DOIUrl":"https://doi.org/10.1145/3637488","url":null,"abstract":"ZNS SSDs divide the storage space into sequential-write zones, reducing costs of DRAM utilization, garbage collection, and over-provisioning. The sequential-write feature of zones is well-suited for LSM-based databases, where random writes are organized into sequential writes to improve performance. However, the current compaction mechanism of LSM-tree results in widely varying access frequencies (i.e., hotness) of data and thus incurs an extreme imbalance in the distribution of erasure counts across zones. The imbalance significantly limits the lifetime of SSDs. Moreover, the current zone-reset method involves a large number of unnecessary erase operations on unused blocks, further shortening the SSD lifetime. Considering the access pattern of LSM-tree, this paper proposes a wear-aware zone-management technique, termed WA-Zone, to effectively balance inter- and intra-zone wear in ZNS SSDs. In WA-Zone, a wear-aware zone allocator is first proposed to dynamically allocate data with different hotness to zones with corresponding lifetimes, enabling an even distribution of the erasure counts across zones. Then, a partial-erase-based zone-reset method is presented to avoid unnecessary erase operations. Furthermore, because the novel zone-reset method might lead to an unbalanced distribution of erasure counts across blocks in a zone, a wear-aware block allocator is proposed. Experimental results based on the FEMU emulator demonstrate the proposed WA-Zone enhances the ZNS-SSD lifetime by 5.23 ×, compared with the baseline scheme.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"104 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138631935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

WIPE: a Write-Optimized Learned Index for Persistent Memory 擦除:持久内存的写优化学习索引

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-11-28 DOI: 10.1145/3634915

Zhonghua Wang, Chen Ding, Fengguang Song, Kai Lu, Jiguang Wan, Zhihu Tan, Changsheng Xie, Guokuan Li

Learned Index, which utilizes effective machine learning models to accelerate locating sorted data positions, has gained increasing attention in many big data scenarios. Using efficient learned models, the learned indexes build large nodes and flat structures, thereby greatly improving the performance. However, most of the state-of-the-art learned indexes are designed for DRAM, and there is hence an urgent need to enable high-performance learned indexes for emerging Non-Volatile Memory (NVM). In this paper, we first evaluate and analyze the performance of the existing learned indexes on NVM. We discover that these learned indexes encounter severe write amplification and write performance degradation due to the requirements of maintaining large sorted/semi-sorted data nodes. To tackle the problems, we propose a novel three-tiered architecture of write-optimized persistent learned index, which is named WIPE, by adopting unsorted fine-granularity data nodes to achieve high write performance on NVM. Thereinto, we devise a new root node construction algorithm to accelerate searching numerous small data nodes. The algorithm ensures stable flat structure and high read performance in large-size datasets by introducing an intermediate layer (i.e., index nodes) and achieving accurate prediction of index node positions from the root node. Our extensive experiments on Intel DCPMM show that WIPE can improve write throughput and read throughput by up to 3.9 × and 7 ×, respectively, compared to the state-of-the-art learned indexes. Also, WIPE can recover from a system crash in ∼ 18ms. WIPE is free as an open-source software package¹.

学习索引利用有效的机器学习模型来加速排序数据位置的定位，在许多大数据场景中越来越受到关注。利用高效的学习模型，学习索引构建大节点和平面结构，从而大大提高了性能。然而，大多数最先进的学习索引都是为DRAM设计的，因此迫切需要为新兴的非易失性存储器(NVM)启用高性能学习索引。本文首先对已有的学习索引在NVM上的性能进行了评价和分析。我们发现，由于需要维护大型排序/半排序数据节点，这些学习索引遇到了严重的写入放大和写入性能下降。为了解决这些问题，我们提出了一种新的三层写优化持久学习索引架构，该架构被命名为WIPE，通过采用未排序的细粒度数据节点来实现NVM上的高写性能。其中，我们设计了一种新的根节点构建算法，以加速搜索大量小数据节点。该算法通过引入中间层(即索引节点)，从根节点开始准确预测索引节点位置，保证了大数据集稳定的平面结构和较高的读取性能。我们在Intel DCPMM上进行的大量实验表明，与最先进的学习索引相比，WIPE可以将写吞吐量和读吞吐量分别提高3.9倍和7倍。此外，在系统崩溃后，WIPE可以在18ms内恢复。作为开源软件包，WIPE是免费的。

{"title":"WIPE: a Write-Optimized Learned Index for Persistent Memory","authors":"Zhonghua Wang, Chen Ding, Fengguang Song, Kai Lu, Jiguang Wan, Zhihu Tan, Changsheng Xie, Guokuan Li","doi":"10.1145/3634915","DOIUrl":"https://doi.org/10.1145/3634915","url":null,"abstract":"Learned Index, which utilizes effective machine learning models to accelerate locating sorted data positions, has gained increasing attention in many big data scenarios. Using efficient learned models, the learned indexes build large nodes and flat structures, thereby greatly improving the performance. However, most of the state-of-the-art learned indexes are designed for DRAM, and there is hence an urgent need to enable high-performance learned indexes for emerging Non-Volatile Memory (NVM). In this paper, we first evaluate and analyze the performance of the existing learned indexes on NVM. We discover that these learned indexes encounter severe write amplification and write performance degradation due to the requirements of maintaining large sorted/semi-sorted data nodes. To tackle the problems, we propose a novel three-tiered architecture of write-optimized persistent learned index, which is named WIPE, by adopting unsorted fine-granularity data nodes to achieve high write performance on NVM. Thereinto, we devise a new root node construction algorithm to accelerate searching numerous small data nodes. The algorithm ensures stable flat structure and high read performance in large-size datasets by introducing an intermediate layer (i.e., index nodes) and achieving accurate prediction of index node positions from the root node. Our extensive experiments on Intel DCPMM show that WIPE can improve write throughput and read throughput by up to 3.9 × and 7 ×, respectively, compared to the state-of-the-art learned indexes. Also, WIPE can recover from a system crash in ∼ 18ms. WIPE is free as an open-source software package1.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"77 1 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Highly Efficient Self-Checking Matrix Multiplication on Tiled AMX Accelerators 在平铺AMX加速器上的高效自检矩阵乘法

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-11-22 DOI: 10.1145/3633332

Chandra Sekhar Mummidi, Victor C. Ferreira, Sudarshan Srinivasan, Sandip Kundu

General Matrix Multiplication (GEMM) is a computationally expensive operation that is used in many applications such as machine-learning. Hardware accelerators are increasingly popular for speeding up GEMM computation, with Tiled Matrix Multiplication (TMUL) in recent Intel processors being an example. Unfortunately, the TMUL hardware is susceptible to errors necessitating online error detection. Algorithm-based Error Detection techniques (ABED) is a powerful technique to detect errors in matrix multiplications. In this paper, we consider implementation of ABED that integrates seamlessly with the TMUL hardware to minimize performance overhead. Unfortunately, rounding errors introduced by floating-point operations do not allow a straightforward implementation of ABED in TMUL. Previously an error bound was considered for addressing rounding errors in ABED. If the error detection threshold is set too low, it will trigger false alarm while a loose bound will allow errors to escape detection. In this paper, we propose an adaptive error threshold that takes into account the TMUL input values to address the problem of false triggers and error escapes, and provide a taxonomy of various error classes. This threshold is obtained from theoretical error analysis but is not easy to implement in hardware. Consequently, we relax the threshold such that it can be easily computed in hardware. While ABED ensures error free computation it does not guarantee full coverage of all hardware faults. To address this problem, we propose an algorithmic pattern-generation technique to ensure full coverage for all hardware faults. To evaluate the benefits of our proposed solution, we conducted fault injection experiments and show that our approach does not produce any false alarms or detection escapes for observable errors. We conducted additional fault injection experiments on a Deep Neural Network (DNN) model and find that if a fault is not detected, it does not cause any misclassification.

通用矩阵乘法(GEMM)是一种计算成本很高的操作，用于许多应用程序，如机器学习。硬件加速器在加速GEMM计算方面越来越受欢迎，最近的英特尔处理器中的平铺矩阵乘法(TMUL)就是一个例子。不幸的是，TMUL硬件容易出错，因此需要进行在线错误检测。基于算法的错误检测技术(ABED)是一种检测矩阵乘法错误的强大技术。在本文中，我们考虑与TMUL硬件无缝集成的ABED实现，以最小化性能开销。不幸的是，浮点运算引入的舍入误差不允许在TMUL中直接实现ABED。以前，为了解决ABED中的舍入错误，考虑了一个错误边界。如果错误检测阈值设置过低，将触发虚警，而松散的绑定将允许错误逃避检测。在本文中，我们提出了一个考虑TMUL输入值的自适应错误阈值，以解决错误触发和错误转义的问题，并提供了各种错误类的分类。该阈值由理论误差分析得到，但在硬件上不易实现。因此，我们放宽阈值，使其可以在硬件中轻松计算。虽然ABED确保无错误计算，但它不能保证完全覆盖所有硬件故障。为了解决这个问题，我们提出了一种算法模式生成技术，以确保完全覆盖所有硬件故障。为了评估我们提出的解决方案的好处，我们进行了故障注入实验，并表明我们的方法不会对可观察到的错误产生任何假警报或检测逃逸。我们在深度神经网络(DNN)模型上进行了额外的故障注入实验，发现如果未检测到故障，则不会导致任何错误分类。

{"title":"Highly Efficient Self-Checking Matrix Multiplication on Tiled AMX Accelerators","authors":"Chandra Sekhar Mummidi, Victor C. Ferreira, Sudarshan Srinivasan, Sandip Kundu","doi":"10.1145/3633332","DOIUrl":"https://doi.org/10.1145/3633332","url":null,"abstract":"General Matrix Multiplication (GEMM) is a computationally expensive operation that is used in many applications such as machine-learning. Hardware accelerators are increasingly popular for speeding up GEMM computation, with Tiled Matrix Multiplication (TMUL) in recent Intel processors being an example. Unfortunately, the TMUL hardware is susceptible to errors necessitating online error detection. Algorithm-based Error Detection techniques (ABED) is a powerful technique to detect errors in matrix multiplications. In this paper, we consider implementation of ABED that integrates seamlessly with the TMUL hardware to minimize performance overhead. Unfortunately, rounding errors introduced by floating-point operations do not allow a straightforward implementation of ABED in TMUL. Previously an error bound was considered for addressing rounding errors in ABED. If the error detection threshold is set too low, it will trigger false alarm while a loose bound will allow errors to escape detection. In this paper, we propose an adaptive error threshold that takes into account the TMUL input values to address the problem of false triggers and error escapes, and provide a taxonomy of various error classes. This threshold is obtained from theoretical error analysis but is not easy to implement in hardware. Consequently, we relax the threshold such that it can be easily computed in hardware. While ABED ensures error free computation it does not guarantee full coverage of all hardware faults. To address this problem, we propose an algorithmic pattern-generation technique to ensure full coverage for all hardware faults. To evaluate the benefits of our proposed solution, we conducted fault injection experiments and show that our approach does not produce any false alarms or detection escapes for observable errors. We conducted additional fault injection experiments on a Deep Neural Network (DNN) model and find that if a fault is not detected, it does not cause any misclassification.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"14 2 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Abakus: Accelerating k-mer Counting With Storage Technology Abakus:用存储技术加速k-mer计数

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-11-21 DOI: 10.1145/3632952

Lingxi Wu, Minxuan Zhou, Weihong Xu, Ashish Venkat, Tajana Rosing, Kevin Skadron

This work seeks to leverage Processing-with-storage-technology (PWST) to accelerate a key bioinformatics kernel called k-mer counting, which involves processing large files of sequence data on the disk to build a histogram of fixed-size genome sequence substrings and thereby entails prohibitively high I/O overhead. In particular, this work proposes a set of accelerator designs called Abakus that offer varying degrees of tradeoffs in terms of performance, efficiency, and hardware implementation complexity. The key to these designs is a set of domain-specific hardware extensions to accelerate the key operations for k-mer counting at various levels of the SSD hierarchy, with the goal of enhancing the limited computing capabilities of conventional SSDs, while exploiting the parallelism of the multi-channel, multi-way SSDs. Our evaluation suggests that Abakus can achieve 8.42 ×, 6.91 ×, and 2.32 × speedup over the CPU-, GPU-, and near-data processing solutions.

这项工作旨在利用处理与存储技术(PWST)来加速一个关键的生物信息学内核，称为k-mer计数，它涉及处理磁盘上的大文件序列数据，以构建固定大小的基因组序列子串的直方图，因此需要极高的I/O开销。特别是，这项工作提出了一组称为Abakus的加速器设计，这些设计在性能、效率和硬件实现复杂性方面提供了不同程度的权衡。这些设计的关键是一组特定于领域的硬件扩展，以加速在SSD层次结构的各个级别上k-mer计数的关键操作，其目标是增强传统SSD有限的计算能力，同时利用多通道、多路SSD的并行性。我们的评估表明，与CPU、GPU和近数据处理解决方案相比，Abakus可以实现8.42倍、6.91倍和2.32倍的加速。

引用次数: 0

Coherence Attacks and Countermeasures in Interposer-Based Chiplet Systems 基于介位元的晶片系统的相干攻击与对抗

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-11-20 DOI: 10.1145/3633461

Gino A. Chacon, Charles Williams, Johann Knechtel, Ozgur Sinanoglu, Paul V. Gratz, Vassos Soteriou

Industry is moving towards large-scale hardware systems which bundle processor cores, memories, accelerators, etc. via 2.5D integration. These components are fabricated separately as chiplets and then integrated using an interposer as an interconnect carrier. This new design style is beneficial in terms of yield and economies of scale, as chiplets may come from various vendors and are relatively easy to integrate into one larger sophisticated system. However, the benefits of this approach come at the cost of new security challenges, especially when integrating chiplets that come from untrusted or not fully trusted, third- party vendors.

In this work, we explore these challenges for modern interposer-based systems of cache-coherent, multi-core chiplets. First, we present basic coherence-oriented hardware Trojan attacks that pose a significant threat to chiplet-based designs and demonstrate how these basic attacks can be orchestrated to pose a significant threat to interposer-based systems. Second, we propose a novel scheme using an active interposer as a generic, secure-by-construction platform that forms a physical root of trust for modern 2.5D systems. The implementation of our scheme is confined to the interposer, resulting in little cost and leaving the chiplets and coherence system untouched. We show that our scheme prevents a range of coherence attacks with low overheads on system performance, ∼ 4%. Further, we demonstrate that our scheme scales efficiently as system size and memory capacities increase, resulting in reduced performance overheads.

工业界正朝着通过2.5D集成将处理器核心、存储器、加速器等捆绑在一起的大规模硬件系统发展。这些组件分别作为小芯片制造，然后使用中间体作为互连载体进行集成。这种新的设计风格在产量和规模经济方面是有益的，因为小芯片可能来自不同的供应商，并且相对容易集成到一个更大的复杂系统中。然而，这种方法的好处是以新的安全挑战为代价的，特别是在集成来自不受信任或不完全受信任的第三方供应商的小芯片时。在这项工作中，我们探讨了现代基于中介程序的缓存一致多核小芯片系统的这些挑战。首先，我们提出了基本的面向一致性的硬件木马攻击，这些攻击对基于芯片的设计构成重大威胁，并演示了这些基本攻击如何被精心策划，对基于中介程序的系统构成重大威胁。其次，我们提出了一个新的方案，使用一个主动中介器作为一个通用的、安全的构建平台，为现代2.5D系统形成一个物理的信任根。我们的方案的实现被限制在中介器上，因此成本很低，并且不影响小芯片和相干系统。我们表明，我们的方案可以防止一系列相干攻击，对系统性能的开销很低，约4%。此外，我们还证明了我们的方案可以随着系统大小和内存容量的增加而有效地扩展，从而降低了性能开销。

{"title":"Coherence Attacks and Countermeasures in Interposer-Based Chiplet Systems","authors":"Gino A. Chacon, Charles Williams, Johann Knechtel, Ozgur Sinanoglu, Paul V. Gratz, Vassos Soteriou","doi":"10.1145/3633461","DOIUrl":"https://doi.org/10.1145/3633461","url":null,"abstract":"Industry is moving towards large-scale hardware systems which bundle processor cores, memories, accelerators, etc. via 2.5D integration. These components are fabricated separately as chiplets and then integrated using an interposer as an interconnect carrier. This new design style is beneficial in terms of yield and economies of scale, as chiplets may come from various vendors and are relatively easy to integrate into one larger sophisticated system. However, the benefits of this approach come at the cost of new security challenges, especially when integrating chiplets that come from untrusted or not fully trusted, third- party vendors. In this work, we explore these challenges for modern interposer-based systems of cache-coherent, multi-core chiplets. First, we present basic coherence-oriented hardware Trojan attacks that pose a significant threat to chiplet-based designs and demonstrate how these basic attacks can be orchestrated to pose a significant threat to interposer-based systems. Second, we propose a novel scheme using an active interposer as a generic, secure-by-construction platform that forms a physical root of trust for modern 2.5D systems. The implementation of our scheme is confined to the interposer, resulting in little cost and leaving the chiplets and coherence system untouched. We show that our scheme prevents a range of coherence attacks with low overheads on system performance, ∼ 4%. Further, we demonstrate that our scheme scales efficiently as system size and memory capacities increase, resulting in reduced performance overheads.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"75 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

COWS for High Performance: Cost Aware Work Stealing for Irregular Parallel Loops: ACM Transactions on Architecture and Code Optimization: Vol 0, No ja 高性能奶牛:不规则并行循环的成本意识工作窃取:ACM架构和代码优化事务:Vol 0, No ja

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-11-18 DOI: 10.1145/3633331

Prasoon Mishra, V. Krishna Nandivada

Parallel libraries such as OpenMP distribute the iterations of parallel-for-loops among the threads, using a programmer-specified scheduling policy. While the existing scheduling policies perform reasonably well in the context of balanced workloads, in computations that involve highly imbalanced workloads it is extremely non-trivial to obtain an efficient distribution of work (even using non-static scheduling methods like dynamic and guided). In this paper, we present a scheme called COst aware Work Stealing (COWS) to efficiently extend the idea of work-stealing to OpenMP.

In contrast to the traditional work-stealing schedulers, COWS takes into consideration that (i) not all iterations of a parallel-for-loops may take the same amount of time. (ii) identifying a suitable victim for stealing is important for load-balancing, and (iii) queues lead to significant overheads in traditional work-stealing and should be avoided. We present two variations of COWS: WSRI (a naive work-stealing scheme based on the number of remaining iterations) and WSRW (work-stealing scheme based on the amount of remaining workload). Since in irregular loops like those found in graph analytics, it is not possible to statically compute the cost of the iterations of the parallel-for-loops, we use a combined compile-time + runtime approach, where the remaining workload of a loop is computed efficiently at runtime by utilizing the code generated by our compile-time component. We have performed an evaluation over seven different benchmark programs, using five different input datasets, on two different hardware across a varying number of threads; leading to a total of 275 number of configurations. We show that in 225 out of 275 configurations, compared to the best OpenMP scheduling scheme for that configuration, our approach achieves clear performance gains.

像OpenMP这样的并行库使用程序员指定的调度策略，在线程之间分发Parallel -for循环的迭代。虽然现有的调度策略在平衡工作负载的上下文中执行得相当好，但在涉及高度不平衡工作负载的计算中，获得有效的工作分配是非常重要的(即使使用动态和引导等非静态调度方法)。在本文中，我们提出了一种称为成本感知工作窃取(COst - aware Work Stealing, COWS)的方案，以有效地将工作窃取的思想扩展到OpenMP。与传统的偷取工作的调度器相比，COWS考虑到(i)并不是一个parallel-for-loop的所有迭代都可能花费相同的时间。(ii)确定合适的窃取对象对于负载平衡很重要，(iii)队列在传统的窃取工作中会导致很大的开销，应该避免。我们提出了奶牛的两种变体:wsrri(基于剩余迭代数量的简单工作窃取方案)和WSRW(基于剩余工作负载数量的工作窃取方案)。由于在图形分析中发现的不规则循环中，不可能静态地计算并行for循环迭代的成本，因此我们使用编译时+运行时组合方法，其中循环的剩余工作负载在运行时通过利用编译时组件生成的代码有效地计算。我们对七个不同的基准测试程序进行了评估，使用五个不同的输入数据集，在两个不同的硬件上，在不同数量的线程上;导致总共275个数的配置。我们显示，在275个配置中的225个配置中，与该配置的最佳OpenMP调度方案相比，我们的方法实现了明显的性能提升。

{"title":"COWS for High Performance: Cost Aware Work Stealing for Irregular Parallel Loops: ACM Transactions on Architecture and Code Optimization: Vol 0, No ja","authors":"Prasoon Mishra, V. Krishna Nandivada","doi":"10.1145/3633331","DOIUrl":"https://doi.org/10.1145/3633331","url":null,"abstract":"Parallel libraries such as OpenMP distribute the iterations of parallel-for-loops among the threads, using a programmer-specified scheduling policy. While the existing scheduling policies perform reasonably well in the context of balanced workloads, in computations that involve highly imbalanced workloads it is extremely non-trivial to obtain an efficient distribution of work (even using non-static scheduling methods like dynamic and guided). In this paper, we present a scheme called COst aware Work Stealing (COWS) to efficiently extend the idea of work-stealing to OpenMP. In contrast to the traditional work-stealing schedulers, COWS takes into consideration that (i) not all iterations of a parallel-for-loops may take the same amount of time. (ii) identifying a suitable victim for stealing is important for load-balancing, and (iii) queues lead to significant overheads in traditional work-stealing and should be avoided. We present two variations of COWS: WSRI (a naive work-stealing scheme based on the number of remaining iterations) and WSRW (work-stealing scheme based on the amount of remaining workload). Since in irregular loops like those found in graph analytics, it is not possible to statically compute the cost of the iterations of the parallel-for-loops, we use a combined compile-time + runtime approach, where the remaining workload of a loop is computed efficiently at runtime by utilizing the code generated by our compile-time component. We have performed an evaluation over seven different benchmark programs, using five different input datasets, on two different hardware across a varying number of threads; leading to a total of 275 number of configurations. We show that in 225 out of 275 configurations, compared to the best OpenMP scheduling scheme for that configuration, our approach achieves clear performance gains.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"59 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fast Convolution Meets Low Precision: Exploring Efficient Quantized Winograd Convolution on Modern CPUs 快速卷积满足低精度:在现代cpu上探索有效的量化Winograd卷积

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-11-17 DOI: 10.1145/3632956

Xueying Wang, Guangli Li, Zhen Jia, Xiaobing Feng, Yida Wang

Low-precision computation has emerged as one of the most effective techniques for accelerating convolutional neural networks and has garnered widespread support on modern hardware. Despite its effectiveness in accelerating convolutional neural networks, low-precision computation has not been commonly applied to fast convolutions, such as the Winograd algorithm, due to numerical issues. In this paper, we propose an effective quantized Winograd convolution, named LoWino, which employs an in-side quantization method in the Winograd domain to reduce the precision loss caused by transformations. Meanwhile, we present an efficient implementation that integrates well-designed optimization techniques, allowing us to fully exploit the capabilities of low-precision computation on modern CPUs. We evaluate LoWino on two Intel Xeon Scalable Processor platforms with representative convolutional layers and neural network models. The experimental results demonstrate that our approach can achieve an average of 1.84 × and 1.91 × operator speedups over state-of-the-art implementations in the vendor library while preserving accuracy loss at a reasonable level.

低精度计算已经成为加速卷积神经网络最有效的技术之一，并在现代硬件上获得了广泛的支持。尽管它在加速卷积神经网络方面很有效，但由于数值问题，低精度计算尚未普遍应用于快速卷积，如Winograd算法。在本文中，我们提出了一种有效的量化Winograd卷积，命名为LoWino，它在Winograd域中采用内量化方法来降低变换引起的精度损失。同时，我们提出了一个有效的实现，集成了精心设计的优化技术，使我们能够充分利用现代cpu的低精度计算能力。我们在两个Intel Xeon可扩展处理器平台上使用代表性的卷积层和神经网络模型对LoWino进行了评估。实验结果表明，与供应商库中最先进的实现相比，我们的方法可以实现平均1.84 x和1.91 x的算子加速，同时将精度损失保持在合理的水平。

{"title":"Fast Convolution Meets Low Precision: Exploring Efficient Quantized Winograd Convolution on Modern CPUs","authors":"Xueying Wang, Guangli Li, Zhen Jia, Xiaobing Feng, Yida Wang","doi":"10.1145/3632956","DOIUrl":"https://doi.org/10.1145/3632956","url":null,"abstract":"Low-precision computation has emerged as one of the most effective techniques for accelerating convolutional neural networks and has garnered widespread support on modern hardware. Despite its effectiveness in accelerating convolutional neural networks, low-precision computation has not been commonly applied to fast convolutions, such as the Winograd algorithm, due to numerical issues. In this paper, we propose an effective quantized Winograd convolution, named LoWino, which employs an in-side quantization method in the Winograd domain to reduce the precision loss caused by transformations. Meanwhile, we present an efficient implementation that integrates well-designed optimization techniques, allowing us to fully exploit the capabilities of low-precision computation on modern CPUs. We evaluate LoWino on two Intel Xeon Scalable Processor platforms with representative convolutional layers and neural network models. The experimental results demonstrate that our approach can achieve an average of 1.84 × and 1.91 × operator speedups over state-of-the-art implementations in the vendor library while preserving accuracy loss at a reasonable level.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"9 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

QoS-pro: A QoS-enhanced Transaction Processing Framework for Shared SSDs QoS-pro:用于共享ssd的qos增强事务处理框架

3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-11-14 DOI: 10.1145/3632955

Hao Fan, Yiliang Ye, Shadi Ibrahim, Zhuo Huang, Xingru Li, WeiBin Xue, Song Wu, Chen Yu, Xuanhua Shi, Hai Jin

Solid State Drives (SSDs) are widely used in data-intensive scenarios due to their high performance and decreasing cost. However, in shared environments, concurrent workloads can interfere with each other, leading to a violation of Quality of Service (QoS). While QoS mechanisms like fairness guarantees and latency constraints have been integrated into SSDs, existing transaction processing frameworks offer limited QoS guarantees and can significantly degrade overall performance in a shared environment. The reason is that the internal components of an SSD, originally designed to exploit parallelism, struggle to coordinate effectively when QoS mechanisms are applied to them. This paper proposes a novel QoS -enhanced transaction pro cessing framework, called QoS-pro, which enhances QoS guarantees for concurrent workloads while maintaining high parallelism for SSDs. QoS-pro achieves this by redesigning transaction processing procedures to fully exploit the parallelism of shared SSDs and enhancing QoS-oriented transaction translation and scheduling with parallelism features in mind. In terms of fairness guarantees, QoS-pro outperforms state-of-the-art methods by achieving 96% fairness improvement and 64% maximum latency reduction. QoS-pro also shows almost no loss in throughput when compared with parallelism-oriented methods. Additionally, QoS-pro triggers the fewest Garbage Collection (GC) operations and minimally affects concurrently running workloads during GC operations.

固态硬盘(Solid State Drives, ssd)以其高性能和低成本的特点在数据密集型场景中得到了广泛的应用。然而，在共享环境中，并发工作负载可能会相互干扰，从而导致违反服务质量(QoS)。虽然像公平保证和延迟限制这样的QoS机制已经集成到ssd中，但现有的事务处理框架提供的QoS保证有限，并且会显著降低共享环境中的整体性能。原因是SSD的内部组件(最初是为了利用并行性而设计的)在应用QoS机制时难以有效地协调。本文提出了一种新的QoS增强事务处理框架，称为QoS-pro，它在保持ssd的高并行性的同时增强了并发工作负载的QoS保证。QoS-pro通过重新设计事务处理过程来充分利用共享ssd的并行性，并考虑到并行性特性，增强面向qos的事务转换和调度，从而实现了这一点。就公平性保证而言，QoS-pro通过实现96%的公平性改进和64%的最大延迟减少，优于最先进的方法。与面向并行的方法相比，QoS-pro在吞吐量方面几乎没有损失。此外，QoS-pro触发的垃圾收集(GC)操作最少，并且在GC操作期间对并发运行的工作负载的影响最小。

{"title":"QoS-pro: A QoS-enhanced Transaction Processing Framework for Shared SSDs","authors":"Hao Fan, Yiliang Ye, Shadi Ibrahim, Zhuo Huang, Xingru Li, WeiBin Xue, Song Wu, Chen Yu, Xuanhua Shi, Hai Jin","doi":"10.1145/3632955","DOIUrl":"https://doi.org/10.1145/3632955","url":null,"abstract":"Solid State Drives (SSDs) are widely used in data-intensive scenarios due to their high performance and decreasing cost. However, in shared environments, concurrent workloads can interfere with each other, leading to a violation of Quality of Service (QoS). While QoS mechanisms like fairness guarantees and latency constraints have been integrated into SSDs, existing transaction processing frameworks offer limited QoS guarantees and can significantly degrade overall performance in a shared environment. The reason is that the internal components of an SSD, originally designed to exploit parallelism, struggle to coordinate effectively when QoS mechanisms are applied to them. This paper proposes a novel QoS -enhanced transaction pro cessing framework, called QoS-pro, which enhances QoS guarantees for concurrent workloads while maintaining high parallelism for SSDs. QoS-pro achieves this by redesigning transaction processing procedures to fully exploit the parallelism of shared SSDs and enhancing QoS-oriented transaction translation and scheduling with parallelism features in mind. In terms of fairness guarantees, QoS-pro outperforms state-of-the-art methods by achieving 96% fairness improvement and 64% maximum latency reduction. QoS-pro also shows almost no loss in throughput when compared with parallelism-oriented methods. Additionally, QoS-pro triggers the fewest Garbage Collection (GC) operations and minimally affects concurrently running workloads during GC operations.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"47 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134901890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fine-Grain Quantitative Analysis of Demand Paging in Unified Virtual Memory 统一虚拟内存中需求分页的细粒度定量分析

3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-11-14 DOI: 10.1145/3632953

Tyler Allen, Bennett Cooper, Rong Ge

The abstraction of a shared memory space over separate CPU and GPU memory domains has eased the burden of portability for many HPC codebases. However, users pay for ease of use provided by system-managed memory with a moderate-to-high performance overhead. NVIDIA Unified Virtual Memory (UVM) is currently the primary real-world implementation of such abstraction and offers a functionally equivalent testbed for in-depth performance study for both UVM and future Linux Heterogeneous Memory Management (HMM) compatible systems. The continued advocacy for UVM and HMM motivates improvement of the underlying system. We focus on UVM-based systems and investigate root causes of UVM overhead, a non-trivial task due to complex interactions of multiple hardware and software constituents and the desired cost granularity. In our prior work, we delved deeply into UVM system architecture and showed internal behaviors of page fault servicing in batches. We provided quantitative evaluation of batch handling for various applications under different scenarios, including prefetching and oversubscription. We revealed the driver workload depends on the interactions among application access patterns, GPU hardware constraints, and host OS components. Host OS components have significant overhead present across implementations, warranting close attention. This extension furthers our prior study in three aspects: fine-grain cost analysis and breakdown, extension to multiple GPUs, and investigation of platforms with different GPU-GPU interconnects. We take a top-down approach to quantitative batch analysis and uncover how constituent component costs accumulate and overlap governed by synchronous and asynchronous operations. Our multi-GPU analysis shows reduced cost of GPU-GPU batch workloads compared to CPU-GPU workloads. We further demonstrate that while specialized interconnects, NVLink, can improve batch cost, their benefits are limited by host OS software overhead and GPU oversubscription. This study serves as a proxy for future shared memory systems, such as those that interface with HMM, and the development of interconnects.

在单独的CPU和GPU内存域上抽象共享内存空间减轻了许多HPC代码库的可移植性负担。但是，用户需要为系统管理内存提供的易用性付费，这将带来中等到较高的性能开销。NVIDIA统一虚拟内存(UVM)目前是这种抽象的主要现实世界实现，并为UVM和未来的Linux异构内存管理(HMM)兼容系统提供了一个功能等效的测试平台，用于深入的性能研究。对UVM和HMM的持续倡导推动了底层系统的改进。我们专注于基于UVM的系统，并调查UVM开销的根本原因，由于多个硬件和软件组件的复杂交互以及所需的成本粒度，UVM开销是一项非常重要的任务。在我们之前的工作中，我们深入研究了UVM系统架构，并展示了批量页面故障服务的内部行为。我们对不同场景下的各种应用程序的批处理进行了定量评估，包括预取和超额订阅。我们揭示了驱动程序工作负载取决于应用程序访问模式、GPU硬件约束和主机操作系统组件之间的交互。主机操作系统组件在各个实现之间有很大的开销，需要密切关注。该扩展在三个方面进一步深化了我们之前的研究:细粒度成本分析和分解，扩展到多个gpu，以及研究不同GPU-GPU互连的平台。我们采用自顶向下的方法进行定量批分析，并揭示由同步和异步操作控制的组件成本是如何累积和重叠的。我们的多gpu分析显示，与CPU-GPU工作负载相比，GPU-GPU批处理工作负载的成本更低。我们进一步证明，虽然专用互连NVLink可以提高批处理成本，但它们的好处受到主机操作系统软件开销和GPU过度订阅的限制。本研究为未来的共享存储系统(如与HMM接口的共享存储系统)和互连的发展提供了参考。

{"title":"Fine-Grain Quantitative Analysis of Demand Paging in Unified Virtual Memory","authors":"Tyler Allen, Bennett Cooper, Rong Ge","doi":"10.1145/3632953","DOIUrl":"https://doi.org/10.1145/3632953","url":null,"abstract":"The abstraction of a shared memory space over separate CPU and GPU memory domains has eased the burden of portability for many HPC codebases. However, users pay for ease of use provided by system-managed memory with a moderate-to-high performance overhead. NVIDIA Unified Virtual Memory (UVM) is currently the primary real-world implementation of such abstraction and offers a functionally equivalent testbed for in-depth performance study for both UVM and future Linux Heterogeneous Memory Management (HMM) compatible systems. The continued advocacy for UVM and HMM motivates improvement of the underlying system. We focus on UVM-based systems and investigate root causes of UVM overhead, a non-trivial task due to complex interactions of multiple hardware and software constituents and the desired cost granularity. In our prior work, we delved deeply into UVM system architecture and showed internal behaviors of page fault servicing in batches. We provided quantitative evaluation of batch handling for various applications under different scenarios, including prefetching and oversubscription. We revealed the driver workload depends on the interactions among application access patterns, GPU hardware constraints, and host OS components. Host OS components have significant overhead present across implementations, warranting close attention. This extension furthers our prior study in three aspects: fine-grain cost analysis and breakdown, extension to multiple GPUs, and investigation of platforms with different GPU-GPU interconnects. We take a top-down approach to quantitative batch analysis and uncover how constituent component costs accumulate and overlap governed by synchronous and asynchronous operations. Our multi-GPU analysis shows reduced cost of GPU-GPU batch workloads compared to CPU-GPU workloads. We further demonstrate that while specialized interconnects, NVLink, can improve batch cost, their benefits are limited by host OS software overhead and GPU oversubscription. This study serves as a proxy for future shared memory systems, such as those that interface with HMM, and the development of interconnects.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"11 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134991126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0