首页 > 最新文献

ACM Transactions on Architecture and Code Optimization最新文献

英文 中文
Tyche: An Efficient and General Prefetcher for Indirect Memory Accesses Tyche:用于间接内存访问的高效通用预取器
IF 1.6 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-01-22 DOI: 10.1145/3641853
Feng Xue, Chenji Han, Xinyu Li, Junliang Wu, Tingting Zhang, Tianyi Liu, Yifan Hao, Zidong Du, Qi Guo, Fuxin Zhang

Indirect memory accesses (IMAs, i.e., A[f(B[i])]) are typical memory access patterns in applications such as graph analysis, machine learning, and database. IMAs are composed of producer-consumer pairs, where the consumers’ memory addresses are derived from the producers’ memory data. Due to the built-in value-dependent feature, IMAs exhibit poor locality, making prefetching ineffective. Hindered by the challenges of recording the potentially complex graphs of instruction dependencies among IMA producers and consumers, current state-of-the-art hardware prefetchers either (a) exhibit inadequate IMA identification abilities or (b) rely on the run-ahead mechanism to prefetch IMAs intermittently and insufficiently.

To solve this problem, we propose Tyche1, an efficient and general hardware prefetcher to enhance IMA performance. Tyche adopts a bilateral propagation mechanism to precisely excavate the instruction dependencies in simple chains with moderate length (rather than complex graphs). Based on the exact instruction dependencies, Tyche can accurately identify various IMA patterns, including nonlinear ones, and generate accurate prefetching requests continuously. Evaluated on broad benchmarks, Tyche achieves an average performance speedup of 16.2% over the state-of-the-art spatial prefetcher Berti. More importantly, Tyche outperforms the state-of-the-art IMA prefetchers IMP, Gretch, and Vector Runahead, by 15.9%, 12.8%, and 10.7%, respectively, with a lower storage overhead of only 0.57KB.

间接内存访问(IMA,即 A[f(B[i])])是图分析、机器学习和数据库等应用中的典型内存访问模式。IMAs 由生产者-消费者对组成,其中消费者的内存地址来自生产者的内存数据。由于内置的值依赖特性,IMAs 的定位性很差,导致预取无效。当前最先进的硬件预取器要么(a)表现出不充分的 IMA 识别能力,要么(b)依赖于运行前置机制来间歇性、不充分地预取 IMA。为了解决这个问题,我们提出了 Tyche1 -- 一种高效、通用的硬件预取器,用于提高 IMA 性能。Tyche 采用双边传播机制,在长度适中的简单链(而非复杂图)中精确挖掘指令依赖关系。基于精确的指令依赖关系,Tyche 可以准确识别各种 IMA 模式,包括非线性模式,并持续生成准确的预取请求。在广泛的基准测试中,Tyche 的平均性能比最先进的空间预取器 Berti 提高了 16.2%。更重要的是,Tyche 的性能比最先进的 IMA 预取器 IMP、Gretch 和 Vector Runahead 分别提高了 15.9%、12.8% 和 10.7%,存储开销仅为 0.57KB。
{"title":"Tyche: An Efficient and General Prefetcher for Indirect Memory Accesses","authors":"Feng Xue, Chenji Han, Xinyu Li, Junliang Wu, Tingting Zhang, Tianyi Liu, Yifan Hao, Zidong Du, Qi Guo, Fuxin Zhang","doi":"10.1145/3641853","DOIUrl":"https://doi.org/10.1145/3641853","url":null,"abstract":"<p>Indirect memory accesses (IMAs, i.e., <i>A</i>[<i>f</i>(<i>B</i>[<i>i</i>])]) are typical memory access patterns in applications such as graph analysis, machine learning, and database. IMAs are composed of producer-consumer pairs, where the consumers’ memory addresses are derived from the producers’ memory data. Due to the built-in value-dependent feature, IMAs exhibit poor locality, making prefetching ineffective. Hindered by the challenges of recording the potentially complex graphs of instruction dependencies among IMA producers and consumers, current state-of-the-art hardware prefetchers either <b>(a)</b> exhibit inadequate IMA identification abilities or <b>(b)</b> rely on the run-ahead mechanism to prefetch IMAs intermittently and insufficiently. </p><p>To solve this problem, we propose Tyche<sup>1</sup>, an efficient and general hardware prefetcher to enhance IMA performance. Tyche adopts a bilateral propagation mechanism to precisely excavate the instruction dependencies in simple chains with moderate length (rather than complex graphs). Based on the exact instruction dependencies, Tyche can accurately identify various IMA patterns, including nonlinear ones, and generate accurate prefetching requests continuously. Evaluated on broad benchmarks, Tyche achieves an average performance speedup of 16.2% over the state-of-the-art spatial prefetcher Berti. More importantly, Tyche outperforms the state-of-the-art IMA prefetchers IMP, Gretch, and Vector Runahead, by 15.9%, 12.8%, and 10.7%, respectively, with a lower storage overhead of only 0.57KB.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"7 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139517190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Rcmp: Reconstructing RDMA-Based Memory Disaggregation via CXL Rcmp:通过 CXL 重构基于 RDMA 的内存分解
IF 1.6 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-01-19 DOI: 10.1145/3634916
Zhonghua Wang, Yixing Guo, Kai Lu, Jiguang Wan, Daohui Wang, Ting Yao, Huatao Wu

Memory disaggregation is a promising architecture for modern datacenters that separates compute and memory resources into independent pools connected by ultra-fast networks, which can improve memory utilization, reduce cost, and enable elastic scaling of compute and memory resources. However, existing memory disaggregation solutions based on remote direct memory access (RDMA) suffer from high latency and additional overheads including page faults and code refactoring. Emerging cache-coherent interconnects such as CXL offer opportunities to reconstruct high-performance memory disaggregation. However, existing CXL-based approaches have physical distance limitation and cannot be deployed across racks.

In this article, we propose Rcmp, a novel low-latency and highly scalable memory disaggregation system based on RDMA and CXL. The significant feature is that Rcmp improves the performance of RDMA-based systems via CXL, and leverages RDMA to overcome CXL’s distance limitation. To address the challenges of the mismatch between RDMA and CXL in terms of granularity, communication, and performance, Rcmp (1) provides a global page-based memory space management and enables fine-grained data access, (2) designs an efficient communication mechanism to avoid communication blocking issues, (3) proposes a hot-page identification and swapping strategy to reduce RDMA communications, and (4) designs an RDMA-optimized RPC framework to accelerate RDMA transfers. We implement a prototype of Rcmp and evaluate its performance by using micro-benchmarks and running a key-value store with YCSB benchmarks. The results show that Rcmp can achieve 5.2× lower latency and 3.8× higher throughput than RDMA-based systems. We also demonstrate that Rcmp can scale well with the increasing number of nodes without compromising performance.

内存分解是现代数据中心的一种前景广阔的架构,它将计算和内存资源分离成独立的池,并通过超高速网络连接起来,从而提高内存利用率,降低成本,并实现计算和内存资源的弹性扩展。然而,现有的基于远程直接内存访问(RDMA)的内存分解解决方案存在高延迟和额外开销(包括页面故障和代码重构)的问题。新兴的高速缓存相干互连(如 CXL)为重构高性能内存分解提供了机会。在本文中,我们提出了基于 RDMA 和 CXL 的新型低延迟、高可扩展性内存分解系统 Rcmp。Rcmp 的显著特点是通过 CXL 提高了基于 RDMA 系统的性能,并利用 RDMA 克服了 CXL 的距离限制。为了解决 RDMA 和 CXL 在粒度、通信和性能方面不匹配的难题,Rcmp (1) 提供了基于全局页面的内存空间管理,实现了细粒度数据访问;(2) 设计了高效的通信机制,避免了通信阻塞问题;(3) 提出了热页面识别和交换策略,以减少 RDMA 通信;(4) 设计了 RDMA 优化的 RPC 框架,以加速 RDMA 传输。我们实现了 Rcmp 的原型,并通过使用微基准和运行 YCSB 基准的键值存储来评估其性能。结果表明,与基于 RDMA 的系统相比,Rcmp 的延迟降低了 5.2 倍,吞吐量提高了 3.8 倍。我们还证明,随着节点数量的增加,Rcmp 可以很好地扩展而不影响性能。
{"title":"Rcmp: Reconstructing RDMA-Based Memory Disaggregation via CXL","authors":"Zhonghua Wang, Yixing Guo, Kai Lu, Jiguang Wan, Daohui Wang, Ting Yao, Huatao Wu","doi":"10.1145/3634916","DOIUrl":"https://doi.org/10.1145/3634916","url":null,"abstract":"<p>Memory disaggregation is a promising architecture for modern datacenters that separates compute and memory resources into independent pools connected by ultra-fast networks, which can improve memory utilization, reduce cost, and enable elastic scaling of compute and memory resources. However, existing memory disaggregation solutions based on remote direct memory access (RDMA) suffer from high latency and additional overheads including page faults and code refactoring. Emerging cache-coherent interconnects such as CXL offer opportunities to reconstruct high-performance memory disaggregation. However, existing CXL-based approaches have physical distance limitation and cannot be deployed across racks.</p><p>In this article, we propose Rcmp, a novel low-latency and highly scalable memory disaggregation system based on RDMA and CXL. The significant feature is that Rcmp improves the performance of RDMA-based systems via CXL, and leverages RDMA to overcome CXL’s distance limitation. To address the challenges of the mismatch between RDMA and CXL in terms of granularity, communication, and performance, Rcmp (1) provides a global page-based memory space management and enables fine-grained data access, (2) designs an efficient communication mechanism to avoid communication blocking issues, (3) proposes a hot-page identification and swapping strategy to reduce RDMA communications, and (4) designs an RDMA-optimized RPC framework to accelerate RDMA transfers. We implement a prototype of Rcmp and evaluate its performance by using micro-benchmarks and running a key-value store with YCSB benchmarks. The results show that Rcmp can achieve 5.2× lower latency and 3.8× higher throughput than RDMA-based systems. We also demonstrate that Rcmp can scale well with the increasing number of nodes without compromising performance.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"23 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139499290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A Survey 处理稀疏矩阵和向量的专用硬件加速器:调查
IF 1.6 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-01-17 DOI: 10.1145/3640542
Valentin Isaac–Chassande, Adrian Evans, Yves Durand, Frédéric Rousseau

Performance in scientific and engineering applications such as computational physics, algebraic graph problems or Convolutional Neural Networks (CNN), is dominated by the manipulation of large sparse matrices – matrices with a large number of zero elements. Specialized software using data formats for sparse matrices has been optimized for the main kernels of interest: SpMV and SpMSpM matrix multiplications, but due to the indirect memory accesses, the performance is still limited by the memory hierarchy of conventional computers. Recent work shows that specific hardware accelerators can reduce memory traffic and improve the execution time of sparse matrix multiplication, compared to the best software implementations. The performance of these sparse hardware accelerators depends on the choice of the sparse format, COO, CSR, etc, the algorithm, inner-product, outer-product, Gustavson, and many hardware design choices. In this article, we propose a systematic survey which identifies the design choices of state-of-the-art accelerators for sparse matrix multiplication kernels. We introduce the necessary concepts and then present, compare and classify the main sparse accelerators in the literature, using consistent notations. Finally, we propose a taxonomy for these accelerators to help future designers make the best choices depending of their objectives.

在计算物理、代数图问题或卷积神经网络(CNN)等科学和工程应用中,对大型稀疏矩阵(具有大量零元素的矩阵)的处理占据了主导地位。使用稀疏矩阵数据格式的专用软件已针对主要的相关内核进行了优化:SpMV 和 SpMSpM 矩阵乘法,但由于需要间接访问内存,其性能仍然受到传统计算机内存层次结构的限制。最近的研究表明,与最好的软件实现相比,特定的硬件加速器可以减少内存流量,缩短稀疏矩阵乘法的执行时间。这些稀疏硬件加速器的性能取决于稀疏格式、COO、CSR 等的选择,算法、内积、外积、Gustavson 以及许多硬件设计选择。在本文中,我们提出了一项系统调查,以确定最先进的稀疏矩阵乘法内核加速器的设计选择。我们介绍了必要的概念,然后使用一致的符号对文献中的主要稀疏加速器进行了介绍、比较和分类。最后,我们提出了这些加速器的分类方法,以帮助未来的设计者根据自己的目标做出最佳选择。
{"title":"Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A Survey","authors":"Valentin Isaac–Chassande, Adrian Evans, Yves Durand, Frédéric Rousseau","doi":"10.1145/3640542","DOIUrl":"https://doi.org/10.1145/3640542","url":null,"abstract":"<p>Performance in scientific and engineering applications such as computational physics, algebraic graph problems or Convolutional Neural Networks (CNN), is dominated by the manipulation of large sparse matrices – matrices with a large number of zero elements. Specialized software using data formats for sparse matrices has been optimized for the main kernels of interest: SpMV and SpMSpM matrix multiplications, but due to the indirect memory accesses, the performance is still limited by the memory hierarchy of conventional computers. Recent work shows that specific hardware accelerators can reduce memory traffic and improve the execution time of sparse matrix multiplication, compared to the best software implementations. The performance of these sparse hardware accelerators depends on the choice of the sparse format, <i>COO</i>, <i>CSR</i>, etc, the algorithm, <i>inner-product</i>, <i>outer-product</i>, <i>Gustavson</i>, and many hardware design choices. In this article, we propose a systematic survey which identifies the design choices of state-of-the-art accelerators for sparse matrix multiplication kernels. We introduce the necessary concepts and then present, compare and classify the main sparse accelerators in the literature, using consistent notations. Finally, we propose a taxonomy for these accelerators to help future designers make the best choices depending of their objectives.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"53 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139483484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cost-aware service placement and scheduling in the Edge-Cloud Continuum 边缘-云连续体中的成本感知服务安置和调度
IF 1.6 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-01-16 DOI: 10.1145/3640823
Samuel Rac, Mats Brorsson

The edge to data center computing continuum is the aggregation of computing resources located anywhere between the network edge (e.g., close to 5G antennas), and servers in traditional data centers. Kubernetes is the de facto standard for the orchestration of services in data center environments, where it is very efficient. It, however, fails to give the same performance when including edge resources. At the edge, resources are more limited, and networking conditions are changing over time. In this paper, we present a methodology that lowers the costs of running applications in the edge-to-cloud computing continuum. This methodology can adapt to changing environments, e.g., moving end-users. We are also monitoring some Key Performance Indicators of the applications to ensure that cost optimizations do not negatively impact their Quality of Service. In addition, to ensure that performances are optimal even when users are moving, we introduce a background process that periodically checks if a better location is available for the service and, if so, moves the service. To demonstrate the performance of our scheduling approach, we evaluate it using a vehicle cooperative perception use case, a representative 5G application. With this use case, we can demonstrate that our scheduling approach can robustly lower the cost in different scenarios, while other approaches that are already available fail in either being adaptive to changing environments or will have poor cost-effectiveness in some scenarios.

从边缘到数据中心的计算连续体是位于网络边缘(如靠近 5G 天线)和传统数据中心服务器之间任何地方的计算资源的聚合。Kubernetes 是在数据中心环境中协调服务的事实标准,其效率非常高。然而,当包括边缘资源时,它却无法提供相同的性能。在边缘,资源更加有限,网络条件也随时间不断变化。在本文中,我们提出了一种方法,可以降低在从边缘到云计算的连续过程中运行应用程序的成本。这种方法可以适应不断变化的环境,例如移动终端用户。我们还对应用程序的一些关键性能指标进行了监控,以确保成本优化不会对其服务质量产生负面影响。此外,为了确保在用户移动时也能达到最佳性能,我们引入了一个后台进程,定期检查是否有更好的服务位置,如果有,则移动服务。为了证明我们的调度方法的性能,我们使用一个车辆协同感知用例(一个代表性的 5G 应用)对其进行了评估。通过这个用例,我们可以证明我们的调度方法可以在不同场景下稳健地降低成本,而其他已有方法要么不能适应不断变化的环境,要么在某些场景下成本效益不佳。
{"title":"Cost-aware service placement and scheduling in the Edge-Cloud Continuum","authors":"Samuel Rac, Mats Brorsson","doi":"10.1145/3640823","DOIUrl":"https://doi.org/10.1145/3640823","url":null,"abstract":"<p>The edge to data center computing continuum is the aggregation of computing resources located anywhere between the network edge (e.g., close to 5G antennas), and servers in traditional data centers. Kubernetes is the de facto standard for the orchestration of services in data center environments, where it is very efficient. It, however, fails to give the same performance when including edge resources. At the edge, resources are more limited, and networking conditions are changing over time. In this paper, we present a methodology that lowers the costs of running applications in the edge-to-cloud computing continuum. This methodology can adapt to changing environments, e.g., moving end-users. We are also monitoring some Key Performance Indicators of the applications to ensure that cost optimizations do not negatively impact their Quality of Service. In addition, to ensure that performances are optimal even when users are moving, we introduce a background process that periodically checks if a better location is available for the service and, if so, moves the service. To demonstrate the performance of our scheduling approach, we evaluate it using a vehicle cooperative perception use case, a representative 5G application. With this use case, we can demonstrate that our scheduling approach can robustly lower the cost in different scenarios, while other approaches that are already available fail in either being adaptive to changing environments or will have poor cost-effectiveness in some scenarios.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"4 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139483853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Instruction Inflation Analyzing Framework for Dynamic Binary Translators 动态二进制翻译器的指令膨胀分析框架
IF 1.6 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-01-15 DOI: 10.1145/3640813
Benyi Xie, Yue Yan, Chenghao Yan, Sicheng Tao, Zhuangzhuang Zhang, Xinyu Li, Yanzhi Lan, Xiang Wu, Tianyi Liu, Tingting Zhang, Fuxin Zhang

Dynamic binary translators (DBTs) are widely used to migrate applications between different instruction set architectures (ISAs). Despite extensive research to improve DBT performance, noticeable overhead remains, preventing near-native performance, especially when translating from complex instruction set computer (CISC) to reduced instruction set computer (RISC). For computational workloads, the main overhead stems from translated code quality. Experimental data show state-of-the-art DBT products have dynamic code inflation of at least 1.46. This indicates on average over 1.46 host instructions are needed to emulate one guest instruction. Worse, inflation closely correlates with translated code quality. However, the detailed sources of instruction inflation remain unclear.

To understand the sources of inflation, we present Deflater, an instruction inflation analysis framework comprising a mathematical model, a collection of black-box unit tests called BenchMIAOes, and a trace-based simulator called InflatSim. The mathematical model calculates overall inflation based on the inflation of individual instructions and translation block (TB) optimizations. BenchMIAOes extract model parameters from DBTs without accessing DBT source code. InflatSim implements the model and uses the extracted parameters from BenchMIAOes to simulate a given DBT’s behavior. Deflater is a valuable tool to guide DBT analysis and improvement. Using Deflater, we simulated inflation for three state-of-the-art CISC-to-RISC DBTs: ExaGear, Rosetta2, and LATX, with inflation errors of 5.63%, 5.15%, and 3.44% respectively for SPEC CPU 2017, gaining insights into these commercial DBTs. Deflater also efficiently models inflation for the open-source DBT QEMU and suggests optimizations that can substantially reduce inflation. Implementing the suggested optimizations confirms Deflater’s effective guidance, with 4.65% inflation error, and gains 5.47x performance improvement.

动态二进制转换器(DBT)被广泛用于在不同指令集架构(ISA)之间迁移应用程序。尽管为提高 DBT 性能进行了大量研究,但仍存在明显的开销,无法实现接近原生的性能,尤其是从复杂指令集计算机 (CISC) 转换到精简指令集计算机 (RISC) 时更是如此。对于计算工作负载,主要开销来自翻译代码的质量。实验数据显示,最先进的 DBT 产品的动态代码膨胀率至少为 1.46。这表明模拟一条客户指令平均需要超过 1.46 条主机指令。更糟糕的是,膨胀与翻译代码质量密切相关。然而,指令膨胀的详细来源仍不清楚。为了了解膨胀的来源,我们提出了一个指令膨胀分析框架 Deflater,该框架由一个数学模型、一组名为 BenchMIAOes 的黑盒单元测试和一个名为 InflatSim 的基于跟踪的模拟器组成。数学模型根据单条指令的膨胀和翻译块 (TB) 优化计算整体膨胀。BenchMIAOes 从 DBT 中提取模型参数,无需访问 DBT 源代码。InflatSim 实现了该模型,并使用从 BenchMIAOes 提取的参数来模拟给定 DBT 的行为。Deflater 是指导 DBT 分析和改进的重要工具。利用 Deflater,我们模拟了三种最先进的 CISC 转 RISC DBT 的膨胀情况:在 SPEC CPU 2017 中,ExaGear、Rosetta2 和 LATX 的膨胀误差分别为 5.63%、5.15% 和 3.44%,从而深入了解了这些商用 DBT。Deflater 还对开源 DBT QEMU 的膨胀进行了有效建模,并提出了可大幅降低膨胀的优化建议。实施建议的优化措施证实了 Deflater 的有效指导,膨胀误差为 4.65%,性能提高了 5.47 倍。
{"title":"An Instruction Inflation Analyzing Framework for Dynamic Binary Translators","authors":"Benyi Xie, Yue Yan, Chenghao Yan, Sicheng Tao, Zhuangzhuang Zhang, Xinyu Li, Yanzhi Lan, Xiang Wu, Tianyi Liu, Tingting Zhang, Fuxin Zhang","doi":"10.1145/3640813","DOIUrl":"https://doi.org/10.1145/3640813","url":null,"abstract":"<p>Dynamic binary translators (DBTs) are widely used to migrate applications between different instruction set architectures (ISAs). Despite extensive research to improve DBT performance, noticeable overhead remains, preventing near-native performance, especially when translating from complex instruction set computer (CISC) to reduced instruction set computer (RISC). For computational workloads, the main overhead stems from translated code quality. Experimental data show state-of-the-art DBT products have dynamic code inflation of at least 1.46. This indicates on average over 1.46 host instructions are needed to emulate one guest instruction. Worse, inflation closely correlates with translated code quality. However, the detailed sources of instruction inflation remain unclear. </p><p>To understand the sources of inflation, we present Deflater, an instruction inflation analysis framework comprising a mathematical model, a collection of black-box unit tests called BenchMIAOes, and a trace-based simulator called InflatSim. The mathematical model calculates overall inflation based on the inflation of individual instructions and translation block (TB) optimizations. BenchMIAOes extract model parameters from DBTs without accessing DBT source code. InflatSim implements the model and uses the extracted parameters from BenchMIAOes to simulate a given DBT’s behavior. Deflater is a valuable tool to guide DBT analysis and improvement. Using Deflater, we simulated inflation for three state-of-the-art CISC-to-RISC DBTs: ExaGear, Rosetta2, and LATX, with inflation errors of 5.63%, 5.15%, and 3.44% respectively for SPEC CPU 2017, gaining insights into these commercial DBTs. Deflater also efficiently models inflation for the open-source DBT QEMU and suggests optimizations that can substantially reduce inflation. Implementing the suggested optimizations confirms Deflater’s effective guidance, with 4.65% inflation error, and gains 5.47x performance improvement.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"22 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139498826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Assessing the Impact of Compiler Optimizations on GPUs Reliability 评估编译器优化对 GPU 可靠性的影响
IF 1.6 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-01-12 DOI: 10.1145/3638249
Fernando Fernandes dos Santos, Luigi Carro, Flavio Vella, Paolo Rech

Graphics Processing Units (GPUs) compilers have evolved in order to support general-purpose programming languages for multiple architectures. NVIDIA CUDA Compiler (NVCC) has many compilation levels before generating the machine code and applies complex optimizations to improve performance. These optimizations modify how the software is mapped in the underlying hardware; thus, as we show in this paper, they can also affect GPU reliability. We evaluate the effects on the GPU error rate of the optimization flags applied at the NVCC Parallel Thread Execution (PTX) compiling phase by analyzing two NVIDIA GPU architectures (Kepler and Volta) and two compiler versions (NVCC 10.2 and 11.3). We compare and combine fault propagation analysis based on software fault injection, hardware utilization distribution obtained with application-level profiling, and machine instructions radiation-induced error rate measured with beam experiments. We consider eight different workloads and 144 combinations of compilation flags, and we show that optimizations can impact the GPUs’ error rate of up to an order of magnitude. Additionally, through accelerated neutron beam experiments on a NVIDIA Kepler GPU, we show that the error rate of the unoptimized GEMM (-O0 flag) is lower than the optimized GEMM’s (-O3 flag) error rate. When the performance is evaluated together with the error rate, we show that the most optimized versions (-O1 and -O3) always produce a higher amount of correct data than the unoptimized code (-O0).

图形处理器(GPU)编译器不断发展,以支持多种架构的通用编程语言。英伟达™(NVIDIA®)CUDA 编译器(NVCC)在生成机器代码之前有许多编译级别,并应用复杂的优化来提高性能。这些优化修改了软件在底层硬件中的映射方式;因此,正如我们在本文中所展示的,它们也会影响 GPU 的可靠性。我们通过分析两种英伟达™(NVIDIA®)GPU架构(Kepler和Volta)和两个编译器版本(NVCC 10.2和11.3),评估了在NVCC并行线程执行(PTX)编译阶段应用的优化标志对GPU错误率的影响。我们比较并结合了基于软件故障注入的故障传播分析、通过应用级剖析获得的硬件利用率分布以及通过光束实验测量的机器指令辐射诱发错误率。我们考虑了八种不同的工作负载和 144 种编译标志组合,结果表明,优化可对 GPU 的错误率产生高达一个数量级的影响。此外,通过在英伟达开普勒GPU上进行加速中子束实验,我们发现未优化GEMM(-O0标志)的错误率低于优化GEMM(-O3标志)的错误率。当性能与错误率一起评估时,我们发现最优化的版本(-O1 和 -O3)产生的正确数据量总是高于未优化的代码(-O0)。
{"title":"Assessing the Impact of Compiler Optimizations on GPUs Reliability","authors":"Fernando Fernandes dos Santos, Luigi Carro, Flavio Vella, Paolo Rech","doi":"10.1145/3638249","DOIUrl":"https://doi.org/10.1145/3638249","url":null,"abstract":"<p>Graphics Processing Units (GPUs) compilers have evolved in order to support general-purpose programming languages for multiple architectures. NVIDIA CUDA Compiler (NVCC) has many compilation levels before generating the machine code and applies complex optimizations to improve performance. These optimizations modify how the software is mapped in the underlying hardware; thus, as we show in this paper, they can also affect GPU reliability. We evaluate the effects on the GPU error rate of the optimization flags applied at the NVCC Parallel Thread Execution (PTX) compiling phase by analyzing two NVIDIA GPU architectures (Kepler and Volta) and two compiler versions (NVCC 10.2 and 11.3). We compare and combine fault propagation analysis based on software fault injection, hardware utilization distribution obtained with application-level profiling, and machine instructions radiation-induced error rate measured with beam experiments. We consider eight different workloads and 144 combinations of compilation flags, and we show that optimizations can impact the GPUs’ error rate of up to an order of magnitude. Additionally, through accelerated neutron beam experiments on a NVIDIA Kepler GPU, we show that the error rate of the unoptimized GEMM (-O0 flag) is lower than the optimized GEMM’s (-O3 flag) error rate. When the performance is evaluated together with the error rate, we show that the most optimized versions (-O1 and -O3) always produce a higher amount of correct data than the unoptimized code (-O0).</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"42 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139461050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Efficient Hybrid Deep Learning Accelerator for Compact and Heterogeneous CNNs 适用于紧凑型异构 CNN 的高效混合深度学习加速器
IF 1.6 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-01-08 DOI: 10.1145/3639823
Fareed Qararyah, Muhammad Waqar Azhar, Pedro Trancoso

Resource-efficient Convolutional Neural Networks (CNNs) are gaining more attention. These CNNs have relatively low computational and memory requirements. A common denominator among such CNNs is having more heterogeneity than traditional CNNs. This heterogeneity is present at two levels: intra-layer-type and inter-layer-type. Generic accelerators do not capture these levels of heterogeneity, which harms their efficiency. Consequently, researchers have proposed model-specific accelerators with dedicated engines. When designing an accelerator with dedicated engines, one option is to dedicate an engine per CNN layer. We refer to accelerators designed with this approach as single-engine single-layer (SESL). This approach enables optimizing each engine for its specific layer. However, such accelerators are resource-demanding and unscalable. Another option is to design a minimal number of dedicated engines such that each engine handles all layers of one type. We refer to these accelerators as single-engine multiple-layer (SEML). single-engine multiple-layer accelerators capture the inter-layer-type, but not the intra-layer-type heterogeneity.

We propose FiBHA (Fixed Budget Hybrid CNN Accelerator), a hybrid accelerator composed of a single-engine single-layer Layer part and a single-engine multiple-layer part, each processing a subset of CNN layers. FiBHA captures more heterogeneity than single-engine multiple-layer while being more resource-aware and scalable than single-engine single-layer. Moreover, we propose a novel module, Fused Inverted Residual Bottleneck (FIRB), a fine-grained and memory-light single-engine single-layer architecture building block. The proposed architecture is implemented and evaluated using high-level synthesis (HLS) on different FPGAs representing various resource budgets. Our evaluation shows that FiBHA improves the throughput by up to 4x and 2.5x compared to state-of-the-art single-engine single-layer and single-engine multiple-layer accelerators, respectively. Moreover, FiBHA reduces memory and energy consumption compared to a single-engine multiple-layer accelerator. The evaluation also shows that FIRB reduces the required memory by up to (54% ), and energy requirements by up to (35% ) compared to traditional pipelining.

资源节约型卷积神经网络(CNN)越来越受到关注。这些 CNN 对计算和内存的要求相对较低。这类 CNN 的一个共同点是比传统 CNN 具有更多的异质性。这种异构存在于两个层面:层内类型和层间类型。通用加速器无法捕捉到这些层次的异构性,从而降低了效率。因此,研究人员提出了具有专用引擎的特定模型加速器。在设计带有专用引擎的加速器时,一种选择是为每个 CNN 层专用一个引擎。我们将采用这种方法设计的加速器称为单引擎单层(SESL)。这种方法可以针对特定层优化每个引擎。不过,这种加速器对资源要求较高,而且无法扩展。另一种方法是设计最少数量的专用引擎,让每个引擎处理一种类型的所有层。我们将这些加速器称为单引擎多层(SEML)。单引擎多层加速器能捕捉层间类型的异质性,但不能捕捉层内类型的异质性。我们提出了 FiBHA(固定预算混合 CNN 加速器),这是一种混合加速器,由一个单引擎单层部分和一个单引擎多层部分组成,每个部分处理 CNN 层的一个子集。FiBHA 比单引擎多层加速器能捕捉更多异构性,同时比单引擎单层加速器更具资源感知能力和可扩展性。此外,我们还提出了一个新模块--融合反转残余瓶颈(FIRB),这是一个细粒度、轻内存的单引擎单层架构构件。我们在代表不同资源预算的不同 FPGA 上使用高级综合(HLS)实现并评估了所提出的架构。评估结果表明,与最先进的单引擎单层加速器和单引擎多层加速器相比,FiBHA 的吞吐量分别提高了 4 倍和 2.5 倍。此外,与单引擎多层加速器相比,FiBHA 还降低了内存和能耗。评估还表明,与传统流水线相比,FIRB最多可减少54%的内存,最多可减少35%的能耗。
{"title":"An Efficient Hybrid Deep Learning Accelerator for Compact and Heterogeneous CNNs","authors":"Fareed Qararyah, Muhammad Waqar Azhar, Pedro Trancoso","doi":"10.1145/3639823","DOIUrl":"https://doi.org/10.1145/3639823","url":null,"abstract":"<p>Resource-efficient Convolutional Neural Networks (CNNs) are gaining more attention. These CNNs have relatively low computational and memory requirements. A common denominator among such CNNs is having more heterogeneity than traditional CNNs. This heterogeneity is present at two levels: intra-layer-type and inter-layer-type. Generic accelerators do not capture these levels of heterogeneity, which harms their efficiency. Consequently, researchers have proposed model-specific accelerators with dedicated engines. When designing an accelerator with dedicated engines, one option is to dedicate an engine per CNN layer. We refer to accelerators designed with this approach as single-engine single-layer (SESL). This approach enables optimizing each engine for its specific layer. However, such accelerators are resource-demanding and unscalable. Another option is to design a minimal number of dedicated engines such that each engine handles all layers of one type. We refer to these accelerators as single-engine multiple-layer (SEML). single-engine multiple-layer accelerators capture the inter-layer-type, but not the intra-layer-type heterogeneity. </p><p>We propose FiBHA (Fixed Budget Hybrid CNN Accelerator), a hybrid accelerator composed of a single-engine single-layer Layer part and a single-engine multiple-layer part, each processing a subset of CNN layers. FiBHA captures more heterogeneity than single-engine multiple-layer while being more resource-aware and scalable than single-engine single-layer. Moreover, we propose a novel module, Fused Inverted Residual Bottleneck (FIRB), a fine-grained and memory-light single-engine single-layer architecture building block. The proposed architecture is implemented and evaluated using high-level synthesis (HLS) on different FPGAs representing various resource budgets. Our evaluation shows that FiBHA improves the throughput by up to 4<i>x</i> and 2.5<i>x</i> compared to state-of-the-art single-engine single-layer and single-engine multiple-layer accelerators, respectively. Moreover, FiBHA reduces memory and energy consumption compared to a single-engine multiple-layer accelerator. The evaluation also shows that FIRB reduces the required memory by up to (54% ), and energy requirements by up to (35% ) compared to traditional pipelining.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"27 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139414715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-Efficient Genome Analysis ApHMM:加速轮廓隐马尔可夫模型,实现快速节能的基因组分析
IF 1.6 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-12-28 DOI: 10.1145/3632950
Can Firtina, Kamlesh Pillai, Gurpreet S. Kalsi, Bharathwaj Suresh, Damla Senol Cali, Jeremie S. Kim, Taha Shahroodi, Meryem Banu Cavlak, Joël Lindegger, Mohammed Alser, Juan Gómez Luna, Sreenivas Subramoney, Onur Mutlu

Profile hidden Markov models (pHMMs) are widely employed in various bioinformatics applications to identify similarities between biological sequences, such as DNA or protein sequences. In pHMMs, sequences are represented as graph structures, where states and edges capture modifications (i.e., insertions, deletions, and substitutions) by assigning probabilities to them. These probabilities are subsequently used to compute the similarity score between a sequence and a pHMM graph. The Baum-Welch algorithm, a prevalent and highly accurate method, utilizes these probabilities to optimize and compute similarity scores. Accurate computation of these probabilities is essential for the correct identification of sequence similarities. However, the Baum-Welch algorithm is computationally intensive, and existing solutions offer either software-only or hardware-only approaches with fixed pHMM designs. When we analyze state-of-the-art works, we identify an urgent need for a flexible, high-performance, and energy-efficient hardware-software co-design to address the major inefficiencies in the Baum-Welch algorithm for pHMMs.

We introduce ApHMM, the first flexible acceleration framework designed to significantly reduce both computational and energy overheads associated with the Baum-Welch algorithm for pHMMs. ApHMM employs hardware-software co-design to tackle the major inefficiencies in the Baum-Welch algorithm by 1) designing flexible hardware to accommodate various pHMM designs, 2) exploiting predictable data dependency patterns through on-chip memory with memoization techniques, 3) rapidly filtering out unnecessary computations using a hardware-based filter, and 4) minimizing redundant computations.

ApHMM achieves substantial speedups of 15.55 × - 260.03 ×, 1.83 × - 5.34 ×, and 27.97 × when compared to CPU, GPU, and FPGA implementations of the Baum-Welch algorithm, respectively. ApHMM outperforms state-of-the-art CPU implementations in three key bioinformatics applications: 1) error correction, 2) protein family search, and 3) multiple sequence alignment, by 1.29 × - 59.94 ×, 1.03 × - 1.75 ×, and 1.03 × - 1.95 ×, respectively, while improving their energy efficiency by 64.24 × - 115.46 ×, 1.75 ×, 1.96 ×.

轮廓隐马尔可夫模型(pHMMs)被广泛应用于各种生物信息学应用中,用于识别 DNA 或蛋白质序列等生物序列之间的相似性。在 pHMMs 中,序列被表示为图结构,其中的状态和边通过分配概率来捕捉修饰(即插入、删除和替换)。这些概率随后用于计算序列与 pHMM 图之间的相似性得分。鲍姆-韦尔奇算法(Baum-Welch algorithm)是一种常用的高精度方法,它利用这些概率来优化和计算相似性得分。准确计算这些概率对于正确识别序列相似性至关重要。然而,Baum-Welch 算法的计算量很大,现有的解决方案要么是纯软件方法,要么是采用固定 pHMM 设计的纯硬件方法。当我们分析最先进的工作时,我们发现迫切需要一种灵活、高性能和高能效的软硬件协同设计,以解决 pHMM 的 Baum-Welch 算法中的主要低效问题。我们介绍了 ApHMM,这是首个灵活的加速框架,旨在显著降低与 pHMM 的 Baum-Welch 算法相关的计算和能耗开销。ApHMM 采用软硬件协同设计来解决 Baum-Welch 算法中的主要低效问题,具体方法包括:1)设计灵活的硬件以适应各种 pHMM 设计;2)通过采用 memoization 技术的片上存储器利用可预测的数据依赖模式;3)使用基于硬件的过滤器快速过滤掉不必要的计算;以及 4)最大限度地减少冗余计算。与鲍姆-韦尔奇算法的 CPU、GPU 和 FPGA 实现相比,ApHMM 分别实现了 15.55 × - 260.03 ×、1.83 × - 5.34 × 和 27.97 × 的大幅提速。在三个关键的生物信息学应用中,ApHMM 的性能优于最先进的 CPU 实现:1)纠错;2)蛋白质族搜索;3)多序列比对,分别提高了 1.29 × - 59.94 ×、1.03 × - 1.75 ×、1.03 × - 1.95 ×,同时能效提高了 64.24 × - 115.46 ×、1.75 ×、1.96 ×。
{"title":"ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-Efficient Genome Analysis","authors":"Can Firtina, Kamlesh Pillai, Gurpreet S. Kalsi, Bharathwaj Suresh, Damla Senol Cali, Jeremie S. Kim, Taha Shahroodi, Meryem Banu Cavlak, Joël Lindegger, Mohammed Alser, Juan Gómez Luna, Sreenivas Subramoney, Onur Mutlu","doi":"10.1145/3632950","DOIUrl":"https://doi.org/10.1145/3632950","url":null,"abstract":"<p>Profile hidden Markov models (pHMMs) are widely employed in various bioinformatics applications to identify similarities between biological sequences, such as DNA or protein sequences. In pHMMs, sequences are represented as graph structures, where states and edges capture modifications (i.e., insertions, deletions, and substitutions) by assigning probabilities to them. These probabilities are subsequently used to compute the similarity score between a sequence and a pHMM graph. The Baum-Welch algorithm, a prevalent and highly accurate method, utilizes these probabilities to optimize and compute similarity scores. Accurate computation of these probabilities is essential for the correct identification of sequence similarities. However, the Baum-Welch algorithm is computationally intensive, and existing solutions offer either software-only or hardware-only approaches with fixed pHMM designs. When we analyze state-of-the-art works, we identify an urgent need for a flexible, high-performance, and energy-efficient hardware-software co-design to address the major inefficiencies in the Baum-Welch algorithm for pHMMs. </p><p>We introduce <i>ApHMM</i>, the <i>first</i> flexible acceleration framework designed to significantly reduce both computational and energy overheads associated with the Baum-Welch algorithm for pHMMs. ApHMM employs hardware-software co-design to tackle the major inefficiencies in the Baum-Welch algorithm by 1) designing flexible hardware to accommodate various pHMM designs, 2) exploiting predictable data dependency patterns through on-chip memory with memoization techniques, 3) rapidly filtering out unnecessary computations using a hardware-based filter, and 4) minimizing redundant computations. </p><p>ApHMM achieves substantial speedups of 15.55 × - 260.03 ×, 1.83 × - 5.34 ×, and 27.97 × when compared to CPU, GPU, and FPGA implementations of the Baum-Welch algorithm, respectively. ApHMM outperforms state-of-the-art CPU implementations in three key bioinformatics applications: 1) error correction, 2) protein family search, and 3) multiple sequence alignment, by 1.29 × - 59.94 ×, 1.03 × - 1.75 ×, and 1.03 × - 1.95 ×, respectively, while improving their energy efficiency by 64.24 × - 115.46 ×, 1.75 ×, 1.96 ×.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"9 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139067804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring Data Layout for Sparse Tensor Times Dense Matrix on GPUs 探索 GPU 上稀疏张量倍密集矩阵的数据布局
IF 1.6 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-12-28 DOI: 10.1145/3633462
Khalid Ahmad, Cris Cecka, Michael Garland, Mary Hall

An important sparse tensor computation is sparse-tensor-dense-matrix multiplication (SpTM), which is used in tensor decomposition and applications. SpTM is a multi-dimensional analog to sparse-matrix-dense-matrix multiplication (SpMM). In this paper, we employ a hierarchical tensor data layout that can unfold a multidimensional tensor to derive a 2D matrix, making it possible to compute SpTM using SpMM kernel implementations for GPUs. We compare two SpMM implementations to the state-of-the-art PASTA sparse tensor contraction implementation using: (1) SpMM with hierarchical tensor data layout; and, (2) unfolding followed by an invocation of cuSPARSE’s SpMM. Results show that SpMM can outperform PASTA 70.9% of the time, but none of the three approaches is best overall. Therefore, we use a decision tree classifier to identify the best performing sparse tensor contraction kernel based on precomputed properties of the sparse tensor.

一种重要的稀疏张量计算是稀疏-张量-密集-矩阵乘法(SpTM),它用于张量分解和应用。SpTM 是稀疏矩阵-密集矩阵乘法(SpMM)的多维类比。在本文中,我们采用了分层张量数据布局,这种布局可以展开多维张量,推导出二维矩阵,从而可以使用 GPU 的 SpMM 内核实现来计算 SpTM。我们将两种 SpMM 实现与最先进的 PASTA 稀疏张量收缩实现进行了比较:(1) 采用分层张量数据布局的 SpMM;(2) 展开后调用 cuSPARSE 的 SpMM。结果表明,SpMM 在 70.9% 的情况下优于 PASTA,但这三种方法总体上都不是最好的。因此,我们使用决策树分类器,根据预先计算的稀疏张量属性,确定性能最佳的稀疏张量收缩核。
{"title":"Exploring Data Layout for Sparse Tensor Times Dense Matrix on GPUs","authors":"Khalid Ahmad, Cris Cecka, Michael Garland, Mary Hall","doi":"10.1145/3633462","DOIUrl":"https://doi.org/10.1145/3633462","url":null,"abstract":"<p>An important sparse tensor computation is sparse-tensor-dense-matrix multiplication (SpTM), which is used in tensor decomposition and applications. SpTM is a multi-dimensional analog to sparse-matrix-dense-matrix multiplication (SpMM). In this paper, we employ a hierarchical tensor data layout that can unfold a multidimensional tensor to derive a 2D matrix, making it possible to compute SpTM using SpMM kernel implementations for GPUs. We compare two SpMM implementations to the state-of-the-art PASTA sparse tensor contraction implementation using: (1) SpMM with hierarchical tensor data layout; and, (2) unfolding followed by an invocation of cuSPARSE’s SpMM. Results show that SpMM can outperform PASTA 70.9% of the time, but none of the three approaches is best overall. Therefore, we use a decision tree classifier to identify the best performing sparse tensor contraction kernel based on precomputed properties of the sparse tensor.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"33 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139068139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Concise Concurrent B+-Tree for Persistent Memory 用于持久内存的简明并发 B+ 树
IF 1.6 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-12-25 DOI: 10.1145/3638717
Yan Wei, Zhang Xingjun

Persistent memory (PM) presents a unique opportunity for designing data management systems that offer improved performance, scalability, and instant restart capability. As a widely used data structure for managing data in such systems, B+-Tree must address the challenges presented by PM in both data consistency and device performance. However, existing studies suffer from significant performance degradation when maintaining data consistency on PM. To settle this problem, we propose a new concurrent B+-Tree, CC-Tree, optimized for PM. CC-Tree ensures data consistency while providing high concurrent performance, thanks to several technologies, including partitioned metadata, log-free split, and lock-free read. We conducted experiments using state-of-the-art indices, and the results demonstrate significant performance improvements, including approximately 1.2-1.6x search, 1.5-1.7x insertion, 1.5-2.8x update, 1.9-4x deletion, 0.9-10x range scan, and up to 1.55-1.82x in hybrid workloads.

持久内存(PM)为设计数据管理系统提供了一个独特的机会,使其具有更高的性能、可扩展性和即时重启能力。作为在此类系统中管理数据的一种广泛使用的数据结构,B+-树必须应对持久性内存在数据一致性和设备性能两方面带来的挑战。然而,现有研究发现,在 PM 上保持数据一致性时,性能会明显下降。为了解决这个问题,我们提出了一种针对 PM 优化的新型并发 B+-树,即 CC-树。CC-Tree 采用了多项技术,包括分区元数据、无日志分割和无锁读取,在确保数据一致性的同时提供了高并发性能。我们使用最先进的索引进行了实验,结果表明性能有了显著提高,包括约 1.2-1.6 倍的搜索、1.5-1.7 倍的插入、1.5-2.8 倍的更新、1.9-4 倍的删除、0.9-10 倍的范围扫描,以及在混合工作负载中高达 1.55-1.82 倍的性能。
{"title":"A Concise Concurrent B+-Tree for Persistent Memory","authors":"Yan Wei, Zhang Xingjun","doi":"10.1145/3638717","DOIUrl":"https://doi.org/10.1145/3638717","url":null,"abstract":"<p>Persistent memory (PM) presents a unique opportunity for designing data management systems that offer improved performance, scalability, and instant restart capability. As a widely used data structure for managing data in such systems, B<sup>+</sup>-Tree must address the challenges presented by PM in both data consistency and device performance. However, existing studies suffer from significant performance degradation when maintaining data consistency on PM. To settle this problem, we propose a new concurrent B<sup>+</sup>-Tree, CC-Tree, optimized for PM. CC-Tree ensures data consistency while providing high concurrent performance, thanks to several technologies, including partitioned metadata, log-free split, and lock-free read. We conducted experiments using state-of-the-art indices, and the results demonstrate significant performance improvements, including approximately 1.2-1.6x search, 1.5-1.7x insertion, 1.5-2.8x update, 1.9-4x deletion, 0.9-10x range scan, and up to 1.55-1.82x in hybrid workloads.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"89 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139036626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
ACM Transactions on Architecture and Code Optimization
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1