ACM Transactions on Architecture and Code Optimization最新文献

A Survey of General-purpose Polyhedral Compilers 通用多面体编译器概览

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-06-22 DOI: 10.1145/3674735

Arun Thangamani, Vincent Loechner, Stéphane Genaud

Since the 1990’s many implementations of polyhedral compilers have been written and distributed, either as source-to-source translating compilers or integrated into wider purpose compilers. This paper provides a survey on those various available implementations as of today, 2024.

We list and describe most commonly available polyhedral schedulers and compiler implementations. Then, we compare the general-purpose polyhedral compilers using two main criteria, robustness and performance, on the PolyBench/C set of benchmarks.

自 20 世纪 90 年代以来，人们编写并发布了许多多面体编译器的实现方法，有的是源代码到源代码的翻译编译器，有的是集成到更广泛用途的编译器中。本文概述了截至 2024 年的各种可用实现。我们列出并描述了最常见的多面体调度器和编译器实现。然后，我们使用两个主要标准（鲁棒性和性能）在 PolyBench/C 基准集上对通用多面体编译器进行了比较。

引用次数: 0

Sectored DRAM: A Practical Energy-Efficient and High-Performance Fine-Grained DRAM Architecture 分段式 DRAM：实用的高能效、高性能细粒度 DRAM 架构

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-06-14 DOI: 10.1145/3673653

Ataberk Olgun, Fatma Bostanci, Geraldo Francisco de Oliveira Junior, Yahya Can Tugrul, Rahul Bera, Abdullah Giray Yaglikci, Hasan Hassan, Oguz Ergin, Onur Mutlu

Modern computing systems access data in main memory at coarse granularity (e.g., at 512-bit cache block granularity). Coarse-grained access leads to wasted energy because the system does not use all individually accessed small portions (e.g., words, each of which typically is 64 bits) of a cache block. In modern DRAM-based computing systems, two key coarse-grained access mechanisms lead to wasted energy: large and fixed-size (i) data transfers between DRAM and the memory controller and (ii) DRAM row activations.

We propose Sectored DRAM, a new, low-overhead DRAM substrate that reduces wasted energy by enabling fine-grained DRAM data transfer and DRAM row activation. To retrieve only useful data from DRAM, Sectored DRAM exploits the observation that many cache blocks are not fully utilized in many workloads due to poor spatial locality. Sectored DRAM predicts the words in a cache block that will likely be accessed during the cache block’s residency in cache and: (i) transfers only the predicted words on the memory channel by dynamically tailoring the DRAM data transfer size for the workload and (ii) activates a smaller set of cells that contain the predicted words by carefully operating physically isolated portions of DRAM rows (i.e., mats). Activating a smaller set of cells on each access relaxes DRAM power delivery constraints and allows the memory controller to schedule DRAM accesses faster.

We evaluate Sectored DRAM using 41 workloads from widely-used benchmark suites. Compared to a system with coarse-grained DRAM, Sectored DRAM reduces the DRAM energy consumption of highly-memory-intensive workloads by up to (on average) 33% (20%) while improving their performance by up to (on average) 36% (17%). Sectored DRAM’s DRAM energy savings, combined with its system performance improvement, allows system-wide energy savings of up to 23%. Sectored DRAM’s DRAM chip area overhead is 1.7% of the area of a modern DDR4 chip. Compared to state-of-the-art fine-grained DRAM architectures, Sectored DRAM greatly reduces DRAM energy consumption, does not reduce DRAM bandwidth, and can be implemented with low hardware cost. Sectored DRAM provides 89% of the performance benefits of, consumes 12% less DRAM energy than, and takes up 34% less DRAM chip area than a high-performance state-of-the-art fine-grained DRAM architecture (Half-DRAM). We hope and believe that Sectored DRAM’s ideas and results will help to enable more efficient and high-performance memory systems. To this end, we open source Sectored DRAM at https://github.com/CMU-SAFARI/Sectored-DRAM.

现代计算系统以粗粒度（如 512 位缓存块粒度）访问主内存中的数据。粗粒度访问会导致能源浪费，因为系统不会使用缓存块中所有单独访问的小部分（如字，每个字通常为 64 位）。在基于 DRAM 的现代计算系统中，有两种关键的粗粒度访问机制会导致能源浪费：(i) DRAM 与内存控制器之间的大型固定尺寸数据传输；(ii) DRAM 行激活。我们提出的 Sectored DRAM 是一种新型、低开销 DRAM 基板，可通过细粒度 DRAM 数据传输和 DRAM 行激活减少能源浪费。为了只从 DRAM 中检索有用的数据，Sectored DRAM 利用了这样的观察结果：由于空间位置性差，许多高速缓存块在许多工作负载中没有得到充分利用。Sectored DRAM 可预测高速缓存块中的字，这些字在高速缓存块驻留期间可能会被访问：(i) 通过为工作负载动态调整 DRAM 数据传输大小，仅在内存通道上传输预测字；(ii) 通过小心操作 DRAM 行（即垫）的物理隔离部分，激活包含预测字的较小单元集。在每次访问中激活较小的单元集可放宽 DRAM 功率交付限制，并允许内存控制器更快地调度 DRAM 访问。我们使用广泛使用的基准套件中的 41 个工作负载对 Sectored DRAM 进行了评估。与采用粗粒度 DRAM 的系统相比，Sectored DRAM 可将高内存密集型工作负载的 DRAM 能耗降低多达（平均）33%（20%），同时将其性能提高多达（平均）36%（17%）。Sectored DRAM 在节省 DRAM 能耗的同时还提高了系统性能，使整个系统的能耗节省高达 23%。分段式 DRAM 的 DRAM 芯片面积开销是现代 DDR4 芯片面积的 1.7%。与最先进的细粒度 DRAM 架构相比，分段式 DRAM 大大降低了 DRAM 能耗，不会降低 DRAM 带宽，而且可以以较低的硬件成本实现。与最先进的高性能细粒度 DRAM 架构（Half-DRAM）相比，Sectored DRAM 可提供 89% 的性能优势，DRAM 能耗降低 12%，DRAM 芯片面积减少 34%。我们希望并相信，Sectored DRAM 的理念和成果将有助于实现更高效、更高性能的内存系统。为此，我们在 https://github.com/CMU-SAFARI/Sectored-DRAM 上开放了 Sectored DRAM 的源代码。

{"title":"Sectored DRAM: A Practical Energy-Efficient and High-Performance Fine-Grained DRAM Architecture","authors":"Ataberk Olgun, Fatma Bostanci, Geraldo Francisco de Oliveira Junior, Yahya Can Tugrul, Rahul Bera, Abdullah Giray Yaglikci, Hasan Hassan, Oguz Ergin, Onur Mutlu","doi":"10.1145/3673653","DOIUrl":"https://doi.org/10.1145/3673653","url":null,"abstract":"Modern computing systems access data in main memory at coarse granularity (e.g., at 512-bit cache block granularity). Coarse-grained access leads to wasted energy because the system does not use all individually accessed small portions (e.g., words, each of which typically is 64 bits) of a cache block. In modern DRAM-based computing systems, two key coarse-grained access mechanisms lead to wasted energy: large and fixed-size (i) data transfers between DRAM and the memory controller and (ii) DRAM row activations. We propose Sectored DRAM, a new, low-overhead DRAM substrate that reduces wasted energy by enabling fine-grained DRAM data transfer and DRAM row activation. To retrieve only useful data from DRAM, Sectored DRAM exploits the observation that many cache blocks are not fully utilized in many workloads due to poor spatial locality. Sectored DRAM predicts the words in a cache block that will likely be accessed during the cache block’s residency in cache and: (i) transfers only the predicted words on the memory channel by dynamically tailoring the DRAM data transfer size for the workload and (ii) activates a smaller set of cells that contain the predicted words by carefully operating physically isolated portions of DRAM rows (i.e., mats). Activating a smaller set of cells on each access relaxes DRAM power delivery constraints and allows the memory controller to schedule DRAM accesses faster. We evaluate Sectored DRAM using 41 workloads from widely-used benchmark suites. Compared to a system with coarse-grained DRAM, Sectored DRAM reduces the DRAM energy consumption of highly-memory-intensive workloads by up to (on average) 33% (20%) while improving their performance by up to (on average) 36% (17%). Sectored DRAM’s DRAM energy savings, combined with its system performance improvement, allows system-wide energy savings of up to 23%. Sectored DRAM’s DRAM chip area overhead is 1.7% of the area of a modern DDR4 chip. Compared to state-of-the-art fine-grained DRAM architectures, Sectored DRAM greatly reduces DRAM energy consumption, does not reduce DRAM bandwidth, and can be implemented with low hardware cost. Sectored DRAM provides 89% of the performance benefits of, consumes 12% less DRAM energy than, and takes up 34% less DRAM chip area than a high-performance state-of-the-art fine-grained DRAM architecture (Half-DRAM). We hope and believe that Sectored DRAM’s ideas and results will help to enable more efficient and high-performance memory systems. To this end, we open source Sectored DRAM at https://github.com/CMU-SAFARI/Sectored-DRAM.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"94 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141508982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Scythe: A Low-latency RDMA-enabled Distributed Transaction System for Disaggregated Memory Scythe：面向分解内存的低延迟 RDMA 分布式事务系统

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-05-27 DOI: 10.1145/3666004

Kai Lu, Siqi Zhao, Haikang Shan, Qiang Wei, Guokuan Li, Jiguang Wan, Ting Yao, Huatao Wu, Daohui Wang

Disaggregated memory separates compute and memory resources into independent pools connected by RDMA (Remote Direct Memory Access) networks, which can improve memory utilization, reduce cost, and enable elastic scaling of compute and memory resources. However, existing RDMA-based distributed transactions on disaggregated memory suffer from severe long-tail latency under high-contention workloads.

In this paper, we propose Scythe, a novel low-latency RDMA-enabled distributed transaction system for disaggregated memory. Scythe optimizes the latency of high-contention transactions in three approaches: 1) Scythe proposes a hot-aware concurrency control policy that uses optimistic concurrency control (OCC) to improve transaction processing efficiency in low-conflict scenarios. Under high conflicts, Scythe designs a timestamp-ordered OCC (TOCC) strategy based on fair locking to reduce the number of retries and cross-node communication overhead. 2) Scythe presents an RDMA-friendly timestamp service for improved timestamp management. 3) Scythe designs an RDMA-optimized RPC framework to improve RDMA bandwidth utilization. The evaluation results show that, compared to state-of-the-art distributed transaction systems, Scythe achieves more than 2.5 × lower latency with 1.8 × higher throughput under high-contention workloads.

分解内存将计算和内存资源分成独立的池，通过 RDMA（远程直接内存访问）网络连接起来，可以提高内存利用率，降低成本，并实现计算和内存资源的弹性扩展。然而，现有的基于 RDMA 的分解内存分布式事务在高负载情况下存在严重的长尾延迟问题。在本文中，我们提出了适用于分解内存的新型低延迟 RDMA 分布式事务系统 Scythe。Scythe 通过三种方法优化了高延迟事务的延迟：1) Scythe 提出了一种热感知并发控制策略，使用乐观并发控制（OCC）来提高低冲突情况下的事务处理效率。在高冲突情况下，Scythe 设计了一种基于公平锁定的时间戳排序 OCC（TOCC）策略，以减少重试次数和跨节点通信开销。2) Scythe 提出了一种 RDMA 友好型时间戳服务，以改进时间戳管理。3) Scythe 设计了一个 RDMA 优化 RPC 框架，以提高 RDMA 带宽利用率。评估结果表明，与最先进的分布式事务处理系统相比，Scythe 在高竞争工作负载下实现了超过 2.5 倍的低延迟和 1.8 倍的高吞吐量。

{"title":"Scythe: A Low-latency RDMA-enabled Distributed Transaction System for Disaggregated Memory","authors":"Kai Lu, Siqi Zhao, Haikang Shan, Qiang Wei, Guokuan Li, Jiguang Wan, Ting Yao, Huatao Wu, Daohui Wang","doi":"10.1145/3666004","DOIUrl":"https://doi.org/10.1145/3666004","url":null,"abstract":"Disaggregated memory separates compute and memory resources into independent pools connected by RDMA (Remote Direct Memory Access) networks, which can improve memory utilization, reduce cost, and enable elastic scaling of compute and memory resources. However, existing RDMA-based distributed transactions on disaggregated memory suffer from severe long-tail latency under high-contention workloads. In this paper, we propose Scythe, a novel low-latency RDMA-enabled distributed transaction system for disaggregated memory. Scythe optimizes the latency of high-contention transactions in three approaches: 1) Scythe proposes a hot-aware concurrency control policy that uses optimistic concurrency control (OCC) to improve transaction processing efficiency in low-conflict scenarios. Under high conflicts, Scythe designs a timestamp-ordered OCC (TOCC) strategy based on fair locking to reduce the number of retries and cross-node communication overhead. 2) Scythe presents an RDMA-friendly timestamp service for improved timestamp management. 3) Scythe designs an RDMA-optimized RPC framework to improve RDMA bandwidth utilization. The evaluation results show that, compared to state-of-the-art distributed transaction systems, Scythe achieves more than 2.5 × lower latency with 1.8 × higher throughput under high-contention workloads.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"21 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141171974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed Restoration FASA-DRAM：通过破坏性激活和延迟恢复减少 DRAM 延迟

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-05-21 DOI: 10.1145/3649455

Haitao Du, Yuhan Qin, Song Chen, Yi Kang

DRAM memory is a performance bottleneck for many applications, due to its high access latency. Previous work has mainly focused on data locality, introducing small but fast regions to cache frequently accessed data, thereby reducing the average latency. However, these locality-based designs have three challenges in modern multi-core systems: (1) inter-application interference leads to random memory access traffic, (2) fairness issues prevent the memory controller from over-prioritizing data locality, and (3) write-intensive applications have much lower locality and evict substantial dirty entries. With frequent data movement between the fast in-DRAM cache and slow regular arrays, the overhead induced by moving data may even offset the performance and energy benefits of in-DRAM caching.

In this article, we decouple the data movement process into two distinct phases. The first phase is Load-Reduced Destructive Activation (LRDA), which destructively promotes data into the in-DRAM cache. The second phase is Delayed Cycle-Stealing Restoration (DCSR), which restores the original data when the DRAM bank is idle. LRDA decouples the most time-consuming restoration phase from activation, and DCSR hides the restoration latency through prevalent bank-level parallelism. We propose FASA-DRAM, incorporating destructive activation and delayed restoration techniques to enable both in-DRAM caching and proactive latency-hiding mechanisms. Our evaluation shows that FASA-DRAM improves the average performance by 19.9% and reduces average DRAM energy consumption by 18.1% over DDR4 DRAM for four-core workloads, with less than 3.4% extra area overhead. Furthermore, FASA-DRAM outperforms state-of-the-art designs in both performance and energy efficiency.

DRAM 存储器因其较高的访问延迟而成为许多应用的性能瓶颈。以往的工作主要集中在数据局部性上，即引入小而快的区域来缓存频繁访问的数据，从而降低平均延迟。然而，这些基于位置性的设计在现代多核系统中面临三个挑战：(1) 应用程序间的干扰会导致随机内存访问流量，(2) 公平性问题会阻止内存控制器过度优先考虑数据位置性，(3) 写密集型应用程序的位置性要低得多，并且会驱逐大量脏条目。在快速内存缓存和慢速常规阵列之间频繁移动数据时，移动数据造成的开销甚至可能抵消内存缓存带来的性能和能耗优势。第一阶段是负载降低的破坏性激活（LRDA），它将数据破坏性地推进到内存缓存中。第二个阶段是延迟周期转换恢复（DCSR），在 DRAM 存储体空闲时恢复原始数据。LRDA 将最耗时的还原阶段与激活解耦，而 DCSR 则通过普遍存在的库级并行性隐藏了还原延迟。我们提出了 FASA-DRAM，它结合了破坏性激活和延迟还原技术，实现了内存缓存和主动延迟隐藏机制。我们的评估结果表明，与 DDR4 DRAM 相比，FASA-DRAM 在四核工作负载中的平均性能提高了 19.9%，平均 DRAM 能耗降低了 18.1%，而额外的面积开销不到 3.4%。此外，FASA-DRAM 在性能和能效方面都优于最先进的设计。

{"title":"FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed Restoration","authors":"Haitao Du, Yuhan Qin, Song Chen, Yi Kang","doi":"10.1145/3649455","DOIUrl":"https://doi.org/10.1145/3649455","url":null,"abstract":"DRAM memory is a performance bottleneck for many applications, due to its high access latency. Previous work has mainly focused on data locality, introducing small but fast regions to cache frequently accessed data, thereby reducing the average latency. However, these locality-based designs have three challenges in modern multi-core systems: (1) inter-application interference leads to random memory access traffic, (2) fairness issues prevent the memory controller from over-prioritizing data locality, and (3) write-intensive applications have much lower locality and evict substantial dirty entries. With frequent data movement between the fast in-DRAM cache and slow regular arrays, the overhead induced by moving data may even offset the performance and energy benefits of in-DRAM caching.In this article, we decouple the data movement process into two distinct phases. The first phase is Load-Reduced Destructive Activation (LRDA), which destructively promotes data into the in-DRAM cache. The second phase is Delayed Cycle-Stealing Restoration (DCSR), which restores the original data when the DRAM bank is idle. LRDA decouples the most time-consuming restoration phase from activation, and DCSR hides the restoration latency through prevalent bank-level parallelism. We propose FASA-DRAM, incorporating destructive activation and delayed restoration techniques to enable both in-DRAM caching and proactive latency-hiding mechanisms. Our evaluation shows that FASA-DRAM improves the average performance by 19.9% and reduces average DRAM energy consumption by 18.1% over DDR4 DRAM for four-core workloads, with less than 3.4% extra area overhead. Furthermore, FASA-DRAM outperforms state-of-the-art designs in both performance and energy efficiency.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"35 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141147023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fixed-point Encoding and Architecture Exploration for Residue Number Systems 余数系统的定点编码和架构探索

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-05-14 DOI: 10.1145/3664923

Bobin Deng, Bhargava Nadendla, Kun Suo, Yixin Xie, Dan Chia-Tien Lo

Residue Number Systems (RNS) demonstrate the fascinating potential to serve integer addition/multiplication-intensive applications. The complexity of Artificial Intelligence (AI) models has grown enormously in recent years. From a computer system’s perspective, ensuring the training of these large-scale AI models within an adequate time and energy consumption has become a big concern. Matrix multiplication is a dominant subroutine in many prevailing AI models, with an addition/multiplication-intensive attribute. However, the data type of matrix multiplication within machine learning training typically requires real numbers, which indicates that RNS benefits for integer applications cannot be directly gained by AI training. The state-of-the-art RNS real number encodings, including floating-point and fixed-point, have defects and can be further enhanced. To transform default RNS benefits to the efficiency of large-scale AI training, we propose a low-cost and high-accuracy RNS fixed-point representation: Single RNS Logical Partition (S-RNS-Logic-P) representation with Scaling Down Postprocessing Multiplication (SD-Post-Mul). Moreover, we extend the implementation details of the other two RNS fixed-point methods: Double RNS Concatenation (D-RNS-Concat) and Single RNS Logical Partition (S-RNS-Logic-P) representation with Scaling Down Preprocessing Multiplication (SD-Pre-Mul). We also design the architectures of these three fixed-point multipliers. In empirical experiments, our S-RNS-Logic-P representation with SD-Post-Mul method achieves less latency and energy overhead while maintaining good accuracy. Furthermore, this method can easily extend to the Redundant Residue Number System (RRNS) to raise the efficiency of error-tolerant domains, such as improving the error correction efficiency of quantum computing.

余数系统（RNS）展示了服务于整数加法/乘法密集型应用的迷人潜力。近年来，人工智能（AI）模型的复杂性大幅提高。从计算机系统的角度来看，如何确保在足够的时间和能耗内训练这些大规模人工智能模型已成为一个重大问题。矩阵乘法是许多主流人工智能模型的主要子程序，具有加法/乘法密集的属性。然而，机器学习训练中矩阵乘法的数据类型通常需要实数，这表明人工智能训练无法直接获得 RNS 在整数应用方面的优势。最先进的 RNS 实数编码（包括浮点和定点）存在缺陷，可以进一步改进。为了将默认 RNS 的优势转化为大规模人工智能训练的效率，我们提出了一种低成本、高精度的 RNS 定点表示法：单 RNS 逻辑分区（S-RNS-Logic-P）表示法与缩放后处理乘法（SD-Post-Mul）。此外，我们还扩展了另外两种 RNS 定点方法的实现细节：双 RNS 连接（D-RNS-Concat）和单 RNS 逻辑分区（S-RNS-Logic-P）表示法与缩放后处理乘法（SD-Pre-Mul）的实现细节。我们还设计了这三种定点乘法器的架构。在实证实验中，我们的 S-RNS-Logic-P 表示与 SD-Post-Mul 方法在保持良好精度的同时，实现了更少的延迟和能量开销。此外，这种方法可以很容易地扩展到冗余余数系统（RRNS），从而提高容错领域的效率，例如提高量子计算的纠错效率。

{"title":"Fixed-point Encoding and Architecture Exploration for Residue Number Systems","authors":"Bobin Deng, Bhargava Nadendla, Kun Suo, Yixin Xie, Dan Chia-Tien Lo","doi":"10.1145/3664923","DOIUrl":"https://doi.org/10.1145/3664923","url":null,"abstract":"Residue Number Systems (RNS) demonstrate the fascinating potential to serve integer addition/multiplication-intensive applications. The complexity of Artificial Intelligence (AI) models has grown enormously in recent years. From a computer system’s perspective, ensuring the training of these large-scale AI models within an adequate time and energy consumption has become a big concern. Matrix multiplication is a dominant subroutine in many prevailing AI models, with an addition/multiplication-intensive attribute. However, the data type of matrix multiplication within machine learning training typically requires real numbers, which indicates that RNS benefits for integer applications cannot be directly gained by AI training. The state-of-the-art RNS real number encodings, including floating-point and fixed-point, have defects and can be further enhanced. To transform default RNS benefits to the efficiency of large-scale AI training, we propose a low-cost and high-accuracy RNS fixed-point representation: Single RNS Logical Partition (S-RNS-Logic-P) representation with Scaling Down Postprocessing Multiplication (SD-Post-Mul). Moreover, we extend the implementation details of the other two RNS fixed-point methods: Double RNS Concatenation (D-RNS-Concat) and Single RNS Logical Partition (S-RNS-Logic-P) representation with Scaling Down Preprocessing Multiplication (SD-Pre-Mul). We also design the architectures of these three fixed-point multipliers. In empirical experiments, our S-RNS-Logic-P representation with SD-Post-Mul method achieves less latency and energy overhead while maintaining good accuracy. Furthermore, this method can easily extend to the Redundant Residue Number System (RRNS) to raise the efficiency of error-tolerant domains, such as improving the error correction efficiency of quantum computing.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"147 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140931088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CoolDC: A Cost-Effective Immersion-Cooled Datacenter with Workload-Aware Temperature Scaling CoolDC：具有工作负载感知温度扩展功能的低成本浸入式冷却数据中心

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-05-14 DOI: 10.1145/3664925

Dongmoon Min, Ilkwon Byun, Gyu-hyeon Lee, Jangwoo Kim

For datacenter architects, it is the most important goal to minimize the datacenter’s total cost of ownership for the target performance (i.e., TCO/performance). As the major component of a datacenter is a server farm, the most effective way of reducing TCO/performance is to improve the server’s performance and power efficiency. To achieve the goal, we claim that it is highly promising to reduce each server’s temperature to its most cost-effective point (or temperature scaling).

In this paper, we propose CoolDC, a novel and immediately-applicable low-temperature cooling method to minimize the datacenter’s TCO. The key idea is to find and apply the most cost-effective sub-freezing temperature to target servers and workloads. For that purpose, we first apply the immersion cooling method to the entire servers to maintain a stable low temperature with little extra cooling and maintenance costs. Second, we define the TCO-optimal temperature for datacenter operation (e.g., 248K~273K (-25℃~0℃)) by carefully estimating all the costs and benefits at low temperatures. Finally, we propose CoolDC, our immersion-cooling datacenter architecture to run every workload at its own TCO-optimal temperature. By incorporating our low-temperature workload-aware temperature scaling, CoolDC achieves 12.7% and 13.4% lower TCO/performance than the conventional air-cooled and immersion-cooled datacenters, respectively, without any modification to existing computers.

对于数据中心架构师来说，最重要的目标是最大限度地降低数据中心在目标性能下的总拥有成本（即 TCO/性能）。由于数据中心的主要组成部分是服务器群，降低总拥有成本/性能的最有效方法就是提高服务器的性能和能效。为了实现这一目标，我们认为将每台服务器的温度降低到最具成本效益的点（或温度缩放）是非常有前途的。在本文中，我们提出了一种新颖且可立即应用的低温冷却方法 CoolDC，以最大限度地降低数据中心的总体拥有成本。其关键思路是为目标服务器和工作负载找到并应用最具成本效益的低温。为此，我们首先对整个服务器采用浸入式冷却方法，以保持稳定的低温，同时减少额外的冷却和维护成本。其次，通过仔细估算低温下的所有成本和收益，我们确定了数据中心运行的 TCO 最佳温度（例如，248K~273K（-25℃~0℃））。最后，我们提出了 CoolDC，即我们的浸入式冷却数据中心架构，可使每个工作负载在各自的 TCO 最佳温度下运行。通过采用我们的低温工作负载感知温度扩展技术，CoolDC 的总体拥有成本/性能分别比传统的风冷数据中心和浸入式冷却数据中心低 12.7% 和 13.4%，而且无需对现有计算机进行任何改动。

{"title":"CoolDC: A Cost-Effective Immersion-Cooled Datacenter with Workload-Aware Temperature Scaling","authors":"Dongmoon Min, Ilkwon Byun, Gyu-hyeon Lee, Jangwoo Kim","doi":"10.1145/3664925","DOIUrl":"https://doi.org/10.1145/3664925","url":null,"abstract":"For datacenter architects, it is the most important goal to minimize the datacenter’s total cost of ownership for the target performance (i.e., TCO/performance). As the major component of a datacenter is a server farm, the most effective way of reducing TCO/performance is to improve the server’s performance and power efficiency. To achieve the goal, we claim that it is highly promising to reduce each server’s temperature to its most cost-effective point (or temperature scaling). In this paper, we propose CoolDC, a novel and immediately-applicable low-temperature cooling method to minimize the datacenter’s TCO. The key idea is to find and apply the most cost-effective sub-freezing temperature to target servers and workloads. For that purpose, we first apply the immersion cooling method to the entire servers to maintain a stable low temperature with little extra cooling and maintenance costs. Second, we define the TCO-optimal temperature for datacenter operation (e.g., 248K~273K (-25℃~0℃)) by carefully estimating all the costs and benefits at low temperatures. Finally, we propose CoolDC, our immersion-cooling datacenter architecture to run every workload at its own TCO-optimal temperature. By incorporating our low-temperature workload-aware temperature scaling, CoolDC achieves 12.7% and 13.4% lower TCO/performance than the conventional air-cooled and immersion-cooled datacenters, respectively, without any modification to existing computers.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"54 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140931024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Stripe-schedule Aware Repair in Erasure-coded Clusters with Heterogeneous Star Networks 具有异构星型网络的擦除编码集群中的条带调度感知修复

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-05-13 DOI: 10.1145/3664926

Hai Zhou, Dan Feng

More and more storage systems use erasure code to tolerate faults. It takes pieces of data blocks as input and encodes a small number of parity blocks as output, where these blocks form a stripe. When reconsidering the recovery problem in the multi-stripe level and heterogeneous network clusters, quickly generating an efficient multi-stripe recovery solution that reduces recovery time remains a challenging and time-consuming task. Previous works either use a greedy algorithm that may fall into the local optimal and have low recovery performance or a meta-heuristic algorithm with a long running time and low solution generation efficiency.

In this paper, we propose a Stripe-schedule Aware Repair (SARepair) technique for multi-stripe recovery in heterogeneous erasure-coded clusters based on RS code. By carefully examining the metadata of blocks, SARepair intelligently adjusts the recovery solution for each stripe and obtains another multi-stripe solution with less recovery time in a computationally efficient manner. It then tolerates worse solutions to overcome the local optimal and uses a rollback mechanism to adjust search regions to reduce recovery time further. Moreover, instead of reading blocks sequentially from each node, SARepair also selectively schedules the reading order for each block to reduce the memory overhead. We extend SARepair to address the full-node recovery and adapt to the LRC code. We prototype SARepair and show via both simulations and Amazon EC2 experiments that the recovery performance can be improved by up to 59.97% over a state-of-the-art recovery approach while keeping running time and memory overhead low.

越来越多的存储系统使用擦除码来容错。它将数据块作为输入，并将少量奇偶校验块作为输出进行编码，这些数据块组成一个条纹。在重新考虑多磁条级和异构网络集群中的恢复问题时，快速生成一个高效的多磁条恢复解决方案以缩短恢复时间仍然是一项具有挑战性且耗时的任务。以往的研究要么使用可能陷入局部最优且恢复性能低的贪婪算法，要么使用运行时间长且解决方案生成效率低的元启发式算法。在本文中，我们提出了一种基于 RS 代码的条带调度感知修复（SARepair）技术，用于异构擦除编码集群中的多条带恢复。通过仔细检查块的元数据，SARepair 可以智能地调整每个磁条的恢复解决方案，并以计算高效的方式获得恢复时间更短的另一种多磁条解决方案。然后，它可以容忍更差的解决方案，以克服局部最优，并使用回滚机制调整搜索区域，进一步缩短恢复时间。此外，SARepair 还会选择性地安排每个数据块的读取顺序，以减少内存开销，而不是按顺序从每个节点读取数据块。我们对 SARepair 进行了扩展，以解决全节点恢复问题并适应 LRC 代码。我们制作了 SARepair 的原型，并通过仿真和 Amazon EC2 实验表明，与最先进的恢复方法相比，SARepair 的恢复性能最多可提高 59.97%，同时还能保持较低的运行时间和内存开销。

{"title":"Stripe-schedule Aware Repair in Erasure-coded Clusters with Heterogeneous Star Networks","authors":"Hai Zhou, Dan Feng","doi":"10.1145/3664926","DOIUrl":"https://doi.org/10.1145/3664926","url":null,"abstract":"More and more storage systems use erasure code to tolerate faults. It takes pieces of data blocks as input and encodes a small number of parity blocks as output, where these blocks form a stripe. When reconsidering the recovery problem in the multi-stripe level and heterogeneous network clusters, quickly generating an efficient multi-stripe recovery solution that reduces recovery time remains a challenging and time-consuming task. Previous works either use a greedy algorithm that may fall into the local optimal and have low recovery performance or a meta-heuristic algorithm with a long running time and low solution generation efficiency. In this paper, we propose a Stripe-schedule Aware Repair (SARepair) technique for multi-stripe recovery in heterogeneous erasure-coded clusters based on RS code. By carefully examining the metadata of blocks, SARepair intelligently adjusts the recovery solution for each stripe and obtains another multi-stripe solution with less recovery time in a computationally efficient manner. It then tolerates worse solutions to overcome the local optimal and uses a rollback mechanism to adjust search regions to reduce recovery time further. Moreover, instead of reading blocks sequentially from each node, SARepair also selectively schedules the reading order for each block to reduce the memory overhead. We extend SARepair to address the full-node recovery and adapt to the LRC code. We prototype SARepair and show via both simulations and Amazon EC2 experiments that the recovery performance can be improved by up to 59.97% over a state-of-the-art recovery approach while keeping running time and memory overhead low.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"42 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140931025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Asynchronous Memory Access Unit: Exploiting Massive Parallelism for Far Memory Access 异步内存访问单元：利用大规模并行性实现远距离内存访问

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-05-09 DOI: 10.1145/3663479

Luming Wang, Xu Zhang, Songyue Wang, Zhuolun Jiang, Tianyue Lu, Mingyu Chen, Siwei Luo, Keji Huang

The growing memory demands of modern applications have driven the adoption of far memory technologies in data centers to provide cost-effective, high-capacity memory solutions. However, far memory presents new performance challenges because its access latencies are significantly longer and more variable than local DRAM. For applications to achieve acceptable performance on far memory, a high degree of memory-level parallelism (MLP) is needed to tolerate the long access latency.

While modern out-of-order processors are capable of exploiting a certain degree of MLP, they are constrained by resource limitations and hardware complexity. The key obstacle is the synchronous memory access semantics of traditional load/store instructions, which occupy critical hardware resources for a long time. The longer far memory latencies exacerbate this limitation.

This paper proposes a set of Asynchronous Memory Access Instructions (AMI) and its supporting function unit, Asynchronous Memory Access Unit (AMU), inside contemporary Out-of-Order Core. AMI separates memory request issuing from response handling to reduce resource occupation. Additionally, AMU architecture supports up to several hundreds of asynchronous memory requests through re-purposing a portion of L2 Cache as scratchpad memory (SPM) to provide sufficient temporal storage. Together with a coroutine-based programming framework, this scheme can achieve significantly higher MLP for hiding far memory latencies.

Evaluation with a cycle-accurate simulation shows AMI achieves 2.42 × speedup on average for memory-bound benchmarks with 1μs additional far memory latency. Over 130 outstanding requests are supported with 26.86 × speedup for GUPS (random access) with 5 μs latency. These demonstrate how the techniques tackle far memory performance impacts through explicit MLP expression and latency adaptation.

现代应用对内存的需求不断增长，推动了远端内存技术在数据中心的应用，以提供具有成本效益的大容量内存解决方案。然而，远端内存带来了新的性能挑战，因为它的访问延迟比本地 DRAM 长得多，而且变化更大。要使应用程序在远端内存上实现可接受的性能，需要高度的内存级并行性（MLP）来承受较长的访问延迟。虽然现代阶外处理器能够利用一定程度的 MLP，但它们受到资源限制和硬件复杂性的制约。关键的障碍在于传统加载/存储指令的同步内存访问语义，它会长时间占用关键的硬件资源。较长的远存储器延迟加剧了这一限制。本文提出了一套异步内存访问指令（AMI）及其支持功能单元--异步内存访问单元（AMU），并将其置于当代的阶次外核（Out-of-Order Core）中。AMI 将内存请求发布与响应处理分开，以减少资源占用。此外，AMU 架构通过将部分二级缓存重新用作刮板内存 (SPM)，提供足够的时间存储，可支持多达数百个异步内存请求。该方案与基于例程的编程框架相结合，可显著提高隐藏远内存延迟的 MLP。通过周期精确模拟进行的评估显示，AMI 在 1μs 额外远内存延迟的内存绑定基准测试中平均实现了 2.42 倍的速度提升。对于具有 5μs 延迟的 GUPS（随机存取），支持 130 多个未处理请求，速度提高了 26.86 倍。这证明了这些技术如何通过明确的 MLP 表达和延迟适应来解决对远端内存性能的影响。

{"title":"Asynchronous Memory Access Unit: Exploiting Massive Parallelism for Far Memory Access","authors":"Luming Wang, Xu Zhang, Songyue Wang, Zhuolun Jiang, Tianyue Lu, Mingyu Chen, Siwei Luo, Keji Huang","doi":"10.1145/3663479","DOIUrl":"https://doi.org/10.1145/3663479","url":null,"abstract":"The growing memory demands of modern applications have driven the adoption of far memory technologies in data centers to provide cost-effective, high-capacity memory solutions. However, far memory presents new performance challenges because its access latencies are significantly longer and more variable than local DRAM. For applications to achieve acceptable performance on far memory, a high degree of memory-level parallelism (MLP) is needed to tolerate the long access latency. While modern out-of-order processors are capable of exploiting a certain degree of MLP, they are constrained by resource limitations and hardware complexity. The key obstacle is the synchronous memory access semantics of traditional load/store instructions, which occupy critical hardware resources for a long time. The longer far memory latencies exacerbate this limitation. This paper proposes a set of Asynchronous Memory Access Instructions (AMI) and its supporting function unit, Asynchronous Memory Access Unit (AMU), inside contemporary Out-of-Order Core. AMI separates memory request issuing from response handling to reduce resource occupation. Additionally, AMU architecture supports up to several hundreds of asynchronous memory requests through re-purposing a portion of L2 Cache as scratchpad memory (SPM) to provide sufficient temporal storage. Together with a coroutine-based programming framework, this scheme can achieve significantly higher MLP for hiding far memory latencies. Evaluation with a cycle-accurate simulation shows AMI achieves 2.42 × speedup on average for memory-bound benchmarks with 1μs additional far memory latency. Over 130 outstanding requests are supported with 26.86 × speedup for GUPS (random access) with 5 μs latency. These demonstrate how the techniques tackle far memory performance impacts through explicit MLP expression and latency adaptation.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"6 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140936310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Characterizing and Optimizing LDPC Performance on 3D NAND Flash Memories 鉴定和优化 3D NAND 闪存上的 LDPC 性能

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-05-03 DOI: 10.1145/3663478

Qiao Li, Yu Chen, Guanyu Wu, Yajuan Du, Min Ye, Xinbiao Gan, jie zhang, Zhirong Shen, Jiwu Shu, Chun Xue

With the development of NAND flash memories’ bit density and stacking technologies, while storage capacity keeps increasing, the issue of reliability becomes increasingly prominent. Low-density parity check (LDPC) code, as a robust error-correcting code, is extensively employed in flash memory. However, when the RBER is prohibitively high, LDPC decoding would introduce long latency. To study how LDPC performs on the latest 3D NAND flash memory, we conduct a comprehensive analysis of LDPC decoding performance using both the theoretically derived threshold voltage distribution model obtained through modeling (Modeling-based method) and the actual voltage distribution collected from on-chip data through testing (Ideal case). Based on LDPC decoding results under various interference conditions, we summarize four findings that can help us gain a better understanding of the characteristics of LDPC decoding in 3D NAND flash memory. Following our characterization, we identify the differences in LDPC decoding performance between the Modeling-based method and the Ideal case. Due to the accuracy of initial probability information, the threshold voltage distribution derived through modeling deviates by certain degrees from the actual threshold voltage distribution. This leads to a performance gap between using the threshold voltage distribution derived from the Modeling-based method and the actual distribution. By observing the abnormal behaviors in the decoding with the Modeling-based method, we introduce an Offsetted Read Voltage (ΔRV) method, for optimizing LDPC decoding performance by offsetting the reading voltage in each layer of a flash block. The evaluation results show that our ΔRV method enhances the decoding performance of LDPC on the Modeling-based method by reducing the total number of sensing levels needed for LDPC decoding by 0.67% to 18.92% for different interference conditions on average, under the P/E cycles from 3000 to 7000.

随着 NAND 闪存位密度和堆叠技术的发展，在存储容量不断增加的同时，可靠性问题也日益突出。低密度奇偶校验（LDPC）码作为一种稳健的纠错码，被广泛应用于闪存中。然而，当 RBER 过高时，LDPC 解码会带来较长的延迟。为了研究 LDPC 在最新 3D NAND 闪存上的性能，我们利用通过建模获得的理论阈值电压分布模型（基于建模的方法）和通过测试从片上数据收集的实际电压分布（理想情况），对 LDPC 解码性能进行了全面分析。根据各种干扰条件下的 LDPC 解码结果，我们总结了四项发现，它们有助于我们更好地理解 3D NAND 闪存中的 LDPC 解码特性。根据我们的特征分析，我们确定了基于建模的方法与理想情况下 LDPC 解码性能的差异。由于初始概率信息的准确性，通过建模得出的阈值电压分布与实际阈值电压分布存在一定程度的偏差。这导致使用基于建模方法得出的阈值电压分布与实际分布之间存在性能差距。通过观察基于建模方法的解码异常行为，我们引入了偏移读取电压 (ΔRV)方法，通过偏移闪存块各层的读取电压来优化 LDPC 解码性能。评估结果表明，与基于建模的方法相比，我们的 ΔRV 方法提高了 LDPC 的解码性能，在 P/E 周期为 3000 到 7000 的不同干扰条件下，LDPC 解码所需的传感层总数平均减少了 0.67% 到 18.92%。

{"title":"Characterizing and Optimizing LDPC Performance on 3D NAND Flash Memories","authors":"Qiao Li, Yu Chen, Guanyu Wu, Yajuan Du, Min Ye, Xinbiao Gan, jie zhang, Zhirong Shen, Jiwu Shu, Chun Xue","doi":"10.1145/3663478","DOIUrl":"https://doi.org/10.1145/3663478","url":null,"abstract":"With the development of NAND flash memories’ bit density and stacking technologies, while storage capacity keeps increasing, the issue of reliability becomes increasingly prominent. Low-density parity check (LDPC) code, as a robust error-correcting code, is extensively employed in flash memory. However, when the RBER is prohibitively high, LDPC decoding would introduce long latency. To study how LDPC performs on the latest 3D NAND flash memory, we conduct a comprehensive analysis of LDPC decoding performance using both the theoretically derived threshold voltage distribution model obtained through modeling (Modeling-based method) and the actual voltage distribution collected from on-chip data through testing (Ideal case). Based on LDPC decoding results under various interference conditions, we summarize four findings that can help us gain a better understanding of the characteristics of LDPC decoding in 3D NAND flash memory. Following our characterization, we identify the differences in LDPC decoding performance between the Modeling-based method and the Ideal case. Due to the accuracy of initial probability information, the threshold voltage distribution derived through modeling deviates by certain degrees from the actual threshold voltage distribution. This leads to a performance gap between using the threshold voltage distribution derived from the Modeling-based method and the actual distribution. By observing the abnormal behaviors in the decoding with the Modeling-based method, we introduce an Offsetted Read Voltage (ΔRV) method, for optimizing LDPC decoding performance by offsetting the reading voltage in each layer of a flash block. The evaluation results show that our ΔRV method enhances the decoding performance of LDPC on the Modeling-based method by reducing the total number of sensing levels needed for LDPC decoding by 0.67% to 18.92% for different interference conditions on average, under the P/E cycles from 3000 to 7000.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"4 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140828697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GraphSER: Distance-Aware Stream-Based Edge Repartition for Many-Core Systems GraphSER：多核系统中基于距离感知流的边缘重划分

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-04-26 DOI: 10.1145/3661998

Junkaixuan Li, Yi Kang

With the explosive growth of graph data, distributed graph processing becomes popular and many graph hardware accelerators use distributed frameworks. Graph partitioning is foundation in distributed graph processing. However, dynamic changes in graph make existing partitioning shifted from its optimized points and cause system performance degraded. Therefore, more efficient dynamic graph partition methods are needed.

In this work, we propose GraphSER, a dynamic graph partition method for many-core systems. In order to improve the cross-node spatial locality and reduce the overhead of repartition, we propose a stream-based edge repartition, in which each computing node sequentially traverses its local edge list in parallel, then migrating edges based on distance and replica degree. GraphSER does not need costly searching and prioritizes nodes so it can avoid poor cross-node spatial locality.

Our evaluation shows that compared to state-of-the-art edge repartition software methods, GraphSER has an average speedup 1.52x, with the maximum up to 2x. Compared to the previous many-core hardware repartition method, GraphSER performance has an average of 40% improvement, with the maximum to 117%.

随着图数据的爆炸式增长，分布式图处理开始流行起来，许多图硬件加速器都使用分布式框架。图分割是分布式图处理的基础。然而，图的动态变化会使现有分区偏离优化点，导致系统性能下降。因此，需要更高效的动态图分割方法。在这项工作中，我们提出了适用于多核系统的动态图分割方法 GraphSER。为了提高跨节点空间位置性并减少重新划分的开销，我们提出了一种基于流的边重新划分方法，即每个计算节点依次并行遍历其本地边列表，然后根据距离和复制度迁移边。GraphSER 不需要高成本的搜索，并优先处理节点，因此可以避免跨节点空间位置性差的问题。我们的评估结果表明，与最先进的边缘重新分区软件方法相比，GraphSER 的平均速度提高了 1.52 倍，最高提高了 2 倍。与之前的多核硬件重新分区方法相比，GraphSER 的性能平均提高了 40%，最高提高了 117%。

{"title":"GraphSER: Distance-Aware Stream-Based Edge Repartition for Many-Core Systems","authors":"Junkaixuan Li, Yi Kang","doi":"10.1145/3661998","DOIUrl":"https://doi.org/10.1145/3661998","url":null,"abstract":"With the explosive growth of graph data, distributed graph processing becomes popular and many graph hardware accelerators use distributed frameworks. Graph partitioning is foundation in distributed graph processing. However, dynamic changes in graph make existing partitioning shifted from its optimized points and cause system performance degraded. Therefore, more efficient dynamic graph partition methods are needed. In this work, we propose GraphSER, a dynamic graph partition method for many-core systems. In order to improve the cross-node spatial locality and reduce the overhead of repartition, we propose a stream-based edge repartition, in which each computing node sequentially traverses its local edge list in parallel, then migrating edges based on distance and replica degree. GraphSER does not need costly searching and prioritizes nodes so it can avoid poor cross-node spatial locality. Our evaluation shows that compared to state-of-the-art edge repartition software methods, GraphSER has an average speedup 1.52x, with the maximum up to 2x. Compared to the previous many-core hardware repartition method, GraphSER performance has an average of 40% improvement, with the maximum to 117%.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"9 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140799625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0