Arun Thangamani, Vincent Loechner, Stéphane Genaud
Since the 1990’s many implementations of polyhedral compilers have been written and distributed, either as source-to-source translating compilers or integrated into wider purpose compilers. This paper provides a survey on those various available implementations as of today, 2024.
We list and describe most commonly available polyhedral schedulers and compiler implementations. Then, we compare the general-purpose polyhedral compilers using two main criteria, robustness and performance, on the PolyBench/C set of benchmarks.
{"title":"A Survey of General-purpose Polyhedral Compilers","authors":"Arun Thangamani, Vincent Loechner, Stéphane Genaud","doi":"10.1145/3674735","DOIUrl":"https://doi.org/10.1145/3674735","url":null,"abstract":"<p>Since the 1990’s many implementations of polyhedral compilers have been written and distributed, either as source-to-source translating compilers or integrated into wider purpose compilers. This paper provides a survey on those various available implementations as of today, 2024. </p><p>We list and describe most commonly available polyhedral schedulers and compiler implementations. Then, we compare the general-purpose polyhedral compilers using two main criteria, robustness and performance, on the PolyBench/C set of benchmarks.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"18 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141508981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ataberk Olgun, Fatma Bostanci, Geraldo Francisco de Oliveira Junior, Yahya Can Tugrul, Rahul Bera, Abdullah Giray Yaglikci, Hasan Hassan, Oguz Ergin, Onur Mutlu
Modern computing systems access data in main memory at coarse granularity (e.g., at 512-bit cache block granularity). Coarse-grained access leads to wasted energy because the system does not use all individually accessed small portions (e.g., words, each of which typically is 64 bits) of a cache block. In modern DRAM-based computing systems, two key coarse-grained access mechanisms lead to wasted energy: large and fixed-size (i) data transfers between DRAM and the memory controller and (ii) DRAM row activations.
We propose Sectored DRAM, a new, low-overhead DRAM substrate that reduces wasted energy by enabling fine-grained DRAM data transfer and DRAM row activation. To retrieve only useful data from DRAM, Sectored DRAM exploits the observation that many cache blocks are not fully utilized in many workloads due to poor spatial locality. Sectored DRAM predicts the words in a cache block that will likely be accessed during the cache block’s residency in cache and: (i) transfers only the predicted words on the memory channel by dynamically tailoring the DRAM data transfer size for the workload and (ii) activates a smaller set of cells that contain the predicted words by carefully operating physically isolated portions of DRAM rows (i.e., mats). Activating a smaller set of cells on each access relaxes DRAM power delivery constraints and allows the memory controller to schedule DRAM accesses faster.
We evaluate Sectored DRAM using 41 workloads from widely-used benchmark suites. Compared to a system with coarse-grained DRAM, Sectored DRAM reduces the DRAM energy consumption of highly-memory-intensive workloads by up to (on average) 33% (20%) while improving their performance by up to (on average) 36% (17%). Sectored DRAM’s DRAM energy savings, combined with its system performance improvement, allows system-wide energy savings of up to 23%. Sectored DRAM’s DRAM chip area overhead is 1.7% of the area of a modern DDR4 chip. Compared to state-of-the-art fine-grained DRAM architectures, Sectored DRAM greatly reduces DRAM energy consumption, does not reduce DRAM bandwidth, and can be implemented with low hardware cost. Sectored DRAM provides 89% of the performance benefits of, consumes 12% less DRAM energy than, and takes up 34% less DRAM chip area than a high-performance state-of-the-art fine-grained DRAM architecture (Half-DRAM). We hope and believe that Sectored DRAM’s ideas and results will help to enable more efficient and high-performance memory systems. To this end, we open source Sectored DRAM at https://github.com/CMU-SAFARI/Sectored-DRAM.
现代计算系统以粗粒度(如 512 位缓存块粒度)访问主内存中的数据。粗粒度访问会导致能源浪费,因为系统不会使用缓存块中所有单独访问的小部分(如字,每个字通常为 64 位)。在基于 DRAM 的现代计算系统中,有两种关键的粗粒度访问机制会导致能源浪费:(i) DRAM 与内存控制器之间的大型固定尺寸数据传输;(ii) DRAM 行激活。我们提出的 Sectored DRAM 是一种新型、低开销 DRAM 基板,可通过细粒度 DRAM 数据传输和 DRAM 行激活减少能源浪费。为了只从 DRAM 中检索有用的数据,Sectored DRAM 利用了这样的观察结果:由于空间位置性差,许多高速缓存块在许多工作负载中没有得到充分利用。Sectored DRAM 可预测高速缓存块中的字,这些字在高速缓存块驻留期间可能会被访问:(i) 通过为工作负载动态调整 DRAM 数据传输大小,仅在内存通道上传输预测字;(ii) 通过小心操作 DRAM 行(即垫)的物理隔离部分,激活包含预测字的较小单元集。在每次访问中激活较小的单元集可放宽 DRAM 功率交付限制,并允许内存控制器更快地调度 DRAM 访问。我们使用广泛使用的基准套件中的 41 个工作负载对 Sectored DRAM 进行了评估。与采用粗粒度 DRAM 的系统相比,Sectored DRAM 可将高内存密集型工作负载的 DRAM 能耗降低多达(平均)33%(20%),同时将其性能提高多达(平均)36%(17%)。Sectored DRAM 在节省 DRAM 能耗的同时还提高了系统性能,使整个系统的能耗节省高达 23%。分段式 DRAM 的 DRAM 芯片面积开销是现代 DDR4 芯片面积的 1.7%。与最先进的细粒度 DRAM 架构相比,分段式 DRAM 大大降低了 DRAM 能耗,不会降低 DRAM 带宽,而且可以以较低的硬件成本实现。与最先进的高性能细粒度 DRAM 架构(Half-DRAM)相比,Sectored DRAM 可提供 89% 的性能优势,DRAM 能耗降低 12%,DRAM 芯片面积减少 34%。我们希望并相信,Sectored DRAM 的理念和成果将有助于实现更高效、更高性能的内存系统。为此,我们在 https://github.com/CMU-SAFARI/Sectored-DRAM 上开放了 Sectored DRAM 的源代码。
{"title":"Sectored DRAM: A Practical Energy-Efficient and High-Performance Fine-Grained DRAM Architecture","authors":"Ataberk Olgun, Fatma Bostanci, Geraldo Francisco de Oliveira Junior, Yahya Can Tugrul, Rahul Bera, Abdullah Giray Yaglikci, Hasan Hassan, Oguz Ergin, Onur Mutlu","doi":"10.1145/3673653","DOIUrl":"https://doi.org/10.1145/3673653","url":null,"abstract":"<p>Modern computing systems access data in main memory at <i>coarse granularity</i> (e.g., at 512-bit cache block granularity). Coarse-grained access leads to wasted energy because the system does <i>not</i> use all individually accessed small portions (e.g., <i>words</i>, each of which typically is 64 bits) of a cache block. In modern DRAM-based computing systems, two key coarse-grained access mechanisms lead to wasted energy: large and fixed-size (i) data transfers between DRAM and the memory controller and (ii) DRAM row activations. </p><p>We propose Sectored DRAM, a new, low-overhead DRAM substrate that reduces wasted energy by enabling <i>fine-grained</i> DRAM data transfer and DRAM row activation. To retrieve only useful data from DRAM, Sectored DRAM exploits the observation that many cache blocks are not fully utilized in many workloads due to poor spatial locality. Sectored DRAM predicts the words in a cache block that will likely be accessed during the cache block’s residency in cache and: (i) transfers only the predicted words on the memory channel by dynamically tailoring the DRAM data transfer size for the workload and (ii) activates a smaller set of cells that contain the predicted words by carefully operating physically isolated portions of DRAM rows (i.e., mats). Activating a smaller set of cells on each access relaxes DRAM power delivery constraints and allows the memory controller to schedule DRAM accesses faster. </p><p>We evaluate Sectored DRAM using 41 workloads from widely-used benchmark suites. Compared to a system with coarse-grained DRAM, Sectored DRAM reduces the DRAM energy consumption of highly-memory-intensive workloads by up to (on average) 33% (20%) while improving their performance by up to (on average) 36% (17%). Sectored DRAM’s DRAM energy savings, combined with its system performance improvement, allows system-wide energy savings of up to 23%. Sectored DRAM’s DRAM chip area overhead is 1.7% of the area of a modern DDR4 chip. Compared to state-of-the-art fine-grained DRAM architectures, Sectored DRAM greatly reduces DRAM energy consumption, does <i>not</i> reduce DRAM bandwidth, and can be implemented with low hardware cost. Sectored DRAM provides 89% of the performance benefits of, consumes 12% less DRAM energy than, and takes up 34% less DRAM chip area than a high-performance state-of-the-art fine-grained DRAM architecture (Half-DRAM). We hope and believe that Sectored DRAM’s ideas and results will help to enable more efficient and high-performance memory systems. To this end, we open source Sectored DRAM at https://github.com/CMU-SAFARI/Sectored-DRAM.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"94 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141508982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kai Lu, Siqi Zhao, Haikang Shan, Qiang Wei, Guokuan Li, Jiguang Wan, Ting Yao, Huatao Wu, Daohui Wang
Disaggregated memory separates compute and memory resources into independent pools connected by RDMA (Remote Direct Memory Access) networks, which can improve memory utilization, reduce cost, and enable elastic scaling of compute and memory resources. However, existing RDMA-based distributed transactions on disaggregated memory suffer from severe long-tail latency under high-contention workloads.
In this paper, we propose Scythe, a novel low-latency RDMA-enabled distributed transaction system for disaggregated memory. Scythe optimizes the latency of high-contention transactions in three approaches: 1) Scythe proposes a hot-aware concurrency control policy that uses optimistic concurrency control (OCC) to improve transaction processing efficiency in low-conflict scenarios. Under high conflicts, Scythe designs a timestamp-ordered OCC (TOCC) strategy based on fair locking to reduce the number of retries and cross-node communication overhead. 2) Scythe presents an RDMA-friendly timestamp service for improved timestamp management. 3) Scythe designs an RDMA-optimized RPC framework to improve RDMA bandwidth utilization. The evaluation results show that, compared to state-of-the-art distributed transaction systems, Scythe achieves more than 2.5 × lower latency with 1.8 × higher throughput under high-contention workloads.
{"title":"Scythe: A Low-latency RDMA-enabled Distributed Transaction System for Disaggregated Memory","authors":"Kai Lu, Siqi Zhao, Haikang Shan, Qiang Wei, Guokuan Li, Jiguang Wan, Ting Yao, Huatao Wu, Daohui Wang","doi":"10.1145/3666004","DOIUrl":"https://doi.org/10.1145/3666004","url":null,"abstract":"<p>Disaggregated memory separates compute and memory resources into independent pools connected by RDMA (Remote Direct Memory Access) networks, which can improve memory utilization, reduce cost, and enable elastic scaling of compute and memory resources. However, existing RDMA-based distributed transactions on disaggregated memory suffer from severe long-tail latency under high-contention workloads. </p><p>In this paper, we propose Scythe, a novel low-latency RDMA-enabled distributed transaction system for disaggregated memory. Scythe optimizes the latency of high-contention transactions in three approaches: 1) Scythe proposes a hot-aware concurrency control policy that uses optimistic concurrency control (OCC) to improve transaction processing efficiency in low-conflict scenarios. Under high conflicts, Scythe designs a timestamp-ordered OCC (TOCC) strategy based on fair locking to reduce the number of retries and cross-node communication overhead. 2) Scythe presents an RDMA-friendly timestamp service for improved timestamp management. 3) Scythe designs an RDMA-optimized RPC framework to improve RDMA bandwidth utilization. The evaluation results show that, compared to state-of-the-art distributed transaction systems, Scythe achieves more than 2.5 × lower latency with 1.8 × higher throughput under high-contention workloads.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"21 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141171974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
DRAM memory is a performance bottleneck for many applications, due to its high access latency. Previous work has mainly focused on data locality, introducing small but fast regions to cache frequently accessed data, thereby reducing the average latency. However, these locality-based designs have three challenges in modern multi-core systems: (1) inter-application interference leads to random memory access traffic, (2) fairness issues prevent the memory controller from over-prioritizing data locality, and (3) write-intensive applications have much lower locality and evict substantial dirty entries. With frequent data movement between the fast in-DRAM cache and slow regular arrays, the overhead induced by moving data may even offset the performance and energy benefits of in-DRAM caching.
In this article, we decouple the data movement process into two distinct phases. The first phase is Load-Reduced Destructive Activation (LRDA), which destructively promotes data into the in-DRAM cache. The second phase is Delayed Cycle-Stealing Restoration (DCSR), which restores the original data when the DRAM bank is idle. LRDA decouples the most time-consuming restoration phase from activation, and DCSR hides the restoration latency through prevalent bank-level parallelism. We propose FASA-DRAM, incorporating destructive activation and delayed restoration techniques to enable both in-DRAM caching and proactive latency-hiding mechanisms. Our evaluation shows that FASA-DRAM improves the average performance by 19.9% and reduces average DRAM energy consumption by 18.1% over DDR4 DRAM for four-core workloads, with less than 3.4% extra area overhead. Furthermore, FASA-DRAM outperforms state-of-the-art designs in both performance and energy efficiency.
DRAM 存储器因其较高的访问延迟而成为许多应用的性能瓶颈。以往的工作主要集中在数据局部性上,即引入小而快的区域来缓存频繁访问的数据,从而降低平均延迟。然而,这些基于位置性的设计在现代多核系统中面临三个挑战:(1) 应用程序间的干扰会导致随机内存访问流量,(2) 公平性问题会阻止内存控制器过度优先考虑数据位置性,(3) 写密集型应用程序的位置性要低得多,并且会驱逐大量脏条目。在快速内存缓存和慢速常规阵列之间频繁移动数据时,移动数据造成的开销甚至可能抵消内存缓存带来的性能和能耗优势。第一阶段是负载降低的破坏性激活(LRDA),它将数据破坏性地推进到内存缓存中。第二个阶段是延迟周期转换恢复(DCSR),在 DRAM 存储体空闲时恢复原始数据。LRDA 将最耗时的还原阶段与激活解耦,而 DCSR 则通过普遍存在的库级并行性隐藏了还原延迟。我们提出了 FASA-DRAM,它结合了破坏性激活和延迟还原技术,实现了内存缓存和主动延迟隐藏机制。我们的评估结果表明,与 DDR4 DRAM 相比,FASA-DRAM 在四核工作负载中的平均性能提高了 19.9%,平均 DRAM 能耗降低了 18.1%,而额外的面积开销不到 3.4%。此外,FASA-DRAM 在性能和能效方面都优于最先进的设计。
{"title":"FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed Restoration","authors":"Haitao Du, Yuhan Qin, Song Chen, Yi Kang","doi":"10.1145/3649455","DOIUrl":"https://doi.org/10.1145/3649455","url":null,"abstract":"<p>DRAM memory is a performance bottleneck for many applications, due to its high access latency. Previous work has mainly focused on data locality, introducing small but fast regions to cache frequently accessed data, thereby reducing the average latency. However, these locality-based designs have three challenges in modern multi-core systems: (1) inter-application interference leads to random memory access traffic, (2) fairness issues prevent the memory controller from over-prioritizing data locality, and (3) write-intensive applications have much lower locality and evict substantial dirty entries. With frequent data movement between the fast in-DRAM cache and slow regular arrays, the overhead induced by moving data may even offset the performance and energy benefits of in-DRAM caching.</p><p>In this article, we decouple the data movement process into two distinct phases. The first phase is Load-Reduced Destructive Activation (LRDA), which destructively promotes data into the in-DRAM cache. The second phase is Delayed Cycle-Stealing Restoration (DCSR), which restores the original data when the DRAM bank is idle. LRDA decouples the most time-consuming restoration phase from activation, and DCSR hides the restoration latency through prevalent bank-level parallelism. We propose FASA-DRAM, incorporating destructive activation and delayed restoration techniques to enable both in-DRAM caching and proactive latency-hiding mechanisms. Our evaluation shows that FASA-DRAM improves the average performance by 19.9% and reduces average DRAM energy consumption by 18.1% over DDR4 DRAM for four-core workloads, with less than 3.4% extra area overhead. Furthermore, FASA-DRAM outperforms state-of-the-art designs in both performance and energy efficiency.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"35 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141147023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bobin Deng, Bhargava Nadendla, Kun Suo, Yixin Xie, Dan Chia-Tien Lo
Residue Number Systems (RNS) demonstrate the fascinating potential to serve integer addition/multiplication-intensive applications. The complexity of Artificial Intelligence (AI) models has grown enormously in recent years. From a computer system’s perspective, ensuring the training of these large-scale AI models within an adequate time and energy consumption has become a big concern. Matrix multiplication is a dominant subroutine in many prevailing AI models, with an addition/multiplication-intensive attribute. However, the data type of matrix multiplication within machine learning training typically requires real numbers, which indicates that RNS benefits for integer applications cannot be directly gained by AI training. The state-of-the-art RNS real number encodings, including floating-point and fixed-point, have defects and can be further enhanced. To transform default RNS benefits to the efficiency of large-scale AI training, we propose a low-cost and high-accuracy RNS fixed-point representation: Single RNS Logical Partition (S-RNS-Logic-P) representation with Scaling Down Postprocessing Multiplication (SD-Post-Mul). Moreover, we extend the implementation details of the other two RNS fixed-point methods: Double RNS Concatenation (D-RNS-Concat) and Single RNS Logical Partition (S-RNS-Logic-P) representation with Scaling Down Preprocessing Multiplication (SD-Pre-Mul). We also design the architectures of these three fixed-point multipliers. In empirical experiments, our S-RNS-Logic-P representation with SD-Post-Mul method achieves less latency and energy overhead while maintaining good accuracy. Furthermore, this method can easily extend to the Redundant Residue Number System (RRNS) to raise the efficiency of error-tolerant domains, such as improving the error correction efficiency of quantum computing.
{"title":"Fixed-point Encoding and Architecture Exploration for Residue Number Systems","authors":"Bobin Deng, Bhargava Nadendla, Kun Suo, Yixin Xie, Dan Chia-Tien Lo","doi":"10.1145/3664923","DOIUrl":"https://doi.org/10.1145/3664923","url":null,"abstract":"<p>Residue Number Systems (RNS) demonstrate the fascinating potential to serve integer addition/multiplication-intensive applications. The complexity of Artificial Intelligence (AI) models has grown enormously in recent years. From a computer system’s perspective, ensuring the training of these large-scale AI models within an adequate time and energy consumption has become a big concern. Matrix multiplication is a dominant subroutine in many prevailing AI models, with an addition/multiplication-intensive attribute. However, the data type of matrix multiplication within machine learning training typically requires real numbers, which indicates that RNS benefits for integer applications cannot be directly gained by AI training. The state-of-the-art RNS real number encodings, including floating-point and fixed-point, have defects and can be further enhanced. To transform default RNS benefits to the efficiency of large-scale AI training, we propose a low-cost and high-accuracy RNS fixed-point representation: <i>Single RNS Logical Partition (S-RNS-Logic-P) representation with Scaling Down Postprocessing Multiplication (SD-Post-Mul)</i>. Moreover, we extend the implementation details of the other two RNS fixed-point methods: <i>Double RNS Concatenation (D-RNS-Concat)</i> and <i>Single RNS Logical Partition (S-RNS-Logic-P) representation with Scaling Down Preprocessing Multiplication (SD-Pre-Mul)</i>. We also design the architectures of these three fixed-point multipliers. In empirical experiments, our <i>S-RNS-Logic-P representation with SD-Post-Mul</i> method achieves less latency and energy overhead while maintaining good accuracy. Furthermore, this method can easily extend to the Redundant Residue Number System (RRNS) to raise the efficiency of error-tolerant domains, such as improving the error correction efficiency of quantum computing.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"147 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140931088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dongmoon Min, Ilkwon Byun, Gyu-hyeon Lee, Jangwoo Kim
For datacenter architects, it is the most important goal to minimize the datacenter’s total cost of ownership for the target performance (i.e., TCO/performance). As the major component of a datacenter is a server farm, the most effective way of reducing TCO/performance is to improve the server’s performance and power efficiency. To achieve the goal, we claim that it is highly promising to reduce each server’s temperature to its most cost-effective point (or temperature scaling).
In this paper, we propose CoolDC, a novel and immediately-applicable low-temperature cooling method to minimize the datacenter’s TCO. The key idea is to find and apply the most cost-effective sub-freezing temperature to target servers and workloads. For that purpose, we first apply the immersion cooling method to the entire servers to maintain a stable low temperature with little extra cooling and maintenance costs. Second, we define the TCO-optimal temperature for datacenter operation (e.g., 248K~273K (-25℃~0℃)) by carefully estimating all the costs and benefits at low temperatures. Finally, we propose CoolDC, our immersion-cooling datacenter architecture to run every workload at its own TCO-optimal temperature. By incorporating our low-temperature workload-aware temperature scaling, CoolDC achieves 12.7% and 13.4% lower TCO/performance than the conventional air-cooled and immersion-cooled datacenters, respectively, without any modification to existing computers.
{"title":"CoolDC: A Cost-Effective Immersion-Cooled Datacenter with Workload-Aware Temperature Scaling","authors":"Dongmoon Min, Ilkwon Byun, Gyu-hyeon Lee, Jangwoo Kim","doi":"10.1145/3664925","DOIUrl":"https://doi.org/10.1145/3664925","url":null,"abstract":"<p>For datacenter architects, it is the most important goal to minimize <i>the datacenter’s total cost of ownership for the target performance</i> (i.e., TCO/performance). As the major component of a datacenter is a server farm, the most effective way of reducing TCO/performance is to improve the server’s performance and power efficiency. To achieve the goal, we claim that it is highly promising to reduce each server’s temperature to its most cost-effective point (or temperature scaling). </p><p>In this paper, we propose <i>CoolDC</i>, a novel and immediately-applicable low-temperature cooling method to minimize the datacenter’s TCO. The key idea is to find and apply the most cost-effective sub-freezing temperature to target servers and workloads. For that purpose, we first apply the immersion cooling method to the entire servers to maintain a stable low temperature with little extra cooling and maintenance costs. Second, we define the TCO-optimal temperature for datacenter operation (e.g., 248K~273K (-25℃~0℃)) by carefully estimating all the costs and benefits at low temperatures. Finally, we propose CoolDC, our immersion-cooling datacenter architecture to run every workload at its own TCO-optimal temperature. By incorporating our low-temperature workload-aware temperature scaling, CoolDC achieves 12.7% and 13.4% lower TCO/performance than the conventional air-cooled and immersion-cooled datacenters, respectively, without any modification to existing computers.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"54 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140931024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
More and more storage systems use erasure code to tolerate faults. It takes pieces of data blocks as input and encodes a small number of parity blocks as output, where these blocks form a stripe. When reconsidering the recovery problem in the multi-stripe level and heterogeneous network clusters, quickly generating an efficient multi-stripe recovery solution that reduces recovery time remains a challenging and time-consuming task. Previous works either use a greedy algorithm that may fall into the local optimal and have low recovery performance or a meta-heuristic algorithm with a long running time and low solution generation efficiency.
In this paper, we propose a Stripe-schedule Aware Repair (SARepair) technique for multi-stripe recovery in heterogeneous erasure-coded clusters based on RS code. By carefully examining the metadata of blocks, SARepair intelligently adjusts the recovery solution for each stripe and obtains another multi-stripe solution with less recovery time in a computationally efficient manner. It then tolerates worse solutions to overcome the local optimal and uses a rollback mechanism to adjust search regions to reduce recovery time further. Moreover, instead of reading blocks sequentially from each node, SARepair also selectively schedules the reading order for each block to reduce the memory overhead. We extend SARepair to address the full-node recovery and adapt to the LRC code. We prototype SARepair and show via both simulations and Amazon EC2 experiments that the recovery performance can be improved by up to 59.97% over a state-of-the-art recovery approach while keeping running time and memory overhead low.
{"title":"Stripe-schedule Aware Repair in Erasure-coded Clusters with Heterogeneous Star Networks","authors":"Hai Zhou, Dan Feng","doi":"10.1145/3664926","DOIUrl":"https://doi.org/10.1145/3664926","url":null,"abstract":"<p>More and more storage systems use erasure code to tolerate faults. It takes pieces of data blocks as input and encodes a small number of parity blocks as output, where these blocks form a stripe. When reconsidering the recovery problem in the multi-stripe level and heterogeneous network clusters, quickly generating an efficient multi-stripe recovery solution that reduces recovery time remains a challenging and time-consuming task. Previous works either use a greedy algorithm that may fall into the local optimal and have low recovery performance or a meta-heuristic algorithm with a long running time and low solution generation efficiency. </p><p>In this paper, we propose a <i>Stripe-schedule Aware Repair</i> (SARepair) technique for multi-stripe recovery in heterogeneous erasure-coded clusters based on RS code. By carefully examining the metadata of blocks, SARepair intelligently adjusts the recovery solution for each stripe and obtains another multi-stripe solution with less recovery time in a computationally efficient manner. It then tolerates worse solutions to overcome the local optimal and uses a rollback mechanism to adjust search regions to reduce recovery time further. Moreover, instead of reading blocks sequentially from each node, SARepair also selectively schedules the reading order for each block to reduce the memory overhead. We extend SARepair to address the full-node recovery and adapt to the LRC code. We prototype SARepair and show via both simulations and Amazon EC2 experiments that the recovery performance can be improved by up to 59.97% over a state-of-the-art recovery approach while keeping running time and memory overhead low.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"42 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140931025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The growing memory demands of modern applications have driven the adoption of far memory technologies in data centers to provide cost-effective, high-capacity memory solutions. However, far memory presents new performance challenges because its access latencies are significantly longer and more variable than local DRAM. For applications to achieve acceptable performance on far memory, a high degree of memory-level parallelism (MLP) is needed to tolerate the long access latency.
While modern out-of-order processors are capable of exploiting a certain degree of MLP, they are constrained by resource limitations and hardware complexity. The key obstacle is the synchronous memory access semantics of traditional load/store instructions, which occupy critical hardware resources for a long time. The longer far memory latencies exacerbate this limitation.
This paper proposes a set of Asynchronous Memory Access Instructions (AMI) and its supporting function unit, Asynchronous Memory Access Unit (AMU), inside contemporary Out-of-Order Core. AMI separates memory request issuing from response handling to reduce resource occupation. Additionally, AMU architecture supports up to several hundreds of asynchronous memory requests through re-purposing a portion of L2 Cache as scratchpad memory (SPM) to provide sufficient temporal storage. Together with a coroutine-based programming framework, this scheme can achieve significantly higher MLP for hiding far memory latencies.
Evaluation with a cycle-accurate simulation shows AMI achieves 2.42 × speedup on average for memory-bound benchmarks with 1μs additional far memory latency. Over 130 outstanding requests are supported with 26.86 × speedup for GUPS (random access) with 5 μs latency. These demonstrate how the techniques tackle far memory performance impacts through explicit MLP expression and latency adaptation.
{"title":"Asynchronous Memory Access Unit: Exploiting Massive Parallelism for Far Memory Access","authors":"Luming Wang, Xu Zhang, Songyue Wang, Zhuolun Jiang, Tianyue Lu, Mingyu Chen, Siwei Luo, Keji Huang","doi":"10.1145/3663479","DOIUrl":"https://doi.org/10.1145/3663479","url":null,"abstract":"<p>The growing memory demands of modern applications have driven the adoption of far memory technologies in data centers to provide cost-effective, high-capacity memory solutions. However, far memory presents new performance challenges because its access latencies are significantly longer and more variable than local DRAM. For applications to achieve acceptable performance on far memory, a high degree of memory-level parallelism (MLP) is needed to tolerate the long access latency. </p><p>While modern out-of-order processors are capable of exploiting a certain degree of MLP, they are constrained by resource limitations and hardware complexity. The key obstacle is the synchronous memory access semantics of traditional load/store instructions, which occupy critical hardware resources for a long time. The longer far memory latencies exacerbate this limitation. </p><p>This paper proposes a set of Asynchronous Memory Access Instructions (AMI) and its supporting function unit, Asynchronous Memory Access Unit (AMU), inside contemporary Out-of-Order Core. AMI separates memory request issuing from response handling to reduce resource occupation. Additionally, AMU architecture supports up to several hundreds of asynchronous memory requests through re-purposing a portion of L2 Cache as scratchpad memory (SPM) to provide sufficient temporal storage. Together with a coroutine-based programming framework, this scheme can achieve significantly higher MLP for hiding far memory latencies. </p><p>Evaluation with a cycle-accurate simulation shows AMI achieves 2.42 × speedup on average for memory-bound benchmarks with 1<i>μ</i>s additional far memory latency. Over 130 outstanding requests are supported with 26.86 × speedup for GUPS (random access) with 5 <i>μ</i>s latency. These demonstrate how the techniques tackle far memory performance impacts through explicit MLP expression and latency adaptation.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"6 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140936310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qiao Li, Yu Chen, Guanyu Wu, Yajuan Du, Min Ye, Xinbiao Gan, jie zhang, Zhirong Shen, Jiwu Shu, Chun Xue
With the development of NAND flash memories’ bit density and stacking technologies, while storage capacity keeps increasing, the issue of reliability becomes increasingly prominent. Low-density parity check (LDPC) code, as a robust error-correcting code, is extensively employed in flash memory. However, when the RBER is prohibitively high, LDPC decoding would introduce long latency. To study how LDPC performs on the latest 3D NAND flash memory, we conduct a comprehensive analysis of LDPC decoding performance using both the theoretically derived threshold voltage distribution model obtained through modeling (Modeling-based method) and the actual voltage distribution collected from on-chip data through testing (Ideal case). Based on LDPC decoding results under various interference conditions, we summarize four findings that can help us gain a better understanding of the characteristics of LDPC decoding in 3D NAND flash memory. Following our characterization, we identify the differences in LDPC decoding performance between the Modeling-based method and the Ideal case. Due to the accuracy of initial probability information, the threshold voltage distribution derived through modeling deviates by certain degrees from the actual threshold voltage distribution. This leads to a performance gap between using the threshold voltage distribution derived from the Modeling-based method and the actual distribution. By observing the abnormal behaviors in the decoding with the Modeling-based method, we introduce an Offsetted Read Voltage (ΔRV) method, for optimizing LDPC decoding performance by offsetting the reading voltage in each layer of a flash block. The evaluation results show that our ΔRV method enhances the decoding performance of LDPC on the Modeling-based method by reducing the total number of sensing levels needed for LDPC decoding by 0.67% to 18.92% for different interference conditions on average, under the P/E cycles from 3000 to 7000.
{"title":"Characterizing and Optimizing LDPC Performance on 3D NAND Flash Memories","authors":"Qiao Li, Yu Chen, Guanyu Wu, Yajuan Du, Min Ye, Xinbiao Gan, jie zhang, Zhirong Shen, Jiwu Shu, Chun Xue","doi":"10.1145/3663478","DOIUrl":"https://doi.org/10.1145/3663478","url":null,"abstract":"<p>With the development of NAND flash memories’ bit density and stacking technologies, while storage capacity keeps increasing, the issue of reliability becomes increasingly prominent. Low-density parity check (LDPC) code, as a robust error-correcting code, is extensively employed in flash memory. However, when the RBER is prohibitively high, LDPC decoding would introduce long latency. To study how LDPC performs on the latest 3D NAND flash memory, we conduct a comprehensive analysis of LDPC decoding performance using both the theoretically derived threshold voltage distribution model obtained through modeling (Modeling-based method) and the actual voltage distribution collected from on-chip data through testing (Ideal case). Based on LDPC decoding results under various interference conditions, we summarize four findings that can help us gain a better understanding of the characteristics of LDPC decoding in 3D NAND flash memory. Following our characterization, we identify the differences in LDPC decoding performance between the Modeling-based method and the Ideal case. Due to the accuracy of initial probability information, the threshold voltage distribution derived through modeling deviates by certain degrees from the actual threshold voltage distribution. This leads to a performance gap between using the threshold voltage distribution derived from the Modeling-based method and the actual distribution. By observing the abnormal behaviors in the decoding with the Modeling-based method, we introduce an Offsetted Read Voltage (<i>Δ</i>RV) method, for optimizing LDPC decoding performance by offsetting the reading voltage in each layer of a flash block. The evaluation results show that our <i>Δ</i>RV method enhances the decoding performance of LDPC on the Modeling-based method by reducing the total number of sensing levels needed for LDPC decoding by 0.67% to 18.92% for different interference conditions on average, under the P/E cycles from 3000 to 7000.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"4 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140828697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the explosive growth of graph data, distributed graph processing becomes popular and many graph hardware accelerators use distributed frameworks. Graph partitioning is foundation in distributed graph processing. However, dynamic changes in graph make existing partitioning shifted from its optimized points and cause system performance degraded. Therefore, more efficient dynamic graph partition methods are needed.
In this work, we propose GraphSER, a dynamic graph partition method for many-core systems. In order to improve the cross-node spatial locality and reduce the overhead of repartition, we propose a stream-based edge repartition, in which each computing node sequentially traverses its local edge list in parallel, then migrating edges based on distance and replica degree. GraphSER does not need costly searching and prioritizes nodes so it can avoid poor cross-node spatial locality.
Our evaluation shows that compared to state-of-the-art edge repartition software methods, GraphSER has an average speedup 1.52x, with the maximum up to 2x. Compared to the previous many-core hardware repartition method, GraphSER performance has an average of 40% improvement, with the maximum to 117%.
{"title":"GraphSER: Distance-Aware Stream-Based Edge Repartition for Many-Core Systems","authors":"Junkaixuan Li, Yi Kang","doi":"10.1145/3661998","DOIUrl":"https://doi.org/10.1145/3661998","url":null,"abstract":"<p>With the explosive growth of graph data, distributed graph processing becomes popular and many graph hardware accelerators use distributed frameworks. Graph partitioning is foundation in distributed graph processing. However, dynamic changes in graph make existing partitioning shifted from its optimized points and cause system performance degraded. Therefore, more efficient dynamic graph partition methods are needed. </p><p>In this work, we propose GraphSER, a dynamic graph partition method for many-core systems. In order to improve the cross-node spatial locality and reduce the overhead of repartition, we propose a stream-based edge repartition, in which each computing node sequentially traverses its local edge list in parallel, then migrating edges based on distance and replica degree. GraphSER does not need costly searching and prioritizes nodes so it can avoid poor cross-node spatial locality. </p><p>Our evaluation shows that compared to state-of-the-art edge repartition software methods, GraphSER has an average speedup 1.52x, with the maximum up to 2x. Compared to the previous many-core hardware repartition method, GraphSER performance has an average of 40% improvement, with the maximum to 117%.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"9 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140799625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}