Vidush Singhal, Laith Sakka, Kirshanthan Sundararajah, Ryan R. Newton, Milind Kulkarni
Many applications are designed to perform traversals on tree-like data structures. Fusing and parallelizing these traversals enhance the performance of applications. Fusing multiple traversals improves the locality of the application. The runtime of an application can be significantly reduced by extracting parallelism and utilizing multi-threading. Prior frameworks have tried to fuse and parallelize tree traversals using coarse-grained approaches, leading to missed fine-grained opportunities for improving performance. Other frameworks have successfully supported fine-grained fusion on heterogeneous tree types but fall short regarding parallelization. We introduce a new framework Orchard built on top of Grafter. Orchard’s novelty lies in allowing the programmer to transform tree traversal applications by automatically applying fine-grained fusion and extracting heterogeneous parallelism.Orchard allows the programmer to write general tree traversal applications in a simple and elegant embedded Domain-Specific Language (eDSL). We show that the combination of fine-grained fusion and heterogeneous parallelism performs better than each alone when the conditions are met.
{"title":"Orchard: Heterogeneous Parallelism and Fine-grained Fusion for Complex Tree Traversals","authors":"Vidush Singhal, Laith Sakka, Kirshanthan Sundararajah, Ryan R. Newton, Milind Kulkarni","doi":"10.1145/3652605","DOIUrl":"https://doi.org/10.1145/3652605","url":null,"abstract":"<p>Many applications are designed to perform traversals on <i>tree-like</i> data structures. Fusing and parallelizing these traversals enhance the performance of applications. Fusing multiple traversals improves the locality of the application. The runtime of an application can be significantly reduced by extracting parallelism and utilizing multi-threading. Prior frameworks have tried to fuse and parallelize tree traversals using coarse-grained approaches, leading to missed fine-grained opportunities for improving performance. Other frameworks have successfully supported fine-grained fusion on heterogeneous tree types but fall short regarding parallelization. We introduce a new framework <span>Orchard</span> built on top of <span>Grafter</span>. <span>Orchard</span>’s novelty lies in allowing the programmer to transform tree traversal applications by automatically applying <i>fine-grained</i> fusion and extracting <i>heterogeneous</i> parallelism.<span>Orchard</span> allows the programmer to write general tree traversal applications in a simple and elegant embedded Domain-Specific Language (eDSL). We show that the combination of fine-grained fusion and heterogeneous parallelism performs better than each alone when the conditions are met.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"7 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140151634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many real-world networks are characterized by being temporal and dynamic, wherein the temporal information signifies the changes in connections, such as the addition or removal of links between nodes. Employing random walks on these temporal networks is a crucial technique for understanding the structural evolution of such graphs over time. However, existing state-of-the-art sampling methods are designed for traditional static graphs, and as such, they struggle to efficiently handle the dynamic aspects of temporal networks. This deficiency can be attributed to several challenges, including increased sampling complexity, extensive index space, limited programmability, and a lack of scalability.
In this paper, we introduce TEA+, a robust, fast, and scalable engine for conducting random walks on temporal graphs. Central to TEA+ is an innovative hybrid sampling method that amalgamates two Monte Carlo sampling techniques. This fusion significantly diminishes space complexity while maintaining a fast sampling speed. Additionally, TEA+ integrates a range of optimizations that significantly enhance sampling efficiency. This is further supported by an effective graph updating strategy, skilled in managing dynamic graph modifications and adeptly handling the insertion and deletion of both edges and vertices. For ease of implementation, we propose a temporal-centric programming model, designed to simplify the development of various random walk algorithms on temporal graphs. To ensure optimal performance across storage constraints, TEA+ features a degree-aware hybrid storage architecture, capable of adeptly scaling in different memory environments. Experimental results showcase the prowess of TEA+, as it attains up to three orders of magnitude speedups compared to current random walk engines on extensive temporal graphs.
{"title":"TEA+: A Novel Temporal Graph Random Walk Engine With Hybrid Storage Architecture","authors":"Chengying Huan, Yongchao Liu, Heng Zhang, Shuaiwen Song, Santosh Pandey, Shiyang Chen, Xiangfei Fang, Yue Jin, Baptiste Lepers, Yanjun Wu, Hang Liu","doi":"10.1145/3652604","DOIUrl":"https://doi.org/10.1145/3652604","url":null,"abstract":"<p>Many real-world networks are characterized by being temporal and dynamic, wherein the temporal information signifies the changes in connections, such as the addition or removal of links between nodes. Employing random walks on these temporal networks is a crucial technique for understanding the structural evolution of such graphs over time. However, existing state-of-the-art sampling methods are designed for traditional static graphs, and as such, they struggle to efficiently handle the dynamic aspects of temporal networks. This deficiency can be attributed to several challenges, including increased sampling complexity, extensive index space, limited programmability, and a lack of scalability. </p><p>In this paper, we introduce <i>TEA+</i>, a robust, fast, and scalable engine for conducting random walks on temporal graphs. Central to <i>TEA+</i> is an innovative hybrid sampling method that amalgamates two Monte Carlo sampling techniques. This fusion significantly diminishes space complexity while maintaining a fast sampling speed. Additionally, <i>TEA+</i> integrates a range of optimizations that significantly enhance sampling efficiency. This is further supported by an effective graph updating strategy, skilled in managing dynamic graph modifications and adeptly handling the insertion and deletion of both edges and vertices. For ease of implementation, we propose a temporal-centric programming model, designed to simplify the development of various random walk algorithms on temporal graphs. To ensure optimal performance across storage constraints, <i>TEA+</i> features a degree-aware hybrid storage architecture, capable of adeptly scaling in different memory environments. Experimental results showcase the prowess of <i>TEA+</i>, as it attains up to three orders of magnitude speedups compared to current random walk engines on extensive temporal graphs.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"48 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140151637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Siddhartha Raman Sundara Raman, Lizy John, Jaydeep P. Kulkarni
Graph neural networks (GNN) are of great interest in real-life applications such as citation networks, drug discovery owing to GNN’s ability to apply machine learning techniques on graphs. GNNs utilize a two-step approach to classify the nodes in a graph into pre-defined categories. The first step uses a combination kernel to perform data-intensive convolution operations with regular memory access patterns. The second step uses an aggregation kernel that operates on sparse data having irregular access patterns. These mixed data patterns render CPU/GPU based compute energy-inefficient. Von-Neumann-based accelerators like AWB-GCN [7] suffer from increased data movement, as the data-intensive combination requires large data movement to/from memory to perform computations. ReFLIP [8] performs Resistive Random Access memory-based in-memory (PIM) compute to overcome data movement costs. However, ReFLIP suffers from increased area requirement due to dedicated accelerator arrangement, reduced performance due to limited parallelism and energy due to fundamental issues in ReRAM-based compute. This paper presents a scalable (non-exponential storage requirement), DAC/ADC-less PIM-based combination, with (i) early compute termination, (ii) pre-compute by reconfiguring SOC components. Graph and sparsity-aware near-memory aggregation using the proposed compute-as-soon-as-ready (CAR), broadcast approach improves performance and energy further. NEM-GNN achieves ∼ 80-230x, ∼ 80-300x, ∼ 850-1134x, and ∼ 7-8x improvement over ReFLIP, in terms of performance, throughput, energy efficiency and compute density.
{"title":"NEM-GNN - DAC/ADC-less, scalable, reconfigurable, graph and sparsity-aware near-memory accelerator for graph neural networks","authors":"Siddhartha Raman Sundara Raman, Lizy John, Jaydeep P. Kulkarni","doi":"10.1145/3652607","DOIUrl":"https://doi.org/10.1145/3652607","url":null,"abstract":"<p>Graph neural networks (GNN) are of great interest in real-life applications such as citation networks, drug discovery owing to GNN’s ability to apply machine learning techniques on graphs. GNNs utilize a two-step approach to classify the nodes in a graph into pre-defined categories. The first step uses a combination kernel to perform data-intensive convolution operations with regular memory access patterns. The second step uses an aggregation kernel that operates on sparse data having irregular access patterns. These mixed data patterns render CPU/GPU based compute energy-inefficient. Von-Neumann-based accelerators like AWB-GCN [7] suffer from increased data movement, as the data-intensive combination requires large data movement to/from memory to perform computations. ReFLIP [8] performs Resistive Random Access memory-based in-memory (PIM) compute to overcome data movement costs. However, ReFLIP suffers from increased area requirement due to dedicated accelerator arrangement, reduced performance due to limited parallelism and energy due to fundamental issues in ReRAM-based compute. This paper presents a scalable (non-exponential storage requirement), DAC/ADC-less PIM-based combination, with (i) early compute termination, (ii) pre-compute by reconfiguring SOC components. Graph and sparsity-aware near-memory aggregation using the proposed compute-as-soon-as-ready (CAR), broadcast approach improves performance and energy further. NEM-GNN achieves ∼ 80-230x, ∼ 80-300x, ∼ 850-1134x, and ∼ 7-8x improvement over ReFLIP, in terms of performance, throughput, energy efficiency and compute density.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"23 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140125670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yan Chen, Qiwen Ke, Huiba Li, Yongwei Wu, Yiming Zhang
Object storage has been widely used in the cloud. Traditionally, the size of object metadata is much smaller than that of object data, and thus existing object storage systems (like Ceph and Oasis) can place object data and metadata respectively on hard disk drives (HDDs) and solid-state drives (SSDs) to achieve high I/O performance at a low monetary cost. Currently, however, a wide range of cloud applications organize their data as large numbers of small objects of which the data size is close to (or even smaller than) the metadata size, thus greatly increasing the cost if placing all metadata on expensive SSDs.
This paper presents xMeta, an SSD-HDD-hybrid optimization for metadata maintenance of cloud-scale object storage. We observed that a substantial portion of the metadata of small objects is rarely accessed and thus can be stored on HDDs with little performance penalty. Therefore, xMeta first classifies the hot and cold metadata based on the frequency of metadata accesses of upper-layer applications, and then adaptively stores the hot metadata on SSDs and the cold metadata on HDDs. We also propose a merging mechanism for hot metadata to further improve the efficiency of SSD storage, and optimize range key query and insertion for hot metadata by designing composite keys. We have integrated the xMeta metadata service with Ceph to realize a high-performance, low-cost object store (called xCeph). The extensive evaluation shows that xCeph outperforms the original Ceph by an order of magnitude in the space requirement of SSD storage, while improving the throughput by up to 2.7 ×.
{"title":"xMeta: SSD-HDD-Hybrid Optimization for Metadata Maintenance of Cloud-Scale Object Storage","authors":"Yan Chen, Qiwen Ke, Huiba Li, Yongwei Wu, Yiming Zhang","doi":"10.1145/3652606","DOIUrl":"https://doi.org/10.1145/3652606","url":null,"abstract":"<p>Object storage has been widely used in the cloud. Traditionally, the size of object metadata is much smaller than that of object data, and thus existing object storage systems (like Ceph and Oasis) can place object data and metadata respectively on hard disk drives (HDDs) and solid-state drives (SSDs) to achieve high I/O performance at a low monetary cost. Currently, however, a wide range of cloud applications organize their data as large numbers of small objects of which the data size is close to (or even smaller than) the metadata size, thus greatly increasing the cost if placing all metadata on expensive SSDs. </p><p>This paper presents x<span>Meta</span>, an SSD-HDD-hybrid optimization for metadata maintenance of cloud-scale object storage. We observed that a substantial portion of the metadata of small objects is rarely accessed and thus can be stored on HDDs with little performance penalty. Therefore, x<span>Meta</span> first classifies the <i>hot</i> and <i>cold</i> metadata based on the frequency of metadata accesses of upper-layer applications, and then adaptively stores the hot metadata on SSDs and the cold metadata on HDDs. We also propose a merging mechanism for hot metadata to further improve the efficiency of SSD storage, and optimize range key query and insertion for hot metadata by designing composite keys. We have integrated the x<span>Meta</span> metadata service with Ceph to realize a high-performance, low-cost object store (called xCeph). The extensive evaluation shows that xCeph outperforms the original Ceph by an order of magnitude in the space requirement of SSD storage, while improving the throughput by up to 2.7 ×.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"74 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140125669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael Canesche, Vanderson M. Rosario, Edson Borin, Fernando Magno Quintão Pereira
Kernel scheduling is the problem of finding the most efficient implementation for a computational kernel. Identifying this implementation involves experimenting with the parameters of compiler optimizations, such as the size of tiling windows and unrolling factors. This paper shows that it is possible to organize these parameters as points in a coordinate space. The function that maps these points to the running time of kernels, in general, will not determine a convex surface. However, this paper provides empirical evidence that the origin of this surface—an unoptimized kernel—and its global optimum—the fastest kernel—reside on a convex region. We call this hypothesis the “droplet expectation”. Consequently, a search method based on the coordinate descent algorithm tends to find the optimal kernel configuration quickly if the hypothesis holds. This approach—called Droplet Search—has been available in Apache TVM since April of 2023. Experimental results with six large deep learning models on various computing devices (ARM, Intel, AMD, and NVIDIA) indicate that Droplet Search is not only as effective as other AutoTVM search techniques but also two to ten times faster. Moreover, models generated by Droplet Search are competitive with those produced by TVM’s AutoScheduler (Ansor), despite the latter using four to five times more code transformations than AutoTVM.
{"title":"The Droplet Search Algorithm for Kernel Scheduling","authors":"Michael Canesche, Vanderson M. Rosario, Edson Borin, Fernando Magno Quintão Pereira","doi":"10.1145/3650109","DOIUrl":"https://doi.org/10.1145/3650109","url":null,"abstract":"<p>Kernel scheduling is the problem of finding the most efficient implementation for a computational kernel. Identifying this implementation involves experimenting with the parameters of compiler optimizations, such as the size of tiling windows and unrolling factors. This paper shows that it is possible to organize these parameters as points in a coordinate space. The function that maps these points to the running time of kernels, in general, will not determine a convex surface. However, this paper provides empirical evidence that the origin of this surface—an unoptimized kernel—and its global optimum—the fastest kernel—reside on a convex region. We call this hypothesis the “droplet expectation”. Consequently, a search method based on the coordinate descent algorithm tends to find the optimal kernel configuration quickly if the hypothesis holds. This approach—called Droplet Search—has been available in Apache TVM since April of 2023. Experimental results with six large deep learning models on various computing devices (ARM, Intel, AMD, and NVIDIA) indicate that Droplet Search is not only as effective as other AutoTVM search techniques but also two to ten times faster. Moreover, models generated by Droplet Search are competitive with those produced by TVM’s AutoScheduler (Ansor), despite the latter using four to five times more code transformations than AutoTVM.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"177 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140018782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Asmita Pal, Keerthana Desai, Rahul Chatterjee, Joshua San Miguel
Trace-based simulation is a widely used methodology for system design exploration. It relies on realistic traces that represent a range of behaviors necessary to be evaluated, containing a lot of information about the application, its inputs and the underlying system on which it was generated. Consequently, generating traces from real-world executions risk leakage of sensitive information. To prevent this, traces can be obfuscated before release. However, this can undermine their ideal utility, i.e., how realistically a program behavior was captured. To address this, we propose Camouflage, a novel obfuscation framework, designed with awareness of the necessary architectural properties required to preserve trace utility, while ensuring secrecy of the inputs used to generate the trace. Focusing on memory access traces, our extensive evaluation on various benchmarks shows that camouflaged traces preserve the performance measurements of the original execution, with an average τ correlation of 0.66. We model input secrecy as an input indistinguishability problem and show that the average security loss is 7.8%, which is better than traces generated from the state-of-the-art.
{"title":"Camouflage: Utility-Aware Obfuscation for Accurate Simulation of Sensitive Program Traces","authors":"Asmita Pal, Keerthana Desai, Rahul Chatterjee, Joshua San Miguel","doi":"10.1145/3650110","DOIUrl":"https://doi.org/10.1145/3650110","url":null,"abstract":"<p>Trace-based simulation is a widely used methodology for system design exploration. It relies on realistic traces that represent a range of behaviors necessary to be evaluated, containing a lot of information about the application, its inputs and the underlying system on which it was generated. Consequently, generating traces from real-world executions risk leakage of sensitive information. To prevent this, traces can be obfuscated before release. However, this can undermine their ideal utility, i.e., how realistically a program behavior was captured. To address this, we propose Camouflage, a novel obfuscation framework, designed with awareness of the necessary architectural properties required to preserve <i>trace utility</i>, while ensuring secrecy of the inputs used to generate the trace. Focusing on memory access traces, our extensive evaluation on various benchmarks shows that camouflaged traces preserve the performance measurements of the original execution, with an average <i>τ</i> correlation of 0.66. We model input secrecy as an input indistinguishability problem and show that the average security loss is 7.8%, which is better than traces generated from the state-of-the-art.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"25 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140018720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
DRAM memory is a performance bottleneck for many applications, due to its high access latency. Previous work has mainly focused on data locality, introducing small-but-fast regions to cache frequently accessed data, thereby reducing the average latency. However, these locality-based designs have three challenges in modern multi-core systems: 1) Inter-application interference leads to random memory access traffic. 2) Fairness issues prevent the memory controller from over-prioritizing data locality. 3) Write-intensive applications have much lower locality and evict substantial dirty entries. With frequent data movement between the fast in-DRAM cache and slow regular arrays, the overhead induced by moving data may even offset the performance and energy benefits of in-DRAM caching.
In this paper, we decouple the data movement process into two distinct phases. The first phase is Load-Reduced Destructive Activation (LRDA), which destructively promotes data into the in-DRAM cache. The second phase is Delayed Cycle-Stealing Restoration (DCSR), which restores the original data when DRAM bank is idle. LRDA decouples the most time-consuming restoration phase from activation, and DCSR hides the restoration latency through prevalent bank-level parallelism. We propose FASA-DRAM incorporating destructive activation and delayed restoration techniques to enable both in-DRAM caching and proactive latency-hiding mechanisms. Our evaluation shows that FASA-DRAM improves the average performance by 19.9% and reduces average DRAM energy consumption by 18.1% over DDR4 DRAM for four-core workloads, with less than 3.4% extra area overhead. Furthermore, FASA-DRAM outperforms state-of-the-art designs in both performance and energy efficiency.
DRAM 存储器因其较高的访问延迟而成为许多应用的性能瓶颈。以前的工作主要集中在数据局部性上,即引入小而快的区域来缓存频繁访问的数据,从而降低平均延迟。然而,这些基于定位的设计在现代多核系统中面临三个挑战:1) 应用程序间的干扰导致随机内存访问流量。2) 公平性问题导致内存控制器无法过度优先考虑数据局部性。3) 写入密集型应用的位置性更低,会驱逐大量脏条目。在快速内存缓存和慢速常规阵列之间频繁移动数据时,移动数据造成的开销甚至可能抵消内存缓存的性能和能耗优势。在本文中,我们将数据移动过程解耦为两个不同的阶段。第一个阶段是负载降低的破坏性激活(LRDA),它以破坏性方式将数据推进到内存缓存中。第二个阶段是延迟周期窃取恢复(DCSR),在 DRAM 存储体空闲时恢复原始数据。LRDA 将最耗时的还原阶段与激活解耦,而 DCSR 则通过普遍存在的库级并行性隐藏了还原延迟。我们提出的 FASA-DRAM 融合了破坏性激活和延迟还原技术,实现了内存缓存和主动延迟隐藏机制。我们的评估结果表明,与 DDR4 DRAM 相比,FASA-DRAM 在四核工作负载中的平均性能提高了 19.9%,平均 DRAM 能耗降低了 18.1%,而额外的面积开销不到 3.4%。此外,FASA-DRAM 在性能和能效方面都优于最先进的设计。
{"title":"FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed Restoration","authors":"Haitao Du, Yuhan Qin, Song Chen, Yi Kang","doi":"10.1145/3649135","DOIUrl":"https://doi.org/10.1145/3649135","url":null,"abstract":"<p>DRAM memory is a performance bottleneck for many applications, due to its high access latency. Previous work has mainly focused on data locality, introducing small-but-fast regions to cache frequently accessed data, thereby reducing the average latency. However, these locality-based designs have three challenges in modern multi-core systems: 1) Inter-application interference leads to random memory access traffic. 2) Fairness issues prevent the memory controller from over-prioritizing data locality. 3) Write-intensive applications have much lower locality and evict substantial dirty entries. With frequent data movement between the fast in-DRAM cache and slow regular arrays, the overhead induced by moving data may even offset the performance and energy benefits of in-DRAM caching. </p><p>In this paper, we decouple the data movement process into two distinct phases. The first phase is Load-Reduced Destructive Activation (LRDA), which destructively promotes data into the in-DRAM cache. The second phase is Delayed Cycle-Stealing Restoration (DCSR), which restores the original data when DRAM bank is idle. LRDA decouples the most time-consuming restoration phase from activation, and DCSR hides the restoration latency through prevalent bank-level parallelism. We propose FASA-DRAM incorporating destructive activation and delayed restoration techniques to enable both in-DRAM caching and proactive latency-hiding mechanisms. Our evaluation shows that FASA-DRAM improves the average performance by 19.9% and reduces average DRAM energy consumption by 18.1% over DDR4 DRAM for four-core workloads, with less than 3.4% extra area overhead. Furthermore, FASA-DRAM outperforms state-of-the-art designs in both performance and energy efficiency.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"60 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139948746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Panagiotis Miliadis, Dimitris Theodoropoulos, Dionisios N. Pnevmatikatos, Nectarios Koziris
FPGAs are increasingly popular in cloud environments for their ability to offer on-demand acceleration and improved compute efficiency. Providers would like to increase utilization, by multiplexing customers on a single device, similar to how processing cores and memory are shared. Nonetheless, multi-tenancy still faces major architectural limitations including: a) inefficient sharing of memory interfaces across hardware tasks exacerbated by technological limitations and peculiarities, b) insufficient solutions for performance and data isolation and high quality of service, c) absent or simplistic allocation strategies to effectively distribute external FPGA memory across hardware tasks. This paper presents a full-stack solution for enabling multi-tenancy on FPGAs. Specifically, our work proposes an intra-fpga virtualization layer to share FPGA interfaces and its resources across tenants. To achieve efficient inter-connectivity between virtual FPGAs (vFGPAs) and external interfaces, we employ a compact network-on-chip architecture to optimize resource utilization. Dedicated memory management units implement the concept of virtual memory in FPGAs, providing mechanisms to isolate the address space and enable memory protection. We also introduce a memory segmentation scheme to effectively allocate FPGA address space and enhance isolation through hardware-software support, while preserving the efficacy of memory transactions. We assess our solution on an Alveo U250 Data Center FPGA Card, employing ten real-world benchmarks from the Rodinia and Rosetta suites. Our framework preserves the performance of hardware tasks from a non-virtualized environment, while enhancing the device aggregate throughput through resource sharing; up to 3.96x in isolated and up to 2.31x in highly congested settings, where an external interface is shared across four vFPGAs. Finally, our work ensures high-quality of service, with hardware tasks achieving up to 0.95x of their native performance, even when resource sharing introduces interference from other accelerators.
{"title":"Architectural support for sharing, isolating and virtualizing FPGA resources.","authors":"Panagiotis Miliadis, Dimitris Theodoropoulos, Dionisios N. Pnevmatikatos, Nectarios Koziris","doi":"10.1145/3648475","DOIUrl":"https://doi.org/10.1145/3648475","url":null,"abstract":"<p>FPGAs are increasingly popular in cloud environments for their ability to offer on-demand acceleration and improved compute efficiency. Providers would like to increase utilization, by multiplexing customers on a single device, similar to how processing cores and memory are shared. Nonetheless, multi-tenancy still faces major architectural limitations including: a) inefficient sharing of memory interfaces across hardware tasks exacerbated by technological limitations and peculiarities, b) insufficient solutions for performance and data isolation and high quality of service, c) absent or simplistic allocation strategies to effectively distribute external FPGA memory across hardware tasks. This paper presents a full-stack solution for enabling multi-tenancy on FPGAs. Specifically, our work proposes an intra-fpga virtualization layer to share FPGA interfaces and its resources across tenants. To achieve efficient inter-connectivity between virtual FPGAs (vFGPAs) and external interfaces, we employ a compact network-on-chip architecture to optimize resource utilization. Dedicated memory management units implement the concept of virtual memory in FPGAs, providing mechanisms to isolate the address space and enable memory protection. We also introduce a memory segmentation scheme to effectively allocate FPGA address space and enhance isolation through hardware-software support, while preserving the efficacy of memory transactions. We assess our solution on an Alveo U250 Data Center FPGA Card, employing ten real-world benchmarks from the Rodinia and Rosetta suites. Our framework preserves the performance of hardware tasks from a non-virtualized environment, while enhancing the device aggregate throughput through resource sharing; up to 3.96x in isolated and up to 2.31x in highly congested settings, where an external interface is shared across four vFPGAs. Finally, our work ensures high-quality of service, with hardware tasks achieving up to 0.95x of their native performance, even when resource sharing introduces interference from other accelerators.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"26 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139768152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ke Liu, Kan Wu, Hua Wang, Ke Zhou, Peng Wang, Ji Zhang, Cong Li
“Learned” admission policies have shown promise in improving Content Delivery Network (CDN) cache performance and lowering operational costs. Unfortunately, existing learned policies are optimized with a few fixed cache sizes while in reality, cache sizes often vary over time in an unpredictable manner. As a result, existing solutions cannot provide consistent benefits in production settings.
We present SLAP, a learned CDN cache admission approach based on segmented object reuse time prediction. SLAP predicts an object’s reuse time range using the Long-Short-Term-Memory model and admits objects that will be reused (before eviction) given the current cache size. SLAP decouples model training from cache size, allowing it to adapt to arbitrary sizes. The key to our solution is a novel segmented labeling scheme that makes SLAP without requiring precise prediction on object reuse time. To further make SLAP a practical and efficient solution, we propose aggressive reusing of computation and training on sampled traces to optimize model training, and a specialized predictor architecture that overlaps prediction computation with miss object fetching to optimize model inference. Our experiments using production CDN traces show that SLAP achieves significantly lower write traffic (38%-59%), longer SSDs lifetime (104%-178%), a consistently higher hit rate (3.2%-11.7%), and requires no effort to adapt to changing cache sizes, outperforming existing policies.
{"title":"SLAP: Segmented Reuse-Time-Label Based Admission Policy for Content Delivery Network Caching","authors":"Ke Liu, Kan Wu, Hua Wang, Ke Zhou, Peng Wang, Ji Zhang, Cong Li","doi":"10.1145/3646550","DOIUrl":"https://doi.org/10.1145/3646550","url":null,"abstract":"<p>“Learned” admission policies have shown promise in improving Content Delivery Network (CDN) cache performance and lowering operational costs. Unfortunately, existing learned policies are optimized with a few fixed cache sizes while in reality, cache sizes often vary over time in an unpredictable manner. As a result, existing solutions cannot provide consistent benefits in production settings. </p><p>We present <i>SLAP</i>, a learned CDN cache admission approach based on segmented object reuse time prediction. <i>SLAP</i> predicts an object’s reuse time range using the Long-Short-Term-Memory model and admits objects that will be reused (before eviction) given the current cache size. <i>SLAP</i> decouples model training from cache size, allowing it to adapt to arbitrary sizes. The key to our solution is a novel segmented labeling scheme that makes <i>SLAP</i> without requiring precise prediction on object reuse time. To further make <i>SLAP</i> a practical and efficient solution, we propose aggressive reusing of computation and training on sampled traces to optimize model training, and a specialized predictor architecture that overlaps prediction computation with miss object fetching to optimize model inference. Our experiments using production CDN traces show that SLAP achieves significantly lower write traffic (38%-59%), longer SSDs lifetime (104%-178%), a consistently higher hit rate (3.2%-11.7%), and requires no effort to adapt to changing cache sizes, outperforming existing policies.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"80 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139768148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kunpeng Xie, Ye Lu, Xinyu He, Dezhi Yi, Huijuan Dong, Yao Chen
Convolutional Neural Networks (CNNs) can benefit from the computational reductions provided by the Winograd minimal filtering algorithm and weight pruning. However, harnessing the potential of both methods simultaneously introduces complexity in designing pruning algorithms and accelerators. Prior studies aimed to establish regular sparsity patterns in the Winograd domain, but they were primarily suited for small tiles, with domain transformation dictating the sparsity ratio. The irregularities in data access and domain transformation pose challenges in accelerator design, especially for larger Winograd tiles. This paper introduces ”Winols,” an innovative algorithm-hardware co-design strategy that emphasizes the strengths of the large-tiling Winograd algorithm. Through a spatial-to-Winograd relevance degree evaluation, we extensively explore domain transformation and propose a cross-domain pruning technique that retains sparsity across both spatial and Winograd domains. To compress pruned weight matrices, we invent a relative column encoding scheme. We further design an FPGA-based accelerator for CNN models with large Winograd tiles and sparse matrix-vector operations. Evaluations indicate our pruning method achieves up to 80% weight tile sparsity in the Winograd domain without compromising accuracy. Our Winols accelerator outperforms dense accelerator by a factor of 31.7 × in inference latency. When compared with prevailing sparse Winograd accelerators, Winols reduces latency by an average of 10.9 ×, and improves DSP and energy efficiencies by over 5.6 × and 5.7 ×, respectively. When compared with the CPU and GPU platform, Winols accelerator with tile size 8 × 8 achieves 24.6 × and 2.84 × energy efficiency improvements, respectively.
{"title":"Winols: A Large-Tiling Sparse Winograd CNN Accelerator on FPGAs","authors":"Kunpeng Xie, Ye Lu, Xinyu He, Dezhi Yi, Huijuan Dong, Yao Chen","doi":"10.1145/3643682","DOIUrl":"https://doi.org/10.1145/3643682","url":null,"abstract":"<p>Convolutional Neural Networks (CNNs) can benefit from the computational reductions provided by the Winograd minimal filtering algorithm and weight pruning. However, harnessing the potential of both methods simultaneously introduces complexity in designing pruning algorithms and accelerators. Prior studies aimed to establish regular sparsity patterns in the Winograd domain, but they were primarily suited for small tiles, with domain transformation dictating the sparsity ratio. The irregularities in data access and domain transformation pose challenges in accelerator design, especially for larger Winograd tiles. This paper introduces ”Winols,” an innovative algorithm-hardware co-design strategy that emphasizes the strengths of the large-tiling Winograd algorithm. Through a spatial-to-Winograd relevance degree evaluation, we extensively explore domain transformation and propose a cross-domain pruning technique that retains sparsity across both spatial and Winograd domains. To compress pruned weight matrices, we invent a relative column encoding scheme. We further design an FPGA-based accelerator for CNN models with large Winograd tiles and sparse matrix-vector operations. Evaluations indicate our pruning method achieves up to 80% weight tile sparsity in the Winograd domain without compromising accuracy. Our Winols accelerator outperforms dense accelerator by a factor of 31.7 × in inference latency. When compared with prevailing sparse Winograd accelerators, Winols reduces latency by an average of 10.9 ×, and improves DSP and energy efficiencies by over 5.6 × and 5.7 ×, respectively. When compared with the CPU and GPU platform, Winols accelerator with tile size 8 × 8 achieves 24.6 × and 2.84 × energy efficiency improvements, respectively.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"23 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139658538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}