ACM Transactions on Architecture and Code Optimization最新文献_第3页

Orchard: Heterogeneous Parallelism and Fine-grained Fusion for Complex Tree Traversals Orchard：复杂树遍历的异构并行和细粒度融合

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-03-15 DOI: 10.1145/3652605

Vidush Singhal, Laith Sakka, Kirshanthan Sundararajah, Ryan R. Newton, Milind Kulkarni

Many applications are designed to perform traversals on tree-like data structures. Fusing and parallelizing these traversals enhance the performance of applications. Fusing multiple traversals improves the locality of the application. The runtime of an application can be significantly reduced by extracting parallelism and utilizing multi-threading. Prior frameworks have tried to fuse and parallelize tree traversals using coarse-grained approaches, leading to missed fine-grained opportunities for improving performance. Other frameworks have successfully supported fine-grained fusion on heterogeneous tree types but fall short regarding parallelization. We introduce a new framework Orchard built on top of Grafter. Orchard’s novelty lies in allowing the programmer to transform tree traversal applications by automatically applying fine-grained fusion and extracting heterogeneous parallelism.Orchard allows the programmer to write general tree traversal applications in a simple and elegant embedded Domain-Specific Language (eDSL). We show that the combination of fine-grained fusion and heterogeneous parallelism performs better than each alone when the conditions are met.

许多应用程序都设计用于对树状数据结构执行遍历。融合和并行化这些遍历可以提高应用程序的性能。融合多个遍历可提高应用程序的定位性。通过提取并行性和利用多线程，可以大大缩短应用程序的运行时间。之前的框架尝试使用粗粒度方法对树遍历进行融合和并行化，结果错失了提高性能的细粒度机会。其他框架成功地支持了异构树类型的细粒度融合，但在并行化方面存在不足。我们在 Grafter 的基础上引入了一个新框架 Orchard。Orchard 的新颖之处在于允许程序员通过自动应用细粒度融合和提取异构并行性来转换树遍历应用程序。Orchard 允许程序员用简单而优雅的嵌入式特定领域语言（eDSL）编写一般的树遍历应用程序。我们的研究表明，在满足条件的情况下，细粒度融合和异构并行的组合比各自单独使用效果更好。

{"title":"Orchard: Heterogeneous Parallelism and Fine-grained Fusion for Complex Tree Traversals","authors":"Vidush Singhal, Laith Sakka, Kirshanthan Sundararajah, Ryan R. Newton, Milind Kulkarni","doi":"10.1145/3652605","DOIUrl":"https://doi.org/10.1145/3652605","url":null,"abstract":"Many applications are designed to perform traversals on tree-like data structures. Fusing and parallelizing these traversals enhance the performance of applications. Fusing multiple traversals improves the locality of the application. The runtime of an application can be significantly reduced by extracting parallelism and utilizing multi-threading. Prior frameworks have tried to fuse and parallelize tree traversals using coarse-grained approaches, leading to missed fine-grained opportunities for improving performance. Other frameworks have successfully supported fine-grained fusion on heterogeneous tree types but fall short regarding parallelization. We introduce a new framework Orchard built on top of Grafter. Orchard’s novelty lies in allowing the programmer to transform tree traversal applications by automatically applying fine-grained fusion and extracting heterogeneous parallelism.Orchard allows the programmer to write general tree traversal applications in a simple and elegant embedded Domain-Specific Language (eDSL). We show that the combination of fine-grained fusion and heterogeneous parallelism performs better than each alone when the conditions are met.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"7 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140151634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TEA+: A Novel Temporal Graph Random Walk Engine With Hybrid Storage Architecture TEA+：采用混合存储架构的新型时态图随机游走引擎

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-03-14 DOI: 10.1145/3652604

Chengying Huan, Yongchao Liu, Heng Zhang, Shuaiwen Song, Santosh Pandey, Shiyang Chen, Xiangfei Fang, Yue Jin, Baptiste Lepers, Yanjun Wu, Hang Liu

Many real-world networks are characterized by being temporal and dynamic, wherein the temporal information signifies the changes in connections, such as the addition or removal of links between nodes. Employing random walks on these temporal networks is a crucial technique for understanding the structural evolution of such graphs over time. However, existing state-of-the-art sampling methods are designed for traditional static graphs, and as such, they struggle to efficiently handle the dynamic aspects of temporal networks. This deficiency can be attributed to several challenges, including increased sampling complexity, extensive index space, limited programmability, and a lack of scalability.

In this paper, we introduce TEA+, a robust, fast, and scalable engine for conducting random walks on temporal graphs. Central to TEA+ is an innovative hybrid sampling method that amalgamates two Monte Carlo sampling techniques. This fusion significantly diminishes space complexity while maintaining a fast sampling speed. Additionally, TEA+ integrates a range of optimizations that significantly enhance sampling efficiency. This is further supported by an effective graph updating strategy, skilled in managing dynamic graph modifications and adeptly handling the insertion and deletion of both edges and vertices. For ease of implementation, we propose a temporal-centric programming model, designed to simplify the development of various random walk algorithms on temporal graphs. To ensure optimal performance across storage constraints, TEA+ features a degree-aware hybrid storage architecture, capable of adeptly scaling in different memory environments. Experimental results showcase the prowess of TEA+, as it attains up to three orders of magnitude speedups compared to current random walk engines on extensive temporal graphs.

现实世界中的许多网络都具有时间性和动态性的特点，其中时间信息表示连接的变化，如节点之间连接的添加或删除。在这些时态网络上采用随机游走是了解此类图形结构随时间演变的关键技术。然而，现有的先进采样方法是针对传统静态图设计的，因此难以有效处理时态网络的动态方面。这种缺陷可归因于几个挑战，包括采样复杂度增加、索引空间过大、可编程性有限以及缺乏可扩展性。在本文中，我们将介绍 TEA+，它是一种强大、快速、可扩展的引擎，用于在时态图上进行随机游走。TEA+ 的核心是一种创新的混合采样方法，它融合了两种蒙特卡罗采样技术。这种融合大大降低了空间复杂性，同时保持了快速的采样速度。此外，TEA+ 还集成了一系列优化技术，大大提高了采样效率。此外，我们还采用了有效的图形更新策略，该策略能够熟练地管理图形的动态修改，并巧妙地处理边和顶点的插入和删除。为了便于实施，我们提出了以时间为中心的编程模型，旨在简化时间图上各种随机行走算法的开发。为确保在存储限制条件下实现最佳性能，TEA+ 采用了程度感知混合存储架构，能够在不同的内存环境中进行灵活扩展。实验结果展示了 TEA+ 的卓越性能，因为与当前的随机游走引擎相比，它在广泛的时序图上的速度提高了三个数量级。

{"title":"TEA+: A Novel Temporal Graph Random Walk Engine With Hybrid Storage Architecture","authors":"Chengying Huan, Yongchao Liu, Heng Zhang, Shuaiwen Song, Santosh Pandey, Shiyang Chen, Xiangfei Fang, Yue Jin, Baptiste Lepers, Yanjun Wu, Hang Liu","doi":"10.1145/3652604","DOIUrl":"https://doi.org/10.1145/3652604","url":null,"abstract":"Many real-world networks are characterized by being temporal and dynamic, wherein the temporal information signifies the changes in connections, such as the addition or removal of links between nodes. Employing random walks on these temporal networks is a crucial technique for understanding the structural evolution of such graphs over time. However, existing state-of-the-art sampling methods are designed for traditional static graphs, and as such, they struggle to efficiently handle the dynamic aspects of temporal networks. This deficiency can be attributed to several challenges, including increased sampling complexity, extensive index space, limited programmability, and a lack of scalability. In this paper, we introduce TEA+, a robust, fast, and scalable engine for conducting random walks on temporal graphs. Central to TEA+ is an innovative hybrid sampling method that amalgamates two Monte Carlo sampling techniques. This fusion significantly diminishes space complexity while maintaining a fast sampling speed. Additionally, TEA+ integrates a range of optimizations that significantly enhance sampling efficiency. This is further supported by an effective graph updating strategy, skilled in managing dynamic graph modifications and adeptly handling the insertion and deletion of both edges and vertices. For ease of implementation, we propose a temporal-centric programming model, designed to simplify the development of various random walk algorithms on temporal graphs. To ensure optimal performance across storage constraints, TEA+ features a degree-aware hybrid storage architecture, capable of adeptly scaling in different memory environments. Experimental results showcase the prowess of TEA+, as it attains up to three orders of magnitude speedups compared to current random walk engines on extensive temporal graphs.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"48 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140151637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

NEM-GNN - DAC/ADC-less, scalable, reconfigurable, graph and sparsity-aware near-memory accelerator for graph neural networks NEM-GNN - 用于图神经网络的无 DAC/ADC、可扩展、可重构、图形和稀疏感知的近内存加速器

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-03-14 DOI: 10.1145/3652607

Siddhartha Raman Sundara Raman, Lizy John, Jaydeep P. Kulkarni

Graph neural networks (GNN) are of great interest in real-life applications such as citation networks, drug discovery owing to GNN’s ability to apply machine learning techniques on graphs. GNNs utilize a two-step approach to classify the nodes in a graph into pre-defined categories. The first step uses a combination kernel to perform data-intensive convolution operations with regular memory access patterns. The second step uses an aggregation kernel that operates on sparse data having irregular access patterns. These mixed data patterns render CPU/GPU based compute energy-inefficient. Von-Neumann-based accelerators like AWB-GCN [7] suffer from increased data movement, as the data-intensive combination requires large data movement to/from memory to perform computations. ReFLIP [8] performs Resistive Random Access memory-based in-memory (PIM) compute to overcome data movement costs. However, ReFLIP suffers from increased area requirement due to dedicated accelerator arrangement, reduced performance due to limited parallelism and energy due to fundamental issues in ReRAM-based compute. This paper presents a scalable (non-exponential storage requirement), DAC/ADC-less PIM-based combination, with (i) early compute termination, (ii) pre-compute by reconfiguring SOC components. Graph and sparsity-aware near-memory aggregation using the proposed compute-as-soon-as-ready (CAR), broadcast approach improves performance and energy further. NEM-GNN achieves ∼ 80-230x, ∼ 80-300x, ∼ 850-1134x, and ∼ 7-8x improvement over ReFLIP, in terms of performance, throughput, energy efficiency and compute density.

图神经网络（GNN）能够在图上应用机器学习技术，因此在引文网络、药物发现等现实应用中备受关注。GNN 采用两步法将图中的节点划分为预定义的类别。第一步使用组合内核，以常规内存访问模式执行数据密集型卷积操作。第二步使用聚合内核，对具有不规则访问模式的稀疏数据进行操作。这些混合数据模式导致基于 CPU/GPU 的计算能效低下。基于冯-诺伊曼的加速器（如 AWB-GCN [7]）会增加数据移动，因为数据密集型组合需要大量数据进出内存才能执行计算。ReFLIP [8] 执行基于电阻式随机存取存储器的内存（PIM）计算，以克服数据移动成本。然而，ReFLIP 存在以下问题：由于专用加速器的布置，面积要求增加；由于并行性有限，性能降低；由于基于 ReRAM 计算的基本问题，能耗增加。本文提出了一种可扩展（非指数存储要求）、无 DAC/ADC 的基于 PIM 的组合，(i) 提前终止计算，(ii) 通过重新配置 SOC 组件进行预计算。使用建议的 "计算即就绪"（CAR）广播方法进行图形和稀疏感知的近内存聚合，进一步提高了性能和能耗。与 ReFLIP 相比，NEM-GNN 在性能、吞吐量、能效和计算密度方面分别提高了 80-230 倍、80-300 倍、850-1134 倍和 7-8 倍。

{"title":"NEM-GNN - DAC/ADC-less, scalable, reconfigurable, graph and sparsity-aware near-memory accelerator for graph neural networks","authors":"Siddhartha Raman Sundara Raman, Lizy John, Jaydeep P. Kulkarni","doi":"10.1145/3652607","DOIUrl":"https://doi.org/10.1145/3652607","url":null,"abstract":"Graph neural networks (GNN) are of great interest in real-life applications such as citation networks, drug discovery owing to GNN’s ability to apply machine learning techniques on graphs. GNNs utilize a two-step approach to classify the nodes in a graph into pre-defined categories. The first step uses a combination kernel to perform data-intensive convolution operations with regular memory access patterns. The second step uses an aggregation kernel that operates on sparse data having irregular access patterns. These mixed data patterns render CPU/GPU based compute energy-inefficient. Von-Neumann-based accelerators like AWB-GCN [7] suffer from increased data movement, as the data-intensive combination requires large data movement to/from memory to perform computations. ReFLIP [8] performs Resistive Random Access memory-based in-memory (PIM) compute to overcome data movement costs. However, ReFLIP suffers from increased area requirement due to dedicated accelerator arrangement, reduced performance due to limited parallelism and energy due to fundamental issues in ReRAM-based compute. This paper presents a scalable (non-exponential storage requirement), DAC/ADC-less PIM-based combination, with (i) early compute termination, (ii) pre-compute by reconfiguring SOC components. Graph and sparsity-aware near-memory aggregation using the proposed compute-as-soon-as-ready (CAR), broadcast approach improves performance and energy further. NEM-GNN achieves ∼ 80-230x, ∼ 80-300x, ∼ 850-1134x, and ∼ 7-8x improvement over ReFLIP, in terms of performance, throughput, energy efficiency and compute density.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"23 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140125670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

xMeta: SSD-HDD-Hybrid Optimization for Metadata Maintenance of Cloud-Scale Object Storage xMeta：为云规模对象存储的元数据维护进行固态硬盘-硬盘-混合优化

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-03-13 DOI: 10.1145/3652606

Yan Chen, Qiwen Ke, Huiba Li, Yongwei Wu, Yiming Zhang

Object storage has been widely used in the cloud. Traditionally, the size of object metadata is much smaller than that of object data, and thus existing object storage systems (like Ceph and Oasis) can place object data and metadata respectively on hard disk drives (HDDs) and solid-state drives (SSDs) to achieve high I/O performance at a low monetary cost. Currently, however, a wide range of cloud applications organize their data as large numbers of small objects of which the data size is close to (or even smaller than) the metadata size, thus greatly increasing the cost if placing all metadata on expensive SSDs.

This paper presents xMeta, an SSD-HDD-hybrid optimization for metadata maintenance of cloud-scale object storage. We observed that a substantial portion of the metadata of small objects is rarely accessed and thus can be stored on HDDs with little performance penalty. Therefore, xMeta first classifies the hot and cold metadata based on the frequency of metadata accesses of upper-layer applications, and then adaptively stores the hot metadata on SSDs and the cold metadata on HDDs. We also propose a merging mechanism for hot metadata to further improve the efficiency of SSD storage, and optimize range key query and insertion for hot metadata by designing composite keys. We have integrated the xMeta metadata service with Ceph to realize a high-performance, low-cost object store (called xCeph). The extensive evaluation shows that xCeph outperforms the original Ceph by an order of magnitude in the space requirement of SSD storage, while improving the throughput by up to 2.7 ×.

对象存储已在云计算中得到广泛应用。传统上，对象元数据的大小远小于对象数据的大小，因此现有的对象存储系统（如 Ceph 和 Oasis）可以将对象数据和元数据分别放在硬盘驱动器（HDD）和固态驱动器（SSD）上，从而以较低的成本实现较高的 I/O 性能。但目前，大量云应用将数据组织为大量小对象，其数据大小接近（甚至小于）元数据大小，因此，如果将所有元数据放在昂贵的固态硬盘上，成本会大大增加。本文介绍了 xMeta，这是一种用于云规模对象存储元数据维护的 SSD-HDD 混合优化技术。我们观察到，小型对象的元数据有很大一部分很少被访问，因此可以存储在 HDD 上而几乎不会影响性能。因此，xMeta 首先根据上层应用对元数据的访问频率对热元数据和冷元数据进行分类，然后自适应地将热元数据存储在 SSD 上，将冷元数据存储在 HDD 上。我们还提出了一种热元数据合并机制，以进一步提高固态硬盘的存储效率，并通过设计复合密钥来优化热元数据的范围密钥查询和插入。我们将 xMeta 元数据服务与 Ceph 集成，实现了高性能、低成本的对象存储（称为 xCeph）。广泛的评估表明，xCeph 在 SSD 存储空间需求方面比原始 Ceph 高出一个数量级，同时吞吐量提高了 2.7 倍。

{"title":"xMeta: SSD-HDD-Hybrid Optimization for Metadata Maintenance of Cloud-Scale Object Storage","authors":"Yan Chen, Qiwen Ke, Huiba Li, Yongwei Wu, Yiming Zhang","doi":"10.1145/3652606","DOIUrl":"https://doi.org/10.1145/3652606","url":null,"abstract":"Object storage has been widely used in the cloud. Traditionally, the size of object metadata is much smaller than that of object data, and thus existing object storage systems (like Ceph and Oasis) can place object data and metadata respectively on hard disk drives (HDDs) and solid-state drives (SSDs) to achieve high I/O performance at a low monetary cost. Currently, however, a wide range of cloud applications organize their data as large numbers of small objects of which the data size is close to (or even smaller than) the metadata size, thus greatly increasing the cost if placing all metadata on expensive SSDs. This paper presents xMeta, an SSD-HDD-hybrid optimization for metadata maintenance of cloud-scale object storage. We observed that a substantial portion of the metadata of small objects is rarely accessed and thus can be stored on HDDs with little performance penalty. Therefore, xMeta first classifies the hot and cold metadata based on the frequency of metadata accesses of upper-layer applications, and then adaptively stores the hot metadata on SSDs and the cold metadata on HDDs. We also propose a merging mechanism for hot metadata to further improve the efficiency of SSD storage, and optimize range key query and insertion for hot metadata by designing composite keys. We have integrated the xMeta metadata service with Ceph to realize a high-performance, low-cost object store (called xCeph). The extensive evaluation shows that xCeph outperforms the original Ceph by an order of magnitude in the space requirement of SSD storage, while improving the throughput by up to 2.7 ×.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"74 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140125669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Droplet Search Algorithm for Kernel Scheduling 用于内核调度的液滴搜索算法

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-02-29 DOI: 10.1145/3650109

Michael Canesche, Vanderson M. Rosario, Edson Borin, Fernando Magno Quintão Pereira

Kernel scheduling is the problem of finding the most efficient implementation for a computational kernel. Identifying this implementation involves experimenting with the parameters of compiler optimizations, such as the size of tiling windows and unrolling factors. This paper shows that it is possible to organize these parameters as points in a coordinate space. The function that maps these points to the running time of kernels, in general, will not determine a convex surface. However, this paper provides empirical evidence that the origin of this surface—an unoptimized kernel—and its global optimum—the fastest kernel—reside on a convex region. We call this hypothesis the “droplet expectation”. Consequently, a search method based on the coordinate descent algorithm tends to find the optimal kernel configuration quickly if the hypothesis holds. This approach—called Droplet Search—has been available in Apache TVM since April of 2023. Experimental results with six large deep learning models on various computing devices (ARM, Intel, AMD, and NVIDIA) indicate that Droplet Search is not only as effective as other AutoTVM search techniques but also two to ten times faster. Moreover, models generated by Droplet Search are competitive with those produced by TVM’s AutoScheduler (Ansor), despite the latter using four to five times more code transformations than AutoTVM.

内核调度是为计算内核寻找最高效实现的问题。要确定这种实现方式，需要对编译器优化的参数进行试验，例如平铺窗口的大小和展开因子。本文表明，可以将这些参数组织为坐标空间中的点。一般来说，将这些点映射到内核运行时间的函数不会确定一个凸面。然而，本文提供的经验证据表明，这个曲面的原点--一个未优化的内核--及其全局最优值--最快的内核--位于一个凸区域。我们将这一假设称为 "液滴期望"。因此，如果假设成立，基于坐标下降算法的搜索方法往往能快速找到最优内核配置。自 2023 年 4 月起，阿帕奇 TVM 中就有了这种名为 "液滴搜索"（Droplet Search）的方法。在各种计算设备（ARM、英特尔、AMD 和英伟达）上使用六个大型深度学习模型的实验结果表明，Droplet Search 不仅与其他 AutoTVM 搜索技术一样有效，而且速度还快两到十倍。此外，Droplet Search 生成的模型与 TVM 的 AutoScheduler（Ansor）生成的模型具有竞争力，尽管后者使用的代码转换次数是 AutoTVM 的四到五倍。

{"title":"The Droplet Search Algorithm for Kernel Scheduling","authors":"Michael Canesche, Vanderson M. Rosario, Edson Borin, Fernando Magno Quintão Pereira","doi":"10.1145/3650109","DOIUrl":"https://doi.org/10.1145/3650109","url":null,"abstract":"Kernel scheduling is the problem of finding the most efficient implementation for a computational kernel. Identifying this implementation involves experimenting with the parameters of compiler optimizations, such as the size of tiling windows and unrolling factors. This paper shows that it is possible to organize these parameters as points in a coordinate space. The function that maps these points to the running time of kernels, in general, will not determine a convex surface. However, this paper provides empirical evidence that the origin of this surface—an unoptimized kernel—and its global optimum—the fastest kernel—reside on a convex region. We call this hypothesis the “droplet expectation”. Consequently, a search method based on the coordinate descent algorithm tends to find the optimal kernel configuration quickly if the hypothesis holds. This approach—called Droplet Search—has been available in Apache TVM since April of 2023. Experimental results with six large deep learning models on various computing devices (ARM, Intel, AMD, and NVIDIA) indicate that Droplet Search is not only as effective as other AutoTVM search techniques but also two to ten times faster. Moreover, models generated by Droplet Search are competitive with those produced by TVM’s AutoScheduler (Ansor), despite the latter using four to five times more code transformations than AutoTVM.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"177 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140018782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Camouflage: Utility-Aware Obfuscation for Accurate Simulation of Sensitive Program Traces 伪装：为准确模拟敏感程序痕迹而进行的效用意识混淆

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-02-29 DOI: 10.1145/3650110

Asmita Pal, Keerthana Desai, Rahul Chatterjee, Joshua San Miguel

Trace-based simulation is a widely used methodology for system design exploration. It relies on realistic traces that represent a range of behaviors necessary to be evaluated, containing a lot of information about the application, its inputs and the underlying system on which it was generated. Consequently, generating traces from real-world executions risk leakage of sensitive information. To prevent this, traces can be obfuscated before release. However, this can undermine their ideal utility, i.e., how realistically a program behavior was captured. To address this, we propose Camouflage, a novel obfuscation framework, designed with awareness of the necessary architectural properties required to preserve trace utility, while ensuring secrecy of the inputs used to generate the trace. Focusing on memory access traces, our extensive evaluation on various benchmarks shows that camouflaged traces preserve the performance measurements of the original execution, with an average τ correlation of 0.66. We model input secrecy as an input indistinguishability problem and show that the average security loss is 7.8%, which is better than traces generated from the state-of-the-art.

基于轨迹的仿真是一种广泛应用的系统设计探索方法。它依赖于真实的轨迹，这些轨迹代表了一系列需要评估的行为，包含了大量有关应用程序、其输入以及生成轨迹的底层系统的信息。因此，从现实世界的执行中生成跟踪信息有可能泄露敏感信息。为防止这种情况发生，可在发布前对跟踪信息进行混淆处理。然而，这会损害其理想效用，即程序行为被捕获的真实程度。为了解决这个问题，我们提出了一种新颖的混淆框架--Camouflage，它在设计时考虑到了保留跟踪效用所需的必要架构特性，同时确保用于生成跟踪的输入保密。我们以内存访问跟踪为重点，对各种基准进行了广泛评估，结果表明，经过伪装的跟踪能保持原始执行的性能测量，平均 τ 相关性为 0.66。我们将输入保密建模为输入不可区分性问题，结果表明平均安全损失为 7.8%，优于最先进技术生成的痕迹。

{"title":"Camouflage: Utility-Aware Obfuscation for Accurate Simulation of Sensitive Program Traces","authors":"Asmita Pal, Keerthana Desai, Rahul Chatterjee, Joshua San Miguel","doi":"10.1145/3650110","DOIUrl":"https://doi.org/10.1145/3650110","url":null,"abstract":"Trace-based simulation is a widely used methodology for system design exploration. It relies on realistic traces that represent a range of behaviors necessary to be evaluated, containing a lot of information about the application, its inputs and the underlying system on which it was generated. Consequently, generating traces from real-world executions risk leakage of sensitive information. To prevent this, traces can be obfuscated before release. However, this can undermine their ideal utility, i.e., how realistically a program behavior was captured. To address this, we propose Camouflage, a novel obfuscation framework, designed with awareness of the necessary architectural properties required to preserve trace utility, while ensuring secrecy of the inputs used to generate the trace. Focusing on memory access traces, our extensive evaluation on various benchmarks shows that camouflaged traces preserve the performance measurements of the original execution, with an average τ correlation of 0.66. We model input secrecy as an input indistinguishability problem and show that the average security loss is 7.8%, which is better than traces generated from the state-of-the-art.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"25 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140018720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed Restoration FASA-DRAM：通过破坏性激活和延迟恢复减少 DRAM 延迟

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-02-23 DOI: 10.1145/3649135

Haitao Du, Yuhan Qin, Song Chen, Yi Kang

DRAM memory is a performance bottleneck for many applications, due to its high access latency. Previous work has mainly focused on data locality, introducing small-but-fast regions to cache frequently accessed data, thereby reducing the average latency. However, these locality-based designs have three challenges in modern multi-core systems: 1) Inter-application interference leads to random memory access traffic. 2) Fairness issues prevent the memory controller from over-prioritizing data locality. 3) Write-intensive applications have much lower locality and evict substantial dirty entries. With frequent data movement between the fast in-DRAM cache and slow regular arrays, the overhead induced by moving data may even offset the performance and energy benefits of in-DRAM caching.

In this paper, we decouple the data movement process into two distinct phases. The first phase is Load-Reduced Destructive Activation (LRDA), which destructively promotes data into the in-DRAM cache. The second phase is Delayed Cycle-Stealing Restoration (DCSR), which restores the original data when DRAM bank is idle. LRDA decouples the most time-consuming restoration phase from activation, and DCSR hides the restoration latency through prevalent bank-level parallelism. We propose FASA-DRAM incorporating destructive activation and delayed restoration techniques to enable both in-DRAM caching and proactive latency-hiding mechanisms. Our evaluation shows that FASA-DRAM improves the average performance by 19.9% and reduces average DRAM energy consumption by 18.1% over DDR4 DRAM for four-core workloads, with less than 3.4% extra area overhead. Furthermore, FASA-DRAM outperforms state-of-the-art designs in both performance and energy efficiency.

DRAM 存储器因其较高的访问延迟而成为许多应用的性能瓶颈。以前的工作主要集中在数据局部性上，即引入小而快的区域来缓存频繁访问的数据，从而降低平均延迟。然而，这些基于定位的设计在现代多核系统中面临三个挑战：1) 应用程序间的干扰导致随机内存访问流量。2) 公平性问题导致内存控制器无法过度优先考虑数据局部性。3) 写入密集型应用的位置性更低，会驱逐大量脏条目。在快速内存缓存和慢速常规阵列之间频繁移动数据时，移动数据造成的开销甚至可能抵消内存缓存的性能和能耗优势。在本文中，我们将数据移动过程解耦为两个不同的阶段。第一个阶段是负载降低的破坏性激活（LRDA），它以破坏性方式将数据推进到内存缓存中。第二个阶段是延迟周期窃取恢复（DCSR），在 DRAM 存储体空闲时恢复原始数据。LRDA 将最耗时的还原阶段与激活解耦，而 DCSR 则通过普遍存在的库级并行性隐藏了还原延迟。我们提出的 FASA-DRAM 融合了破坏性激活和延迟还原技术，实现了内存缓存和主动延迟隐藏机制。我们的评估结果表明，与 DDR4 DRAM 相比，FASA-DRAM 在四核工作负载中的平均性能提高了 19.9%，平均 DRAM 能耗降低了 18.1%，而额外的面积开销不到 3.4%。此外，FASA-DRAM 在性能和能效方面都优于最先进的设计。

{"title":"FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed Restoration","authors":"Haitao Du, Yuhan Qin, Song Chen, Yi Kang","doi":"10.1145/3649135","DOIUrl":"https://doi.org/10.1145/3649135","url":null,"abstract":"DRAM memory is a performance bottleneck for many applications, due to its high access latency. Previous work has mainly focused on data locality, introducing small-but-fast regions to cache frequently accessed data, thereby reducing the average latency. However, these locality-based designs have three challenges in modern multi-core systems: 1) Inter-application interference leads to random memory access traffic. 2) Fairness issues prevent the memory controller from over-prioritizing data locality. 3) Write-intensive applications have much lower locality and evict substantial dirty entries. With frequent data movement between the fast in-DRAM cache and slow regular arrays, the overhead induced by moving data may even offset the performance and energy benefits of in-DRAM caching. In this paper, we decouple the data movement process into two distinct phases. The first phase is Load-Reduced Destructive Activation (LRDA), which destructively promotes data into the in-DRAM cache. The second phase is Delayed Cycle-Stealing Restoration (DCSR), which restores the original data when DRAM bank is idle. LRDA decouples the most time-consuming restoration phase from activation, and DCSR hides the restoration latency through prevalent bank-level parallelism. We propose FASA-DRAM incorporating destructive activation and delayed restoration techniques to enable both in-DRAM caching and proactive latency-hiding mechanisms. Our evaluation shows that FASA-DRAM improves the average performance by 19.9% and reduces average DRAM energy consumption by 18.1% over DDR4 DRAM for four-core workloads, with less than 3.4% extra area overhead. Furthermore, FASA-DRAM outperforms state-of-the-art designs in both performance and energy efficiency.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"60 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139948746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Architectural support for sharing, isolating and virtualizing FPGA resources. 为共享、隔离和虚拟化 FPGA 资源提供架构支持。

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-02-16 DOI: 10.1145/3648475

Panagiotis Miliadis, Dimitris Theodoropoulos, Dionisios N. Pnevmatikatos, Nectarios Koziris

FPGAs are increasingly popular in cloud environments for their ability to offer on-demand acceleration and improved compute efficiency. Providers would like to increase utilization, by multiplexing customers on a single device, similar to how processing cores and memory are shared. Nonetheless, multi-tenancy still faces major architectural limitations including: a) inefficient sharing of memory interfaces across hardware tasks exacerbated by technological limitations and peculiarities, b) insufficient solutions for performance and data isolation and high quality of service, c) absent or simplistic allocation strategies to effectively distribute external FPGA memory across hardware tasks. This paper presents a full-stack solution for enabling multi-tenancy on FPGAs. Specifically, our work proposes an intra-fpga virtualization layer to share FPGA interfaces and its resources across tenants. To achieve efficient inter-connectivity between virtual FPGAs (vFGPAs) and external interfaces, we employ a compact network-on-chip architecture to optimize resource utilization. Dedicated memory management units implement the concept of virtual memory in FPGAs, providing mechanisms to isolate the address space and enable memory protection. We also introduce a memory segmentation scheme to effectively allocate FPGA address space and enhance isolation through hardware-software support, while preserving the efficacy of memory transactions. We assess our solution on an Alveo U250 Data Center FPGA Card, employing ten real-world benchmarks from the Rodinia and Rosetta suites. Our framework preserves the performance of hardware tasks from a non-virtualized environment, while enhancing the device aggregate throughput through resource sharing; up to 3.96x in isolated and up to 2.31x in highly congested settings, where an external interface is shared across four vFPGAs. Finally, our work ensures high-quality of service, with hardware tasks achieving up to 0.95x of their native performance, even when resource sharing introduces interference from other accelerators.

FPGA 能够按需加速并提高计算效率，因此在云环境中越来越受欢迎。提供商希望通过在单个设备上复用客户来提高利用率，这与共享处理内核和内存的方式类似。然而，多租户仍然面临着主要的架构限制，包括：a) 由于技术限制和特殊性，跨硬件任务共享内存接口的效率低下；b) 性能和数据隔离以及高质量服务的解决方案不足；c) 缺乏或简化了分配策略，无法在硬件任务中有效分配外部 FPGA 内存。本文提出了在 FPGA 上实现多租户的全栈解决方案。具体来说，我们的工作提出了一个 FPGA 内部虚拟化层，用于跨租户共享 FPGA 接口及其资源。为了实现虚拟 FPGA（vFGPA）与外部接口之间的高效互联，我们采用了一种紧凑型片上网络架构来优化资源利用率。专用内存管理单元在 FPGA 中实现了虚拟内存的概念，提供了隔离地址空间和实现内存保护的机制。我们还引入了内存分段方案，以有效分配 FPGA 地址空间，并通过软硬件支持加强隔离，同时保持内存事务的有效性。我们在 Alveo U250 数据中心 FPGA 卡上评估了我们的解决方案，采用了 Rodinia 和 Rosetta 套件中的十个实际基准。我们的框架保留了非虚拟化环境中硬件任务的性能，同时通过资源共享提高了设备的总吞吐量；在孤立环境中提高了 3.96 倍，在高度拥堵环境中提高了 2.31 倍，其中一个外部接口由四个 vFPGA 共享。最后，我们的工作确保了服务质量，即使在资源共享引入其他加速器干扰的情况下，硬件任务也能达到其原始性能的 0.95 倍。

{"title":"Architectural support for sharing, isolating and virtualizing FPGA resources.","authors":"Panagiotis Miliadis, Dimitris Theodoropoulos, Dionisios N. Pnevmatikatos, Nectarios Koziris","doi":"10.1145/3648475","DOIUrl":"https://doi.org/10.1145/3648475","url":null,"abstract":"FPGAs are increasingly popular in cloud environments for their ability to offer on-demand acceleration and improved compute efficiency. Providers would like to increase utilization, by multiplexing customers on a single device, similar to how processing cores and memory are shared. Nonetheless, multi-tenancy still faces major architectural limitations including: a) inefficient sharing of memory interfaces across hardware tasks exacerbated by technological limitations and peculiarities, b) insufficient solutions for performance and data isolation and high quality of service, c) absent or simplistic allocation strategies to effectively distribute external FPGA memory across hardware tasks. This paper presents a full-stack solution for enabling multi-tenancy on FPGAs. Specifically, our work proposes an intra-fpga virtualization layer to share FPGA interfaces and its resources across tenants. To achieve efficient inter-connectivity between virtual FPGAs (vFGPAs) and external interfaces, we employ a compact network-on-chip architecture to optimize resource utilization. Dedicated memory management units implement the concept of virtual memory in FPGAs, providing mechanisms to isolate the address space and enable memory protection. We also introduce a memory segmentation scheme to effectively allocate FPGA address space and enhance isolation through hardware-software support, while preserving the efficacy of memory transactions. We assess our solution on an Alveo U250 Data Center FPGA Card, employing ten real-world benchmarks from the Rodinia and Rosetta suites. Our framework preserves the performance of hardware tasks from a non-virtualized environment, while enhancing the device aggregate throughput through resource sharing; up to 3.96x in isolated and up to 2.31x in highly congested settings, where an external interface is shared across four vFPGAs. Finally, our work ensures high-quality of service, with hardware tasks achieving up to 0.95x of their native performance, even when resource sharing introduces interference from other accelerators.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"26 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139768152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SLAP: Segmented Reuse-Time-Label Based Admission Policy for Content Delivery Network Caching SLAP：基于分段重复使用时间标签的内容交付网络缓存准入策略

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-02-09 DOI: 10.1145/3646550

Ke Liu, Kan Wu, Hua Wang, Ke Zhou, Peng Wang, Ji Zhang, Cong Li

“Learned” admission policies have shown promise in improving Content Delivery Network (CDN) cache performance and lowering operational costs. Unfortunately, existing learned policies are optimized with a few fixed cache sizes while in reality, cache sizes often vary over time in an unpredictable manner. As a result, existing solutions cannot provide consistent benefits in production settings.

We present SLAP, a learned CDN cache admission approach based on segmented object reuse time prediction. SLAP predicts an object’s reuse time range using the Long-Short-Term-Memory model and admits objects that will be reused (before eviction) given the current cache size. SLAP decouples model training from cache size, allowing it to adapt to arbitrary sizes. The key to our solution is a novel segmented labeling scheme that makes SLAP without requiring precise prediction on object reuse time. To further make SLAP a practical and efficient solution, we propose aggressive reusing of computation and training on sampled traces to optimize model training, and a specialized predictor architecture that overlaps prediction computation with miss object fetching to optimize model inference. Our experiments using production CDN traces show that SLAP achieves significantly lower write traffic (38%-59%), longer SSDs lifetime (104%-178%), a consistently higher hit rate (3.2%-11.7%), and requires no effort to adapt to changing cache sizes, outperforming existing policies.

"学习 "接纳策略在提高内容分发网络（CDN）缓存性能和降低运营成本方面大有可为。遗憾的是，现有的学习策略是根据几个固定的缓存大小进行优化的，而在现实中，缓存大小往往会以不可预测的方式随时间而变化。因此，现有解决方案无法在生产环境中提供一致的优势。我们提出的 SLAP 是一种基于分段对象重用时间预测的学习型 CDN 缓存接纳方法。SLAP 利用长短期内存模型预测对象的重用时间范围，并根据当前的缓存大小接纳将被重用（在驱逐之前）的对象。SLAP 将模型训练与缓存大小脱钩，使其能够适应任意大小的缓存。我们解决方案的关键在于一种新颖的分段标签方案，它使 SLAP 无需精确预测对象的重用时间。为了进一步使 SLAP 成为实用高效的解决方案，我们提出了在采样跟踪上积极重复使用计算和训练以优化模型训练的方案，并提出了一种专门的预测器架构，将预测计算与缺失对象获取重叠，以优化模型推理。我们使用生产 CDN 跟踪进行的实验表明，SLAP 可显著降低写入流量（38%-59%），延长固态硬盘的使用寿命（104%-178%），持续提高命中率（3.2%-11.7%），而且无需努力适应不断变化的缓存大小，性能优于现有策略。

{"title":"SLAP: Segmented Reuse-Time-Label Based Admission Policy for Content Delivery Network Caching","authors":"Ke Liu, Kan Wu, Hua Wang, Ke Zhou, Peng Wang, Ji Zhang, Cong Li","doi":"10.1145/3646550","DOIUrl":"https://doi.org/10.1145/3646550","url":null,"abstract":"“Learned” admission policies have shown promise in improving Content Delivery Network (CDN) cache performance and lowering operational costs. Unfortunately, existing learned policies are optimized with a few fixed cache sizes while in reality, cache sizes often vary over time in an unpredictable manner. As a result, existing solutions cannot provide consistent benefits in production settings. We present SLAP, a learned CDN cache admission approach based on segmented object reuse time prediction. SLAP predicts an object’s reuse time range using the Long-Short-Term-Memory model and admits objects that will be reused (before eviction) given the current cache size. SLAP decouples model training from cache size, allowing it to adapt to arbitrary sizes. The key to our solution is a novel segmented labeling scheme that makes SLAP without requiring precise prediction on object reuse time. To further make SLAP a practical and efficient solution, we propose aggressive reusing of computation and training on sampled traces to optimize model training, and a specialized predictor architecture that overlaps prediction computation with miss object fetching to optimize model inference. Our experiments using production CDN traces show that SLAP achieves significantly lower write traffic (38%-59%), longer SSDs lifetime (104%-178%), a consistently higher hit rate (3.2%-11.7%), and requires no effort to adapt to changing cache sizes, outperforming existing policies.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"80 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139768148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Winols: A Large-Tiling Sparse Winograd CNN Accelerator on FPGAs Winols：FPGA 上的大平铺稀疏 Winograd CNN 加速器

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2024-01-31 DOI: 10.1145/3643682

Kunpeng Xie, Ye Lu, Xinyu He, Dezhi Yi, Huijuan Dong, Yao Chen

Convolutional Neural Networks (CNNs) can benefit from the computational reductions provided by the Winograd minimal filtering algorithm and weight pruning. However, harnessing the potential of both methods simultaneously introduces complexity in designing pruning algorithms and accelerators. Prior studies aimed to establish regular sparsity patterns in the Winograd domain, but they were primarily suited for small tiles, with domain transformation dictating the sparsity ratio. The irregularities in data access and domain transformation pose challenges in accelerator design, especially for larger Winograd tiles. This paper introduces ”Winols,” an innovative algorithm-hardware co-design strategy that emphasizes the strengths of the large-tiling Winograd algorithm. Through a spatial-to-Winograd relevance degree evaluation, we extensively explore domain transformation and propose a cross-domain pruning technique that retains sparsity across both spatial and Winograd domains. To compress pruned weight matrices, we invent a relative column encoding scheme. We further design an FPGA-based accelerator for CNN models with large Winograd tiles and sparse matrix-vector operations. Evaluations indicate our pruning method achieves up to 80% weight tile sparsity in the Winograd domain without compromising accuracy. Our Winols accelerator outperforms dense accelerator by a factor of 31.7 × in inference latency. When compared with prevailing sparse Winograd accelerators, Winols reduces latency by an average of 10.9 ×, and improves DSP and energy efficiencies by over 5.6 × and 5.7 ×, respectively. When compared with the CPU and GPU platform, Winols accelerator with tile size 8 × 8 achieves 24.6 × and 2.84 × energy efficiency improvements, respectively.

卷积神经网络（CNN）可受益于 Winograd 最小过滤算法和权重剪枝带来的计算量减少。然而，同时利用这两种方法的潜力会给剪枝算法和加速器的设计带来复杂性。之前的研究旨在在 Winograd 域中建立规则的稀疏性模式，但这些研究主要适用于小型瓦片，域变换决定了稀疏性比率。数据访问和域变换的不规则性给加速器设计带来了挑战，特别是对于较大的 Winograd 瓦。本文介绍的 "Winols "是一种创新的算法-硬件协同设计策略，它强调了大型瓦片 Winograd 算法的优势。通过空间到 Winograd 相关度评估，我们广泛探索了领域转换，并提出了一种跨领域剪枝技术，可在空间和 Winograd 领域保留稀疏性。为了压缩剪枝后的权重矩阵，我们发明了一种相对列编码方案。我们进一步设计了一种基于 FPGA 的加速器，用于具有大型 Winograd 层和稀疏矩阵-向量操作的 CNN 模型。评估结果表明，我们的剪枝方法在 Winograd 域中实现了高达 80% 的权重瓦稀疏性，而不会影响精度。在推理延迟方面，我们的 Winols 加速器比密集加速器快 31.7 倍。与现有的稀疏 Winograd 加速器相比，Winols 的延迟平均减少了 10.9 倍，DSP 和能效分别提高了 5.6 倍和 5.7 倍。与 CPU 和 GPU 平台相比，瓦片大小为 8 × 8 的 Winols 加速器的能效分别提高了 24.6 倍和 2.84 倍。

{"title":"Winols: A Large-Tiling Sparse Winograd CNN Accelerator on FPGAs","authors":"Kunpeng Xie, Ye Lu, Xinyu He, Dezhi Yi, Huijuan Dong, Yao Chen","doi":"10.1145/3643682","DOIUrl":"https://doi.org/10.1145/3643682","url":null,"abstract":"Convolutional Neural Networks (CNNs) can benefit from the computational reductions provided by the Winograd minimal filtering algorithm and weight pruning. However, harnessing the potential of both methods simultaneously introduces complexity in designing pruning algorithms and accelerators. Prior studies aimed to establish regular sparsity patterns in the Winograd domain, but they were primarily suited for small tiles, with domain transformation dictating the sparsity ratio. The irregularities in data access and domain transformation pose challenges in accelerator design, especially for larger Winograd tiles. This paper introduces ”Winols,” an innovative algorithm-hardware co-design strategy that emphasizes the strengths of the large-tiling Winograd algorithm. Through a spatial-to-Winograd relevance degree evaluation, we extensively explore domain transformation and propose a cross-domain pruning technique that retains sparsity across both spatial and Winograd domains. To compress pruned weight matrices, we invent a relative column encoding scheme. We further design an FPGA-based accelerator for CNN models with large Winograd tiles and sparse matrix-vector operations. Evaluations indicate our pruning method achieves up to 80% weight tile sparsity in the Winograd domain without compromising accuracy. Our Winols accelerator outperforms dense accelerator by a factor of 31.7 × in inference latency. When compared with prevailing sparse Winograd accelerators, Winols reduces latency by an average of 10.9 ×, and improves DSP and energy efficiencies by over 5.6 × and 5.7 ×, respectively. When compared with the CPU and GPU platform, Winols accelerator with tile size 8 × 8 achieves 24.6 × and 2.84 × energy efficiency improvements, respectively.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"23 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139658538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0