ACM Transactions on Architecture and Code Optimization最新文献_第6页

ISP Agent: A Generalized In-Storage-Processing Workload Offloading Framework by Providing Multiple Optimization Opportunities ISP代理:通过提供多个优化机会的广义存储处理工作负载卸载框架

3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-11-14 DOI: 10.1145/3632951

Seokwon Kang, Jongbin Kim, Gyeongyong Lee, Jeongmyung Lee, Jiwon Seo, Hyungsoo Jung, Yong Ho Song, Yongjun Park

As solid-state drives (SSDs) with sufficient computing power have recently become the dominant devices in modern computer systems, in-storage processing (ISP), which processes data within the storage without transferring it to the host memory, is being utilized in various emerging applications. The main challenge of ISP is to deliver storage data to the offloaded workload. This is difficult because of the information gap between the host and storage, the data consistency problem between the host and offloaded workloads, and SSD-specific hardware limitations. Moreover, because the offloaded workloads use internal SSD resources, host I/O performance might be degraded due to resource conflicts. Although several ISP frameworks have been proposed, existing ISP approaches that do not deeply consider the internal SSD behavior are often insufficient to support efficient ISP workload offloading with high programmability. In this paper, we propose an ISP agent, a lightweight ISP workload offloading framework for SSD devices. The ISP agent provides I/O and memory interfaces that allow users to run existing function codes on SSDs without major code modifications, and separates the resources for the offloaded workloads from the existing SSD firmware to minimize interference with host I/O processing. The ISP agent also provides further optimization opportunities for the offloaded workload by considering SSD architectures. We have implemented the ISP agent on the OpenSSD Cosmos+ board and evaluated its performance using synthetic benchmarks and a real-world ISP-assisted database checkpointing application. The experimental results demonstrate that the ISP agent enhances host application performance while increasing ISP programmability, and that the optimization opportunities provided by the ISP agent can significantly improve ISP-side performance without compromising host I/O processing.

随着具有强大计算能力的固态硬盘(ssd)在现代计算机系统中占据主导地位，在存储器中处理数据而不将其传输到主机内存的存储内处理(ISP)正在各种新兴应用中得到利用。ISP面临的主要挑战是将存储数据交付给卸载的工作负载。由于主机和存储之间的信息差距、主机和卸载的工作负载之间的数据一致性问题以及特定于ssd的硬件限制，这很困难。此外，由于卸载的负载使用内部SSD资源，可能会由于资源冲突而导致主机I/O性能下降。虽然已经提出了几个ISP框架，但现有的ISP方法没有深入考虑内部SSD行为，通常不足以支持具有高可编程性的高效ISP工作负载卸载。在本文中，我们提出了一个ISP代理，一个轻量级的SSD设备的ISP工作负载卸载框架。ISP代理提供I/O和内存接口，允许用户在SSD上运行现有的功能代码，而无需对代码进行重大修改，并将用于卸载工作负载的资源从现有SSD固件中分离出来，以尽量减少对主机I/O处理的干扰。通过考虑SSD架构，ISP代理还为卸载的工作负载提供了进一步的优化机会。我们已经在OpenSSD Cosmos+板上实现了ISP代理，并使用合成基准测试和一个真实的ISP辅助数据库检查点应用程序来评估其性能。实验结果表明，ISP代理在提高ISP可编程性的同时提高了主机应用性能，并且ISP代理提供的优化机会可以在不影响主机I/O处理的情况下显著提高ISP端性能。

{"title":"ISP Agent: A Generalized In-Storage-Processing Workload Offloading Framework by Providing Multiple Optimization Opportunities","authors":"Seokwon Kang, Jongbin Kim, Gyeongyong Lee, Jeongmyung Lee, Jiwon Seo, Hyungsoo Jung, Yong Ho Song, Yongjun Park","doi":"10.1145/3632951","DOIUrl":"https://doi.org/10.1145/3632951","url":null,"abstract":"As solid-state drives (SSDs) with sufficient computing power have recently become the dominant devices in modern computer systems, in-storage processing (ISP), which processes data within the storage without transferring it to the host memory, is being utilized in various emerging applications. The main challenge of ISP is to deliver storage data to the offloaded workload. This is difficult because of the information gap between the host and storage, the data consistency problem between the host and offloaded workloads, and SSD-specific hardware limitations. Moreover, because the offloaded workloads use internal SSD resources, host I/O performance might be degraded due to resource conflicts. Although several ISP frameworks have been proposed, existing ISP approaches that do not deeply consider the internal SSD behavior are often insufficient to support efficient ISP workload offloading with high programmability. In this paper, we propose an ISP agent, a lightweight ISP workload offloading framework for SSD devices. The ISP agent provides I/O and memory interfaces that allow users to run existing function codes on SSDs without major code modifications, and separates the resources for the offloaded workloads from the existing SSD firmware to minimize interference with host I/O processing. The ISP agent also provides further optimization opportunities for the offloaded workload by considering SSD architectures. We have implemented the ISP agent on the OpenSSD Cosmos+ board and evaluated its performance using synthetic benchmarks and a real-world ISP-assisted database checkpointing application. The experimental results demonstrate that the ISP agent enhances host application performance while increasing ISP programmability, and that the optimization opportunities provided by the ISP agent can significantly improve ISP-side performance without compromising host I/O processing.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"45 35","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134902822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SAC: An Ultra-Efficient Spin-based Architecture for Compressed DNNs 一种超高效的基于自旋的压缩dnn架构

3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-11-14 DOI: 10.1145/3632957

Yunping Zhao, Sheng Ma, Hengzhu Liu, Libo Huang, Yi Dai

Deep Neural Networks (DNNs) have achieved great progress in academia and industry. But they have become computational and memory intensive with the increase of network depth. Previous designs seek breakthroughs in software and hardware levels to mitigate these challenges. At the software level, neural network compression techniques have effectively reduced network scale and energy consumption. However, the conventional compression algorithm is complex and energy intensive. At the hardware level, the improvements in the semiconductor process have effectively reduced power and energy consumption. However, it is difficult for the traditional Von-Neumann architecture to further reduce the power consumption, due to the memory wall and the end of Moore’s law. To overcome these challenges, the spintronic device based DNN machines have emerged for their non-volatility, ultra low power, and high energy efficiency. However, there is no spin-based design has achieved innovation at both the software and hardware level. Specifically, there is no systematic study of spin-based DNN architecture to deploy compressed networks. In our study, we present an ultra-efficient Spin-based Architecture for Compressed DNNs (SAC), to substantially reduce power consumption and energy consumption. Specifically, we propose a One-Step Compression algorithm (OSC) to reduce the computational complexity with minimum accuracy loss. We also propose a spin-based architecture to realize better performance for the compressed network. Furthermore, we introduce a novel computation flow that enables the reuse of activations and weights. Experimental results show that our study can reduce the computational complexity of compression algorithm from (mathcal {O}(Tk^3) ) to (mathcal {O}(k^2 log k) ) , and achieve 14 × ∼ 40 × compression ratio. Furthermore, our design can attain a 2 × enhancement in power efficiency and a 5 × improvement in computational efficiency compared to the Eyeriss. Our models are available at an anonymous link https://bit.ly/39cdtTa.

深度神经网络(Deep Neural Networks, dnn)在学术界和工业界都取得了很大的进展。但随着网络深度的增加，它们的计算量和内存都变得越来越大。以前的设计在软件和硬件层面寻求突破，以缓解这些挑战。在软件层面，神经网络压缩技术有效地减小了网络规模和能耗。然而，传统的压缩算法复杂且能耗大。在硬件层面，半导体工艺的改进有效地降低了功耗和能耗。然而，由于内存墙和摩尔定律的终结，传统的冯-诺伊曼架构很难进一步降低功耗。为了克服这些挑战，基于自旋电子器件的深度神经网络机器以其无挥发性、超低功耗和高能效而出现。然而，目前还没有基于自旋的设计在软件和硬件两方面都取得了创新。具体来说，目前还没有系统的研究基于自旋的DNN架构来部署压缩网络。在我们的研究中，我们提出了一种超高效的基于自旋的压缩dnn (SAC)架构，以大幅降低功耗和能耗。具体来说，我们提出了一种一步压缩算法(One-Step Compression algorithm, OSC)，以降低计算复杂度和最小的精度损失。我们还提出了一种基于自旋的架构，以实现更好的压缩网络性能。此外，我们还引入了一种新的计算流，可以重用激活和权重。实验结果表明，我们的研究可以将压缩算法的计算复杂度从(mathcal {O}(Tk^3) )降低到(mathcal {O}(k^2 log k) )，并实现14 × ～ 40 ×的压缩比。此外，与Eyeriss相比，我们的设计可以实现2倍的功率效率提高和5倍的计算效率提高。我们的模型可通过匿名链接https://bit.ly/39cdtTa获得。

{"title":"SAC: An Ultra-Efficient Spin-based Architecture for Compressed DNNs","authors":"Yunping Zhao, Sheng Ma, Hengzhu Liu, Libo Huang, Yi Dai","doi":"10.1145/3632957","DOIUrl":"https://doi.org/10.1145/3632957","url":null,"abstract":"Deep Neural Networks (DNNs) have achieved great progress in academia and industry. But they have become computational and memory intensive with the increase of network depth. Previous designs seek breakthroughs in software and hardware levels to mitigate these challenges. At the software level, neural network compression techniques have effectively reduced network scale and energy consumption. However, the conventional compression algorithm is complex and energy intensive. At the hardware level, the improvements in the semiconductor process have effectively reduced power and energy consumption. However, it is difficult for the traditional Von-Neumann architecture to further reduce the power consumption, due to the memory wall and the end of Moore’s law. To overcome these challenges, the spintronic device based DNN machines have emerged for their non-volatility, ultra low power, and high energy efficiency. However, there is no spin-based design has achieved innovation at both the software and hardware level. Specifically, there is no systematic study of spin-based DNN architecture to deploy compressed networks. In our study, we present an ultra-efficient Spin-based Architecture for Compressed DNNs (SAC), to substantially reduce power consumption and energy consumption. Specifically, we propose a One-Step Compression algorithm (OSC) to reduce the computational complexity with minimum accuracy loss. We also propose a spin-based architecture to realize better performance for the compressed network. Furthermore, we introduce a novel computation flow that enables the reuse of activations and weights. Experimental results show that our study can reduce the computational complexity of compression algorithm from (mathcal {O}(Tk^3) ) to (mathcal {O}(k^2 log k) ) , and achieve 14 × ∼ 40 × compression ratio. Furthermore, our design can attain a 2 × enhancement in power efficiency and a 5 × improvement in computational efficiency compared to the Eyeriss. Our models are available at an anonymous link https://bit.ly/39cdtTa.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"11 10","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134991123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hardware Hardened Sandbox Enclaves for Trusted Serverless Computing 用于可信无服务器计算的硬件强化沙箱飞地

3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-11-14 DOI: 10.1145/3632954

Joongun Park, Seunghyo Kang, Sanghyeon Lee, Taehoon Kim, Jongse Park, Youngjin Kwon, Jaehyuk Huh

In cloud-based serverless computing, an application consists of multiple functions provided by mutually distrusting parties. For secure serverless computing, the hardware-based trusted execution environment (TEE) can provide strong isolation among functions. However, not only protecting each function from the host OS and other functions, but also protecting the host system from the functions, is critical for the security of the cloud servers. Such an emerging trusted serverless computing poses new challenges: each TEE must be isolated from the host system bi-directionally, and the system calls from it must be validated. In addition, the resource utilization of each TEE must be accountable in a mutually trusted way. However, the current TEE model cannot efficiently represent such trusted serverless applications. To overcome the lack of such hardware support, this paper proposes an extended TEE model called Cloister , designed for trusted serverless computing. Cloister proposes four new key techniques. First, it extends the hardware-based memory isolation in SGX to confine a deployed function only within its TEE (enclave). Second, it proposes a trusted monitor enclave that filters and validates system calls from enclaves. Third, it provides a trusted resource accounting mechanism for enclaves which is agreeable to both service developers and cloud providers. Finally, Cloister accelerates enclave loading by redesigning its memory verification for fast function deployment. Using an emulated Intel SGX platform with the proposed extensions, this paper shows that trusted serverless applications can be effectively supported with small changes in the SGX hardware.

在基于云的无服务器计算中，应用程序由相互不信任的各方提供的多个功能组成。对于安全的无服务器计算，基于硬件的可信执行环境(TEE)可以提供功能之间的强隔离。但是，不仅要保护每个功能不受主机操作系统和其他功能的影响，还要保护主机系统不受这些功能的影响，这对云服务器的安全性至关重要。这种新兴的可信无服务器计算提出了新的挑战:每个TEE必须与主机系统双向隔离，并且必须验证来自TEE的系统调用。此外，每个TEE的资源利用必须以相互信任的方式负责。然而，目前的TEE模型不能有效地表示这种可信的无服务器应用程序。为了克服这种硬件支持的缺乏，本文提出了一种名为Cloister的扩展TEE模型，该模型专为可信无服务器计算而设计。Cloister提出了四个新的关键技术。首先，它扩展了SGX中基于硬件的内存隔离，将部署的函数限制在TEE (enclave)内。其次，它提出了一个可信的监视enclave，用于过滤和验证来自enclave的系统调用。第三，它为enclave提供了一种可信的资源记帐机制，服务开发人员和云提供商都同意这种机制。最后，Cloister通过重新设计内存验证来加速enclave加载，以实现快速的功能部署。本文使用一个带有所提出扩展的模拟Intel SGX平台，表明只需对SGX硬件进行很小的更改，就可以有效地支持可信的无服务器应用程序。

{"title":"Hardware Hardened Sandbox Enclaves for Trusted Serverless Computing","authors":"Joongun Park, Seunghyo Kang, Sanghyeon Lee, Taehoon Kim, Jongse Park, Youngjin Kwon, Jaehyuk Huh","doi":"10.1145/3632954","DOIUrl":"https://doi.org/10.1145/3632954","url":null,"abstract":"In cloud-based serverless computing, an application consists of multiple functions provided by mutually distrusting parties. For secure serverless computing, the hardware-based trusted execution environment (TEE) can provide strong isolation among functions. However, not only protecting each function from the host OS and other functions, but also protecting the host system from the functions, is critical for the security of the cloud servers. Such an emerging trusted serverless computing poses new challenges: each TEE must be isolated from the host system bi-directionally, and the system calls from it must be validated. In addition, the resource utilization of each TEE must be accountable in a mutually trusted way. However, the current TEE model cannot efficiently represent such trusted serverless applications. To overcome the lack of such hardware support, this paper proposes an extended TEE model called Cloister , designed for trusted serverless computing. Cloister proposes four new key techniques. First, it extends the hardware-based memory isolation in SGX to confine a deployed function only within its TEE (enclave). Second, it proposes a trusted monitor enclave that filters and validates system calls from enclaves. Third, it provides a trusted resource accounting mechanism for enclaves which is agreeable to both service developers and cloud providers. Finally, Cloister accelerates enclave loading by redesigning its memory verification for fast function deployment. Using an emulated Intel SGX platform with the proposed extensions, this paper shows that trusted serverless applications can be effectively supported with small changes in the SGX hardware.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"6 7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134991851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Autovesk: Automatic vectorized code generation from unstructured static kernels using graph transformations Autovesk:使用图形转换从非结构化静态内核自动向量化代码生成

3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-11-09 DOI: 10.1145/3631709

Hayfa Tayeb, Ludovic Paillat, Bérenger Bramas

Leveraging the SIMD capability of modern CPU architectures is mandatory to take full advantage of their increased performance. To exploit this capability, binary executables must be vectorized, either manually by developers or automatically by a tool. For this reason, the compilation research community has developed several strategies for transforming scalar code into a vectorized implementation. However, most existing automatic vectorization techniques in modern compilers are designed for regular codes, leaving irregular applications with non-contiguous data access patterns at a disadvantage. In this paper, we present a new tool, Autovesk, that automatically generates vectorized code from scalar code, specifically targeting irregular data access patterns. We describe how our method transforms a graph of scalar instructions into a vectorized one, using different heuristics to reduce the number or cost of instructions. Finally, we demonstrate the effectiveness of our approach on various computational kernels using Intel AVX-512 and ARM SVE. We compare the speedups of Autovesk vectorized code over GCC, Clang LLVM and Intel automatic vectorization optimizations. We achieve competitive results on linear kernels and up to 11x speedups on irregular kernels.

必须利用现代CPU体系结构的SIMD功能，才能充分利用其提高的性能。为了利用这个功能，二进制可执行文件必须进行矢量化，要么由开发人员手动进行，要么由工具自动进行。由于这个原因，编译研究界已经开发了几种将标量代码转换为矢量化实现的策略。然而，现代编译器中大多数现有的自动向量化技术都是为规则代码设计的，这使得具有非连续数据访问模式的不规则应用程序处于不利地位。在本文中，我们提出了一个新的工具Autovesk，它可以从标量代码自动生成矢量化代码，特别是针对不规则的数据访问模式。我们描述了我们的方法如何将标量指令图转换为矢量图，使用不同的启发式方法来减少指令的数量或成本。最后，我们展示了我们的方法在使用Intel AVX-512和ARM SVE的各种计算内核上的有效性。我们比较了Autovesk矢量化代码与GCC、Clang LLVM和Intel自动矢量化优化的速度提升。我们在线性核上获得了竞争性的结果，在不规则核上获得了高达11倍的加速。

{"title":"Autovesk: Automatic vectorized code generation from unstructured static kernels using graph transformations","authors":"Hayfa Tayeb, Ludovic Paillat, Bérenger Bramas","doi":"10.1145/3631709","DOIUrl":"https://doi.org/10.1145/3631709","url":null,"abstract":"Leveraging the SIMD capability of modern CPU architectures is mandatory to take full advantage of their increased performance. To exploit this capability, binary executables must be vectorized, either manually by developers or automatically by a tool. For this reason, the compilation research community has developed several strategies for transforming scalar code into a vectorized implementation. However, most existing automatic vectorization techniques in modern compilers are designed for regular codes, leaving irregular applications with non-contiguous data access patterns at a disadvantage. In this paper, we present a new tool, Autovesk, that automatically generates vectorized code from scalar code, specifically targeting irregular data access patterns. We describe how our method transforms a graph of scalar instructions into a vectorized one, using different heuristics to reduce the number or cost of instructions. Finally, we demonstrate the effectiveness of our approach on various computational kernels using Intel AVX-512 and ARM SVE. We compare the speedups of Autovesk vectorized code over GCC, Clang LLVM and Intel automatic vectorization optimizations. We achieve competitive results on linear kernels and up to 11x speedups on irregular kernels.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":" 42","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135191396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

QuCloud+: A Holistic Qubit Mapping Scheme for Single/Multi-programming on 2D/3D NISQ Quantum Computers QuCloud+:用于2D/3D NISQ量子计算机单/多编程的整体量子比特映射方案

3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-11-07 DOI: 10.1145/3631525

Lei Liu, Xinglei Dou

Qubit mapping for NISQ superconducting quantum computers is essential to fidelity and resource utilization. The existing qubit mapping schemes meet challenges, e.g., crosstalk, SWAP overheads, diverse device topologies, etc., leading to qubit resource underutilization and low fidelity in computing results. This paper introduces QuCloud+, a new qubit mapping scheme that tackles these challenges. QuCloud+ has several new designs. (1) QuCloud+ supports single/multi-programming quantum computing on quantum chips with 2D/3D topology. (2) QuCloud+ partitions physical qubits for concurrent quantum programs with the crosstalk-aware community detection technique and further allocates qubits according to qubit degree, improving fidelity and resource utilization. (3) QuCloud+ includes an X-SWAP mechanism that avoids SWAPs with high crosstalk errors and enables inter-program SWAPs to reduce the SWAP overheads. (4) QuCloud+ schedules concurrent quantum programs to be mapped and executed based on estimated fidelity for the best practice. Experimental results show that, compared with the existing typical multi-programming study [12], QuCloud+ achieves up to 9.03% higher fidelity and saves on the required SWAPs during mapping, reducing the number of CNOT gates inserted by 40.92%. Compared with a recent study [30] that enables post-mapping gate optimizations to further reduce gates, QuCloud+ reduces the post-mapping circuit depth by 21.91% while using a similar number of gates.

NISQ超导量子计算机的量子比特映射对保真度和资源利用率至关重要。现有的量子位映射方案面临串扰、SWAP开销、器件拓扑多样化等挑战，导致量子位资源利用不足，计算结果保真度低。本文介绍了一种新的量子比特映射方案quucloud +来解决这些挑战。quucloud +有几个新设计。(1) QuCloud+支持在二维/三维拓扑的量子芯片上进行单/多编程量子计算。(2) QuCloud+利用串扰感知社区检测技术为并发量子程序划分物理量子位，并根据量子位度进一步分配量子位，提高了保真度和资源利用率。(3) QuCloud+包含X-SWAP机制，避免了高串扰错误的SWAP，并允许程序间SWAP，减少SWAP开销。(4) quucloud +根据最佳实践的估计保真度调度并行量子程序进行映射和执行。实验结果表明，与现有的典型多编程研究[12]相比，QuCloud+的保真度提高了9.03%，并且在映射过程中节省了所需的swap，减少了插入CNOT门的数量40.92%。与最近的一项研究[30]相比，QuCloud+在使用相同门数的情况下，将映射后电路深度降低了21.91%。

{"title":"QuCloud+: A Holistic Qubit Mapping Scheme for Single/Multi-programming on 2D/3D NISQ Quantum Computers","authors":"Lei Liu, Xinglei Dou","doi":"10.1145/3631525","DOIUrl":"https://doi.org/10.1145/3631525","url":null,"abstract":"Qubit mapping for NISQ superconducting quantum computers is essential to fidelity and resource utilization. The existing qubit mapping schemes meet challenges, e.g., crosstalk, SWAP overheads, diverse device topologies, etc., leading to qubit resource underutilization and low fidelity in computing results. This paper introduces QuCloud+, a new qubit mapping scheme that tackles these challenges. QuCloud+ has several new designs. (1) QuCloud+ supports single/multi-programming quantum computing on quantum chips with 2D/3D topology. (2) QuCloud+ partitions physical qubits for concurrent quantum programs with the crosstalk-aware community detection technique and further allocates qubits according to qubit degree, improving fidelity and resource utilization. (3) QuCloud+ includes an X-SWAP mechanism that avoids SWAPs with high crosstalk errors and enables inter-program SWAPs to reduce the SWAP overheads. (4) QuCloud+ schedules concurrent quantum programs to be mapped and executed based on estimated fidelity for the best practice. Experimental results show that, compared with the existing typical multi-programming study [12], QuCloud+ achieves up to 9.03% higher fidelity and saves on the required SWAPs during mapping, reducing the number of CNOT gates inserted by 40.92%. Compared with a recent study [30] that enables post-mapping gate optimizations to further reduce gates, QuCloud+ reduces the post-mapping circuit depth by 21.91% while using a similar number of gates.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"280 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135475092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Extension VM: Interleaved Data Layout in Vector Memory 扩展虚拟机:交错的数据布局在矢量存储器

3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-11-07 DOI: 10.1145/3631528

Dunbo Zhang, Qingjie Lang, Ruoxi Wang, Li Shen

While vector architecture is widely employed in processors for neural networks, signal processing, and high-performance computing; however, its performance is limited by inefficient column-major memory access. The column-major access limitation originates from the unsuitable mapping of multidimensional data structures to two-dimensional vector memory spaces. In addition, the traditional data layout mapping method creates an irreconcilable conflict between row- and column-major accesses. Ideally, both row- and column-major accesses can take advantage of the bank parallelism of vector memory. To this end, we propose the Interleaved Data Layout (IDL) method in vector memory, which can distribute vector elements into different banks regardless of whether they are in the row- or column major category, so that any vector memory access can benefit from bank parallelism. Additionally, we propose an Extension Vector Memory (EVM) architecture to achieve IDL in vector memory. EVM can support two data layout methods and vector memory access modes simultaneously. The key idea is to continuously distribute the data that needs to be accessed from the main memory to different banks during the loading period. Thus, EVM can provide a larger spatial locality level through careful programming and the extension ISA support. The experimental results showed a 1.43-fold improvement of state-of-the-art vector processors by the proposed architecture, with an area cost of only 1.73%. Furthermore, the energy consumption was reduced by 50.1%.

虽然矢量架构被广泛应用于神经网络、信号处理和高性能计算的处理器;然而，它的性能受到低效的列主内存访问的限制。列主访问限制源于多维数据结构到二维矢量存储空间的不适当映射。此外，传统的数据布局映射方法在行主访问和列主访问之间产生了不可调和的冲突。理想情况下，行主访问和列主访问都可以利用向量存储器的组并行性。为此，我们提出了向量存储器中的交错数据布局(IDL)方法，该方法可以将向量元素分布到不同的bank中，而不管它们是在行或列主类别中，从而使任何向量存储器访问都可以受益于bank并行性。此外，我们提出了一种扩展向量存储器(EVM)架构来实现向量存储器中的IDL。EVM可以同时支持两种数据布局方式和矢量存储器访问方式。其关键思想是在加载期间连续地将需要从主存访问的数据分发到不同的银行。因此，通过仔细的编程和扩展ISA支持，EVM可以提供更大的空间局部性级别。实验结果表明，该架构比最先进的矢量处理器提高了1.43倍，而面积成本仅为1.73%。能耗降低50.1%。

{"title":"Extension VM: Interleaved Data Layout in Vector Memory","authors":"Dunbo Zhang, Qingjie Lang, Ruoxi Wang, Li Shen","doi":"10.1145/3631528","DOIUrl":"https://doi.org/10.1145/3631528","url":null,"abstract":"While vector architecture is widely employed in processors for neural networks, signal processing, and high-performance computing; however, its performance is limited by inefficient column-major memory access. The column-major access limitation originates from the unsuitable mapping of multidimensional data structures to two-dimensional vector memory spaces. In addition, the traditional data layout mapping method creates an irreconcilable conflict between row- and column-major accesses. Ideally, both row- and column-major accesses can take advantage of the bank parallelism of vector memory. To this end, we propose the Interleaved Data Layout (IDL) method in vector memory, which can distribute vector elements into different banks regardless of whether they are in the row- or column major category, so that any vector memory access can benefit from bank parallelism. Additionally, we propose an Extension Vector Memory (EVM) architecture to achieve IDL in vector memory. EVM can support two data layout methods and vector memory access modes simultaneously. The key idea is to continuously distribute the data that needs to be accessed from the main memory to different banks during the loading period. Thus, EVM can provide a larger spatial locality level through careful programming and the extension ISA support. The experimental results showed a 1.43-fold improvement of state-of-the-art vector processors by the proposed architecture, with an area cost of only 1.73%. Furthermore, the energy consumption was reduced by 50.1%.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"79 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135480149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Critical Data Backup with Hybrid Flash-Based Consumer Devices 关键数据备份与混合闪存为基础的消费设备

3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-11-06 DOI: 10.1145/3631529

Longfei Luo, Dingcui Yu, Yina Lv, Liang Shi

Hybrid flash-based storage constructed with high-density and low-cost flash memory has become increasingly popular in consumer devices in the last decade due to its low cost. However, its poor reliability is one of the major concerns. To protect critical data for guaranteeing user experience, some methods are proposed to improve the reliability of consumer devices with non-hybrid flash storage. However, with the widespread use of hybrid storage, these methods will result in severe problems, including significant performance and endurance degradation. This is caused by that the different characteristics of flash memory in hybrid storage are not considered, e.g., performance, endurance, and access granularity. To address the above problems, a critical data backup (CDB) design is proposed to ensure critical data reliability at a low cost. The basic idea is to accumulate two copies of critical data in the fast memory first to make full use of its performance and endurance. Then one copy will be migrated to the slow memory in the stripe to avoid the write amplification caused by different access granularity between them. By respecting the different characteristics of flash memory in hybrid storage, CDB can achieve encouraging performance and endurance improvement compared with the state-of-the-art. Furthermore, to avoid performance and lifetime degradation caused by the backup data occupying too much space of fast memory, CDB Pro is designed. Two advanced schemes are integrated. One is making use of the pseudo-single-level-cell (pSLC) technique to make a part of slow memory become high-performance. By supplying some high-performance space, data will be fully updated before being evicted to slow memory. More invalid data are generated which reduces eviction costs. Another is to categorize data into three types according to their different life cycles. By putting the same type of data in a block, the eviction efficiency is improved. Therefore, both of them can improve device performance and lifetime based on CDB. Experiments are conducted to prove the efficiency of CDB and CDB Pro. Experimental results show that compared with the state-of-the-arts, CDB can ensure critical data reliability with lower device performance and lifetime loss while CDB Pro can diminish the loss further.

在过去十年中，以高密度和低成本闪存为基础的混合闪存由于其低成本在消费设备中越来越受欢迎。然而，其较差的可靠性是主要问题之一。为了保护关键数据，保证用户体验，提出了一些提高非混合闪存消费设备可靠性的方法。然而，随着混合存储的广泛使用，这些方法将导致严重的问题，包括显著的性能和耐久性下降。这是由于混合存储中闪存的不同特性没有被考虑，例如，性能，耐用性和访问粒度。针对上述问题，提出关键数据备份(CDB)设计，以低成本保证关键数据的可靠性。其基本思想是首先在快速存储器中积累关键数据的两个副本，以充分利用其性能和耐久性。然后将其中一份副本迁移到分条内的慢速内存中，以避免由于它们之间的访问粒度不同而导致的写放大。通过尊重混合存储中闪存的不同特性，CDB可以实现令人鼓舞的性能和耐久性改进。此外，为了避免备份数据占用过多的快速内存空间而导致性能和寿命下降，设计了CDB Pro。两种先进的方案集成在一起。一种是利用伪单级单元(pSLC)技术使一部分慢速存储器变得高性能。通过提供一些高性能空间，数据将在被驱逐到慢速内存之前被完全更新。产生了更多的无效数据，从而降低了清除成本。另一种是根据数据不同的生命周期将数据分为三种类型。通过将相同类型的数据放在一个块中，可以提高删除效率。因此，它们都可以提高基于CDB的设备性能和寿命。通过实验验证了CDB和CDB Pro的有效性。实验结果表明，与最先进的CDB相比，CDB在保证关键数据可靠性的同时，降低了器件性能和寿命损耗，而CDB Pro可以进一步降低寿命损耗。

{"title":"Critical Data Backup with Hybrid Flash-Based Consumer Devices","authors":"Longfei Luo, Dingcui Yu, Yina Lv, Liang Shi","doi":"10.1145/3631529","DOIUrl":"https://doi.org/10.1145/3631529","url":null,"abstract":"Hybrid flash-based storage constructed with high-density and low-cost flash memory has become increasingly popular in consumer devices in the last decade due to its low cost. However, its poor reliability is one of the major concerns. To protect critical data for guaranteeing user experience, some methods are proposed to improve the reliability of consumer devices with non-hybrid flash storage. However, with the widespread use of hybrid storage, these methods will result in severe problems, including significant performance and endurance degradation. This is caused by that the different characteristics of flash memory in hybrid storage are not considered, e.g., performance, endurance, and access granularity. To address the above problems, a critical data backup (CDB) design is proposed to ensure critical data reliability at a low cost. The basic idea is to accumulate two copies of critical data in the fast memory first to make full use of its performance and endurance. Then one copy will be migrated to the slow memory in the stripe to avoid the write amplification caused by different access granularity between them. By respecting the different characteristics of flash memory in hybrid storage, CDB can achieve encouraging performance and endurance improvement compared with the state-of-the-art. Furthermore, to avoid performance and lifetime degradation caused by the backup data occupying too much space of fast memory, CDB Pro is designed. Two advanced schemes are integrated. One is making use of the pseudo-single-level-cell (pSLC) technique to make a part of slow memory become high-performance. By supplying some high-performance space, data will be fully updated before being evicted to slow memory. More invalid data are generated which reduces eviction costs. Another is to categorize data into three types according to their different life cycles. By putting the same type of data in a block, the eviction efficiency is improved. Therefore, both of them can improve device performance and lifetime based on CDB. Experiments are conducted to prove the efficiency of CDB and CDB Pro. Experimental results show that compared with the state-of-the-arts, CDB can ensure critical data reliability with lower device performance and lifetime loss while CDB Pro can diminish the loss further.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"25 9","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135634735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DAG-Order: An Order-Based Dynamic DAG Scheduling for Real-Time Networks-on-Chip DAG- order:基于订单的实时片上网络动态DAG调度

3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-11-03 DOI: 10.1145/3631527

Peng Chen, Hui Chen, Weichen Liu, Linbo Long, Wanli Chang, Nan Guan

With the high-performance requirement of safety-critical real-time tasks, the platforms of many-core processors with high parallelism are widely utilized, where network-on-chip (NoC) is generally employed for inter-core communication due to its scalability and high efficiency. Unfortunately, large uncertainties are suffered on NoCs from both the overly parallel architecture and the distributed scheduling strategy (e.g., wormhole flow control), which complicates the response time upper bounds estimation (i.e., either unsafe or pessimistic). For DAG-based real-time parallel tasks, to solve this problem, we propose DAG-Order, an order-based dynamic DAG scheduling approach, which strictly guarantees NoC real-time services. Firstly, rather than build the new analysis to fit the widely-used best-effort wormhole NoC, DAG-Order is built upon a kind of advanced low-latency NoC with s ingle-cycle l ong-range t raversal (SLT) to avoid the unpredictable parallel transmission on the shared source-destination link of wormhole NoCs. Secondly, DAG-Order is a non-preemptive dynamic scheduling strategy, which jointly considers communication as well as computation workloads, and fits SLT NoC. With such an order-based dynamic scheduling strategy, the provably bound safety is ensured by enforcing certain order constraints among DAG edges/vertices that eliminate the execution-timing anomaly at runtime. Thirdly, the order constraints are further relaxed for higher average-case runtime performance without compromising bound safety. Finally, an effective heuristic algorithm seeking a proper schedule order is developed to tighten the bounds. Experiments on synthetic and realistic benchmarks demonstrate that DAG-Order performs better than the state-of-the-art related scheduling methods.

随着对安全关键型实时任务的高性能要求，具有高并行性的多核处理器平台得到了广泛的应用，而片上网络(network-on-chip, NoC)由于其可扩展性和高效性，一般采用其进行核间通信。不幸的是，由于过度并行架构和分布式调度策略(例如，虫洞流控制)，noc遭受了很大的不确定性，这使得响应时间上界估计变得复杂(即，不安全或悲观)。对于基于DAG的实时并行任务，为了解决这一问题，我们提出了一种基于顺序的动态DAG调度方法DAG- order，该方法严格保证了NoC实时服务。首先，DAG-Order不是建立新的分析来适应广泛使用的best-effort虫洞NoC，而是建立在一种先进的低延迟NoC上，具有5个单周期远程路由(SLT)，以避免在虫洞NoC的共享源-目的链路上不可预测的并行传输。其次，DAG-Order是一种综合考虑通信和计算负载的非抢占式动态调度策略，适合SLT NoC。利用这种基于顺序的动态调度策略，通过在DAG边/顶点之间施加一定的顺序约束来消除运行时的执行时间异常，从而保证了可证明的绑定安全性。第三，进一步放宽顺序约束，以获得更高的平均情况运行时性能，同时不损害绑定安全性。最后，提出了一种有效的启发式算法，寻找合适的调度顺序来收紧边界。在综合基准和现实基准上的实验表明，DAG-Order比最先进的相关调度方法性能更好。

{"title":"DAG-Order: An Order-Based Dynamic DAG Scheduling for Real-Time Networks-on-Chip","authors":"Peng Chen, Hui Chen, Weichen Liu, Linbo Long, Wanli Chang, Nan Guan","doi":"10.1145/3631527","DOIUrl":"https://doi.org/10.1145/3631527","url":null,"abstract":"With the high-performance requirement of safety-critical real-time tasks, the platforms of many-core processors with high parallelism are widely utilized, where network-on-chip (NoC) is generally employed for inter-core communication due to its scalability and high efficiency. Unfortunately, large uncertainties are suffered on NoCs from both the overly parallel architecture and the distributed scheduling strategy (e.g., wormhole flow control), which complicates the response time upper bounds estimation (i.e., either unsafe or pessimistic). For DAG-based real-time parallel tasks, to solve this problem, we propose DAG-Order, an order-based dynamic DAG scheduling approach, which strictly guarantees NoC real-time services. Firstly, rather than build the new analysis to fit the widely-used best-effort wormhole NoC, DAG-Order is built upon a kind of advanced low-latency NoC with s ingle-cycle l ong-range t raversal (SLT) to avoid the unpredictable parallel transmission on the shared source-destination link of wormhole NoCs. Secondly, DAG-Order is a non-preemptive dynamic scheduling strategy, which jointly considers communication as well as computation workloads, and fits SLT NoC. With such an order-based dynamic scheduling strategy, the provably bound safety is ensured by enforcing certain order constraints among DAG edges/vertices that eliminate the execution-timing anomaly at runtime. Thirdly, the order constraints are further relaxed for higher average-case runtime performance without compromising bound safety. Finally, an effective heuristic algorithm seeking a proper schedule order is developed to tighten the bounds. Experiments on synthetic and realistic benchmarks demonstrate that DAG-Order performs better than the state-of-the-art related scheduling methods.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"42 11","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135818860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

JiuJITsu: Removing Gadgets with Safe Register Allocation for JIT Code Generation 用安全的寄存器分配来移除JIT代码生成的小工具

3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-11-03 DOI: 10.1145/3631526

Zhang Jiang, Ying Chen, Xiaoli Gong, Jin Zhang, Wenwen Wang, Pen-Chung Yew

Code-reuse attacks have the capability to craft malicious instructions from small code fragments, commonly referred to as ”gadgets.” These gadgets are generated by JIT (Just-In-Time) engines as integral components of native instructions, with the flexibility to be embedded in various fields, including Displacement . In this paper, we introduce a novel approach for potential gadget insertion, achieved through the manipulation of ModR/M and SIB bytes via JavaScript code. This manipulation influences a JIT engine’s register allocation and code generation algorithms. These newly generated gadgets do not rely on constants and thus evade existing constant blinding schemes. Furthermore, they can be combined with 1-byte constants, a combination that proves to be challenging to defend against using conventional constant blinding techniques. To showcase the feasibility of our approach, we provide proof-of-concept (POC) code for three distinct types of gadgets. Our research underscores the potential for attackers to exploit ModR/M and SIB bytes within JIT-generated native instructions. In response, we propose a practical defense mechanism to mitigate such attacks. We introduce JiuJITsu , a security-enhanced register allocation scheme designed to prevent harmful register assignments during the JIT code generation phase, thereby thwarting the generation of these malicious gadgets. We conduct a comprehensive analysis of JiuJITsu ’s effectiveness in defending against code-reuse attacks. Our findings demonstrate that it incurs a runtime overhead of under 1% when evaluated using JetStream2 benchmarks and real-world websites.

代码重用攻击能够从通常称为“小工具”的小代码片段中制造恶意指令。这些小工具由JIT (Just-In-Time)引擎生成，作为本地指令的组成部分，可以灵活地嵌入到包括Displacement在内的各个领域。在本文中，我们介绍了一种通过JavaScript代码操纵ModR/M和SIB字节来实现潜在小工具插入的新方法。这种操作影响JIT引擎的寄存器分配和代码生成算法。这些新生成的小工具不依赖于常量，因此避开了现有的常量致盲方案。此外，它们还可以与1字节常量结合使用，事实证明，使用传统的常量盲化技术很难抵御这种组合。为了展示我们的方法的可行性，我们为三种不同类型的小工具提供了概念验证(POC)代码。我们的研究强调了攻击者利用jit生成的本机指令中的ModR/M和SIB字节的可能性。作为回应，我们提出了一种实用的防御机制来减轻这种攻击。我们介绍了JiuJITsu，一种安全增强的寄存器分配方案，旨在防止在JIT代码生成阶段进行有害的寄存器分配，从而阻止这些恶意小工具的生成。我们对JiuJITsu在防御代码重用攻击方面的有效性进行了全面分析。我们的研究结果表明，当使用JetStream2基准测试和实际网站进行评估时，它产生的运行时开销低于1%。

{"title":"JiuJITsu: Removing Gadgets with Safe Register Allocation for JIT Code Generation","authors":"Zhang Jiang, Ying Chen, Xiaoli Gong, Jin Zhang, Wenwen Wang, Pen-Chung Yew","doi":"10.1145/3631526","DOIUrl":"https://doi.org/10.1145/3631526","url":null,"abstract":"Code-reuse attacks have the capability to craft malicious instructions from small code fragments, commonly referred to as ”gadgets.” These gadgets are generated by JIT (Just-In-Time) engines as integral components of native instructions, with the flexibility to be embedded in various fields, including Displacement . In this paper, we introduce a novel approach for potential gadget insertion, achieved through the manipulation of ModR/M and SIB bytes via JavaScript code. This manipulation influences a JIT engine’s register allocation and code generation algorithms. These newly generated gadgets do not rely on constants and thus evade existing constant blinding schemes. Furthermore, they can be combined with 1-byte constants, a combination that proves to be challenging to defend against using conventional constant blinding techniques. To showcase the feasibility of our approach, we provide proof-of-concept (POC) code for three distinct types of gadgets. Our research underscores the potential for attackers to exploit ModR/M and SIB bytes within JIT-generated native instructions. In response, we propose a practical defense mechanism to mitigate such attacks. We introduce JiuJITsu , a security-enhanced register allocation scheme designed to prevent harmful register assignments during the JIT code generation phase, thereby thwarting the generation of these malicious gadgets. We conduct a comprehensive analysis of JiuJITsu ’s effectiveness in defending against code-reuse attacks. Our findings demonstrate that it incurs a runtime overhead of under 1% when evaluated using JetStream2 benchmarks and real-world websites.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"43 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135819012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multiply-and-Fire (MnF): An Event-driven Sparse Neural Network Accelerator 多射(MnF):事件驱动的稀疏神经网络加速器

3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-10-27 DOI: 10.1145/3630255

Miao Yu, Tingting Xiang, Venkata Pavan Kumar Miriyala, Trevor E. Carlson

Deep neural network inference has become a vital workload for many systems, from edge-based computing to data centers. To reduce the performance and power requirements for DNNs running on these systems, pruning is commonly used as a way to maintain most of the accuracy of the system while significantly reducing the workload requirements. Unfortunately, accelerators designed for unstructured pruning typically employ expensive methods to either determine non-zero activation-weight pairings or reorder computation. These methods require additional storage and memory accesses compared to the more regular data access patterns seen in structurally pruned models. However, even existing works that focus on the more regular access patterns seen in structured pruning continue to suffer from inefficient designs, which either ignore or expensively handle activation sparsity leading to low performance. To address these inefficiencies, we leverage structured pruning and propose the multiply-and-fire (MnF) technique, which aims to solve these problems in three ways: (a) the use of a novel event-driven dataflow that naturally exploits activation sparsity without complex, high-overhead logic; (b) an optimized dataflow takes an activation-centric approach, which aims to maximize the reuse of activation data in computation and ensures the data are only fetched once from off-chip global and on-chip local memory; (c) Based on the proposed event-driven dataflow, we develop an energy-efficient, high-performance sparsity-aware DNN accelerator. Our results show that our MnF accelerator achieves a significant improvement across a number of modern benchmarks and presents a new direction to enable highly efficient AI inference for both CNN and MLP workloads. Overall, this work achieves a geometric mean of 11.2 × higher energy efficiency and 1.41 × speedup compared to a state-of-the-art sparsity-aware accelerator.

从边缘计算到数据中心，深度神经网络推理已经成为许多系统的重要工作负载。为了降低在这些系统上运行的dnn的性能和功率需求，修剪通常被用作保持系统大部分准确性的方法，同时显着降低工作负载需求。不幸的是，为非结构化修剪设计的加速器通常使用昂贵的方法来确定非零激活权对或重新排序计算。与结构修剪模型中更常规的数据访问模式相比，这些方法需要额外的存储和内存访问。然而，即使是现有的专注于结构化修剪中更规则的访问模式的工作，仍然受到低效设计的影响，这些设计要么忽略激活稀疏性，要么代价高昂地处理激活稀疏性，从而导致低性能。为了解决这些低效率问题，我们利用结构化修剪并提出了“多射”(MnF)技术，该技术旨在通过三种方式解决这些问题:(a)使用新颖的事件驱动数据流，该数据流自然地利用了激活稀疏性，而不需要复杂的、高开销的逻辑;(b)优化的数据流采用以激活为中心的方法，旨在最大限度地在计算中重用激活数据，并确保数据仅从片外全局和片上本地内存中获取一次;(c)基于提出的事件驱动数据流，我们开发了一个节能，高性能的稀疏感知深度神经网络加速器。我们的结果表明，我们的MnF加速器在许多现代基准测试中取得了显着改进，并为CNN和MLP工作负载提供了一个新的方向，以实现高效的AI推理。总的来说，与最先进的稀疏感知加速器相比，这项工作实现了11.2倍的几何平均能效和1.41倍的加速。

{"title":"Multiply-and-Fire (MnF): An Event-driven Sparse Neural Network Accelerator","authors":"Miao Yu, Tingting Xiang, Venkata Pavan Kumar Miriyala, Trevor E. Carlson","doi":"10.1145/3630255","DOIUrl":"https://doi.org/10.1145/3630255","url":null,"abstract":"Deep neural network inference has become a vital workload for many systems, from edge-based computing to data centers. To reduce the performance and power requirements for DNNs running on these systems, pruning is commonly used as a way to maintain most of the accuracy of the system while significantly reducing the workload requirements. Unfortunately, accelerators designed for unstructured pruning typically employ expensive methods to either determine non-zero activation-weight pairings or reorder computation. These methods require additional storage and memory accesses compared to the more regular data access patterns seen in structurally pruned models. However, even existing works that focus on the more regular access patterns seen in structured pruning continue to suffer from inefficient designs, which either ignore or expensively handle activation sparsity leading to low performance. To address these inefficiencies, we leverage structured pruning and propose the multiply-and-fire (MnF) technique, which aims to solve these problems in three ways: (a) the use of a novel event-driven dataflow that naturally exploits activation sparsity without complex, high-overhead logic; (b) an optimized dataflow takes an activation-centric approach, which aims to maximize the reuse of activation data in computation and ensures the data are only fetched once from off-chip global and on-chip local memory; (c) Based on the proposed event-driven dataflow, we develop an energy-efficient, high-performance sparsity-aware DNN accelerator. Our results show that our MnF accelerator achieves a significant improvement across a number of modern benchmarks and presents a new direction to enable highly efficient AI inference for both CNN and MLP workloads. Overall, this work achieves a geometric mean of 11.2 × higher energy efficiency and 1.41 × speedup compared to a state-of-the-art sparsity-aware accelerator.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"7 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136318164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0