Seokwon Kang, Jongbin Kim, Gyeongyong Lee, Jeongmyung Lee, Jiwon Seo, Hyungsoo Jung, Yong Ho Song, Yongjun Park
As solid-state drives (SSDs) with sufficient computing power have recently become the dominant devices in modern computer systems, in-storage processing (ISP), which processes data within the storage without transferring it to the host memory, is being utilized in various emerging applications. The main challenge of ISP is to deliver storage data to the offloaded workload. This is difficult because of the information gap between the host and storage, the data consistency problem between the host and offloaded workloads, and SSD-specific hardware limitations. Moreover, because the offloaded workloads use internal SSD resources, host I/O performance might be degraded due to resource conflicts. Although several ISP frameworks have been proposed, existing ISP approaches that do not deeply consider the internal SSD behavior are often insufficient to support efficient ISP workload offloading with high programmability. In this paper, we propose an ISP agent, a lightweight ISP workload offloading framework for SSD devices. The ISP agent provides I/O and memory interfaces that allow users to run existing function codes on SSDs without major code modifications, and separates the resources for the offloaded workloads from the existing SSD firmware to minimize interference with host I/O processing. The ISP agent also provides further optimization opportunities for the offloaded workload by considering SSD architectures. We have implemented the ISP agent on the OpenSSD Cosmos+ board and evaluated its performance using synthetic benchmarks and a real-world ISP-assisted database checkpointing application. The experimental results demonstrate that the ISP agent enhances host application performance while increasing ISP programmability, and that the optimization opportunities provided by the ISP agent can significantly improve ISP-side performance without compromising host I/O processing.
{"title":"ISP Agent: A Generalized In-Storage-Processing Workload Offloading Framework by Providing Multiple Optimization Opportunities","authors":"Seokwon Kang, Jongbin Kim, Gyeongyong Lee, Jeongmyung Lee, Jiwon Seo, Hyungsoo Jung, Yong Ho Song, Yongjun Park","doi":"10.1145/3632951","DOIUrl":"https://doi.org/10.1145/3632951","url":null,"abstract":"As solid-state drives (SSDs) with sufficient computing power have recently become the dominant devices in modern computer systems, in-storage processing (ISP), which processes data within the storage without transferring it to the host memory, is being utilized in various emerging applications. The main challenge of ISP is to deliver storage data to the offloaded workload. This is difficult because of the information gap between the host and storage, the data consistency problem between the host and offloaded workloads, and SSD-specific hardware limitations. Moreover, because the offloaded workloads use internal SSD resources, host I/O performance might be degraded due to resource conflicts. Although several ISP frameworks have been proposed, existing ISP approaches that do not deeply consider the internal SSD behavior are often insufficient to support efficient ISP workload offloading with high programmability. In this paper, we propose an ISP agent, a lightweight ISP workload offloading framework for SSD devices. The ISP agent provides I/O and memory interfaces that allow users to run existing function codes on SSDs without major code modifications, and separates the resources for the offloaded workloads from the existing SSD firmware to minimize interference with host I/O processing. The ISP agent also provides further optimization opportunities for the offloaded workload by considering SSD architectures. We have implemented the ISP agent on the OpenSSD Cosmos+ board and evaluated its performance using synthetic benchmarks and a real-world ISP-assisted database checkpointing application. The experimental results demonstrate that the ISP agent enhances host application performance while increasing ISP programmability, and that the optimization opportunities provided by the ISP agent can significantly improve ISP-side performance without compromising host I/O processing.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"45 35","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134902822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yunping Zhao, Sheng Ma, Hengzhu Liu, Libo Huang, Yi Dai
Deep Neural Networks (DNNs) have achieved great progress in academia and industry. But they have become computational and memory intensive with the increase of network depth. Previous designs seek breakthroughs in software and hardware levels to mitigate these challenges. At the software level, neural network compression techniques have effectively reduced network scale and energy consumption. However, the conventional compression algorithm is complex and energy intensive. At the hardware level, the improvements in the semiconductor process have effectively reduced power and energy consumption. However, it is difficult for the traditional Von-Neumann architecture to further reduce the power consumption, due to the memory wall and the end of Moore’s law. To overcome these challenges, the spintronic device based DNN machines have emerged for their non-volatility, ultra low power, and high energy efficiency. However, there is no spin-based design has achieved innovation at both the software and hardware level. Specifically, there is no systematic study of spin-based DNN architecture to deploy compressed networks. In our study, we present an ultra-efficient Spin-based Architecture for Compressed DNNs (SAC), to substantially reduce power consumption and energy consumption. Specifically, we propose a One-Step Compression algorithm (OSC) to reduce the computational complexity with minimum accuracy loss. We also propose a spin-based architecture to realize better performance for the compressed network. Furthermore, we introduce a novel computation flow that enables the reuse of activations and weights. Experimental results show that our study can reduce the computational complexity of compression algorithm from (mathcal {O}(Tk^3) ) to (mathcal {O}(k^2 log k) ) , and achieve 14 × ∼ 40 × compression ratio. Furthermore, our design can attain a 2 × enhancement in power efficiency and a 5 × improvement in computational efficiency compared to the Eyeriss. Our models are available at an anonymous link https://bit.ly/39cdtTa.
{"title":"SAC: An Ultra-Efficient Spin-based Architecture for Compressed DNNs","authors":"Yunping Zhao, Sheng Ma, Hengzhu Liu, Libo Huang, Yi Dai","doi":"10.1145/3632957","DOIUrl":"https://doi.org/10.1145/3632957","url":null,"abstract":"Deep Neural Networks (DNNs) have achieved great progress in academia and industry. But they have become computational and memory intensive with the increase of network depth. Previous designs seek breakthroughs in software and hardware levels to mitigate these challenges. At the software level, neural network compression techniques have effectively reduced network scale and energy consumption. However, the conventional compression algorithm is complex and energy intensive. At the hardware level, the improvements in the semiconductor process have effectively reduced power and energy consumption. However, it is difficult for the traditional Von-Neumann architecture to further reduce the power consumption, due to the memory wall and the end of Moore’s law. To overcome these challenges, the spintronic device based DNN machines have emerged for their non-volatility, ultra low power, and high energy efficiency. However, there is no spin-based design has achieved innovation at both the software and hardware level. Specifically, there is no systematic study of spin-based DNN architecture to deploy compressed networks. In our study, we present an ultra-efficient Spin-based Architecture for Compressed DNNs (SAC), to substantially reduce power consumption and energy consumption. Specifically, we propose a One-Step Compression algorithm (OSC) to reduce the computational complexity with minimum accuracy loss. We also propose a spin-based architecture to realize better performance for the compressed network. Furthermore, we introduce a novel computation flow that enables the reuse of activations and weights. Experimental results show that our study can reduce the computational complexity of compression algorithm from (mathcal {O}(Tk^3) ) to (mathcal {O}(k^2 log k) ) , and achieve 14 × ∼ 40 × compression ratio. Furthermore, our design can attain a 2 × enhancement in power efficiency and a 5 × improvement in computational efficiency compared to the Eyeriss. Our models are available at an anonymous link https://bit.ly/39cdtTa.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"11 10","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134991123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joongun Park, Seunghyo Kang, Sanghyeon Lee, Taehoon Kim, Jongse Park, Youngjin Kwon, Jaehyuk Huh
In cloud-based serverless computing, an application consists of multiple functions provided by mutually distrusting parties. For secure serverless computing, the hardware-based trusted execution environment (TEE) can provide strong isolation among functions. However, not only protecting each function from the host OS and other functions, but also protecting the host system from the functions, is critical for the security of the cloud servers. Such an emerging trusted serverless computing poses new challenges: each TEE must be isolated from the host system bi-directionally, and the system calls from it must be validated. In addition, the resource utilization of each TEE must be accountable in a mutually trusted way. However, the current TEE model cannot efficiently represent such trusted serverless applications. To overcome the lack of such hardware support, this paper proposes an extended TEE model called Cloister , designed for trusted serverless computing. Cloister proposes four new key techniques. First, it extends the hardware-based memory isolation in SGX to confine a deployed function only within its TEE (enclave). Second, it proposes a trusted monitor enclave that filters and validates system calls from enclaves. Third, it provides a trusted resource accounting mechanism for enclaves which is agreeable to both service developers and cloud providers. Finally, Cloister accelerates enclave loading by redesigning its memory verification for fast function deployment. Using an emulated Intel SGX platform with the proposed extensions, this paper shows that trusted serverless applications can be effectively supported with small changes in the SGX hardware.
{"title":"Hardware Hardened Sandbox Enclaves for Trusted Serverless Computing","authors":"Joongun Park, Seunghyo Kang, Sanghyeon Lee, Taehoon Kim, Jongse Park, Youngjin Kwon, Jaehyuk Huh","doi":"10.1145/3632954","DOIUrl":"https://doi.org/10.1145/3632954","url":null,"abstract":"In cloud-based serverless computing, an application consists of multiple functions provided by mutually distrusting parties. For secure serverless computing, the hardware-based trusted execution environment (TEE) can provide strong isolation among functions. However, not only protecting each function from the host OS and other functions, but also protecting the host system from the functions, is critical for the security of the cloud servers. Such an emerging trusted serverless computing poses new challenges: each TEE must be isolated from the host system bi-directionally, and the system calls from it must be validated. In addition, the resource utilization of each TEE must be accountable in a mutually trusted way. However, the current TEE model cannot efficiently represent such trusted serverless applications. To overcome the lack of such hardware support, this paper proposes an extended TEE model called Cloister , designed for trusted serverless computing. Cloister proposes four new key techniques. First, it extends the hardware-based memory isolation in SGX to confine a deployed function only within its TEE (enclave). Second, it proposes a trusted monitor enclave that filters and validates system calls from enclaves. Third, it provides a trusted resource accounting mechanism for enclaves which is agreeable to both service developers and cloud providers. Finally, Cloister accelerates enclave loading by redesigning its memory verification for fast function deployment. Using an emulated Intel SGX platform with the proposed extensions, this paper shows that trusted serverless applications can be effectively supported with small changes in the SGX hardware.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"6 7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134991851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Leveraging the SIMD capability of modern CPU architectures is mandatory to take full advantage of their increased performance. To exploit this capability, binary executables must be vectorized, either manually by developers or automatically by a tool. For this reason, the compilation research community has developed several strategies for transforming scalar code into a vectorized implementation. However, most existing automatic vectorization techniques in modern compilers are designed for regular codes, leaving irregular applications with non-contiguous data access patterns at a disadvantage. In this paper, we present a new tool, Autovesk, that automatically generates vectorized code from scalar code, specifically targeting irregular data access patterns. We describe how our method transforms a graph of scalar instructions into a vectorized one, using different heuristics to reduce the number or cost of instructions. Finally, we demonstrate the effectiveness of our approach on various computational kernels using Intel AVX-512 and ARM SVE. We compare the speedups of Autovesk vectorized code over GCC, Clang LLVM and Intel automatic vectorization optimizations. We achieve competitive results on linear kernels and up to 11x speedups on irregular kernels.
{"title":"Autovesk: Automatic vectorized code generation from unstructured static kernels using graph transformations","authors":"Hayfa Tayeb, Ludovic Paillat, Bérenger Bramas","doi":"10.1145/3631709","DOIUrl":"https://doi.org/10.1145/3631709","url":null,"abstract":"Leveraging the SIMD capability of modern CPU architectures is mandatory to take full advantage of their increased performance. To exploit this capability, binary executables must be vectorized, either manually by developers or automatically by a tool. For this reason, the compilation research community has developed several strategies for transforming scalar code into a vectorized implementation. However, most existing automatic vectorization techniques in modern compilers are designed for regular codes, leaving irregular applications with non-contiguous data access patterns at a disadvantage. In this paper, we present a new tool, Autovesk, that automatically generates vectorized code from scalar code, specifically targeting irregular data access patterns. We describe how our method transforms a graph of scalar instructions into a vectorized one, using different heuristics to reduce the number or cost of instructions. Finally, we demonstrate the effectiveness of our approach on various computational kernels using Intel AVX-512 and ARM SVE. We compare the speedups of Autovesk vectorized code over GCC, Clang LLVM and Intel automatic vectorization optimizations. We achieve competitive results on linear kernels and up to 11x speedups on irregular kernels.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":" 42","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135191396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qubit mapping for NISQ superconducting quantum computers is essential to fidelity and resource utilization. The existing qubit mapping schemes meet challenges, e.g., crosstalk, SWAP overheads, diverse device topologies, etc., leading to qubit resource underutilization and low fidelity in computing results. This paper introduces QuCloud+, a new qubit mapping scheme that tackles these challenges. QuCloud+ has several new designs. (1) QuCloud+ supports single/multi-programming quantum computing on quantum chips with 2D/3D topology. (2) QuCloud+ partitions physical qubits for concurrent quantum programs with the crosstalk-aware community detection technique and further allocates qubits according to qubit degree, improving fidelity and resource utilization. (3) QuCloud+ includes an X-SWAP mechanism that avoids SWAPs with high crosstalk errors and enables inter-program SWAPs to reduce the SWAP overheads. (4) QuCloud+ schedules concurrent quantum programs to be mapped and executed based on estimated fidelity for the best practice. Experimental results show that, compared with the existing typical multi-programming study [12], QuCloud+ achieves up to 9.03% higher fidelity and saves on the required SWAPs during mapping, reducing the number of CNOT gates inserted by 40.92%. Compared with a recent study [30] that enables post-mapping gate optimizations to further reduce gates, QuCloud+ reduces the post-mapping circuit depth by 21.91% while using a similar number of gates.
{"title":"QuCloud+: A Holistic Qubit Mapping Scheme for Single/Multi-programming on 2D/3D NISQ Quantum Computers","authors":"Lei Liu, Xinglei Dou","doi":"10.1145/3631525","DOIUrl":"https://doi.org/10.1145/3631525","url":null,"abstract":"Qubit mapping for NISQ superconducting quantum computers is essential to fidelity and resource utilization. The existing qubit mapping schemes meet challenges, e.g., crosstalk, SWAP overheads, diverse device topologies, etc., leading to qubit resource underutilization and low fidelity in computing results. This paper introduces QuCloud+, a new qubit mapping scheme that tackles these challenges. QuCloud+ has several new designs. (1) QuCloud+ supports single/multi-programming quantum computing on quantum chips with 2D/3D topology. (2) QuCloud+ partitions physical qubits for concurrent quantum programs with the crosstalk-aware community detection technique and further allocates qubits according to qubit degree, improving fidelity and resource utilization. (3) QuCloud+ includes an X-SWAP mechanism that avoids SWAPs with high crosstalk errors and enables inter-program SWAPs to reduce the SWAP overheads. (4) QuCloud+ schedules concurrent quantum programs to be mapped and executed based on estimated fidelity for the best practice. Experimental results show that, compared with the existing typical multi-programming study [12], QuCloud+ achieves up to 9.03% higher fidelity and saves on the required SWAPs during mapping, reducing the number of CNOT gates inserted by 40.92%. Compared with a recent study [30] that enables post-mapping gate optimizations to further reduce gates, QuCloud+ reduces the post-mapping circuit depth by 21.91% while using a similar number of gates.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"280 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135475092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
While vector architecture is widely employed in processors for neural networks, signal processing, and high-performance computing; however, its performance is limited by inefficient column-major memory access. The column-major access limitation originates from the unsuitable mapping of multidimensional data structures to two-dimensional vector memory spaces. In addition, the traditional data layout mapping method creates an irreconcilable conflict between row- and column-major accesses. Ideally, both row- and column-major accesses can take advantage of the bank parallelism of vector memory. To this end, we propose the Interleaved Data Layout (IDL) method in vector memory, which can distribute vector elements into different banks regardless of whether they are in the row- or column major category, so that any vector memory access can benefit from bank parallelism. Additionally, we propose an Extension Vector Memory (EVM) architecture to achieve IDL in vector memory. EVM can support two data layout methods and vector memory access modes simultaneously. The key idea is to continuously distribute the data that needs to be accessed from the main memory to different banks during the loading period. Thus, EVM can provide a larger spatial locality level through careful programming and the extension ISA support. The experimental results showed a 1.43-fold improvement of state-of-the-art vector processors by the proposed architecture, with an area cost of only 1.73%. Furthermore, the energy consumption was reduced by 50.1%.
{"title":"Extension VM: Interleaved Data Layout in Vector Memory","authors":"Dunbo Zhang, Qingjie Lang, Ruoxi Wang, Li Shen","doi":"10.1145/3631528","DOIUrl":"https://doi.org/10.1145/3631528","url":null,"abstract":"While vector architecture is widely employed in processors for neural networks, signal processing, and high-performance computing; however, its performance is limited by inefficient column-major memory access. The column-major access limitation originates from the unsuitable mapping of multidimensional data structures to two-dimensional vector memory spaces. In addition, the traditional data layout mapping method creates an irreconcilable conflict between row- and column-major accesses. Ideally, both row- and column-major accesses can take advantage of the bank parallelism of vector memory. To this end, we propose the Interleaved Data Layout (IDL) method in vector memory, which can distribute vector elements into different banks regardless of whether they are in the row- or column major category, so that any vector memory access can benefit from bank parallelism. Additionally, we propose an Extension Vector Memory (EVM) architecture to achieve IDL in vector memory. EVM can support two data layout methods and vector memory access modes simultaneously. The key idea is to continuously distribute the data that needs to be accessed from the main memory to different banks during the loading period. Thus, EVM can provide a larger spatial locality level through careful programming and the extension ISA support. The experimental results showed a 1.43-fold improvement of state-of-the-art vector processors by the proposed architecture, with an area cost of only 1.73%. Furthermore, the energy consumption was reduced by 50.1%.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"79 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135480149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hybrid flash-based storage constructed with high-density and low-cost flash memory has become increasingly popular in consumer devices in the last decade due to its low cost. However, its poor reliability is one of the major concerns. To protect critical data for guaranteeing user experience, some methods are proposed to improve the reliability of consumer devices with non-hybrid flash storage. However, with the widespread use of hybrid storage, these methods will result in severe problems, including significant performance and endurance degradation. This is caused by that the different characteristics of flash memory in hybrid storage are not considered, e.g., performance, endurance, and access granularity. To address the above problems, a critical data backup (CDB) design is proposed to ensure critical data reliability at a low cost. The basic idea is to accumulate two copies of critical data in the fast memory first to make full use of its performance and endurance. Then one copy will be migrated to the slow memory in the stripe to avoid the write amplification caused by different access granularity between them. By respecting the different characteristics of flash memory in hybrid storage, CDB can achieve encouraging performance and endurance improvement compared with the state-of-the-art. Furthermore, to avoid performance and lifetime degradation caused by the backup data occupying too much space of fast memory, CDB Pro is designed. Two advanced schemes are integrated. One is making use of the pseudo-single-level-cell (pSLC) technique to make a part of slow memory become high-performance. By supplying some high-performance space, data will be fully updated before being evicted to slow memory. More invalid data are generated which reduces eviction costs. Another is to categorize data into three types according to their different life cycles. By putting the same type of data in a block, the eviction efficiency is improved. Therefore, both of them can improve device performance and lifetime based on CDB. Experiments are conducted to prove the efficiency of CDB and CDB Pro. Experimental results show that compared with the state-of-the-arts, CDB can ensure critical data reliability with lower device performance and lifetime loss while CDB Pro can diminish the loss further.
{"title":"Critical Data Backup with Hybrid Flash-Based Consumer Devices","authors":"Longfei Luo, Dingcui Yu, Yina Lv, Liang Shi","doi":"10.1145/3631529","DOIUrl":"https://doi.org/10.1145/3631529","url":null,"abstract":"Hybrid flash-based storage constructed with high-density and low-cost flash memory has become increasingly popular in consumer devices in the last decade due to its low cost. However, its poor reliability is one of the major concerns. To protect critical data for guaranteeing user experience, some methods are proposed to improve the reliability of consumer devices with non-hybrid flash storage. However, with the widespread use of hybrid storage, these methods will result in severe problems, including significant performance and endurance degradation. This is caused by that the different characteristics of flash memory in hybrid storage are not considered, e.g., performance, endurance, and access granularity. To address the above problems, a critical data backup (CDB) design is proposed to ensure critical data reliability at a low cost. The basic idea is to accumulate two copies of critical data in the fast memory first to make full use of its performance and endurance. Then one copy will be migrated to the slow memory in the stripe to avoid the write amplification caused by different access granularity between them. By respecting the different characteristics of flash memory in hybrid storage, CDB can achieve encouraging performance and endurance improvement compared with the state-of-the-art. Furthermore, to avoid performance and lifetime degradation caused by the backup data occupying too much space of fast memory, CDB Pro is designed. Two advanced schemes are integrated. One is making use of the pseudo-single-level-cell (pSLC) technique to make a part of slow memory become high-performance. By supplying some high-performance space, data will be fully updated before being evicted to slow memory. More invalid data are generated which reduces eviction costs. Another is to categorize data into three types according to their different life cycles. By putting the same type of data in a block, the eviction efficiency is improved. Therefore, both of them can improve device performance and lifetime based on CDB. Experiments are conducted to prove the efficiency of CDB and CDB Pro. Experimental results show that compared with the state-of-the-arts, CDB can ensure critical data reliability with lower device performance and lifetime loss while CDB Pro can diminish the loss further.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"25 9","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135634735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the high-performance requirement of safety-critical real-time tasks, the platforms of many-core processors with high parallelism are widely utilized, where network-on-chip (NoC) is generally employed for inter-core communication due to its scalability and high efficiency. Unfortunately, large uncertainties are suffered on NoCs from both the overly parallel architecture and the distributed scheduling strategy (e.g., wormhole flow control), which complicates the response time upper bounds estimation (i.e., either unsafe or pessimistic). For DAG-based real-time parallel tasks, to solve this problem, we propose DAG-Order, an order-based dynamic DAG scheduling approach, which strictly guarantees NoC real-time services. Firstly, rather than build the new analysis to fit the widely-used best-effort wormhole NoC, DAG-Order is built upon a kind of advanced low-latency NoC with s ingle-cycle l ong-range t raversal (SLT) to avoid the unpredictable parallel transmission on the shared source-destination link of wormhole NoCs. Secondly, DAG-Order is a non-preemptive dynamic scheduling strategy, which jointly considers communication as well as computation workloads, and fits SLT NoC. With such an order-based dynamic scheduling strategy, the provably bound safety is ensured by enforcing certain order constraints among DAG edges/vertices that eliminate the execution-timing anomaly at runtime. Thirdly, the order constraints are further relaxed for higher average-case runtime performance without compromising bound safety. Finally, an effective heuristic algorithm seeking a proper schedule order is developed to tighten the bounds. Experiments on synthetic and realistic benchmarks demonstrate that DAG-Order performs better than the state-of-the-art related scheduling methods.
{"title":"DAG-Order: An Order-Based Dynamic DAG Scheduling for Real-Time Networks-on-Chip","authors":"Peng Chen, Hui Chen, Weichen Liu, Linbo Long, Wanli Chang, Nan Guan","doi":"10.1145/3631527","DOIUrl":"https://doi.org/10.1145/3631527","url":null,"abstract":"With the high-performance requirement of safety-critical real-time tasks, the platforms of many-core processors with high parallelism are widely utilized, where network-on-chip (NoC) is generally employed for inter-core communication due to its scalability and high efficiency. Unfortunately, large uncertainties are suffered on NoCs from both the overly parallel architecture and the distributed scheduling strategy (e.g., wormhole flow control), which complicates the response time upper bounds estimation (i.e., either unsafe or pessimistic). For DAG-based real-time parallel tasks, to solve this problem, we propose DAG-Order, an order-based dynamic DAG scheduling approach, which strictly guarantees NoC real-time services. Firstly, rather than build the new analysis to fit the widely-used best-effort wormhole NoC, DAG-Order is built upon a kind of advanced low-latency NoC with s ingle-cycle l ong-range t raversal (SLT) to avoid the unpredictable parallel transmission on the shared source-destination link of wormhole NoCs. Secondly, DAG-Order is a non-preemptive dynamic scheduling strategy, which jointly considers communication as well as computation workloads, and fits SLT NoC. With such an order-based dynamic scheduling strategy, the provably bound safety is ensured by enforcing certain order constraints among DAG edges/vertices that eliminate the execution-timing anomaly at runtime. Thirdly, the order constraints are further relaxed for higher average-case runtime performance without compromising bound safety. Finally, an effective heuristic algorithm seeking a proper schedule order is developed to tighten the bounds. Experiments on synthetic and realistic benchmarks demonstrate that DAG-Order performs better than the state-of-the-art related scheduling methods.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"42 11","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135818860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhang Jiang, Ying Chen, Xiaoli Gong, Jin Zhang, Wenwen Wang, Pen-Chung Yew
Code-reuse attacks have the capability to craft malicious instructions from small code fragments, commonly referred to as ”gadgets.” These gadgets are generated by JIT (Just-In-Time) engines as integral components of native instructions, with the flexibility to be embedded in various fields, including Displacement . In this paper, we introduce a novel approach for potential gadget insertion, achieved through the manipulation of ModR/M and SIB bytes via JavaScript code. This manipulation influences a JIT engine’s register allocation and code generation algorithms. These newly generated gadgets do not rely on constants and thus evade existing constant blinding schemes. Furthermore, they can be combined with 1-byte constants, a combination that proves to be challenging to defend against using conventional constant blinding techniques. To showcase the feasibility of our approach, we provide proof-of-concept (POC) code for three distinct types of gadgets. Our research underscores the potential for attackers to exploit ModR/M and SIB bytes within JIT-generated native instructions. In response, we propose a practical defense mechanism to mitigate such attacks. We introduce JiuJITsu , a security-enhanced register allocation scheme designed to prevent harmful register assignments during the JIT code generation phase, thereby thwarting the generation of these malicious gadgets. We conduct a comprehensive analysis of JiuJITsu ’s effectiveness in defending against code-reuse attacks. Our findings demonstrate that it incurs a runtime overhead of under 1% when evaluated using JetStream2 benchmarks and real-world websites.
{"title":"JiuJITsu: Removing Gadgets with Safe Register Allocation for JIT Code Generation","authors":"Zhang Jiang, Ying Chen, Xiaoli Gong, Jin Zhang, Wenwen Wang, Pen-Chung Yew","doi":"10.1145/3631526","DOIUrl":"https://doi.org/10.1145/3631526","url":null,"abstract":"Code-reuse attacks have the capability to craft malicious instructions from small code fragments, commonly referred to as ”gadgets.” These gadgets are generated by JIT (Just-In-Time) engines as integral components of native instructions, with the flexibility to be embedded in various fields, including Displacement . In this paper, we introduce a novel approach for potential gadget insertion, achieved through the manipulation of ModR/M and SIB bytes via JavaScript code. This manipulation influences a JIT engine’s register allocation and code generation algorithms. These newly generated gadgets do not rely on constants and thus evade existing constant blinding schemes. Furthermore, they can be combined with 1-byte constants, a combination that proves to be challenging to defend against using conventional constant blinding techniques. To showcase the feasibility of our approach, we provide proof-of-concept (POC) code for three distinct types of gadgets. Our research underscores the potential for attackers to exploit ModR/M and SIB bytes within JIT-generated native instructions. In response, we propose a practical defense mechanism to mitigate such attacks. We introduce JiuJITsu , a security-enhanced register allocation scheme designed to prevent harmful register assignments during the JIT code generation phase, thereby thwarting the generation of these malicious gadgets. We conduct a comprehensive analysis of JiuJITsu ’s effectiveness in defending against code-reuse attacks. Our findings demonstrate that it incurs a runtime overhead of under 1% when evaluated using JetStream2 benchmarks and real-world websites.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"43 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135819012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep neural network inference has become a vital workload for many systems, from edge-based computing to data centers. To reduce the performance and power requirements for DNNs running on these systems, pruning is commonly used as a way to maintain most of the accuracy of the system while significantly reducing the workload requirements. Unfortunately, accelerators designed for unstructured pruning typically employ expensive methods to either determine non-zero activation-weight pairings or reorder computation. These methods require additional storage and memory accesses compared to the more regular data access patterns seen in structurally pruned models. However, even existing works that focus on the more regular access patterns seen in structured pruning continue to suffer from inefficient designs, which either ignore or expensively handle activation sparsity leading to low performance. To address these inefficiencies, we leverage structured pruning and propose the multiply-and-fire (MnF) technique, which aims to solve these problems in three ways: (a) the use of a novel event-driven dataflow that naturally exploits activation sparsity without complex, high-overhead logic; (b) an optimized dataflow takes an activation-centric approach, which aims to maximize the reuse of activation data in computation and ensures the data are only fetched once from off-chip global and on-chip local memory; (c) Based on the proposed event-driven dataflow, we develop an energy-efficient, high-performance sparsity-aware DNN accelerator. Our results show that our MnF accelerator achieves a significant improvement across a number of modern benchmarks and presents a new direction to enable highly efficient AI inference for both CNN and MLP workloads. Overall, this work achieves a geometric mean of 11.2 × higher energy efficiency and 1.41 × speedup compared to a state-of-the-art sparsity-aware accelerator.
{"title":"Multiply-and-Fire (MnF): An Event-driven Sparse Neural Network Accelerator","authors":"Miao Yu, Tingting Xiang, Venkata Pavan Kumar Miriyala, Trevor E. Carlson","doi":"10.1145/3630255","DOIUrl":"https://doi.org/10.1145/3630255","url":null,"abstract":"Deep neural network inference has become a vital workload for many systems, from edge-based computing to data centers. To reduce the performance and power requirements for DNNs running on these systems, pruning is commonly used as a way to maintain most of the accuracy of the system while significantly reducing the workload requirements. Unfortunately, accelerators designed for unstructured pruning typically employ expensive methods to either determine non-zero activation-weight pairings or reorder computation. These methods require additional storage and memory accesses compared to the more regular data access patterns seen in structurally pruned models. However, even existing works that focus on the more regular access patterns seen in structured pruning continue to suffer from inefficient designs, which either ignore or expensively handle activation sparsity leading to low performance. To address these inefficiencies, we leverage structured pruning and propose the multiply-and-fire (MnF) technique, which aims to solve these problems in three ways: (a) the use of a novel event-driven dataflow that naturally exploits activation sparsity without complex, high-overhead logic; (b) an optimized dataflow takes an activation-centric approach, which aims to maximize the reuse of activation data in computation and ensures the data are only fetched once from off-chip global and on-chip local memory; (c) Based on the proposed event-driven dataflow, we develop an energy-efficient, high-performance sparsity-aware DNN accelerator. Our results show that our MnF accelerator achieves a significant improvement across a number of modern benchmarks and presents a new direction to enable highly efficient AI inference for both CNN and MLP workloads. Overall, this work achieves a geometric mean of 11.2 × higher energy efficiency and 1.41 × speedup compared to a state-of-the-art sparsity-aware accelerator.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"7 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136318164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}