Journal of Systems Architecture最新文献

英文中文

GTA: Generating high-performance tensorized program with dual-task scheduling

IF 3.7 2区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Journal of Systems Architecture

Pub Date : 2025-02-07 DOI: 10.1016/j.sysarc.2025.103359

Anxing Xie , Yonghua Hu , Yaohua Wang , Zhe Li , Yuxiang Gao , Zenghua Cheng

Generating high-performance tensorized programs for deep learning accelerators (DLAs) is crucial for ensuring the efficient execution of deep neural networks. But, producing such programs for different operators across various DLAs is notoriously challenging. Existing methods utilize hardware abstraction to represent acceleration intrinsics, enabling end-to-end automated exploration of the intrinsics mapping space. However, their limited search space and inefficient exploration strategies often result in suboptimal tensorized programs and significant search time overhead.

In this paper, we propose GTA, a framework designed to generate high-performance tensorized programs for DLAs. Unlike existing deep learning compilers, we first coordinate intrinsic-based mapping abstraction with rule-based program generation strategy, followed by the application of resource-constrained rules to eliminate ineffective tensor program candidates from the search space. Second, we employ a dual-task scheduling strategy to allocate tuning resources across multiple subgraphs of deep neural networks and their mapping candidates. As a result, GTA can find high-performance tensor programs that are outside the search space of existing state-of-the-art methods. Our experiments show that GTA achieves an average speedup of more than 1.88

\times

over AMOS and 2.29

\times

over Ansor on NVIDIA GPU with Tensor Core, as well as 1.49

\times

over Ansor and 2.76

\times

over PyTorch on CPU with AVX512.

{"title":"GTA: Generating high-performance tensorized program with dual-task scheduling","authors":"Anxing Xie , Yonghua Hu , Yaohua Wang , Zhe Li , Yuxiang Gao , Zenghua Cheng","doi":"10.1016/j.sysarc.2025.103359","DOIUrl":"10.1016/j.sysarc.2025.103359","url":null,"abstract":"<div><div>Generating high-performance tensorized programs for deep learning accelerators (DLAs) is crucial for ensuring the efficient execution of deep neural networks. But, producing such programs for different operators across various DLAs is notoriously challenging. Existing methods utilize hardware abstraction to represent acceleration intrinsics, enabling end-to-end automated exploration of the intrinsics mapping space. However, their limited search space and inefficient exploration strategies often result in suboptimal tensorized programs and significant search time overhead.</div><div>In this paper, we propose GTA, a framework designed to generate high-performance tensorized programs for DLAs. Unlike existing deep learning compilers, we first coordinate intrinsic-based mapping abstraction with rule-based program generation strategy, followed by the application of resource-constrained rules to eliminate ineffective tensor program candidates from the search space. Second, we employ a dual-task scheduling strategy to allocate tuning resources across multiple subgraphs of deep neural networks and their mapping candidates. As a result, GTA can find high-performance tensor programs that are outside the search space of existing state-of-the-art methods. Our experiments show that GTA achieves an average speedup of more than 1.88<span><math><mo>×</mo></math></span> over AMOS and 2.29<span><math><mo>×</mo></math></span> over Ansor on NVIDIA GPU with Tensor Core, as well as 1.49<span><math><mo>×</mo></math></span> over Ansor and 2.76<span><math><mo>×</mo></math></span> over PyTorch on CPU with AVX512.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"160 ","pages":"Article 103359"},"PeriodicalIF":3.7,"publicationDate":"2025-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143376982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Electric vehicle charging network security: A survey

IF 3.7 2区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Journal of Systems Architecture

Pub Date : 2025-02-01 DOI: 10.1016/j.sysarc.2025.103337

Xin Hu , XiaoNing Jiang , Jie Zhang , ShouGuang Wang , MengChu Zhou , Bei Zhang , ZhiGang Gan , BinXiao Yu

With the rising number of electric vehicles, their charging network has expanded into a vast and intricate system involving multiple stakeholders and numerous charging service entities. It relies on various technologies, such as communication protocols, network connections, and data transmission, leading to the generation of substantial amounts of sensitive data. Ensuring its high security requires the implementation of a comprehensive array of security measures, including stringent identity authentication and access control techniques, cryptographic solutions to safeguard data, and diverse strategies for detecting and countering network attacks. In this article, we present a comprehensive overview of the mainstream architectures and communication protocols in electric vehicle charging networks. We also provide a brief introduction to other protocols used in charging networks and explore the charging infrastructure and its associated security requirements. We examine a charging network’s security challenges at its perception, network, and application layers, and summarize the important solutions proposed in recent literature. Additionally, we delve into privacy security and payment security in the charging business. We investigate the utilization of blockchain technology to safeguard charging networks. Finally, we highlight public safety issues of a charging network and suggest future research directions to steer further studies of charging network security.

{"title":"Electric vehicle charging network security: A survey","authors":"Xin Hu , XiaoNing Jiang , Jie Zhang , ShouGuang Wang , MengChu Zhou , Bei Zhang , ZhiGang Gan , BinXiao Yu","doi":"10.1016/j.sysarc.2025.103337","DOIUrl":"10.1016/j.sysarc.2025.103337","url":null,"abstract":"<div><div>With the rising number of electric vehicles, their charging network has expanded into a vast and intricate system involving multiple stakeholders and numerous charging service entities. It relies on various technologies, such as communication protocols, network connections, and data transmission, leading to the generation of substantial amounts of sensitive data. Ensuring its high security requires the implementation of a comprehensive array of security measures, including stringent identity authentication and access control techniques, cryptographic solutions to safeguard data, and diverse strategies for detecting and countering network attacks. In this article, we present a comprehensive overview of the mainstream architectures and communication protocols in electric vehicle charging networks. We also provide a brief introduction to other protocols used in charging networks and explore the charging infrastructure and its associated security requirements. We examine a charging network’s security challenges at its perception, network, and application layers, and summarize the important solutions proposed in recent literature. Additionally, we delve into privacy security and payment security in the charging business. We investigate the utilization of blockchain technology to safeguard charging networks. Finally, we highlight public safety issues of a charging network and suggest future research directions to steer further studies of charging network security.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"159 ","pages":"Article 103337"},"PeriodicalIF":3.7,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143133624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Distrusting cores by separating computation from isolation

IF 3.7 2区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Journal of Systems Architecture

Pub Date : 2025-02-01 DOI: 10.1016/j.sysarc.2024.103328

Nils Asmussen, Till Miemietz, Sebastian Haas, Michael Roitzsch

Security mechanisms such as address spaces rely on the assumption that processor cores can be fully trusted. But the steady influx of side-channel vulnerabilities in processors is challenging this assumption. To minimize the impact of security vulnerabilities in processors, we need a system architecture that can tolerate potentially exploitable cores.

In this paper, we propose the untrusted core isolation model to protect critical computation on trusted cores from untrusted and potentially buggy cores. We survey how current architectural building blocks such as MMUs fall short of this goal and derive requirements for untrusted core isolation. To demonstrate its feasibility, we discuss both changes to commodity platforms and show how research works such as

fulfill the requirements. We evaluate the security benefits via a qualitative comparison of current architectures in both industry and academia and study its costs by a quantitative comparison of the most promising approaches on off-the-shelf and FPGA-based platforms.

引用次数: 0

LicenseNet: Proactively safeguarding intellectual property of AI models through model license

IF 3.7 2区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Journal of Systems Architecture

Pub Date : 2025-02-01 DOI: 10.1016/j.sysarc.2025.103330

Peihao Li , Jie Huang , Shuaishuai Zhang

With the widespread adoption of AI models in edge computing systems, these high-value models face significant risks of theft, misuse, and tampering due to the lower security and reliability of edge devices compared to the cloud. The leakage of models can result in substantial financial losses and security threats, making the protection of intellectual property (IP) crucial. Existing watermark-based IP verification techniques fail to proactively prevent infringement, while other active IP protection solutions often suffer from high overhead, low performance, and inadequate security. This paper proposes LicenseNet, an AI model IP protection framework based on licenses, which enables authorized access to models by embedding license features within them. We design a gradient optimization-based method for synchronizing license training with model parameters and introduce a random perturbation-based data standardization technique. This allows the trained model to generate correct inferences for license data while producing confusing results for original data, thus enhancing the security of the model on edge devices. Additionally, to enhance the model’s resistance against fine-tuning attacks, a supervised discrimination mechanism is incorporated. Experimental results demonstrate that LicenseNet achieves higher security, reduced performance loss, and an improvement in resistance to fine-tuning attacks by at least 29.03% compared to existing methods in edge computing environments.

{"title":"LicenseNet: Proactively safeguarding intellectual property of AI models through model license","authors":"Peihao Li , Jie Huang , Shuaishuai Zhang","doi":"10.1016/j.sysarc.2025.103330","DOIUrl":"10.1016/j.sysarc.2025.103330","url":null,"abstract":"<div><div>With the widespread adoption of AI models in edge computing systems, these high-value models face significant risks of theft, misuse, and tampering due to the lower security and reliability of edge devices compared to the cloud. The leakage of models can result in substantial financial losses and security threats, making the protection of intellectual property (IP) crucial. Existing watermark-based IP verification techniques fail to proactively prevent infringement, while other active IP protection solutions often suffer from high overhead, low performance, and inadequate security. This paper proposes LicenseNet, an AI model IP protection framework based on licenses, which enables authorized access to models by embedding license features within them. We design a gradient optimization-based method for synchronizing license training with model parameters and introduce a random perturbation-based data standardization technique. This allows the trained model to generate correct inferences for license data while producing confusing results for original data, thus enhancing the security of the model on edge devices. Additionally, to enhance the model’s resistance against fine-tuning attacks, a supervised discrimination mechanism is incorporated. Experimental results demonstrate that LicenseNet achieves higher security, reduced performance loss, and an improvement in resistance to fine-tuning attacks by at least 29.03% compared to existing methods in edge computing environments.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"159 ","pages":"Article 103330"},"PeriodicalIF":3.7,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143133685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DBDAA: Dual blockchain and decentralized identifiers assisted anonymous authentication for building IoT

IF 3.7 2区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Journal of Systems Architecture

Pub Date : 2025-02-01 DOI: 10.1016/j.sysarc.2025.103334

Xiaohua Wu , Zheng Luo , Jun Cheng , Puyan Wang

The rapid advancements in the Internet of Things (IoT) and blockchain have made building IoT a critical element in modern intelligent building management. However, with the increasing adaptability of building IoT systems, concerns about data security and privacy have risen. Traditional solutions often rely on a single user identity, which poses security risks. Additionally, these solutions overlook the security of user permission information in access control systems. As a result, users with elevated permissions become prime targets for attacks. Existing key management mechanisms also have shortcomings, especially in key recovery processes. To address these issues, this paper proposes an identity authentication mechanism leveraging dual-blockchain and decentralized identifier (DID) technology. In the proposed mechanism, we develop a lightweight authentication scheme for devices. We also create a multi-DID model for users and further propose an anonymous authentication scheme, to safeguard data privacy, identity, and permission information, striking a balance between anonymity and oversight. Moreover, we design a key backup and recovery scheme to safeguard against key loss or damage, increasing system reliability. Experimental results demonstrate that our scheme enhances security while reducing computational and communication overhead.

{"title":"DBDAA: Dual blockchain and decentralized identifiers assisted anonymous authentication for building IoT","authors":"Xiaohua Wu , Zheng Luo , Jun Cheng , Puyan Wang","doi":"10.1016/j.sysarc.2025.103334","DOIUrl":"10.1016/j.sysarc.2025.103334","url":null,"abstract":"<div><div>The rapid advancements in the Internet of Things (IoT) and blockchain have made building IoT a critical element in modern intelligent building management. However, with the increasing adaptability of building IoT systems, concerns about data security and privacy have risen. Traditional solutions often rely on a single user identity, which poses security risks. Additionally, these solutions overlook the security of user permission information in access control systems. As a result, users with elevated permissions become prime targets for attacks. Existing key management mechanisms also have shortcomings, especially in key recovery processes. To address these issues, this paper proposes an identity authentication mechanism leveraging dual-blockchain and decentralized identifier (DID) technology. In the proposed mechanism, we develop a lightweight authentication scheme for devices. We also create a multi-DID model for users and further propose an anonymous authentication scheme, to safeguard data privacy, identity, and permission information, striking a balance between anonymity and oversight. Moreover, we design a key backup and recovery scheme to safeguard against key loss or damage, increasing system reliability. Experimental results demonstrate that our scheme enhances security while reducing computational and communication overhead.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"159 ","pages":"Article 103334"},"PeriodicalIF":3.7,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143133681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Accelerating tensor multiplication by exploring hybrid product with hardware and software co-design

IF 3.7 2区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Journal of Systems Architecture

Pub Date : 2025-02-01 DOI: 10.1016/j.sysarc.2025.103333

Zhiyuan Zhang, Zhihua Fan, Wenming Li, Yuhang Qiu, Zhen Wang, Xiaochun Ye, Dongrui Fan, Xuejun An

Tensor multiplication holds a pivotal position in numerous applications. The existing accelerators predominantly rely on inner or outer products for their computational strategies, yet these methodologies encounter obstacles such as excessive storage overhead, underutilization of parallelism, and merging costs. To tackle the challenges, we propose an acceleration technique that integrates a hybrid product approach with a tailored hardware. Our design can accommodate tensor multiplications of various scales, boasting exceptional scalability. First, we employ a hybrid product approach for tensor multiplications, strategically leveraging various methods – including inner, outer, and Hadamard products – to optimize different stages of submatrices computations. Second, we devise a dedicated architecture that seamlessly aligns with hybrid product, leveraging dataflow paradigm to map tensor multiplication efficiently onto the hardware. Third, we design a sliding-window partial reuse FIFO (SWFIFO), alongside a data reorder and scheduling unit to accelerate data retrieval. For general matrix multiplication (GEMM), our design demonstrates an average speedup of

17.62 \times

and 9.47% energy consumption over Nvidia’s V100 GPU. Furthermore, it surpasses Google’s TPU (size of 256 × 256) by an average of

3.76 \times

, TPUv2 (size of 128 × 128) by

3.19 \times

and Eyeriss by

3.8 \times

. When evaluated on eight neural network models, our design yields a performance boost of

2.89 \times

over TPU and

2.19 \times

over Eyeriss.

{"title":"Accelerating tensor multiplication by exploring hybrid product with hardware and software co-design","authors":"Zhiyuan Zhang, Zhihua Fan, Wenming Li, Yuhang Qiu, Zhen Wang, Xiaochun Ye, Dongrui Fan, Xuejun An","doi":"10.1016/j.sysarc.2025.103333","DOIUrl":"10.1016/j.sysarc.2025.103333","url":null,"abstract":"<div><div>Tensor multiplication holds a pivotal position in numerous applications. The existing accelerators predominantly rely on inner or outer products for their computational strategies, yet these methodologies encounter obstacles such as excessive storage overhead, underutilization of parallelism, and merging costs. To tackle the challenges, we propose an acceleration technique that integrates a hybrid product approach with a tailored hardware. Our design can accommodate tensor multiplications of various scales, boasting exceptional scalability. First, we employ a hybrid product approach for tensor multiplications, strategically leveraging various methods – including inner, outer, and Hadamard products – to optimize different stages of submatrices computations. Second, we devise a dedicated architecture that seamlessly aligns with hybrid product, leveraging dataflow paradigm to map tensor multiplication efficiently onto the hardware. Third, we design a sliding-window partial reuse FIFO (SWFIFO), alongside a data reorder and scheduling unit to accelerate data retrieval. For general matrix multiplication (GEMM), our design demonstrates an average speedup of <span><math><mrow><mn>17</mn><mo>.</mo><mn>62</mn><mo>×</mo></mrow></math></span> and 9.47% energy consumption over Nvidia’s V100 GPU. Furthermore, it surpasses Google’s TPU (size of 256 × 256) by an average of <span><math><mrow><mn>3</mn><mo>.</mo><mn>76</mn><mo>×</mo></mrow></math></span>, TPUv2 (size of 128 × 128) by <span><math><mrow><mn>3</mn><mo>.</mo><mn>19</mn><mo>×</mo></mrow></math></span> and Eyeriss by <span><math><mrow><mn>3</mn><mo>.</mo><mn>8</mn><mo>×</mo></mrow></math></span>. When evaluated on eight neural network models, our design yields a performance boost of <span><math><mrow><mn>2</mn><mo>.</mo><mn>89</mn><mo>×</mo></mrow></math></span> over TPU and <span><math><mrow><mn>2</mn><mo>.</mo><mn>19</mn><mo>×</mo></mrow></math></span> over Eyeriss.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"159 ","pages":"Article 103333"},"PeriodicalIF":3.7,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143133682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adapter-guided knowledge transfer for heterogeneous federated learning

IF 3.7 2区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Journal of Systems Architecture

Pub Date : 2025-02-01 DOI: 10.1016/j.sysarc.2025.103338

Shichong Liu , Haozhe Jin , Zhiwei Tang , Rui Zhai , Ke Lu , Junyang Yu , Chenxi Bai

Federated learning (FL) aims to collaboratively train a global model or multiple local models on decentralized data. Most existing FL approaches focus on addressing statistical heterogeneity among clients, often overlooking the challenge of model heterogeneity. To address both statistical and model heterogeneity issues, we propose FedAKT, a novel model-heterogeneous personalized federated learning (MHPFL) approach. First, to facilitate cross-client knowledge transfer, our method adds a small homogeneous adapter for each client. Second, we introduce a feature-based mutual distillation (FMD) mechanism, which promotes bidirectional knowledge exchange in local models. Third, a header dual-use (HDU) mechanism is proposed, enabling each local model’s header to effectively learn feature information from different perspectives. Extensive experiments on the CIFAR10, CIFAR-100, and Tiny-ImageNet datasets demonstrate the superiority of FedAKT compared to advanced baselines.

引用次数: 0

Semi-clairvoyant scheduling in non-preemptive fixed-priority mixed-criticality systems

IF 3.7 2区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Journal of Systems Architecture

Pub Date : 2025-02-01 DOI: 10.1016/j.sysarc.2025.103332

Yi-Wen Zhang, Chen Ouyang

Non-preemptive real-time scheduling of mixed-criticality systems (MCSs) in which tasks have different levels of criticality has gained increasing attention. In this paper, we propose schedulability tests specifically designed for the imprecise mixed-criticality task model, which is the first effort for non-preemptive fixed-priority semi-clairvoyant scheduling (NPFP-SC). It schedules tasks by a non-preemptive fixed-priority scheme while the system transition becomes apparent upon the arrival of high-criticality tasks. We next propose the energy-aware scheduling algorithm based on NPFP-SC schedulability tests, called EA-NPFP-SC, to solve the energy problem of MCSs. Our experimental results indicate that the NPFP-SC algorithm outperforms existing methods in terms of schedulability ratio by 17.33%, while the EA-NPFP-SC algorithm achieves a 40.18% reduction in energy consumption compared to the NPFP-SC algorithm.

引用次数: 0

An energy-efficient near-data processing accelerator for DNNs to optimize memory accesses

IF 3.7 2区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Journal of Systems Architecture

Pub Date : 2025-02-01 DOI: 10.1016/j.sysarc.2024.103320

Bahareh Khabbazan, Mohammad Sabri, Marc Riera, Antonio González

The constant growth of DNNs makes them challenging to implement and run efficiently on traditional compute-centric architectures. Some accelerators have attempted to add more compute units and on-chip buffers to solve the memory wall problem without much success, and sometimes even worsening the issue since more compute units also require higher memory bandwidth. Prior works have proposed the design of memory-centric architectures based on the Near-Data Processing (NDP) paradigm. NDP seeks to break the memory wall by moving the computations closer to the memory hierarchy, reducing the data movements and their cost as much as possible. The 3D-stacked memory is especially appealing for DNN accelerators due to its high-density/low-energy storage and near-memory computation capabilities to perform the DNN operations massively in parallel. However, memory accesses remain as the main bottleneck for running modern DNNs efficiently.

To improve the efficiency of DNN inference we present QeiHaN, a hardware accelerator that implements a 3D-stacked memory-centric weight storage scheme to take advantage of a logarithmic quantization of activations. In particular, since activations of FC and CONV layers of modern DNNs are commonly represented as powers of two with negative exponents, QeiHaN performs an implicit in-memory bit-shifting of the DNN weights to reduce memory activity. Only the meaningful bits of the weights required for the bit-shift operation are accessed. Overall, QeiHaN reduces memory accesses by 25% compared to a standard memory organization. We evaluate QeiHaN on a popular set of DNNs. On average, QeiHaN provides

4.3 x

speedup and

3.5 x

energy savings over a Neurocube-like accelerator.

{"title":"An energy-efficient near-data processing accelerator for DNNs to optimize memory accesses","authors":"Bahareh Khabbazan, Mohammad Sabri, Marc Riera, Antonio González","doi":"10.1016/j.sysarc.2024.103320","DOIUrl":"10.1016/j.sysarc.2024.103320","url":null,"abstract":"<div><div>The constant growth of DNNs makes them challenging to implement and run efficiently on traditional compute-centric architectures. Some accelerators have attempted to add more compute units and on-chip buffers to solve the memory wall problem without much success, and sometimes even worsening the issue since more compute units also require higher memory bandwidth. Prior works have proposed the design of memory-centric architectures based on the Near-Data Processing (NDP) paradigm. NDP seeks to break the memory wall by moving the computations closer to the memory hierarchy, reducing the data movements and their cost as much as possible. The 3D-stacked memory is especially appealing for DNN accelerators due to its high-density/low-energy storage and near-memory computation capabilities to perform the DNN operations massively in parallel. However, memory accesses remain as the main bottleneck for running modern DNNs efficiently.</div><div>To improve the efficiency of DNN inference we present QeiHaN, a hardware accelerator that implements a 3D-stacked memory-centric weight storage scheme to take advantage of a logarithmic quantization of activations. In particular, since activations of FC and CONV layers of modern DNNs are commonly represented as powers of two with negative exponents, QeiHaN performs an implicit in-memory bit-shifting of the DNN weights to reduce memory activity. Only the meaningful bits of the weights required for the bit-shift operation are accessed. Overall, QeiHaN reduces memory accesses by 25% compared to a standard memory organization. We evaluate QeiHaN on a popular set of DNNs. On average, QeiHaN provides <span><math><mrow><mn>4</mn><mo>.</mo><mn>3</mn><mi>x</mi></mrow></math></span> speedup and <span><math><mrow><mn>3</mn><mo>.</mo><mn>5</mn><mi>x</mi></mrow></math></span> energy savings over a Neurocube-like accelerator.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"159 ","pages":"Article 103320"},"PeriodicalIF":3.7,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143133687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Input/mapping precision controllable digital CIM with adaptive adder tree architecture for flexible DNN inference

IF 3.7 2区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Journal of Systems Architecture

Pub Date : 2025-02-01 DOI: 10.1016/j.sysarc.2024.103327

Juhong Park, Johnny Rhe, Chanwook Hwang, Jaehyeon So, Jong Hwan Ko

Digital compute-in-memory (CIM) systems, known for their precise computations, have emerged as a viable solution for real-time deep neural network (DNN) inference. However, traditional digital CIM systems often suffer from suboptimal array utilization due to static multi-bit input/mapping dataflows and inflexible adder tree structures, which do not adequately accommodate the diverse computational demands of DNNs. In this paper, we introduce a novel digital CIM architecture that dynamically redistributes bit precisions across the input and mapping domains according to computational load and data precision, thereby improving array utilization and energy efficiency. For supporting flexible bit configurations, the system incorporates an adaptive adder tree with the integrated bit-shift logic. To minimize potential overhead introduced by the bit-shiftable adder tree, we also propose a grouping algorithm that efficiently executes shift and add operations. Simulation results show that our proposed methods not only improve array utilization but also significantly accelerate computation speed, achieving up to a 10.46

\times

speedup compared to traditional methods.

{"title":"Input/mapping precision controllable digital CIM with adaptive adder tree architecture for flexible DNN inference","authors":"Juhong Park, Johnny Rhe, Chanwook Hwang, Jaehyeon So, Jong Hwan Ko","doi":"10.1016/j.sysarc.2024.103327","DOIUrl":"10.1016/j.sysarc.2024.103327","url":null,"abstract":"<div><div>Digital compute-in-memory (CIM) systems, known for their precise computations, have emerged as a viable solution for real-time deep neural network (DNN) inference. However, traditional digital CIM systems often suffer from suboptimal array utilization due to static multi-bit input/mapping dataflows and inflexible adder tree structures, which do not adequately accommodate the diverse computational demands of DNNs. In this paper, we introduce a novel digital CIM architecture that dynamically redistributes bit precisions across the input and mapping domains according to computational load and data precision, thereby improving array utilization and energy efficiency. For supporting flexible bit configurations, the system incorporates an adaptive adder tree with the integrated bit-shift logic. To minimize potential overhead introduced by the bit-shiftable adder tree, we also propose a grouping algorithm that efficiently executes shift and add operations. Simulation results show that our proposed methods not only improve array utilization but also significantly accelerate computation speed, achieving up to a 10.46<span><math><mo>×</mo></math></span> speedup compared to traditional methods.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"159 ","pages":"Article 103327"},"PeriodicalIF":3.7,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143133735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Journal of Systems Architecture

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀