Pub Date : 2026-01-05DOI: 10.1016/j.sysarc.2026.103695
Yaodong Guo, Shuangshuang Chang, Dong Ji, Shiyue Qin, Te Xu
Efficient parallel application scheduling algorithms are crucial for optimizing performance on heterogeneous clustered multi-core systems. The primary objective of scheduling is to reduce the makespan of parallel applications, typically represented as Directed Acyclic Graphs (DAGs). This paper introduces a Group Theory-based Differential Evolution (GTDE1) algorithm to address the NP-complete DAG scheduling problem, to minimize makespan and computation time. The GTDE algorithm leverages group theory to explore the inherent symmetry in system architectures, enabling the classification of scheduling schemes and thus reducing redundant computations while maintaining population diversity. To further enhance performance, the algorithm employs an Opposition-Based Learning (OBL) strategy to improve the initial population and integrates a hybrid mutation strategy for more efficient exploration of the solution space. Experimental results demonstrate that the GTDE algorithm consistently outperforms state-of-the-art DAG scheduling algorithms in terms of performance metrics, such as makespan and computation time, with average improvements of 36% and 73%, respectively, achieving superior performance across various scenarios.
{"title":"Group theory-based differential evolution algorithm for efficient DAG scheduling on heterogeneous clustered multi-core system","authors":"Yaodong Guo, Shuangshuang Chang, Dong Ji, Shiyue Qin, Te Xu","doi":"10.1016/j.sysarc.2026.103695","DOIUrl":"10.1016/j.sysarc.2026.103695","url":null,"abstract":"<div><div>Efficient parallel application scheduling algorithms are crucial for optimizing performance on heterogeneous clustered multi-core systems. The primary objective of scheduling is to reduce the makespan of parallel applications, typically represented as Directed Acyclic Graphs (DAGs). This paper introduces a Group Theory-based Differential Evolution (GTDE<span><span><sup>1</sup></span></span>) algorithm to address the NP-complete DAG scheduling problem, to minimize makespan and computation time. The GTDE algorithm leverages group theory to explore the inherent symmetry in system architectures, enabling the classification of scheduling schemes and thus reducing redundant computations while maintaining population diversity. To further enhance performance, the algorithm employs an Opposition-Based Learning (OBL) strategy to improve the initial population and integrates a hybrid mutation strategy for more efficient exploration of the solution space. Experimental results demonstrate that the GTDE algorithm consistently outperforms state-of-the-art DAG scheduling algorithms in terms of performance metrics, such as makespan and computation time, with average improvements of 36% and 73%, respectively, achieving superior performance across various scenarios.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"172 ","pages":"Article 103695"},"PeriodicalIF":4.1,"publicationDate":"2026-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145927173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-05DOI: 10.1016/j.sysarc.2026.103693
Jihe Wang , Mengchao Zhang , Huijuan Duan , Kuizhi Mei , Danghui Wang , Meikang Qiu
The advancement of edge and cloud computing technologies presents a significant challenge in intelligent computing: efficiently allocating computational tasks between edge devices and the cloud to leverage their respective resource advantages and optimize overall system performance. A core difficulty lies in balancing accuracy, energy efficiency, and latency, particularly given the resource-constrained nature of edge devices compared to the powerful computational capabilities of the cloud. To address this challenge, we propose a co-optimization framework for energy-efficient cloud–edge inference based on stochastic computing. In this framework, the frontend employs stochastic computing (SC) alongside a search for optimal bit-width and layer count to achieve a lightweight design and reduce power consumption. The backend utilizes neural architecture search (NAS) to optimize accuracy. A joint optimization framework holistically balances power consumption, latency, and accuracy to enhance overall system performance. Experimental results indicate that the frontend power consumption is reduced by approximately 35% compared to conventional binary computing methods. The co-optimization framework maintains near-baseline accuracy with only 0.2% degradation, while achieving an energy efficiency ratio more than 1.55 times greater and a power-delay product (PDP) between 0.77 and 0.92 times that of the original binary computing.
{"title":"A co-optimization framework toward energy-efficient cloud–edge inference with stochastic computing and precision-compensating NAS","authors":"Jihe Wang , Mengchao Zhang , Huijuan Duan , Kuizhi Mei , Danghui Wang , Meikang Qiu","doi":"10.1016/j.sysarc.2026.103693","DOIUrl":"10.1016/j.sysarc.2026.103693","url":null,"abstract":"<div><div>The advancement of edge and cloud computing technologies presents a significant challenge in intelligent computing: efficiently allocating computational tasks between edge devices and the cloud to leverage their respective resource advantages and optimize overall system performance. A core difficulty lies in balancing accuracy, energy efficiency, and latency, particularly given the resource-constrained nature of edge devices compared to the powerful computational capabilities of the cloud. To address this challenge, we propose a co-optimization framework for energy-efficient cloud–edge inference based on stochastic computing. In this framework, the frontend employs stochastic computing (SC) alongside a search for optimal bit-width and layer count to achieve a lightweight design and reduce power consumption. The backend utilizes neural architecture search (NAS) to optimize accuracy. A joint optimization framework holistically balances power consumption, latency, and accuracy to enhance overall system performance. Experimental results indicate that the frontend power consumption is reduced by approximately 35% compared to conventional binary computing methods. The co-optimization framework maintains near-baseline accuracy with only 0.2% degradation, while achieving an energy efficiency ratio more than 1.55 times greater and a power-delay product (PDP) between 0.77 and 0.92 times that of the original binary computing.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"172 ","pages":"Article 103693"},"PeriodicalIF":4.1,"publicationDate":"2026-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145927098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1016/j.sysarc.2026.103690
Young Chan Kim , Seok Kyu Yoon , Sung Soo Han, Chae Won Park, Jun Oh Park, Jun Ha Ko, Hyun Kim
With the emergence of transformer-based models that have demonstrated remarkable performance in natural language processing tasks, large language models (LLMs) built upon the transformer architecture and trained on massive datasets have achieved outstanding results in various tasks such as translation and summarization. Among these, decoder-only LLMs have garnered significant attention due to their superior few-shot and zero-shot capabilities compared to other architectures. Motivated by their exceptional performance, numerous efforts have been made to deploy decoder-only LLMs on diverse hardware platforms. However, the substantial computational and memory demands during both training and inference pose considerable challenges for resource-constrained hardware. Although efficient architectural designs have been proposed to address these issues, LLM inference continues to require excessive computational and memory resources. Consequently, extensive research has been conducted to compress model components and enhance inference efficiency across different hardware platforms. To further accelerate the inherently repetitive computations of LLMs, a variety of approaches have been introduced, integrating operator-level optimizations within Transformer blocks and system-level optimizations at the granularity of repeated Transformer block execution. This paper surveys recent research on decoder-only LLM inference acceleration, categorizing existing approaches based on optimization levels specific to each hardware platform. Building on this classification, we provide a comprehensive analysis of prior decoder-only LLM acceleration techniques from multiple perspectives.
{"title":"Accelerating language giants: A survey of optimization strategies for LLM inference on hardware platforms","authors":"Young Chan Kim , Seok Kyu Yoon , Sung Soo Han, Chae Won Park, Jun Oh Park, Jun Ha Ko, Hyun Kim","doi":"10.1016/j.sysarc.2026.103690","DOIUrl":"10.1016/j.sysarc.2026.103690","url":null,"abstract":"<div><div>With the emergence of transformer-based models that have demonstrated remarkable performance in natural language processing tasks, large language models (LLMs) built upon the transformer architecture and trained on massive datasets have achieved outstanding results in various tasks such as translation and summarization. Among these, decoder-only LLMs have garnered significant attention due to their superior few-shot and zero-shot capabilities compared to other architectures. Motivated by their exceptional performance, numerous efforts have been made to deploy decoder-only LLMs on diverse hardware platforms. However, the substantial computational and memory demands during both training and inference pose considerable challenges for resource-constrained hardware. Although efficient architectural designs have been proposed to address these issues, LLM inference continues to require excessive computational and memory resources. Consequently, extensive research has been conducted to compress model components and enhance inference efficiency across different hardware platforms. To further accelerate the inherently repetitive computations of LLMs, a variety of approaches have been introduced, integrating operator-level optimizations within Transformer blocks and system-level optimizations at the granularity of repeated Transformer block execution. This paper surveys recent research on decoder-only LLM inference acceleration, categorizing existing approaches based on optimization levels specific to each hardware platform. Building on this classification, we provide a comprehensive analysis of prior decoder-only LLM acceleration techniques from multiple perspectives.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"172 ","pages":"Article 103690"},"PeriodicalIF":4.1,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145927172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1016/j.sysarc.2025.103682
Xiang Su , Zhibin Yang , Qu Liu , Hao Liu , Yong Zhou , Zhiqiu Huang
Deep Neural Networks (DNNs) are widely used in various safety-critical domains. Due to data distribution shifts, DNNs may exhibit unforeseen faults that lead to serious safety risks. To reveal potential faults within DNNs, a vast number of labeled test data are required, but the labeling process is time-consuming. To solve this problem, test input selection methods improve efficiency by selecting a subset of test inputs more likely to reveal DNN model faults. However, existing methods often focus solely on single data distribution characteristics and overlook the complex differences and diversity among test inputs. In this paper, we propose MetaDTS, a distribution difference-based adaptive test input selection method that comprehensively assesses the feature and probability distribution differences of test inputs. MetaDTS employs a meta-model to evaluate the probability of misclassification of test inputs and select them. To effectively capture the complex differences among test inputs, we introduce two novel uncertainty metrics: Feature Distribution Difference (FDD) and Probability Distribution Difference (PDD). By integrating these metrics, MetaDTS adaptively selects test inputs that can reveal diverse faults of DNN models. We conducted extensive experiments on five datasets and seven models, comparing MetaDTS with nine baseline methods. The results demonstrate that MetaDTS significantly outperforms the baseline methods in selecting test inputs with high fault-revealing capability, model optimization, and enhancing test inputs diversity.
{"title":"MetaDTS: Distribution difference-based adaptive test input selection for Deep Neural Networks","authors":"Xiang Su , Zhibin Yang , Qu Liu , Hao Liu , Yong Zhou , Zhiqiu Huang","doi":"10.1016/j.sysarc.2025.103682","DOIUrl":"10.1016/j.sysarc.2025.103682","url":null,"abstract":"<div><div>Deep Neural Networks (DNNs) are widely used in various safety-critical domains. Due to data distribution shifts, DNNs may exhibit unforeseen faults that lead to serious safety risks. To reveal potential faults within DNNs, a vast number of labeled test data are required, but the labeling process is time-consuming. To solve this problem, test input selection methods improve efficiency by selecting a subset of test inputs more likely to reveal DNN model faults. However, existing methods often focus solely on single data distribution characteristics and overlook the complex differences and diversity among test inputs. In this paper, we propose MetaDTS, a distribution difference-based adaptive test input selection method that comprehensively assesses the feature and probability distribution differences of test inputs. MetaDTS employs a meta-model to evaluate the probability of misclassification of test inputs and select them. To effectively capture the complex differences among test inputs, we introduce two novel uncertainty metrics: Feature Distribution Difference (FDD) and Probability Distribution Difference (PDD). By integrating these metrics, MetaDTS adaptively selects test inputs that can reveal diverse faults of DNN models. We conducted extensive experiments on five datasets and seven models, comparing MetaDTS with nine baseline methods. The results demonstrate that MetaDTS significantly outperforms the baseline methods in selecting test inputs with high fault-revealing capability, model optimization, and enhancing test inputs diversity.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"172 ","pages":"Article 103682"},"PeriodicalIF":4.1,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145927097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1016/j.sysarc.2026.103684
Junchi Ren , Chao Li , Xiaowei Chen , Yao Chen , Zehao Chen , Qian Wei , Zhaoyan Shen
Graph neural networks (GNNs) have emerged as a powerful model for their effectiveness in learning over graphs, with broad applications in domains such as biology, e-commerce, and materials science. With the rapid growth of real-world graphs, efficient data management in GNNs has become a formidable challenge. Out-of-core (OOC) GNN system, as a representative solution, leverages external storage (CPU memory and SSD) to enable large-scale graph training on a single machine. Based on the scale of graph data, OOC GNN systems adopt different levels of storage extension and can be classified into two categories, semi OOC and fully OOC systems. However, the optimization details for both categories of OOC GNN systems remain only preliminarily understood. To address this gap, we provide a comprehensive survey of existing optimization techniques for semi OOC and fully OOC systems from the perspective of data management. We decompose the data management mechanisms into three layers: data storage, data organization, and data transfer, where data storage refers to the placement of graph data on disk, data organization pertains to the adaptive strategy that decides where to place graph data across different memory hierarchies during training, and data transfer concerns the I/O path of data movement between these storage layers. For each layer, we discuss the key challenges, review the corresponding optimization strategies proposed in existing OOC GNN systems, and analyze their advantages and limitations. Furthermore, we outline future research directions for data management of OOC GNN systems.
{"title":"A survey on data management for Out-of-Core GNN systems","authors":"Junchi Ren , Chao Li , Xiaowei Chen , Yao Chen , Zehao Chen , Qian Wei , Zhaoyan Shen","doi":"10.1016/j.sysarc.2026.103684","DOIUrl":"10.1016/j.sysarc.2026.103684","url":null,"abstract":"<div><div>Graph neural networks (GNNs) have emerged as a powerful model for their effectiveness in learning over graphs, with broad applications in domains such as biology, e-commerce, and materials science. With the rapid growth of real-world graphs, efficient data management in GNNs has become a formidable challenge. Out-of-core (OOC) GNN system, as a representative solution, leverages external storage (CPU memory and SSD) to enable large-scale graph training on a single machine. Based on the scale of graph data, OOC GNN systems adopt different levels of storage extension and can be classified into two categories, semi OOC and fully OOC systems. However, the optimization details for both categories of OOC GNN systems remain only preliminarily understood. To address this gap, we provide a comprehensive survey of existing optimization techniques for semi OOC and fully OOC systems from the perspective of data management. We decompose the data management mechanisms into three layers: data storage, data organization, and data transfer, where data storage refers to the placement of graph data on disk, data organization pertains to the adaptive strategy that decides where to place graph data across different memory hierarchies during training, and data transfer concerns the I/O path of data movement between these storage layers. For each layer, we discuss the key challenges, review the corresponding optimization strategies proposed in existing OOC GNN systems, and analyze their advantages and limitations. Furthermore, we outline future research directions for data management of OOC GNN systems.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"173 ","pages":"Article 103684"},"PeriodicalIF":4.1,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145981640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1016/j.sysarc.2026.103687
Mingshun Luo , Haolei He , Wenti Yang , Shuai Yuan , Zhitao Guan
The Artificial Intelligence of Things (AIoT) is transforming modern society by combining the data-collection capabilities of IoT devices with the inference power of cloud-based large language models (LLMs). However, transmitting sensitive data to the cloud for intelligent decision-making raises significant privacy concerns. Cryptographic techniques such as homomorphic encryption (HE) and secure multi-party computation (MPC) provide promising solutions for privacy-preserving inference. However, existing schemes primarily target small-scale models and are inefficient when applied to Transformer-based LLMs, which involve large-scale matrix multiplications and complex non-linear functions, and deep model architectures. To address these challenges, we propose an efficient privacy-preserving Transformer inference scheme for cloud-based AIoT scenarios. Our framework integrates HE and MPC to ensure data confidentiality while minimizing computational and communication overhead. We design a fast HE-based matrix multiplication protocol using an offline-online collaborative pipeline and single instruction multiple data (SIMD)-based packing rules. Furthermore, we develop an accurate and efficient MPC-based non-linear function evaluation protocol using optimized piecewise polynomial approximation and integer-fraction decomposition. Experimental results show that our approach achieves 8.3–91.6 faster in matrix multiplication, 1.4–19 faster in non-linear function evaluation, and 3.5–137.9 reduction in communication overhead with the LAN network, while maintaining lossless accuracy, thus enabling secure and scalable intelligent decision-making in AIoT environments.
{"title":"An efficient privacy-preserving transformer inference scheme for cloud-based intelligent decision-making in AIoT","authors":"Mingshun Luo , Haolei He , Wenti Yang , Shuai Yuan , Zhitao Guan","doi":"10.1016/j.sysarc.2026.103687","DOIUrl":"10.1016/j.sysarc.2026.103687","url":null,"abstract":"<div><div>The Artificial Intelligence of Things (AIoT) is transforming modern society by combining the data-collection capabilities of IoT devices with the inference power of cloud-based large language models (LLMs). However, transmitting sensitive data to the cloud for intelligent decision-making raises significant privacy concerns. Cryptographic techniques such as homomorphic encryption (HE) and secure multi-party computation (MPC) provide promising solutions for privacy-preserving inference. However, existing schemes primarily target small-scale models and are inefficient when applied to Transformer-based LLMs, which involve large-scale matrix multiplications and complex non-linear functions, and deep model architectures. To address these challenges, we propose an efficient privacy-preserving Transformer inference scheme for cloud-based AIoT scenarios. Our framework integrates HE and MPC to ensure data confidentiality while minimizing computational and communication overhead. We design a fast HE-based matrix multiplication protocol using an offline-online collaborative pipeline and single instruction multiple data (SIMD)-based packing rules. Furthermore, we develop an accurate and efficient MPC-based non-linear function evaluation protocol using optimized piecewise polynomial approximation and integer-fraction decomposition. Experimental results show that our approach achieves 8.3<span><math><mo>×</mo></math></span>–91.6<span><math><mo>×</mo></math></span> faster in matrix multiplication, 1.4<span><math><mo>×</mo></math></span>–19<span><math><mo>×</mo></math></span> faster in non-linear function evaluation, and 3.5<span><math><mo>×</mo></math></span>–137.9<span><math><mo>×</mo></math></span> reduction in communication overhead with the LAN network, while maintaining lossless accuracy, thus enabling secure and scalable intelligent decision-making in AIoT environments.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"172 ","pages":"Article 103687"},"PeriodicalIF":4.1,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145927095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1016/j.sysarc.2026.103688
Siyuan Shen , Xiaoying Jia , Min Luo , Zhiyan Xu , Zhichao Zhou
The smart grid provides a flexible interactive platform for energy stakeholders by sharing energy usage data to enhance management efficiency and service accuracy. However, such data are often highly sensitive and vulnerable to eavesdropping and tampering during transmission. Ensuring data authenticity, integrity and users’ privacy is therefore critical. Redactable signatures have emerged as a promising cryptographic primitive to address these concerns. Nonetheless, most existing redactable signature schemes lack fine-grained control over the redaction process, making them susceptible to unauthorized or malicious modifications. To address this issue, we propose an identity-based Controlled Redactable Signature Scheme (CRSS), enabling users to selectively disclose information under controlled conditions without revealing private information. We define a formal security model and prove that the proposed scheme achieves unforgeability, redaction controllability, privacy, and transparency. Furthermore, theoretical analysis and experimental evaluation demonstrate that our scheme offers superior efficiency and practicality compared to existing approaches.
{"title":"Towards privacy preservation in smart grids via controlled redactable signatures","authors":"Siyuan Shen , Xiaoying Jia , Min Luo , Zhiyan Xu , Zhichao Zhou","doi":"10.1016/j.sysarc.2026.103688","DOIUrl":"10.1016/j.sysarc.2026.103688","url":null,"abstract":"<div><div>The smart grid provides a flexible interactive platform for energy stakeholders by sharing energy usage data to enhance management efficiency and service accuracy. However, such data are often highly sensitive and vulnerable to eavesdropping and tampering during transmission. Ensuring data authenticity, integrity and users’ privacy is therefore critical. Redactable signatures have emerged as a promising cryptographic primitive to address these concerns. Nonetheless, most existing redactable signature schemes lack fine-grained control over the redaction process, making them susceptible to unauthorized or malicious modifications. To address this issue, we propose an identity-based Controlled Redactable Signature Scheme (CRSS), enabling users to selectively disclose information under controlled conditions without revealing private information. We define a formal security model and prove that the proposed scheme achieves unforgeability, redaction controllability, privacy, and transparency. Furthermore, theoretical analysis and experimental evaluation demonstrate that our scheme offers superior efficiency and practicality compared to existing approaches.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"172 ","pages":"Article 103688"},"PeriodicalIF":4.1,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145927096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1016/j.sysarc.2026.103691
Arash Golabi , Abdelkarim Erradi , Ahmed Bensaid , Abdulla Al-Ali , Uvais Qidwai
Distributed Deep Learning (DDL) partitions deep neural networks across multiple devices, enhancing efficiency in large-scale inference tasks. However, this segmentation exposes intermediate-layer feature maps to new security vulnerabilities, expanding the attack surface beyond traditional input-level threats. This work investigates an adaptation of Natural Evolution Strategies (NES), named NES with Random Uniform Perturbation (NES-RUP), for adversarial manipulation of intermediate-layer feature maps in horizontally distributed inference systems. Instead of Gaussian-based perturbation sampling, the proposed method utilizes uniformly distributed noise and targets only a subset of feature map channels. This design improves stealth by using uniform noise distributions, which avoid extreme outliers and limit perturbations to bounded ranges, keeping activations closer to their clean values and thereby reducing anomaly detection likelihood, while also aligning with the performance and privacy constraints of AIoT-enabled smart environments. Extensive experiments on VGG16, ResNet50 and DeiT-Tiny (a Vision Transformer) using the CIFAR-10 and Mini-ImageNet datasets demonstrate that the adapted NES method achieves high misclassification rates with minimal feature-level distortion, preserving the statistical characteristics of natural feature activations. Furthermore, it successfully bypasses common defenses such as low-pass filtering and feature map anomaly detection (e.g., PseudoNet), revealing critical vulnerabilities in collaborative inference. These findings underscore the need for dedicated defense strategies that address intermediate-layer threats in secure AIoT infrastructures.
{"title":"Covert feature-space adversarial perturbation using natural evolution strategies in distributed deep learning","authors":"Arash Golabi , Abdelkarim Erradi , Ahmed Bensaid , Abdulla Al-Ali , Uvais Qidwai","doi":"10.1016/j.sysarc.2026.103691","DOIUrl":"10.1016/j.sysarc.2026.103691","url":null,"abstract":"<div><div>Distributed Deep Learning (DDL) partitions deep neural networks across multiple devices, enhancing efficiency in large-scale inference tasks. However, this segmentation exposes intermediate-layer feature maps to new security vulnerabilities, expanding the attack surface beyond traditional input-level threats. This work investigates an adaptation of Natural Evolution Strategies (NES), named NES with Random Uniform Perturbation (NES-RUP), for adversarial manipulation of intermediate-layer feature maps in horizontally distributed inference systems. Instead of Gaussian-based perturbation sampling, the proposed method utilizes uniformly distributed noise and targets only a subset of feature map channels. This design improves stealth by using uniform noise distributions, which avoid extreme outliers and limit perturbations to bounded ranges, keeping activations closer to their clean values and thereby reducing anomaly detection likelihood, while also aligning with the performance and privacy constraints of AIoT-enabled smart environments. Extensive experiments on VGG16, ResNet50 and DeiT-Tiny (a Vision Transformer) using the CIFAR-10 and Mini-ImageNet datasets demonstrate that the adapted NES method achieves high misclassification rates with minimal feature-level distortion, preserving the statistical characteristics of natural feature activations. Furthermore, it successfully bypasses common defenses such as low-pass filtering and feature map anomaly detection (e.g., PseudoNet), revealing critical vulnerabilities in collaborative inference. These findings underscore the need for dedicated defense strategies that address intermediate-layer threats in secure AIoT infrastructures.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"172 ","pages":"Article 103691"},"PeriodicalIF":4.1,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145927100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31DOI: 10.1016/j.sysarc.2025.103681
Minhao Li , Le Wang , Zhaohua Li , Rongxin Hu , Tang Zhou , Binxing Fang
Federated Learning (FL), a cornerstone of privacy-preserving technology in the Artificial Intelligence of Things (AIoT) era, enables collaborative model training across edge devices via gradient sharing. However, Gradient Leakage Attacks (GLA) can reconstruct private data from these gradients, fundamentally undermining FL’s privacy guarantees. While recent Generation-based GLAs demonstrate significant promise in efficiency and quality, their core feature separation algorithms rely heavily on idealized properties of early-stage training. This dependency causes attack performance to decay rapidly as training progresses and fails in challenging scenarios, such as batches with duplicate labels. To overcome this, we reframe feature separation as a Blind Source Separation (BSS) problem, solved using Independent Component Analysis (ICA). Through a systematic analysis of the entire training lifecycle, we uncover the adversarial dynamics of key mathematical properties that govern BSS solvability. Based on these insights, we introduce a novel framework: ICA-driven Generative Attacks (ICA-GA). Extensive experiments show that ICA-GA significantly outperforms baselines throughout the training lifecycle and exhibits remarkable robustness against challenging conditions, including batches with full label duplication, FedAVG, and gradient compression. Furthermore, our incremental generator fine-tuning strategy reduces the marginal cost of continuous multi-round attacks by an order of magnitude, making such threats highly practical. Our work reveals that the privacy risk in FL is far more persistent and severe than previously understood. Our code is publicly available at https://github.com/liminhao-99/ICA-GA.
{"title":"Unmixing gradients: Uncovering persistent leakage in Federated Learning via Independent Component Analysis","authors":"Minhao Li , Le Wang , Zhaohua Li , Rongxin Hu , Tang Zhou , Binxing Fang","doi":"10.1016/j.sysarc.2025.103681","DOIUrl":"10.1016/j.sysarc.2025.103681","url":null,"abstract":"<div><div>Federated Learning (FL), a cornerstone of privacy-preserving technology in the Artificial Intelligence of Things (AIoT) era, enables collaborative model training across edge devices via gradient sharing. However, Gradient Leakage Attacks (GLA) can reconstruct private data from these gradients, fundamentally undermining FL’s privacy guarantees. While recent Generation-based GLAs demonstrate significant promise in efficiency and quality, their core feature separation algorithms rely heavily on idealized properties of early-stage training. This dependency causes attack performance to decay rapidly as training progresses and fails in challenging scenarios, such as batches with duplicate labels. To overcome this, we reframe feature separation as a Blind Source Separation (BSS) problem, solved using Independent Component Analysis (ICA). Through a systematic analysis of the entire training lifecycle, we uncover the adversarial dynamics of key mathematical properties that govern BSS solvability. Based on these insights, we introduce a novel framework: <strong>ICA-driven Generative Attacks (ICA-GA)</strong>. Extensive experiments show that ICA-GA significantly outperforms baselines throughout the training lifecycle and exhibits remarkable robustness against challenging conditions, including batches with full label duplication, FedAVG, and gradient compression. Furthermore, our incremental generator fine-tuning strategy reduces the marginal cost of continuous multi-round attacks by an order of magnitude, making such threats highly practical. Our work reveals that the privacy risk in FL is far more persistent and severe than previously understood. Our code is publicly available at <span><span>https://github.com/liminhao-99/ICA-GA</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"173 ","pages":"Article 103681"},"PeriodicalIF":4.1,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145941384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-26DOI: 10.1016/j.sysarc.2025.103667
Chi-Wei Chu, Ding-Yong Hong, Jan-Jan Wu
In deep learning frameworks, weight pruning is a widely used technique for improving computational efficiency by reducing the size of large models. This is especially critical for convolutional operators, which often act as performance bottlenecks in convolutional neural networks (CNNs). However, the effectiveness of pruning heavily depends on how it is implemented, as different methods can significantly impact both computational performance and memory footprint. In this work, we propose a column-wise N:M pruning strategy applied at the tile level and modify XNNPACK to enable efficient execution of pruned models on the RISC-V vector architecture. Additionally, we propose fusing the operations of im2col and data packing to minimize redundant memory accesses and memory overhead. To further optimize performance, we incorporate AITemplate’s profiling technique to identify the optimal implementation for each convolutional operator. Our proposed approach effectively increases ResNet inference throughput by as much as , and preserves ImageNet top-1 accuracy within 2.1% of the dense baseline. The code of our framework is publicly available at https://github.com/wewe5215/AI_template_RVV_backend
{"title":"Efficient column-wise N:M pruning on RISC-V CPU","authors":"Chi-Wei Chu, Ding-Yong Hong, Jan-Jan Wu","doi":"10.1016/j.sysarc.2025.103667","DOIUrl":"10.1016/j.sysarc.2025.103667","url":null,"abstract":"<div><div>In deep learning frameworks, weight pruning is a widely used technique for improving computational efficiency by reducing the size of large models. This is especially critical for convolutional operators, which often act as performance bottlenecks in convolutional neural networks (CNNs). However, the effectiveness of pruning heavily depends on how it is implemented, as different methods can significantly impact both computational performance and memory footprint. In this work, we propose a column-wise N:M pruning strategy applied at the tile level and modify XNNPACK to enable efficient execution of pruned models on the RISC-V vector architecture. Additionally, we propose fusing the operations of im2col and data packing to minimize redundant memory accesses and memory overhead. To further optimize performance, we incorporate AITemplate’s profiling technique to identify the optimal implementation for each convolutional operator. Our proposed approach effectively increases ResNet inference throughput by as much as <span><math><mrow><mn>4</mn><mo>×</mo></mrow></math></span>, and preserves ImageNet top-1 accuracy within 2.1% of the dense baseline. The code of our framework is publicly available at <span><span>https://github.com/wewe5215/AI_template_RVV_backend</span><svg><path></path></svg></span></div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"172 ","pages":"Article 103667"},"PeriodicalIF":4.1,"publicationDate":"2025-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145885132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}