首页 > 最新文献

Journal of Systems Architecture最新文献

英文 中文
HPA: Manipulating deep reinforcement learning via adversarial interaction HPA:通过对抗性互动操纵深度强化学习
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-01-22 DOI: 10.1016/j.sysarc.2026.103685
Kanghua Mo , Zhengxin Zhang , Yuanzhi Zhang , Yucheng Long , Zhengdao Li
Recent studies have demonstrated that policy manipulation attacks on deep reinforcement learning (DRL) systems can lead to the learning of abnormal policies by victim agents. However, existing work typically assumes that the attacker can manipulate multiple components of the training process, such as reward functions, environment dynamics, or state information. In IoT-enabled smart societies, where AI-driven systems operate in interconnected and data-sensitive environments, such assumptions raise serious concerns regarding security and privacy. This paper investigates a novel policy manipulation attack in competitive multi-agent reinforcement learning under significantly weaker assumptions, where the attacker only requires access to the victim’s training settings and, in some cases, the learned policy outputs during training. We propose the honeypot policy attack (HPA), in which an adversarial agent induces the victim to learn an attacker-specified target policy by deliberately taking suboptimal actions. To this end, we introduce a honeypot reward estimation mechanism that quantifies the amount of reward sacrifice required by the adversarial agent to influence the victim’s learning process, and adapts this sacrifice according to the degree of policy manipulation. Extensive experiments on three representative competitive games demonstrate that HPA is both effective and stealthy, exposing previously unexplored vulnerabilities in DRL-based systems deployed in IoT-driven smart environments. To the best of our knowledge, this work presents the first policy manipulation attack that does not rely on explicit tampering with internal components of DRL systems, but instead operates solely through admissible adversarial interactions, offering new insights into security challenges faced by emerging AIoT ecosystems.
最近的研究表明,针对深度强化学习(DRL)系统的策略操纵攻击可以导致受害者代理学习异常策略。然而,现有的工作通常假设攻击者可以操纵训练过程的多个组件,例如奖励函数、环境动态或状态信息。在物联网驱动的智能社会中,人工智能驱动的系统在相互关联和数据敏感的环境中运行,这种假设引发了对安全和隐私的严重担忧。本文研究了竞争多智能体强化学习中的一种新的策略操纵攻击,在明显较弱的假设下,攻击者只需要访问受害者的训练设置,在某些情况下,在训练期间学习的策略输出。我们提出了蜜罐策略攻击(HPA),其中敌对代理通过故意采取次优行为诱导受害者学习攻击者指定的目标策略。为此,我们引入了一种蜜罐奖励估计机制,该机制量化了对抗代理影响受害者学习过程所需的奖励牺牲量,并根据策略操纵的程度对这种牺牲进行调整。在三个具有代表性的竞争游戏中进行的广泛实验表明,HPA既有效又隐蔽,暴露了在物联网驱动的智能环境中部署的基于drl的系统中以前未被探索的漏洞。据我们所知,这项工作提出了第一个策略操纵攻击,它不依赖于对DRL系统内部组件的显式篡改,而是仅通过可接受的对抗性相互作用进行操作,为新兴AIoT生态系统面临的安全挑战提供了新的见解。
{"title":"HPA: Manipulating deep reinforcement learning via adversarial interaction","authors":"Kanghua Mo ,&nbsp;Zhengxin Zhang ,&nbsp;Yuanzhi Zhang ,&nbsp;Yucheng Long ,&nbsp;Zhengdao Li","doi":"10.1016/j.sysarc.2026.103685","DOIUrl":"10.1016/j.sysarc.2026.103685","url":null,"abstract":"<div><div>Recent studies have demonstrated that policy manipulation attacks on deep reinforcement learning (DRL) systems can lead to the learning of abnormal policies by victim agents. However, existing work typically assumes that the attacker can manipulate multiple components of the training process, such as reward functions, environment dynamics, or state information. In IoT-enabled smart societies, where AI-driven systems operate in interconnected and data-sensitive environments, such assumptions raise serious concerns regarding security and privacy. This paper investigates a novel policy manipulation attack in competitive multi-agent reinforcement learning under significantly weaker assumptions, where the attacker only requires access to the victim’s training settings and, in some cases, the learned policy outputs during training. We propose the honeypot policy attack (HPA), in which an adversarial agent induces the victim to learn an attacker-specified target policy by deliberately taking suboptimal actions. To this end, we introduce a honeypot reward estimation mechanism that quantifies the amount of reward sacrifice required by the adversarial agent to influence the victim’s learning process, and adapts this sacrifice according to the degree of policy manipulation. Extensive experiments on three representative competitive games demonstrate that HPA is both effective and stealthy, exposing previously unexplored vulnerabilities in DRL-based systems deployed in IoT-driven smart environments. To the best of our knowledge, this work presents the first policy manipulation attack that does not rely on explicit tampering with internal components of DRL systems, but instead operates solely through admissible adversarial interactions, offering new insights into security challenges faced by emerging AIoT ecosystems.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"173 ","pages":"Article 103685"},"PeriodicalIF":4.1,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146090348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GAS: A scheduling primitive dependency analysis-based cost model for tensor program optimization 基于调度原语依赖分析的张量程序优化成本模型
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-01-21 DOI: 10.1016/j.sysarc.2026.103721
Yonghua Hu , Anxing Xie , Yaohua Wang , Zhe Li , Zenghua Cheng , Junyang Tang
Automatically generating high-performance tensor programs has become a promising approach for deploying deep neural networks. A key challenge lies in designing an effective cost model to navigate the vast scheduling search space. Existing approaches typically fall into two categories, each with limitations: offline learning cost models rely on large pre-collected datasets, which may be incomplete or device-specific, and online learning cost models depend on handcrafted features, requiring substantial manual effort and expertise.
We propose GAS, a lightweight framework for generating tensor programs for deep learning applications. GAS reformulates feature extraction as a sequence-dependent analysis of scheduling primitives. Our cost model integrates three key factors to uncover performance-critical insights within scheduling sequences: (1) decision factors allocation, quantifying entropy and skewness of scheduling primitive factors to capture their dominance; (2) primitive contribution weights, measuring the relative impact of primitives on overall performance; and (3) structural semantic alignment, capturing correlations between scheduling primitive factors and hardware parallelism mechanisms. This approach reduces the complexity of handcrafted feature engineering and extensive pre-training datasets, significantly improving both efficiency and scalability. Experimental results on NVIDIA GPUs demonstrate that GAS achieves average speedups of 3.79× over AMOS and 2.22× over Ansor, while also consistently outperforming other state-of-the-art tensor compilers.
自动生成高性能张量程序已成为部署深度神经网络的一种很有前途的方法。一个关键的挑战在于设计一个有效的成本模型来导航巨大的调度搜索空间。现有的方法通常分为两类,每一类都有局限性:离线学习成本模型依赖于大量预先收集的数据集,这些数据集可能是不完整的或特定于设备的,在线学习成本模型依赖于手工制作的特征,需要大量的手工工作和专业知识。我们提出GAS,一个轻量级框架,用于为深度学习应用程序生成张量程序。GAS将特征提取重新表述为调度原语的序列相关分析。我们的成本模型集成了三个关键因素来揭示调度序列中的性能关键洞察:(1)决策因素分配,量化调度原始因素的熵和偏度,以捕获它们的主导地位;(2)原语贡献权重,衡量原语对整体性能的相对影响;(3)结构语义对齐,捕获调度原语因素与硬件并行机制之间的相关性。这种方法降低了手工特征工程和大量预训练数据集的复杂性,显著提高了效率和可扩展性。在NVIDIA gpu上的实验结果表明,GAS的平均速度比AMOS高3.79倍,比Ansor高2.22倍,同时也始终优于其他最先进的张量编译器。
{"title":"GAS: A scheduling primitive dependency analysis-based cost model for tensor program optimization","authors":"Yonghua Hu ,&nbsp;Anxing Xie ,&nbsp;Yaohua Wang ,&nbsp;Zhe Li ,&nbsp;Zenghua Cheng ,&nbsp;Junyang Tang","doi":"10.1016/j.sysarc.2026.103721","DOIUrl":"10.1016/j.sysarc.2026.103721","url":null,"abstract":"<div><div>Automatically generating high-performance tensor programs has become a promising approach for deploying deep neural networks. A key challenge lies in designing an effective cost model to navigate the vast scheduling search space. Existing approaches typically fall into two categories, each with limitations: offline learning cost models rely on large pre-collected datasets, which may be incomplete or device-specific, and online learning cost models depend on handcrafted features, requiring substantial manual effort and expertise.</div><div>We propose GAS, a lightweight framework for generating tensor programs for deep learning applications. GAS reformulates feature extraction as a sequence-dependent analysis of scheduling primitives. Our cost model integrates three key factors to uncover performance-critical insights within scheduling sequences: (1) decision factors allocation, quantifying entropy and skewness of scheduling primitive factors to capture their dominance; (2) primitive contribution weights, measuring the relative impact of primitives on overall performance; and (3) structural semantic alignment, capturing correlations between scheduling primitive factors and hardware parallelism mechanisms. This approach reduces the complexity of handcrafted feature engineering and extensive pre-training datasets, significantly improving both efficiency and scalability. Experimental results on NVIDIA GPUs demonstrate that GAS achieves average speedups of 3.79<span><math><mo>×</mo></math></span> over AMOS and 2.22<span><math><mo>×</mo></math></span> over Ansor, while also consistently outperforming other state-of-the-art tensor compilers.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"173 ","pages":"Article 103721"},"PeriodicalIF":4.1,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146039223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EdgeTrust-Shard: Hierarchical blockchain architecture for federated learning in cross-chain IoT ecosystems EdgeTrust-Shard:跨链物联网生态系统中联邦学习的分层区块链架构
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-01-21 DOI: 10.1016/j.sysarc.2026.103701
Tuan-Dung Tran, Phuong-Dai Bui, Van-Hau Pham
Enabling AI-driven real-time distributed computing on the edge-cloud continuum requires overcoming a critical dependability challenge: resource-constrained IoT devices cannot participate in Byzantine-resilient federated learning due to a 1940-fold memory gap, with robust aggregation methods demanding 512MB–2GB while microcontrollers offer only 264KB SRAM. We present EdgeTrust-Shard, a novel system architecture designed for dependability, security, and scalability in edge AI. It enables real-time Byzantine-resilient federated learning on commodity microcontrollers by distributing computational complexity across the network topology. The framework’s contributions include optimal M=N clustering for O(N) communication, a Multi-Factor Proof-of-Performance consensus mechanism providing quadratic Byzantine suppression with proven O(T1/2) convergence, and platform-optimized cryptography delivering a 3.4-fold speedup for real-time processing. A case study using a hybrid physical-simulation deployment demonstrates the system’s efficacy, achieving 93.9–94.7% accuracy across Byzantine attack scenarios at 30% adversary presence within a 140KB memory footprint on Raspberry Pi Pico nodes. By outperforming adapted state-of-the-art blockchain-FL systems like FedChain and BlockFL by up to 9.3 percentage points, EdgeTrust-Shard provides a critical security enhancement for the edge-cloud continuum, transforming passive IoT data sources into dependable participants in distributed trust computations for next-generation applications such as smart cities and industrial automation.
在边缘云连续体上实现人工智能驱动的实时分布式计算需要克服一个关键的可靠性挑战:由于内存缺口达40倍,资源受限的物联网设备无法参与拜占庭弹性联邦学习,而强大的聚合方法需要512MB-2GB,而微控制器仅提供264KB SRAM。我们提出了EdgeTrust-Shard,这是一种新颖的系统架构,专为边缘人工智能的可靠性,安全性和可扩展性而设计。它通过在网络拓扑上分配计算复杂性,在商品微控制器上实现实时拜占庭弹性联邦学习。该框架的贡献包括用于O(N)通信的最优M=N聚类,提供二次拜占庭抑制的多因素性能证明共识机制,具有已证明的O(T−1/2)收敛性,以及为实时处理提供3.4倍加速的平台优化加密。使用混合物理模拟部署的案例研究证明了系统的有效性,在Raspberry Pi Pico节点140KB内存占用内,在30%的对手存在的拜占庭攻击场景中,实现了93.9-94.7%的准确率。通过比FedChain和BlockFL等最先进的区块链- fl系统高出9.3个百分点,EdgeTrust-Shard为边缘云连续体提供了关键的安全性增强,将被动物联网数据源转化为下一代应用(如智能城市和工业自动化)分布式信任计算的可靠参与者。
{"title":"EdgeTrust-Shard: Hierarchical blockchain architecture for federated learning in cross-chain IoT ecosystems","authors":"Tuan-Dung Tran,&nbsp;Phuong-Dai Bui,&nbsp;Van-Hau Pham","doi":"10.1016/j.sysarc.2026.103701","DOIUrl":"10.1016/j.sysarc.2026.103701","url":null,"abstract":"<div><div>Enabling AI-driven real-time distributed computing on the edge-cloud continuum requires overcoming a critical dependability challenge: resource-constrained IoT devices cannot participate in Byzantine-resilient federated learning due to a 1940-fold memory gap, with robust aggregation methods demanding 512MB–2GB while microcontrollers offer only 264KB SRAM. We present EdgeTrust-Shard, a novel system architecture designed for dependability, security, and scalability in edge AI. It enables real-time Byzantine-resilient federated learning on commodity microcontrollers by distributing computational complexity across the network topology. The framework’s contributions include optimal <span><math><mrow><mi>M</mi><mo>=</mo><msqrt><mrow><mi>N</mi></mrow></msqrt></mrow></math></span> clustering for <span><math><mrow><mi>O</mi><mrow><mo>(</mo><mi>N</mi><mo>)</mo></mrow></mrow></math></span> communication, a Multi-Factor Proof-of-Performance consensus mechanism providing quadratic Byzantine suppression with proven <span><math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>T</mi></mrow><mrow><mo>−</mo><mn>1</mn><mo>/</mo><mn>2</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> convergence, and platform-optimized cryptography delivering a 3.4-fold speedup for real-time processing. A case study using a hybrid physical-simulation deployment demonstrates the system’s efficacy, achieving 93.9–94.7% accuracy across Byzantine attack scenarios at 30% adversary presence within a 140KB memory footprint on Raspberry Pi Pico nodes. By outperforming adapted state-of-the-art blockchain-FL systems like FedChain and BlockFL by up to 9.3 percentage points, EdgeTrust-Shard provides a critical security enhancement for the edge-cloud continuum, transforming passive IoT data sources into dependable participants in distributed trust computations for next-generation applications such as smart cities and industrial automation.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"173 ","pages":"Article 103701"},"PeriodicalIF":4.1,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146090346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GenMClass: Design and comparative analysis of genome classifier-on-chip platform GenMClass:基因组芯片分类器平台的设计与比较分析
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-01-17 DOI: 10.1016/j.sysarc.2026.103702
Daria Bromot , Yehuda Kra , Zuher Jahshan , Esteban Garzón , Adam Teman , Leonid Yavits
We propose GenMClass, a genome classification system-on-chip (SoC) implementing two different classification approaches and comprising two separate classification engines: a DNN accelerator GenDNN, that classifies DNA reads converted to images using a classification neural network, and a similarity search-capable Error Tolerant Content Addressable Memory ETCAM, that classifies genomes by k-mer matching. Classification operations are controlled by an embedded RISCV processor. GenMClass classification platform was designed and manufactured in a commercial 65 nm process. We conduct a comparative analysis of ETCAM and GenDNN classification efficiency as well as their performance, silicon area and power consumption using silicon measurements. The size of GenMClass SoC is 3.4 mm2 and its total power consumption (assuming both GenDNN and ETCAM perform classification at the same time) is 144 mW. This allows using GenMClass as a portable classifier for pathogen surveillance during pandemics, food safety and environmental monitoring, agriculture pathogen and antimicrobial resistance control, in the field or at points of care.
我们提出GenMClass,一个基因组分类系统芯片(SoC),实现两种不同的分类方法,包括两个独立的分类引擎:DNN加速器GenDNN,使用分类神经网络对转换为图像的DNA进行分类,以及具有相似性搜索功能的容错内容可寻址内存ETCAM,通过k-mer匹配对基因组进行分类。分类操作由嵌入式RISCV处理器控制。GenMClass分类平台采用商用65nm工艺设计和制造。我们对ETCAM和GenDNN的分类效率、性能、硅面积和功耗进行了比较分析。GenMClass SoC的尺寸为3.4 mm2,其总功耗(假设GenDNN和ETCAM同时进行分类)为144 mW。这使得GenMClass可以作为便携式分类器在大流行期间进行病原体监测、食品安全和环境监测、农业病原体和抗菌素耐药性控制,以及在现场或医疗点使用。
{"title":"GenMClass: Design and comparative analysis of genome classifier-on-chip platform","authors":"Daria Bromot ,&nbsp;Yehuda Kra ,&nbsp;Zuher Jahshan ,&nbsp;Esteban Garzón ,&nbsp;Adam Teman ,&nbsp;Leonid Yavits","doi":"10.1016/j.sysarc.2026.103702","DOIUrl":"10.1016/j.sysarc.2026.103702","url":null,"abstract":"<div><div>We propose GenMClass, a genome classification system-on-chip (SoC) implementing two different classification approaches and comprising two separate classification engines: a DNN accelerator GenDNN, that classifies DNA reads converted to images using a classification neural network, and a similarity search-capable Error Tolerant Content Addressable Memory ETCAM, that classifies genomes by k-mer matching. Classification operations are controlled by an embedded RISCV processor. GenMClass classification platform was designed and manufactured in a commercial 65 nm process. We conduct a comparative analysis of ETCAM and GenDNN classification efficiency as well as their performance, silicon area and power consumption using silicon measurements. The size of GenMClass SoC is 3.4 mm<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span> and its total power consumption (assuming both GenDNN and ETCAM perform classification at the same time) is 144 mW. This allows using GenMClass as a portable classifier for pathogen surveillance during pandemics, food safety and environmental monitoring, agriculture pathogen and antimicrobial resistance control, in the field or at points of care.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"173 ","pages":"Article 103702"},"PeriodicalIF":4.1,"publicationDate":"2026-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145981625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AIMD: AI-powered android malware detection for securing AIoT devices and networks using graph embedding and ensemble learning AIMD: ai驱动的android恶意软件检测,用于使用图嵌入和集成学习来保护AIoT设备和网络
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-01-16 DOI: 10.1016/j.sysarc.2026.103707
Santosh K. Smmarwar , Rahul Priyadarshi , Pratik Angaitkar , Subodh Mishra , Rajkumar Singh Rathore
The rapid evolution of Artificial Intelligence of Things (AIoT) is accelerating the development of smart societies, where interconnected consumer electronics such as smartphones, IoT devices, smart meters, and surveillance systems play a crucial role in optimizing operational efficiency and service delivery. However, this hyper-connected digital ecosystem is increasingly vulnerable to sophisticated Android malware attacks that exploit system weaknesses, disrupt services, and compromise data privacy and integrity. These malware variants leverage advanced evasion techniques, including permission abuse, dynamic runtime manipulation, and memory-based obfuscation, rendering traditional detection methods ineffective. The key challenges in securing AIoT-driven smart societies include managing high-dimensional feature spaces, detecting dynamically evolving malware behaviours, and ensuring real-time classification performance. To address these issues, this paper proposed an AI-powered Android Malware Detection (AIMD) framework designed for AIoT-enabled smart society environments. The framework extracts multi-level features (permissions, intents, API calls, and obfuscated memory patterns) from Android APK files and employs graph embedding techniques (DeepWalk and Node2Vec) for dimensionality reduction. Feature selection is optimized using the Red Deer Algorithm (RDA), a metaheuristic approach, while classification is performed through an ensemble of machine learning models (Support Vector Machine, Decision Tree, Random Forest, Extra Trees) enhanced by bagging, boosting, stacking, and soft voting techniques. Experimental evaluations on CICInvesAndMal2019 and CICMalMem2022 datasets demonstrate the effectiveness of the proposed system, achieving malware detection accuracies of 98.78% and 99.99%, respectively. By integrating AI-driven malware detection into AIoT infrastructures, this research advances cybersecurity resilience, safeguarding smart societies against emerging threats in an increasingly connected world.
物联网人工智能(AIoT)的快速发展正在加速智能社会的发展,智能手机、物联网设备、智能电表和监控系统等互联消费电子产品在优化运营效率和服务交付方面发挥着至关重要的作用。然而,这种超连接的数字生态系统越来越容易受到复杂的Android恶意软件攻击,这些攻击会利用系统弱点,破坏服务,并损害数据隐私和完整性。这些恶意软件变体利用高级规避技术,包括权限滥用、动态运行时操作和基于内存的混淆,使传统的检测方法无效。保护人工智能物联网驱动的智能社会的关键挑战包括管理高维特征空间,检测动态演变的恶意软件行为,并确保实时分类性能。为了解决这些问题,本文提出了一种基于ai的Android恶意软件检测(AIMD)框架,该框架设计用于支持ai的智能社会环境。该框架从Android APK文件中提取多级特性(权限、意图、API调用和模糊内存模式),并采用图嵌入技术(DeepWalk和Node2Vec)进行降维。特征选择使用红鹿算法(RDA)进行优化,这是一种元启发式方法,而分类是通过机器学习模型(支持向量机、决策树、随机森林、额外树)的集合进行的,这些模型通过装袋、提升、堆叠和软投票技术进行增强。在CICInvesAndMal2019和CICMalMem2022数据集上的实验评估证明了该系统的有效性,恶意软件检测准确率分别达到98.78%和99.99%。通过将人工智能驱动的恶意软件检测集成到AIoT基础设施中,本研究提高了网络安全弹性,在日益互联的世界中保护智能社会免受新出现的威胁。
{"title":"AIMD: AI-powered android malware detection for securing AIoT devices and networks using graph embedding and ensemble learning","authors":"Santosh K. Smmarwar ,&nbsp;Rahul Priyadarshi ,&nbsp;Pratik Angaitkar ,&nbsp;Subodh Mishra ,&nbsp;Rajkumar Singh Rathore","doi":"10.1016/j.sysarc.2026.103707","DOIUrl":"10.1016/j.sysarc.2026.103707","url":null,"abstract":"<div><div>The rapid evolution of Artificial Intelligence of Things (AIoT) is accelerating the development of smart societies, where interconnected consumer electronics such as smartphones, IoT devices, smart meters, and surveillance systems play a crucial role in optimizing operational efficiency and service delivery. However, this hyper-connected digital ecosystem is increasingly vulnerable to sophisticated Android malware attacks that exploit system weaknesses, disrupt services, and compromise data privacy and integrity. These malware variants leverage advanced evasion techniques, including permission abuse, dynamic runtime manipulation, and memory-based obfuscation, rendering traditional detection methods ineffective. The key challenges in securing AIoT-driven smart societies include managing high-dimensional feature spaces, detecting dynamically evolving malware behaviours, and ensuring real-time classification performance. To address these issues, this paper proposed an AI-powered Android Malware Detection (AIMD) framework designed for AIoT-enabled smart society environments. The framework extracts multi-level features (permissions, intents, API calls, and obfuscated memory patterns) from Android APK files and employs graph embedding techniques (DeepWalk and Node2Vec) for dimensionality reduction. Feature selection is optimized using the Red Deer Algorithm (RDA), a metaheuristic approach, while classification is performed through an ensemble of machine learning models (Support Vector Machine, Decision Tree, Random Forest, Extra Trees) enhanced by bagging, boosting, stacking, and soft voting techniques. Experimental evaluations on CICInvesAndMal2019 and CICMalMem2022 datasets demonstrate the effectiveness of the proposed system, achieving malware detection accuracies of 98.78% and 99.99%, respectively. By integrating AI-driven malware detection into AIoT infrastructures, this research advances cybersecurity resilience, safeguarding smart societies against emerging threats in an increasingly connected world.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"173 ","pages":"Article 103707"},"PeriodicalIF":4.1,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146039168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MECSim: A comprehensive simulation platform for multi-access edge computing MECSim:多接入边缘计算综合仿真平台
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-01-15 DOI: 10.1016/j.sysarc.2026.103706
Akhirul Islam, Manojit Ghose
The rapid growth of CPU-intensive and latency-sensitive applications has intensified the need for efficient resource management within edge computing environments. While existing simulators such as iFogSim, EdgeCloudSim, and PureEdgeSim have contributed significantly to edge computing research, they lack comprehensive support for modeling modern hardware heterogeneity, energy-aware mechanisms, service providers’ economic models, dependent task modeling, and reliability-driven task management. This paper presents MECSim (multi-access edge computing simulator), an enhanced simulation framework that extends PureEdgeSim to enable realistic modeling of heterogeneous, cooperative, and fault-tolerant edge computing ecosystems. MECSim supports multi-data-center clusters along with dynamic voltage and frequency scaling (DVFS) capable user devices for energy-efficient operation. The framework further integrates dependent-task modeling, cost and profit evaluation for service providers, and reliability mechanisms via transient-failure simulation, caching, and task replication. We have implemented five state-of-the-art approaches, demonstrating the effectiveness of our simulation platform and building confidence in its practical utility to handle diverse system architectures. With its extensible architecture and comprehensive modeling capabilities, MECSim provides a promising platform for future research on energy-efficient, profit-driven, and fault-tolerant task offloading and scheduling in heterogeneous MEC environments. The results also demonstrate that MECSim achieves a 44.13% (on average) reduction in simulation time compared to EdgeCloudSim. In addition, we have conducted experiments using dispersion-aware metrics to quantify variability and stability across 50 independent runs, thereby enabling a more robust and reliable performance evaluation.
cpu密集型和延迟敏感型应用程序的快速增长加剧了对边缘计算环境中有效资源管理的需求。虽然现有的模拟器如iFogSim、EdgeCloudSim和PureEdgeSim对边缘计算研究做出了重大贡献,但它们缺乏对现代硬件异构建模、能量感知机制、服务提供商经济模型、依赖任务建模和可靠性驱动任务管理的全面支持。本文介绍了MECSim(多访问边缘计算模拟器),这是一个增强的仿真框架,扩展了PureEdgeSim,可以实现异构、协作和容错边缘计算生态系统的真实建模。MECSim支持多数据中心集群以及具有动态电压和频率缩放(DVFS)功能的用户设备,以实现节能运行。该框架通过瞬时故障模拟、缓存和任务复制进一步集成了依赖任务建模、服务提供者的成本和利润评估以及可靠性机制。我们已经实施了五种最先进的方法,展示了我们的仿真平台的有效性,并建立了对其处理不同系统架构的实际效用的信心。MECSim具有可扩展的体系结构和全面的建模能力,为未来异构MEC环境中节能、利润驱动和容错任务卸载和调度的研究提供了一个有前景的平台。结果还表明,与EdgeCloudSim相比,MECSim实现了44.13%(平均)的模拟时间减少。此外,我们还进行了实验,使用弥散感知指标来量化50次独立运行的可变性和稳定性,从而实现更稳健、更可靠的性能评估。
{"title":"MECSim: A comprehensive simulation platform for multi-access edge computing","authors":"Akhirul Islam,&nbsp;Manojit Ghose","doi":"10.1016/j.sysarc.2026.103706","DOIUrl":"10.1016/j.sysarc.2026.103706","url":null,"abstract":"<div><div>The rapid growth of CPU-intensive and latency-sensitive applications has intensified the need for efficient resource management within edge computing environments. While existing simulators such as iFogSim, EdgeCloudSim, and PureEdgeSim have contributed significantly to edge computing research, they lack comprehensive support for modeling modern hardware heterogeneity, energy-aware mechanisms, service providers’ economic models, dependent task modeling, and reliability-driven task management. This paper presents MECSim (multi-access edge computing simulator), an enhanced simulation framework that extends PureEdgeSim to enable realistic modeling of heterogeneous, cooperative, and fault-tolerant edge computing ecosystems. MECSim supports multi-data-center clusters along with dynamic voltage and frequency scaling (DVFS) capable user devices for energy-efficient operation. The framework further integrates dependent-task modeling, cost and profit evaluation for service providers, and reliability mechanisms via transient-failure simulation, caching, and task replication. We have implemented five state-of-the-art approaches, demonstrating the effectiveness of our simulation platform and building confidence in its practical utility to handle diverse system architectures. With its extensible architecture and comprehensive modeling capabilities, MECSim provides a promising platform for future research on energy-efficient, profit-driven, and fault-tolerant task offloading and scheduling in heterogeneous MEC environments. The results also demonstrate that MECSim achieves a 44.13% (on average) reduction in simulation time compared to EdgeCloudSim. In addition, we have conducted experiments using dispersion-aware metrics to quantify variability and stability across 50 independent runs, thereby enabling a more robust and reliable performance evaluation.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"173 ","pages":"Article 103706"},"PeriodicalIF":4.1,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145981626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Precision boundary modeling for area-efficient Block Floating Point accumulation 区域高效块浮点积累的精确边界建模
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-01-13 DOI: 10.1016/j.sysarc.2026.103704
Jun He , Jing Feng , Xin Ju , Yasong Cao , Zhongdi Luo , Jianchao Yang , Jingkui Yang , Gang Li , Jian Cheng , Dong Chen , Mei Wen
Block Floating Point (BFP) is extensively employed for the low-precision quantization of deep network weights and activations to attain advantages in both hardware efficiency and performance. Nevertheless, when the precision of weights and activations is diminished to below 8 bits, the required high-precision floating-point accumulation becomes a dominant hardware bottleneck in the BFP processing element (PE). To address this challenge, we introduce a framework based on the Frobenius norm Retention Ratio (FnRR) to explore the precision boundaries for BFP accumulation, and extend it to a hierarchical chunk-based accumulation scheme. Comprehensive experiments across representative CNN and LLM models demonstrate that our predicted precision boundaries maintain performance closely matching FP32 baselines, while further precision reduction leads to substantial accuracy degradation, validating the effectiveness of our boundary determination. Guided by this analysis, we present a corresponding hardware for BFP computation. This design achieves 13.7%–25.2% improvements in area and power efficiency compared with FP32 accumulation under identical quantization settings, and delivers up to 10.3× area and 11.0× power reductions relative to conventional BFP implementations.
块浮点(BFP)被广泛用于深度网络权重和激活的低精度量化,以获得硬件效率和性能方面的优势。然而,当权重和激活的精度降低到8位以下时,所需的高精度浮点积累成为BFP处理单元(PE)的主要硬件瓶颈。为了解决这一挑战,我们引入了一个基于Frobenius norm Retention Ratio (FnRR)的框架来探索BFP积累的精度边界,并将其扩展到基于分层块的积累方案。具有代表性的CNN和LLM模型的综合实验表明,我们预测的精度边界保持了与FP32基线密切匹配的性能,而进一步的精度降低导致精度大幅下降,验证了我们边界确定的有效性。在此分析的指导下,我们提出了相应的BFP计算硬件。在相同的量化设置下,与FP32积累相比,该设计的面积和功耗效率提高了13.7%-25.2%,与传统的BFP实现相比,该设计的面积和功耗降低了10.3倍,功耗降低了11.0倍。
{"title":"Precision boundary modeling for area-efficient Block Floating Point accumulation","authors":"Jun He ,&nbsp;Jing Feng ,&nbsp;Xin Ju ,&nbsp;Yasong Cao ,&nbsp;Zhongdi Luo ,&nbsp;Jianchao Yang ,&nbsp;Jingkui Yang ,&nbsp;Gang Li ,&nbsp;Jian Cheng ,&nbsp;Dong Chen ,&nbsp;Mei Wen","doi":"10.1016/j.sysarc.2026.103704","DOIUrl":"10.1016/j.sysarc.2026.103704","url":null,"abstract":"<div><div>Block Floating Point (BFP) is extensively employed for the low-precision quantization of deep network weights and activations to attain advantages in both hardware efficiency and performance. Nevertheless, when the precision of weights and activations is diminished to below 8 bits, the required high-precision floating-point accumulation becomes a dominant hardware bottleneck in the BFP processing element (PE). To address this challenge, we introduce a framework based on the Frobenius norm Retention Ratio (FnRR) to explore the precision boundaries for BFP accumulation, and extend it to a hierarchical chunk-based accumulation scheme. Comprehensive experiments across representative CNN and LLM models demonstrate that our predicted precision boundaries maintain performance closely matching FP32 baselines, while further precision reduction leads to substantial accuracy degradation, validating the effectiveness of our boundary determination. Guided by this analysis, we present a corresponding hardware for BFP computation. This design achieves 13.7%–25.2% improvements in area and power efficiency compared with FP32 accumulation under identical quantization settings, and delivers up to <span><math><mrow><mn>10</mn><mo>.</mo><mn>3</mn><mo>×</mo></mrow></math></span> area and <span><math><mrow><mn>11</mn><mo>.</mo><mn>0</mn><mo>×</mo></mrow></math></span> power reductions relative to conventional BFP implementations.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"173 ","pages":"Article 103704"},"PeriodicalIF":4.1,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145981643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Peak-memory-aware partitioning and scheduling for multi-tenant DNN model inference 多租户DNN模型推理的峰值内存感知分区和调度
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-01-09 DOI: 10.1016/j.sysarc.2026.103696
Jaeho Lee, Ju Min Lee, Haeeun Jeong, Hyunho Kwon, Youngsok Kim, Yongjun Park, Hanjun Kim
As Deep Neural Networks (DNNs) are widely used in various applications, multiple DNN inference models start to run on a single GPU. The simultaneous execution of multiple DNN models can overwhelm the GPU memory with increasing model size, leading to unexpected out-of-memory (OOM) errors. To avoid the OOM errors, existing systems attempt to schedule models either in model-level granularity or layer-level granularity. However, the model-level scheduling schemes inefficiently utilize memory spaces because they preallocate memory based on the model’s peak memory demand, and the layer-level scheduling schemes suffer from high scheduling overhead due to too fine-grained scheduling units. This work proposes a new peak-memory-aware DNN model partitioning compiler and scheduler, called Quilt. The Quilt compiler partitions a DNN model into multiple tasks based on their peak memory usage, and the Quilt scheduler orchestrates the tasks of multiple models without the OOM errors. Additionally, the compiler generates a memory pool for tensors shared between partitioned tasks, reducing CPU–GPU communication overhead when consecutively executing the tasks. Compared to the model-level and layer-level scheduling schemes, Quilt successfully reduces 25.4% and 37.7% of overall latency, respectively, while preventing the OOM errors. Moreover, Quilt achieves up to 10.8% faster overall inference latency than the state-of-the-art Triton inference server for 6 DNN models.
随着深度神经网络(DNN)在各种应用中的广泛应用,多个DNN推理模型开始在单个GPU上运行。同时执行多个DNN模型可能会随着模型大小的增加而淹没GPU内存,导致意外的内存不足(OOM)错误。为了避免OOM错误,现有系统尝试在模型级粒度或层级粒度中调度模型。然而,模型级调度方案由于基于模型的峰值内存需求预分配内存而无法有效地利用内存空间,而层级调度方案由于过于细粒度的调度单元而存在较高的调度开销。本工作提出了一种新的峰值内存感知DNN模型分区编译器和调度器,称为Quilt。Quilt编译器根据它们的峰值内存使用情况将DNN模型划分为多个任务,而Quilt调度器在没有OOM错误的情况下编排多个模型的任务。此外,编译器为分区任务之间共享的张量生成一个内存池,从而在连续执行任务时减少CPU-GPU通信开销。与模型级和层级调度方案相比,Quilt成功地分别减少了25.4%和37.7%的总延迟,同时防止了OOM错误。此外,对于6个DNN模型,Quilt的总体推理延迟比最先进的Triton推理服务器快10.8%。
{"title":"Peak-memory-aware partitioning and scheduling for multi-tenant DNN model inference","authors":"Jaeho Lee,&nbsp;Ju Min Lee,&nbsp;Haeeun Jeong,&nbsp;Hyunho Kwon,&nbsp;Youngsok Kim,&nbsp;Yongjun Park,&nbsp;Hanjun Kim","doi":"10.1016/j.sysarc.2026.103696","DOIUrl":"10.1016/j.sysarc.2026.103696","url":null,"abstract":"<div><div>As Deep Neural Networks (DNNs) are widely used in various applications, multiple DNN inference models start to run on a single GPU. The simultaneous execution of multiple DNN models can overwhelm the GPU memory with increasing model size, leading to unexpected out-of-memory (OOM) errors. To avoid the OOM errors, existing systems attempt to schedule models either in model-level granularity or layer-level granularity. However, the model-level scheduling schemes inefficiently utilize memory spaces because they preallocate memory based on the model’s peak memory demand, and the layer-level scheduling schemes suffer from high scheduling overhead due to too fine-grained scheduling units. This work proposes a new peak-memory-aware DNN model partitioning compiler and scheduler, called Quilt. The Quilt compiler partitions a DNN model into multiple tasks based on their peak memory usage, and the Quilt scheduler orchestrates the tasks of multiple models without the OOM errors. Additionally, the compiler generates a memory pool for tensors shared between partitioned tasks, reducing CPU–GPU communication overhead when consecutively executing the tasks. Compared to the model-level and layer-level scheduling schemes, Quilt successfully reduces 25.4% and 37.7% of overall latency, respectively, while preventing the OOM errors. Moreover, Quilt achieves up to 10.8% faster overall inference latency than the state-of-the-art Triton inference server for 6 DNN models.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"173 ","pages":"Article 103696"},"PeriodicalIF":4.1,"publicationDate":"2026-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145981644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A PCM-based hybrid online learning architecture with adaptive threshold sign-based backpropagation and pulse-aware conductance control 基于pcm的混合在线学习体系结构,具有自适应阈值反向传播和脉冲感知电导控制
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-01-08 DOI: 10.1016/j.sysarc.2026.103683
Zhenhao Jiao , Xiaogang Chen , Tao Hong , Shunfen Li , Xi Li , Weibang Dai , Chengcai Tu , Zhitang Song
The increasing demand for low-latency, energy-efficient online learning in edge devices has driven the exploration of neuromorphic computing and hybrid analog–digital architectures. In this work, we propose a phase-change memory (PCM)-based hybrid architecture for in-situ online learning, which integrates parallel analog matrix–vector multiplication with adaptive digital control. The system features two key innovations: (1) an adaptive threshold sign-based backpropagation (ATSBP) algorithm that dynamically adjusts quantization thresholds for activations and error signals based on real-time feedback from mini-batch statistics, and (2) a pulse-aware conductance control scheme that enables precise conductance tuning of PCM devices using experimentally calibrated pulse-conductance mappings. These mechanisms jointly reduce unnecessary write operations and enhance robustness against device nonidealities such as nonlinearity and drift. Through systematic validation, we demonstrate that our hybrid architecture significantly improves convergence stability and energy efficiency in online learning scenarios, without sacrificing classification accuracy. The proposed system highlights a promising pathway toward scalable, hardware-friendly neuromorphic learning on edge platforms.
边缘设备对低延迟、高能效在线学习的需求不断增长,推动了神经形态计算和混合模拟-数字架构的探索。在这项工作中,我们提出了一种基于相变存储器(PCM)的原位在线学习混合架构,该架构集成了并行模拟矩阵向量乘法和自适应数字控制。该系统具有两个关键创新:(1)基于自适应阈值符号的反向传播(ATSBP)算法,该算法基于小批量统计数据的实时反馈动态调整激活和错误信号的量化阈值;(2)脉冲感知电导控制方案,该方案使用实验校准的脉冲电导映射实现PCM器件的精确电导调谐。这些机制共同减少了不必要的写入操作,并增强了对器件非理想性(如非线性和漂移)的鲁棒性。通过系统验证,我们证明了我们的混合架构在不牺牲分类精度的情况下显著提高了在线学习场景下的收敛稳定性和能源效率。提出的系统强调了在边缘平台上实现可扩展、硬件友好的神经形态学习的有希望的途径。
{"title":"A PCM-based hybrid online learning architecture with adaptive threshold sign-based backpropagation and pulse-aware conductance control","authors":"Zhenhao Jiao ,&nbsp;Xiaogang Chen ,&nbsp;Tao Hong ,&nbsp;Shunfen Li ,&nbsp;Xi Li ,&nbsp;Weibang Dai ,&nbsp;Chengcai Tu ,&nbsp;Zhitang Song","doi":"10.1016/j.sysarc.2026.103683","DOIUrl":"10.1016/j.sysarc.2026.103683","url":null,"abstract":"<div><div>The increasing demand for low-latency, energy-efficient online learning in edge devices has driven the exploration of neuromorphic computing and hybrid analog–digital architectures. In this work, we propose a phase-change memory (PCM)-based hybrid architecture for in-situ online learning, which integrates parallel analog matrix–vector multiplication with adaptive digital control. The system features two key innovations: (1) an adaptive threshold sign-based backpropagation (ATSBP) algorithm that dynamically adjusts quantization thresholds for activations and error signals based on real-time feedback from mini-batch statistics, and (2) a pulse-aware conductance control scheme that enables precise conductance tuning of PCM devices using experimentally calibrated pulse-conductance mappings. These mechanisms jointly reduce unnecessary write operations and enhance robustness against device nonidealities such as nonlinearity and drift. Through systematic validation, we demonstrate that our hybrid architecture significantly improves convergence stability and energy efficiency in online learning scenarios, without sacrificing classification accuracy. The proposed system highlights a promising pathway toward scalable, hardware-friendly neuromorphic learning on edge platforms.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"173 ","pages":"Article 103683"},"PeriodicalIF":4.1,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145981645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FP8ApproxLib: An FPGA-based approximate multiplier library for 8-bit floating point FP8ApproxLib:基于fpga的8位浮点近似乘法器库
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-01-08 DOI: 10.1016/j.sysarc.2026.103686
Ruiqi Chen , Yangxintong Lyu , Han Bao , Shidi Tang , Jindong Li , Yanxiang Zhu , Ming Ling , Bruno da Silva
The 8-bit floating-point (FP8) data format has been increasingly adopted in neural network (NN) computations due to its superior dynamic range compared to traditional 8-bit integer (INT8). Nevertheless, the heavy reliance on multiplication in neural network workloads leads to considerable energy consumption, even with FP8, particularly in the context of FPGA-based deployments. To this end, this paper presents FP8ApproxLib, an FPGA-based approximate multiplier library for FP8. Firstly, we conduct a bit-level analysis of the prior approximation method and introduce improvements to reduce the resulting computational error. Based on these, we implement a fine-grained optimized design on mainstream FPGAs (Altera and AMD) using primitives and templates combined with physical layout constraints. Moreover, an automated tool is developed to support user configuration and generate HDL code. We then evaluate the accuracy and hardware efficiency of the FP8 approximate multipliers. The results show that our proposed method achieves an average error reduction of 53.15% (36.74%72.82%) compared to previous FP8 approximation method. Moreover, compared to prior 8-bit approximate multipliers, our FP8 designs exhibit the lowest resource utilization. Finally, we integrate the design into the inference phase of three representative NN models (CNN, LLM, and Diffusion), demonstrating its excellent power efficiency. This is the first FP8 approximate multiplier design with architecture-aware fine-grained optimization and deployment for modern FPGA platforms, which can serve as a benchmark for future designs and comparisons of FPGA-based low-precision floating-point approximate multipliers. The code of this work is available in our GitLab.
与传统的8位整数(INT8)相比,8位浮点(FP8)数据格式具有更好的动态范围,因此在神经网络(NN)计算中越来越多地采用。然而,神经网络工作负载对乘法的严重依赖导致了相当大的能源消耗,即使是FP8,特别是在基于fpga的部署环境中。为此,本文提出了基于fpga的FP8近似乘法器库FP8ApproxLib。首先,我们对先验近似方法进行了位级分析,并引入改进以减少计算误差。在此基础上,我们利用原语和模板结合物理布局约束,在主流fpga (Altera和AMD)上实现了细粒度优化设计。此外,还开发了一个支持用户配置和生成HDL代码的自动化工具。然后我们评估了FP8近似乘法器的精度和硬件效率。结果表明,与以前的FP8近似方法相比,我们提出的方法平均误差降低了53.15%(36.74% ~ 72.82%)。此外,与之前的8位近似乘法器相比,我们的FP8设计显示出最低的资源利用率。最后,我们将设计集成到三个代表性NN模型(CNN, LLM和Diffusion)的推理阶段,证明了其出色的功率效率。这是第一个FP8近似乘法器设计,具有架构感知的细粒度优化和现代FPGA平台的部署,可以作为未来设计和比较基于FPGA的低精度浮点近似乘法器的基准。这项工作的代码可在我们的GitLab *中获得。
{"title":"FP8ApproxLib: An FPGA-based approximate multiplier library for 8-bit floating point","authors":"Ruiqi Chen ,&nbsp;Yangxintong Lyu ,&nbsp;Han Bao ,&nbsp;Shidi Tang ,&nbsp;Jindong Li ,&nbsp;Yanxiang Zhu ,&nbsp;Ming Ling ,&nbsp;Bruno da Silva","doi":"10.1016/j.sysarc.2026.103686","DOIUrl":"10.1016/j.sysarc.2026.103686","url":null,"abstract":"<div><div>The 8-bit floating-point (FP8) data format has been increasingly adopted in neural network (NN) computations due to its superior dynamic range compared to traditional 8-bit integer (INT8). Nevertheless, the heavy reliance on multiplication in neural network workloads leads to considerable energy consumption, even with FP8, particularly in the context of FPGA-based deployments. To this end, this paper presents FP8ApproxLib, an FPGA-based approximate multiplier library for FP8. Firstly, we conduct a bit-level analysis of the prior approximation method and introduce improvements to reduce the resulting computational error. Based on these, we implement a fine-grained optimized design on mainstream FPGAs (Altera and AMD) using primitives and templates combined with physical layout constraints. Moreover, an automated tool is developed to support user configuration and generate HDL code. We then evaluate the accuracy and hardware efficiency of the FP8 approximate multipliers. The results show that our proposed method achieves an average error reduction of 53.15% (36.74%<span><math><mo>∼</mo></math></span>72.82%) compared to previous FP8 approximation method. Moreover, compared to prior 8-bit approximate multipliers, our FP8 designs exhibit the lowest resource utilization. Finally, we integrate the design into the inference phase of three representative NN models (CNN, LLM, and Diffusion), demonstrating its excellent power efficiency. This is the first FP8 approximate multiplier design with architecture-aware fine-grained optimization and deployment for modern FPGA platforms, which can serve as a benchmark for future designs and comparisons of FPGA-based low-precision floating-point approximate multipliers. The code of this work is available in our GitLab<span><math><msup><mrow></mrow><mrow><mo>∗</mo></mrow></msup></math></span>.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"173 ","pages":"Article 103686"},"PeriodicalIF":4.1,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145981642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Systems Architecture
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1