Pub Date : 2025-10-16DOI: 10.1109/LES.2025.3598202
Julian Göppert;Axel Sikora
Industrial cyber-physical systems (ICPS) face rising cyberattacks, requiring secure credential management also in resource-constrained embedded systems. Standards specifying field level communication of ICPS (e.g., PROFINET or OPC UA) define protocol-specific credential management processes, yet lack formal security verification. We propose a generic model capturing initial security onboarding and automated credential provisioning. Using ProVerif, an automatic symbolic protocol verifier, we formalize certificate-based authentication under a Dolev-Yao adversary, verifying private key secrecy, component authentication, and mutual authentication with the operator domain. Robustness checks confirm resilience against key leakage and highlight the vulnerabilities of the trust on first use concept proposed by the standards. Our model offers the first formal guarantees for secure credential management in ICPS.
{"title":"Formal Modeling and Verification of Generic Credential Management Processes for Industrial Cyber–Physical Systems","authors":"Julian Göppert;Axel Sikora","doi":"10.1109/LES.2025.3598202","DOIUrl":"https://doi.org/10.1109/LES.2025.3598202","url":null,"abstract":"Industrial cyber-physical systems (ICPS) face rising cyberattacks, requiring secure credential management also in resource-constrained embedded systems. Standards specifying field level communication of ICPS (e.g., PROFINET or OPC UA) define protocol-specific credential management processes, yet lack formal security verification. We propose a generic model capturing initial security onboarding and automated credential provisioning. Using ProVerif, an automatic symbolic protocol verifier, we formalize certificate-based authentication under a Dolev-Yao adversary, verifying private key secrecy, component authentication, and mutual authentication with the operator domain. Robustness checks confirm resilience against key leakage and highlight the vulnerabilities of the trust on first use concept proposed by the standards. Our model offers the first formal guarantees for secure credential management in ICPS.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 5","pages":"349-352"},"PeriodicalIF":2.0,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-16DOI: 10.1109/LES.2025.3600618
Bo Zhang;Yinkang Gao;Caixu Zhao;Xi Li
Ensuring predictable and repeatable behavior in concurrent real-time systems requires dataflow determinism—that is, each consumer task instance must always read data from the same producer instance. While the logical execution time (LET) model enforces this property, its software implementations typically rely on timed I/O or multibuffering protocols. These approaches introduce software complexity, execution overhead, and priority inversion, resulting in increased and unstable task response times, thereby degrading overall schedulability. We propose time-semantic memory instruction (TSMI), a new instruction set extension that embeds logical timing into memory access operations. Unlike existing LET implementations, TSMI enforces dataflow determinism at the instruction level, eliminating the need for memory protocols or access ordering constraints. We develop a TSMI microarchitectural implementation that translates TSMI instructions into standard memory accesses and a programming model that not only captures LET semantics but also enables more expressive, per-access dataflow control. A cycle-accurate RISC-V simulator with TSMI achieves up to 95.36% worst-case response time (WCRT) and 98.88% response time variability (RTV) reduction compared to existing methods.
{"title":"Instruction-Level Support for Deterministic Dataflow in Real-Time Systems","authors":"Bo Zhang;Yinkang Gao;Caixu Zhao;Xi Li","doi":"10.1109/LES.2025.3600618","DOIUrl":"https://doi.org/10.1109/LES.2025.3600618","url":null,"abstract":"Ensuring predictable and repeatable behavior in concurrent real-time systems requires dataflow determinism—that is, each consumer task instance must always read data from the same producer instance. While the logical execution time (LET) model enforces this property, its software implementations typically rely on timed I/O or multibuffering protocols. These approaches introduce software complexity, execution overhead, and priority inversion, resulting in increased and unstable task response times, thereby degrading overall schedulability. We propose time-semantic memory instruction (TSMI), a new instruction set extension that embeds logical timing into memory access operations. Unlike existing LET implementations, TSMI enforces dataflow determinism at the instruction level, eliminating the need for memory protocols or access ordering constraints. We develop a TSMI microarchitectural implementation that translates TSMI instructions into standard memory accesses and a programming model that not only captures LET semantics but also enables more expressive, per-access dataflow control. A cycle-accurate RISC-V simulator with TSMI achieves up to 95.36% worst-case response time (WCRT) and 98.88% response time variability (RTV) reduction compared to existing methods.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 5","pages":"341-344"},"PeriodicalIF":2.0,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-16DOI: 10.1109/LES.2025.3600611
Saeyeon Kim;Sunyoung Park;Nahyeon Kim;Jiyoung Lee;Ji-Hoon Kim
Recent advancements in DRAM technology have increased the complexity and variety of memory faults, necessitating efficient and programmable fault diagnosis, especially in AI and automotive systems where reliability is critical. This letter proposes a Nested Loop Analyzer (NLA) integrated into a RISC-V-based memory test platform to enhance both efficiency and programmability in run-time memory testing. By leveraging Loop Control Flow Analysis and Basic Block Identification, the NLA eliminates complex loop control in pattern generation and reduces pattern buffer overhead between the Pattern Generator (PG) and the DRAM physical layer (PHY). Additionally, integrating memory testing within the RISC-V system-on-chip (SoC) environment enables seamless development and integration of memory testing with general application tasks. The proposed approach provides a high-programmability, run-time DRAM test pattern generation platform with efficient hardware usage, reduced buffer requirements, and seamless RISC-V integration.
{"title":"RISC-V Integrated Nested Loop Analyzer for Runtime DRAM Test Pattern Generation","authors":"Saeyeon Kim;Sunyoung Park;Nahyeon Kim;Jiyoung Lee;Ji-Hoon Kim","doi":"10.1109/LES.2025.3600611","DOIUrl":"https://doi.org/10.1109/LES.2025.3600611","url":null,"abstract":"Recent advancements in DRAM technology have increased the complexity and variety of memory faults, necessitating efficient and programmable fault diagnosis, especially in AI and automotive systems where reliability is critical. This letter proposes a Nested Loop Analyzer (NLA) integrated into a RISC-V-based memory test platform to enhance both efficiency and programmability in run-time memory testing. By leveraging Loop Control Flow Analysis and Basic Block Identification, the NLA eliminates complex loop control in pattern generation and reduces pattern buffer overhead between the Pattern Generator (PG) and the DRAM physical layer (PHY). Additionally, integrating memory testing within the RISC-V system-on-chip (SoC) environment enables seamless development and integration of memory testing with general application tasks. The proposed approach provides a high-programmability, run-time DRAM test pattern generation platform with efficient hardware usage, reduced buffer requirements, and seamless RISC-V integration.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 5","pages":"333-336"},"PeriodicalIF":2.0,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modern cyber–physical systems (CPSs) and IoT-enabled smart factories rely on human–robot collaboration (HRC) to combine human intuition and robotic precision in real time. Balancing such HRC assembly lines, where each task may execute in human-only, robot-only, or collaborative modes, poses a combinatorial challenge that defies scalable mixed integer linear programming (MILP) and oversimplified heuristics. In this letter, we present IBSHRC, a proof-of-concept Iterative Beam Search framework designed for single-product, straight-line CPS assembly systems. IBSHRC leverages mode-aware initialization, binary-search cycle-time refinement, and efficient pruning to navigate vast scheduling spaces at the network edge. On benchmark instances up to 100 tasks, our method delivers near-optimal cycle times with up to $300times $ speed-ups over MILP (subsecond runtimes), demonstrating its promise for real-time, IoT-driven industrial scheduling.
{"title":"An Efficient Iterative Beam Search for Human–Robot Collaborative Assembly Line Balancing","authors":"Suraj Meshram;Sanket Jaipuriar;Arnab Sarkar;Arijit Mondal","doi":"10.1109/LES.2025.3600560","DOIUrl":"https://doi.org/10.1109/LES.2025.3600560","url":null,"abstract":"Modern cyber–physical systems (CPSs) and IoT-enabled smart factories rely on human–robot collaboration (HRC) to combine human intuition and robotic precision in real time. Balancing such HRC assembly lines, where each task may execute in human-only, robot-only, or collaborative modes, poses a combinatorial challenge that defies scalable mixed integer linear programming (MILP) and oversimplified heuristics. In this letter, we present IBSHRC, a proof-of-concept Iterative Beam Search framework designed for single-product, straight-line CPS assembly systems. IBSHRC leverages mode-aware initialization, binary-search cycle-time refinement, and efficient pruning to navigate vast scheduling spaces at the network edge. On benchmark instances up to 100 tasks, our method delivers near-optimal cycle times with up to <inline-formula> <tex-math>$300times $ </tex-math></inline-formula> speed-ups over MILP (subsecond runtimes), demonstrating its promise for real-time, IoT-driven industrial scheduling.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 5","pages":"313-316"},"PeriodicalIF":2.0,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145351945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-16DOI: 10.1109/LES.2025.3599829
Mohamed Marwen Moslah;Ramzi Zouari;Ahmad Shahnejat Bushehri;Felipe Gohring de Magalhaes;Gabriela Nicolescu
Autonomous driving systems increasingly depend on multimodal sensor fusion (deep sensor fusion (DSF)), integrating data from cameras, radar, and LiDAR to improve environmental perception and decision-making. The integration of deep learning models into sensor fusion has significantly enhanced perception capabilities, but it also raises concerns about the robustness of these models when exposed to adversarial attacks. As prior research on the adversarial robustness of TransFuser — one of the most advanced end-to-end transformer-based DSF models for autonomous driving — has been limited to single-modality attacks targeting the camera sensor, this work extends the investigation to assess the robustness of TransFuser under various attack scenarios, including those involving the LiDAR modality. We employed the fast gradient sign method (FGSM) and projected gradient descent (PGD) to perform single-channel adversarial attacks on camera and LiDAR modalities separately, as well as the joint-channel attack. The experiments were conducted in the CARLA simulator using the Town05 Short urban environment, including 32 routes featuring diverse driving scenarios. The results clearly demonstrate the vulnerability of TransFuser to adversarial attacks where transformer-based sensor fusion is utilized, particularly under joint-channel attacks. Our experiments demonstrate that LiDAR-targeted single-channel attacks significantly degrade driving performance, reducing the driving score by 49.87% under FGSM attacks, and by 50.15% and 42.12% under joint FGSM and PGD attacks, respectively. This study informs the design of more robust and secure DSF architectures for end-to-end autonomous driving.
{"title":"Investigation of the Adversarial Robustness of End-to-End Deep Sensor Fusion Models","authors":"Mohamed Marwen Moslah;Ramzi Zouari;Ahmad Shahnejat Bushehri;Felipe Gohring de Magalhaes;Gabriela Nicolescu","doi":"10.1109/LES.2025.3599829","DOIUrl":"https://doi.org/10.1109/LES.2025.3599829","url":null,"abstract":"Autonomous driving systems increasingly depend on multimodal sensor fusion (deep sensor fusion (DSF)), integrating data from cameras, radar, and LiDAR to improve environmental perception and decision-making. The integration of deep learning models into sensor fusion has significantly enhanced perception capabilities, but it also raises concerns about the robustness of these models when exposed to adversarial attacks. As prior research on the adversarial robustness of TransFuser — one of the most advanced end-to-end transformer-based DSF models for autonomous driving — has been limited to single-modality attacks targeting the camera sensor, this work extends the investigation to assess the robustness of TransFuser under various attack scenarios, including those involving the LiDAR modality. We employed the fast gradient sign method (FGSM) and projected gradient descent (PGD) to perform single-channel adversarial attacks on camera and LiDAR modalities separately, as well as the joint-channel attack. The experiments were conducted in the CARLA simulator using the Town05 Short urban environment, including 32 routes featuring diverse driving scenarios. The results clearly demonstrate the vulnerability of TransFuser to adversarial attacks where transformer-based sensor fusion is utilized, particularly under joint-channel attacks. Our experiments demonstrate that LiDAR-targeted single-channel attacks significantly degrade driving performance, reducing the driving score by 49.87% under FGSM attacks, and by 50.15% and 42.12% under joint FGSM and PGD attacks, respectively. This study informs the design of more robust and secure DSF architectures for end-to-end autonomous driving.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 5","pages":"325-328"},"PeriodicalIF":2.0,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145351966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-16DOI: 10.1109/LES.2025.3600592
I-Yang Chen;Kai-Wei Hou;Ya-Shu Chen
Resistive random-access memory (ReRAM)-based processing-in-memory (PIM) architectures offer high computational parallelism for accelerating neural networks. However, they suffer from high power consumption, primarily due to the extensive use of analog-to-digital converters (ADCs). In this work, we propose ReCEN, a configurable engine designed to reduce energy consumption by dynamically adjusting the operating frequency of ADCs through column sparsity exploration in neural networks. To further enhance energy efficiency, we exploit sparsity by introducing effective bias and discarding least significant bits during ReRAM weight programming. Experimental results demonstrate that, under equivalent computational resources, our proposed engine significantly reduces ADC power consumption, thereby improving overall energy efficiency.
{"title":"A Configurable ReRAM Engine for Energy-Efficient Sparse Neural Network Acceleration","authors":"I-Yang Chen;Kai-Wei Hou;Ya-Shu Chen","doi":"10.1109/LES.2025.3600592","DOIUrl":"https://doi.org/10.1109/LES.2025.3600592","url":null,"abstract":"Resistive random-access memory (ReRAM)-based processing-in-memory (PIM) architectures offer high computational parallelism for accelerating neural networks. However, they suffer from high power consumption, primarily due to the extensive use of analog-to-digital converters (ADCs). In this work, we propose ReCEN, a configurable engine designed to reduce energy consumption by dynamically adjusting the operating frequency of ADCs through column sparsity exploration in neural networks. To further enhance energy efficiency, we exploit sparsity by introducing effective bias and discarding least significant bits during ReRAM weight programming. Experimental results demonstrate that, under equivalent computational resources, our proposed engine significantly reduces ADC power consumption, thereby improving overall energy efficiency.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 5","pages":"305-308"},"PeriodicalIF":2.0,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Power side channel attacks (PSCAs) are a significant threat to secure cryptographic processors. Masking is an algorithmic countermeasure against PSCAs. During masking, the insertion of registers at designated locations according to the masking scheme is pivotal to ensure glitches in hardware do not effect the PSCA security. Such an insertion should be followed by proper insertion of balancing registers to ensure equal number of registers in all parallel paths of the design. This is called register balancing (RB). An RB procedure should also ensure minimum latency to meet the time constraints of cryptographic workloads. Previously, RB has been carried using a retiming-based approach which involved finding a correct set of retiming labels via solving a set of constraints on these labels, with a time complexity of $O(V(V+E))$ . This work presents a faster RB approach with a time complexity of $O(V+E)$ . The new RB approach used to generate masked AES-256 Canright’s and PRESENT S-boxes demonstrated upto $28times $ faster automated synthesis of masked hardware over existing approaches.
功率侧信道攻击(psca)是对安全加密处理器的重大威胁。掩蔽是一种对抗psca的算法。在屏蔽期间,根据屏蔽方案在指定位置插入寄存器是确保硬件故障不影响PSCA安全性的关键。这样的插入之后应该适当地插入平衡寄存器,以确保在设计的所有并行路径中寄存器的数量相等。这被称为寄存器平衡(RB)。RB过程还应该确保最小的延迟,以满足加密工作负载的时间限制。以前,RB使用基于重新计时的方法进行,该方法包括通过解决这些标签上的一组约束来找到一组正确的重新计时标签,时间复杂度为$O(V(V+E))$。这项工作提出了一个更快的RB方法,时间复杂度为$O(V+E)$。用于生成屏蔽AES-256 Canright 's和PRESENT s -box的新RB方法证明,与现有方法相比,屏蔽硬件的自动合成速度提高了28倍。
{"title":"Efficient Register-Balancing for Masked Hardware","authors":"Nilotpola Sarma;Sujeet Narayan Kamble;Chandan Karfa","doi":"10.1109/LES.2025.3600586","DOIUrl":"https://doi.org/10.1109/LES.2025.3600586","url":null,"abstract":"Power side channel attacks (PSCAs) are a significant threat to secure cryptographic processors. Masking is an algorithmic countermeasure against PSCAs. During masking, the insertion of registers at designated locations according to the masking scheme is pivotal to ensure glitches in hardware do not effect the PSCA security. Such an insertion should be followed by proper insertion of balancing registers to ensure equal number of registers in all parallel paths of the design. This is called register balancing (RB). An RB procedure should also ensure minimum latency to meet the time constraints of cryptographic workloads. Previously, RB has been carried using a retiming-based approach which involved finding a correct set of retiming labels via solving a set of constraints on these labels, with a time complexity of <inline-formula> <tex-math>$O(V(V+E))$ </tex-math></inline-formula>. This work presents a faster RB approach with a time complexity of <inline-formula> <tex-math>$O(V+E)$ </tex-math></inline-formula>. The new RB approach used to generate masked AES-256 Canright’s and PRESENT S-boxes demonstrated upto <inline-formula> <tex-math>$28times $ </tex-math></inline-formula> faster automated synthesis of masked hardware over existing approaches.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 5","pages":"301-304"},"PeriodicalIF":2.0,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
software-defined vehicles (SDVs) employ zonal architectures whose zones exchange periodic real-time traffic over a high-speed Time-Sensitive Ethernet (IEEE 802.1Q) backbone. Existing research has produced many deterministic routing schemes for such backbone Ethernet networks, yet has largely ignored the complementary problem of payload-level frame minimization: multiplexing dozens of small zonal messages—whether CAN, LIN, or any other in-zone protocol—into the fewest possible Ethernet frames so that bandwidth, switch buffers and latency budgets are not squandered. This letter concentrates on that neglected frame-minimization facet and presents an optimization flow dedicated to reducing the number of Ethernet frames. We first formulate an exact Satisfiability Modulo Theories (SMTs) model that finds the minimal set of multiplexed frames required across an entire hyper-period. Because SMT becomes intractable on large vehicle topologies, we then introduce a matrix-based heuristic aggregation (MaHA) algorithm that reproduces the SMT’s frame-minimization decisions to within a few percent while executing in milliseconds. Experiments on synthetic SDV workloads show that naive one zonal message per Ethernet frame policies waste significant available payload capacity; our SMT eliminates this waste completely, and the heuristic achieves closely similar savings with up to three orders of magnitude less run-time, making it a practical drop-in solution for next-generation SDV networks.
{"title":"Minimizing Backbone Ethernet Traffic for Enabling Interzonal Messages in Software-Defined Vehicles","authors":"Ashiqur Rahaman Molla;Ram Mohan Kota;Jaishree Mayank;Arnab Sarkar;Arijit Mondal;Soumyajit Dey","doi":"10.1109/LES.2025.3600612","DOIUrl":"https://doi.org/10.1109/LES.2025.3600612","url":null,"abstract":"software-defined vehicles (SDVs) employ zonal architectures whose zones exchange periodic real-time traffic over a high-speed Time-Sensitive Ethernet (IEEE 802.1Q) backbone. Existing research has produced many deterministic routing schemes for such backbone Ethernet networks, yet has largely ignored the complementary problem of payload-level frame minimization: multiplexing dozens of small zonal messages—whether CAN, LIN, or any other in-zone protocol—into the fewest possible Ethernet frames so that bandwidth, switch buffers and latency budgets are not squandered. This letter concentrates on that neglected frame-minimization facet and presents an optimization flow dedicated to reducing the number of Ethernet frames. We first formulate an exact Satisfiability Modulo Theories (SMTs) model that finds the minimal set of multiplexed frames required across an entire hyper-period. Because SMT becomes intractable on large vehicle topologies, we then introduce a matrix-based heuristic aggregation (MaHA) algorithm that reproduces the SMT’s frame-minimization decisions to within a few percent while executing in milliseconds. Experiments on synthetic SDV workloads show that naive one zonal message per Ethernet frame policies waste significant available payload capacity; our SMT eliminates this waste completely, and the heuristic achieves closely similar savings with up to three orders of magnitude less run-time, making it a practical drop-in solution for next-generation SDV networks.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 5","pages":"353-356"},"PeriodicalIF":2.0,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-16DOI: 10.1109/LES.2025.3600043
Ali Asghar;Shahzad Bangash;Suleman Shah;Laiq Hasan;Salim Ullah;Siva Satyendra Sahoo;Akash Kumar
FPGAs provide customizable, low-power, and real-time ML Models acceleration for embedded systems, making them ideal for edge applications like robotics and IoT. However, ML models are computationally intensive and rely heavily on multiplication operations, which dominate the overall resource and power consumption, especially in deep neural networks. Currently available open-source frameworks, such as hls4ml, FINN and Tensil artificial intelligence (AI), facilitate FPGA-based implementation of ML algorithms but exclusively use accurate arithmetic operators, failing to exploit the inherent error resilience of ML models. Meanwhile, a large body of research in approximate computing has produced Approximate Multipliers that offer substantial reductions in area, power, and latency by sacrificing a small amount of accuracy. However, these Approximate Multipliers are not integrated into widely used hardware generation workflows, and no automated mechanism exists for incorporating them into ML model implementations at both software and hardware levels. In this work, we extend the hls4ml framework to support the use of Approximate Multipliers. Our approach enables seamless evaluation of multiple approximate designs, allowing tradeoffs between resource usage and inference accuracy to be explored efficiently. Experimental results demonstrate up to 3.94% LUTs savings and 7.33% reduction in On-Chip Power, with accuracy degradation of 1% compared to accurate designs.
{"title":"EMGAxO: Extending Machine Learning Hardware Generators With Approximate Operators","authors":"Ali Asghar;Shahzad Bangash;Suleman Shah;Laiq Hasan;Salim Ullah;Siva Satyendra Sahoo;Akash Kumar","doi":"10.1109/LES.2025.3600043","DOIUrl":"https://doi.org/10.1109/LES.2025.3600043","url":null,"abstract":"FPGAs provide customizable, low-power, and real-time ML Models acceleration for embedded systems, making them ideal for edge applications like robotics and IoT. However, ML models are computationally intensive and rely heavily on multiplication operations, which dominate the overall resource and power consumption, especially in deep neural networks. Currently available open-source frameworks, such as hls4ml, FINN and Tensil artificial intelligence (AI), facilitate FPGA-based implementation of ML algorithms but exclusively use accurate arithmetic operators, failing to exploit the inherent error resilience of ML models. Meanwhile, a large body of research in approximate computing has produced Approximate Multipliers that offer substantial reductions in area, power, and latency by sacrificing a small amount of accuracy. However, these Approximate Multipliers are not integrated into widely used hardware generation workflows, and no automated mechanism exists for incorporating them into ML model implementations at both software and hardware levels. In this work, we extend the hls4ml framework to support the use of Approximate Multipliers. Our approach enables seamless evaluation of multiple approximate designs, allowing tradeoffs between resource usage and inference accuracy to be explored efficiently. Experimental results demonstrate up to 3.94% LUTs savings and 7.33% reduction in On-Chip Power, with accuracy degradation of 1% compared to accurate designs.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 5","pages":"345-348"},"PeriodicalIF":2.0,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-16DOI: 10.1109/LES.2025.3601057
Reza Jahadi;Phil Munz;Ehsan Atoofian
Attention mechanism has become the backbone of machine learning applications, expanding beyond natural language processing into domains, such as computer vision and recommendation systems. We observe that implementing attention layers on GPUs with tensor cores (TCs) using matrix-multiply and accumulate (MMA) operations is suboptimal as the attention layer incurs an excessively large-memory footprint and significant computational complexity, especially with a higher number of input elements. In this work, we propose a hardware-software co-design approach to accelerate the execution of attention layers and reduce energy consumption on GPUs with TCs. Our proposed mechanism efficiently processes costly attention operations through a unique fusion mechanism, reducing the memory requirements of attention layers. Additionally, we revisit the design of TCs and offload certain non-MMA operations within the attention layer from standard CUDA cores to TCs. We extend the instruction set of GPUs to support new operations performed on TCs. Our evaluations reveal that optimizing both hardware and software for attention layers results in a 13.4% performance improvement and an 18.3% reduction in energy-delay on average, compared to a software-only optimization approach.
{"title":"Fused Tensor Core: A Hardware–Software Co-Design for Efficient Execution of Attentions on GPUs","authors":"Reza Jahadi;Phil Munz;Ehsan Atoofian","doi":"10.1109/LES.2025.3601057","DOIUrl":"https://doi.org/10.1109/LES.2025.3601057","url":null,"abstract":"Attention mechanism has become the backbone of machine learning applications, expanding beyond natural language processing into domains, such as computer vision and recommendation systems. We observe that implementing attention layers on GPUs with tensor cores (TCs) using matrix-multiply and accumulate (MMA) operations is suboptimal as the attention layer incurs an excessively large-memory footprint and significant computational complexity, especially with a higher number of input elements. In this work, we propose a hardware-software co-design approach to accelerate the execution of attention layers and reduce energy consumption on GPUs with TCs. Our proposed mechanism efficiently processes costly attention operations through a unique fusion mechanism, reducing the memory requirements of attention layers. Additionally, we revisit the design of TCs and offload certain non-MMA operations within the attention layer from standard CUDA cores to TCs. We extend the instruction set of GPUs to support new operations performed on TCs. Our evaluations reveal that optimizing both hardware and software for attention layers results in a 13.4% performance improvement and an 18.3% reduction in energy-delay on average, compared to a software-only optimization approach.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 5","pages":"317-320"},"PeriodicalIF":2.0,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}