Pub Date : 2025-12-12DOI: 10.1016/j.micpro.2025.105237
Martí Alonso , Andreu Gironés , Juan-José Costa , Enric Morancho , Stefano Di Carlo , Ramon Canal
The fast-paced evolution of cyberattacks to digital infrastructures requires new protection mechanisms to counterattack them. Malware attacks, a type of cyberattacks ranging from viruses and worms to ransomware and spyware, have been traditionally detected using signature-based methods. But with new versions of malware, this approach is not good enough, and new machine learning tools look promising. In this paper we present two methods to detect Linux malware using machine learning models: (1) a dynamic approach, that tracks the application executed instructions (opcodes) while they are being executed; and (2) a static approach, that inspects the binary application files before execution. We evaluate (1) five machine learning models (Support Vector Machine, k-Nearest Neighbor, Naive Bayes, Decision Tree and Random Forest) and (2) a deep neural network using a Long Short-Term Memory architecture with word embedding. We show the methodology, the initial dataset preparation, the infrastructure used to obtain the traces of executed instructions, and the evaluation of the results for the different models used. The obtained results show that the dynamic approach with a Random Forest classifier gets a 90% accuracy or higher, while the static approach obtains a 98% accuracy.
{"title":"Automatic linux malware detection using binary inspection and runtime opcode tracing","authors":"Martí Alonso , Andreu Gironés , Juan-José Costa , Enric Morancho , Stefano Di Carlo , Ramon Canal","doi":"10.1016/j.micpro.2025.105237","DOIUrl":"10.1016/j.micpro.2025.105237","url":null,"abstract":"<div><div>The fast-paced evolution of cyberattacks to digital infrastructures requires new protection mechanisms to counterattack them. Malware attacks, a type of cyberattacks ranging from viruses and worms to ransomware and spyware, have been traditionally detected using signature-based methods. But with new versions of malware, this approach is not good enough, and new machine learning tools look promising. In this paper we present two methods to detect Linux malware using machine learning models: (1) a dynamic approach, that tracks the application executed instructions (opcodes) while they are being executed; and (2) a static approach, that inspects the binary application files before execution. We evaluate (1) five machine learning models (Support Vector Machine, k-Nearest Neighbor, Naive Bayes, Decision Tree and Random Forest) and (2) a deep neural network using a Long Short-Term Memory architecture with word embedding. We show the methodology, the initial dataset preparation, the infrastructure used to obtain the traces of executed instructions, and the evaluation of the results for the different models used. The obtained results show that the dynamic approach with a Random Forest classifier gets a 90% accuracy or higher, while the static approach obtains a 98% accuracy.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"120 ","pages":"Article 105237"},"PeriodicalIF":2.6,"publicationDate":"2025-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-04DOI: 10.1016/j.micpro.2025.105236
Adebayo Omotosho , Sirine Ilahi , Ernesto Cristopher Villegas Castillo , Christian Hammer , Hans-Martin Bluethgen
Return-oriented programming (ROP) chains together sequences of instructions residing in executable pages of the memory to compromise a program’s control flow. On embedded systems, ROP detection is intricate as such devices lack the resources to directly run sophisticated software-based detection techniques, as these are memory and CPU-intensive.
However, a Field Programmable Gate Array (FPGA) can enhance the capabilities of an embedded device to handle resource-intensive tasks. Hence, this paper presents the first performance evaluation of a Support Vector Machine (SVM) hardware accelerator for automatic ROP classification on Xtensa-embedded devices using hardware performance counters (HPCs).
In addition to meeting security requirements, modern cyber–physical systems must exhibit high reliability against hardware failures to ensure correct functionality. To assess the reliability level of our proposed SVM architecture, we perform simulation-based fault injection at the RT-level. To improve the efficiency of this evaluation, we utilize a hybrid virtual prototype that integrates the RT-level model of the SVM accelerator with the Tensilica LX7 Instruction Set Simulator. This setup enables early-stage reliability assessment, helping to identify vulnerabilities and reduce the need for extensive fault injection campaigns during later stages of the design process.
Our evaluation results show that an SVM accelerator targeting an FPGA device can detect and prevent ROP attacks on an embedded processor with high accuracy in real time. In addition, we explore the most vulnerable locations of our SVM design to permanent faults, enabling the exploration of safety mechanisms that increase fault coverage in future works.
{"title":"SHAX: Evaluation of SVM hardware accelerator for detecting and preventing ROP on Xtensa","authors":"Adebayo Omotosho , Sirine Ilahi , Ernesto Cristopher Villegas Castillo , Christian Hammer , Hans-Martin Bluethgen","doi":"10.1016/j.micpro.2025.105236","DOIUrl":"10.1016/j.micpro.2025.105236","url":null,"abstract":"<div><div><em>Return-oriented programming</em> (ROP) chains together sequences of instructions residing in executable pages of the memory to compromise a program’s control flow. On <em>embedded systems</em>, ROP detection is intricate as such devices lack the resources to directly run sophisticated software-based detection techniques, as these are memory and CPU-intensive.</div><div>However, a <em>Field Programmable Gate Array</em> (FPGA) can enhance the capabilities of an embedded device to handle resource-intensive tasks. Hence, this paper presents the first performance evaluation of a Support Vector Machine (SVM) hardware accelerator for automatic ROP classification on Xtensa-embedded devices using hardware performance counters (HPCs).</div><div>In addition to meeting security requirements, modern cyber–physical systems must exhibit high reliability against hardware failures to ensure correct functionality. To assess the reliability level of our proposed SVM architecture, we perform simulation-based fault injection at the RT-level. To improve the efficiency of this evaluation, we utilize a hybrid virtual prototype that integrates the RT-level model of the SVM accelerator with the Tensilica LX7 Instruction Set Simulator. This setup enables early-stage reliability assessment, helping to identify vulnerabilities and reduce the need for extensive fault injection campaigns during later stages of the design process.</div><div>Our evaluation results show that an SVM accelerator targeting an FPGA device can detect and prevent ROP attacks on an embedded processor with high accuracy in real time. In addition, we explore the most vulnerable locations of our SVM design to permanent faults, enabling the exploration of safety mechanisms that increase fault coverage in future works.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"120 ","pages":"Article 105236"},"PeriodicalIF":2.6,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-21DOI: 10.1016/j.micpro.2025.105224
Roberto Ammendola , Andrea Biagioni , Carlotta Chiarini , Paolo Cretaro , Ottorino Frezza , Francesca Lo Cicero , Alessandro Lonardo , Michele Martinelli , Pier Stanislao Paolucci , Elena Pastorelli , Pierpaolo Perticaroli , Luca Pontisso , Cristian Rossi , Francesco Simula , Piero Vicini
High speed interconnects are critical to provide robust and highly efficient services to every user in a cluster. Several commercial offerings – many of which now firmly established in the market – have arisen throughout the years, spanning the very many possible tradeoffs between cost, reconfigurability, performance, resiliency and support for a variety of processing architectures. On the other hand, custom interconnects may represent an appealing solution for applications requiring cost-effectiveness, customizability and flexibility.
In this regard, the APEnet project was started in 2003, focusing on the design of PCIe FPGA-based custom Network Interface Cards (NIC) for cluster interconnects with a 3D torus topology. In this work, we highlight the main features of APEnetX, the latest version of the APEnet NIC. Designed on the Xilinx Alveo U200 card, it implements Remote Direct Memory Access (RDMA) transactions using both Xilinx Ultrascale+ IPs and custom hardware and software components to ensure efficient data transfer without the involvement of the host operating system. The software stack lets the user interface with the NIC directly via a low level driver or through a plug-in for the OpenMPI stack, aligning our NIC to the application layer standards in the HPC community. The APEnetX architecture integrates a Quality-of-Service (QoS) scheme implementation, in order to enforce some level of performance during network congestion events. Finally, APEnetX is accompanied by an Omnet++ based simulator which enables probing the performance of the network when its size is pushed to numbers of nodes otherwise unattainable for cost and/or practicality reasons.
{"title":"Hardware and software design of APEnetX: A custom high-speed interconnect for scientific computing","authors":"Roberto Ammendola , Andrea Biagioni , Carlotta Chiarini , Paolo Cretaro , Ottorino Frezza , Francesca Lo Cicero , Alessandro Lonardo , Michele Martinelli , Pier Stanislao Paolucci , Elena Pastorelli , Pierpaolo Perticaroli , Luca Pontisso , Cristian Rossi , Francesco Simula , Piero Vicini","doi":"10.1016/j.micpro.2025.105224","DOIUrl":"10.1016/j.micpro.2025.105224","url":null,"abstract":"<div><div>High speed interconnects are critical to provide robust and highly efficient services to every user in a cluster. Several commercial offerings – many of which now firmly established in the market – have arisen throughout the years, spanning the very many possible tradeoffs between cost, reconfigurability, performance, resiliency and support for a variety of processing architectures. On the other hand, custom interconnects may represent an appealing solution for applications requiring cost-effectiveness, customizability and flexibility.</div><div>In this regard, the APEnet project was started in 2003, focusing on the design of PCIe FPGA-based custom Network Interface Cards (NIC) for cluster interconnects with a 3D torus topology. In this work, we highlight the main features of APEnetX, the latest version of the APEnet NIC. Designed on the Xilinx Alveo U200 card, it implements Remote Direct Memory Access (RDMA) transactions using both Xilinx Ultrascale+ IPs and custom hardware and software components to ensure efficient data transfer without the involvement of the host operating system. The software stack lets the user interface with the NIC directly via a low level driver or through a plug-in for the OpenMPI stack, aligning our NIC to the application layer standards in the HPC community. The APEnetX architecture integrates a Quality-of-Service (QoS) scheme implementation, in order to enforce some level of performance during network congestion events. Finally, APEnetX is accompanied by an Omnet++ based simulator which enables probing the performance of the network when its size is pushed to numbers of nodes otherwise unattainable for cost and/or practicality reasons.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"120 ","pages":"Article 105224"},"PeriodicalIF":2.6,"publicationDate":"2025-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145571201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-19DOI: 10.1016/j.micpro.2025.105226
Ioannis Tsounis, Dimitris Agiakatsikas, Mihalis Psarakis
Low-Latency Approximate Adders (LLAAs) are high-performance adder models that perform either approximate addition with configurable accuracy-loss or accurate addition by integrating proper circuitry to detect and correct the expected approximation error. Due to their block-based structure, these adder models offer lower latency at the expense of configurable accuracy loss and area overhead. However, hardware accelerators employing such adders are susceptible to hardware (HW) faults, which can cause extra errors (i.e., HW errors) in addition to the expected approximation errors during their operation. In this work, we propose a novel Accuracy Configurable Low-latency and Fault-tolerant Adder, namely ALFA, that offers 100% fault coverage taking into consideration the required accuracy level. Our approach takes advantage of the resemblance between the HW errors and the approximation errors to build a scheme based on selective Triple Modular Redundancy (TMR), which can detect and correct all errors that violate the accuracy threshold. The proposed ALFA model for approximate operation achieves significant performance gains with minimum area overhead compared to the state-of-the-art Reduced Precision Redundancy (RPR) Ripple Carry Adders (RCA) with the same level of fault-tolerance. Furthermore, the accurate ALFA model outperforms the RCA with classical TMR in terms of performance.
{"title":"ALFA: Design of an accuracy-configurable and low-latency fault-tolerant adder","authors":"Ioannis Tsounis, Dimitris Agiakatsikas, Mihalis Psarakis","doi":"10.1016/j.micpro.2025.105226","DOIUrl":"10.1016/j.micpro.2025.105226","url":null,"abstract":"<div><div>Low-Latency Approximate Adders (LLAAs) are high-performance adder models that perform either approximate addition with configurable accuracy-loss or accurate addition by integrating proper circuitry to detect and correct the expected approximation error. Due to their block-based structure, these adder models offer lower latency at the expense of configurable accuracy loss and area overhead. However, hardware accelerators employing such adders are susceptible to hardware (HW) faults, which can cause extra errors (i.e., HW errors) in addition to the expected approximation errors during their operation. In this work, we propose a novel Accuracy Configurable Low-latency and Fault-tolerant Adder, namely ALFA, that offers 100% fault coverage taking into consideration the required accuracy level. Our approach takes advantage of the resemblance between the HW errors and the approximation errors to build a scheme based on selective Triple Modular Redundancy (TMR), which can detect and correct all errors that violate the accuracy threshold. The proposed ALFA model for approximate operation achieves significant performance gains with minimum area overhead compared to the state-of-the-art Reduced Precision Redundancy (RPR) Ripple Carry Adders (RCA) with the same level of fault-tolerance. Furthermore, the accurate ALFA model outperforms the RCA with classical TMR in terms of performance.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"120 ","pages":"Article 105226"},"PeriodicalIF":2.6,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-17DOI: 10.1016/j.micpro.2025.105223
Ehsan Kabir , Jason D. Bakos , David Andrews , Miaoqing Huang
Transformer neural networks (TNN) excel in natural language processing (NLP), machine translation, and computer vision (CV) without relying on recurrent or convolutional layers. However, they have high computational and memory demands, particularly on resource constrained devices like FPGAs. Moreover, transformer models vary in processing time across applications, requiring custom models with specific parameters. Designing custom accelerators for each model is complex and time-intensive. Some custom accelerators exist with no runtime adaptability, and they often rely on sparse matrices to reduce latency. However, hardware designs become more challenging due to the need for application-specific sparsity patterns. This paper introduces ADAPTOR, a runtime-adaptive accelerator for dense matrix computations in transformer encoders and decoders on FPGAs. ADAPTOR enhances the utilization of processing elements and on-chip memory, enhancing parallelism and reducing latency. It incorporates efficient matrix tiling to distribute resources across FPGA platforms and is fully quantized for computational efficiency and portability. Evaluations on Xilinx Alveo U55C data center cards and embedded platforms like VC707 and ZCU102 show that our design is 1.2 and 2.87 more power efficient than the NVIDIA K80 GPU and the i7-8700K CPU respectively. Additionally, it achieves a speedup of 1.7 to 2.25 compared to some state-of-the-art FPGA-based accelerators.
{"title":"A runtime-adaptive transformer neural network accelerator on FPGAs","authors":"Ehsan Kabir , Jason D. Bakos , David Andrews , Miaoqing Huang","doi":"10.1016/j.micpro.2025.105223","DOIUrl":"10.1016/j.micpro.2025.105223","url":null,"abstract":"<div><div>Transformer neural networks (TNN) excel in natural language processing (NLP), machine translation, and computer vision (CV) without relying on recurrent or convolutional layers. However, they have high computational and memory demands, particularly on resource constrained devices like FPGAs. Moreover, transformer models vary in processing time across applications, requiring custom models with specific parameters. Designing custom accelerators for each model is complex and time-intensive. Some custom accelerators exist with no runtime adaptability, and they often rely on sparse matrices to reduce latency. However, hardware designs become more challenging due to the need for application-specific sparsity patterns. This paper introduces ADAPTOR, a runtime-adaptive accelerator for dense matrix computations in transformer encoders and decoders on FPGAs. ADAPTOR enhances the utilization of processing elements and on-chip memory, enhancing parallelism and reducing latency. It incorporates efficient matrix tiling to distribute resources across FPGA platforms and is fully quantized for computational efficiency and portability. Evaluations on Xilinx Alveo U55C data center cards and embedded platforms like VC707 and ZCU102 show that our design is 1.2<span><math><mo>×</mo></math></span> and 2.87<span><math><mo>×</mo></math></span> more power efficient than the NVIDIA K80 GPU and the i7-8700K CPU respectively. Additionally, it achieves a speedup of 1.7 to 2.25<span><math><mo>×</mo></math></span> compared to some state-of-the-art FPGA-based accelerators.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"120 ","pages":"Article 105223"},"PeriodicalIF":2.6,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145571199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-17DOI: 10.1016/j.micpro.2025.105225
Soumyashree Mangaraj, Kamalakanta Mahapatra, Samit Ari
Electrocardiogram (ECG) study to diagnose cardiac abnormalities is a popular non-invasive technique. Architecture relying on deep learning (DL), and its hardware deployment on edge is crucial for effective diagnosis in smart health care applications. This inference on resource limited FPGA platform poses a significant challenge with intense mathematical computations of DL architectures. Existing FPGA implemented convolutional neural network (CNN) architectures typically adopt sequential deep convolutional stacking, which demands recurrent use of memory to retrieve data, and ultimately degrading throughput and adding latency. A hardware efficient tri-branch CNN architecture is introduced for arrhythmia classification, which leverages FPGA’s intrinsic parallel architecture and minimizes overhead of data management. The proposed CNN’s hardware architecture is implemented in a high-level synthesis (HLS) framework through three key optimizations: (i) pool-conv-graded-quantized (PCGQ) module, (ii) in-pool merged function module, and (iii) skip-zero connection. These enhancements improve layer level precision, reduce quantization error, lower latency, and optimize FPGA resource utilization. Implemented on a PYNQ-Z2 FPGA, the design utilizes 27.79% LUTs, 12.24% FFs, 50.45% DSPs, 34.29% BRAM, and delivers 347 GOPS throughput at 45 ms latency, validated in Vivado 2022.2. The proposed system is assessed using the MIT-BIH Arrhythmia Dataset in accordance with AAMI EC57 standards, and attained a classification accuracy of 97.98% across five types of ECG beats, highlighting its suitability for portable healthcare applications.
{"title":"Cardiac arrhythmia classification system: An optimized HLS-based hardware implementation on PYNQ platform","authors":"Soumyashree Mangaraj, Kamalakanta Mahapatra, Samit Ari","doi":"10.1016/j.micpro.2025.105225","DOIUrl":"10.1016/j.micpro.2025.105225","url":null,"abstract":"<div><div>Electrocardiogram (ECG) study to diagnose cardiac abnormalities is a popular non-invasive technique. Architecture relying on deep learning (DL), and its hardware deployment on edge is crucial for effective diagnosis in smart health care applications. This inference on resource limited FPGA platform poses a significant challenge with intense mathematical computations of DL architectures. Existing FPGA implemented convolutional neural network (CNN) architectures typically adopt sequential deep convolutional stacking, which demands recurrent use of memory to retrieve data, and ultimately degrading throughput and adding latency. A hardware efficient tri-branch CNN architecture is introduced for arrhythmia classification, which leverages FPGA’s intrinsic parallel architecture and minimizes overhead of data management. The proposed CNN’s hardware architecture is implemented in a high-level synthesis (HLS) framework through three key optimizations: (i) pool-conv-graded-quantized (PCGQ) module, (ii) in-pool merged function module, and (iii) skip-zero connection. These enhancements improve layer level precision, reduce quantization error, lower latency, and optimize FPGA resource utilization. Implemented on a PYNQ-Z2 FPGA, the design utilizes 27.79% LUTs, 12.24% FFs, 50.45% DSPs, 34.29% BRAM, and delivers 347 GOPS throughput at 45 ms latency, validated in Vivado 2022.2. The proposed system is assessed using the MIT-BIH Arrhythmia Dataset in accordance with AAMI EC57 standards, and attained a classification accuracy of 97.98% across five types of ECG beats, highlighting its suitability for portable healthcare applications.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"120 ","pages":"Article 105225"},"PeriodicalIF":2.6,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145555216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-08DOI: 10.1016/j.micpro.2025.105220
Philipp Käsgen , Markus Weinhardt , Christian Hochberger
When dealing with multiple data consumers and producers in a highly parallel accelerator architecture the challenge arises how to coordinate the requests to memory. An example of such an accelerator is a coarse-grained reconfigurable array (CGRA). CGRAs consist of multiple processing elements (PEs) which can consume and produce data. On the one hand, the resulting load and store requests to the memory need to be orchestrated such that the CGRA does not deadlock when connected to a cache hierarchy responding to memory requests out-of-request-order. On the other hand, multiple consumers and producers open up the possibility to make better use of the available memory bandwidth such that the cache is busy constantly. We call the unit to address these challenges and opportunities frontend (FE).
We propose a synthesizable FE for the HiPReP CGRA which enables the integration with a RISC-V based host system. Based on an example application, we showcase a methodology to match the number of consumers and producers (i.e. PEs) with the memory hierarchy such that the CGRA can efficiently harness the available L1 data cache bandwidth, reaching 99.6% of the theoretical peak bandwidth in a synthetic benchmark, and enabling a speedup of up to 21.9x over an out-of-order processor for dense matrix-matrix-multiplications. Moreover, we explore the FE design, the impact of the different numbers of PEs, memory access patterns, synthesis results, and compare the accelerator runtime with the runtime on the host itself as baseline.
{"title":"A CGRA frontend for bandwidth utilization in HiPReP","authors":"Philipp Käsgen , Markus Weinhardt , Christian Hochberger","doi":"10.1016/j.micpro.2025.105220","DOIUrl":"10.1016/j.micpro.2025.105220","url":null,"abstract":"<div><div>When dealing with multiple data consumers and producers in a highly parallel accelerator architecture the challenge arises how to coordinate the requests to memory. An example of such an accelerator is a coarse-grained reconfigurable array (CGRA). CGRAs consist of multiple processing elements (PEs) which can consume and produce data. On the one hand, the resulting load and store requests to the memory need to be orchestrated such that the CGRA does not deadlock when connected to a cache hierarchy responding to memory requests out-of-request-order. On the other hand, multiple consumers and producers open up the possibility to make better use of the available memory bandwidth such that the cache is busy constantly. We call the unit to address these challenges and opportunities <em>frontend</em> (FE).</div><div>We propose a synthesizable FE for the HiPReP CGRA which enables the integration with a RISC-V based host system. Based on an example application, we showcase a methodology to match the number of consumers and producers (i.e. PEs) with the memory hierarchy such that the CGRA can efficiently harness the available L1 data cache bandwidth, reaching 99.6% of the theoretical peak bandwidth in a synthetic benchmark, and enabling a speedup of up to 21.9x over an out-of-order processor for dense matrix-matrix-multiplications. Moreover, we explore the FE design, the impact of the different numbers of PEs, memory access patterns, synthesis results, and compare the accelerator runtime with the runtime on the host itself as baseline.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"119 ","pages":"Article 105220"},"PeriodicalIF":2.6,"publicationDate":"2025-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-04DOI: 10.1016/j.micpro.2025.105221
Francesco Daghero , Gabriele Faraone , Eugenio Serianni , Nicola Di Carolo , Giovanna Antonella Franchino , Michelangelo Grosso , Daniele Jahier Pagliari
The Analog-On-Top (AoT) Mixed-Signal (AMS) design flow is a time-consuming process, heavily reliant on expert knowledge and manual iteration. A critical step involves reserving top-level layout regions for digital blocks, which typically requires several back-and-forth exchanges between analog and digital teams due to the complex interplay of design constraints that affect the digital area requirements. Existing automated approaches often fail to generalize, as they are benchmarked on overly simplistic designs that lack real-world complexity. In this work, we frame the area adequacy check as a binary classification task and propose a Machine Learning (ML) solution to predict whether the reserved area for a digital block is sufficient. We conduct an extensive evaluation across multiple ML models on a dataset of production-level designs, achieving up to 94.38% F1 score with a Random Forest. Finally, we apply ensemble techniques to improve performance further, reaching 95.35% F1 with a majority-vote ensemble.
{"title":"Machine learning for predicting digital block layout feasibility in Analog-On-Top designs","authors":"Francesco Daghero , Gabriele Faraone , Eugenio Serianni , Nicola Di Carolo , Giovanna Antonella Franchino , Michelangelo Grosso , Daniele Jahier Pagliari","doi":"10.1016/j.micpro.2025.105221","DOIUrl":"10.1016/j.micpro.2025.105221","url":null,"abstract":"<div><div>The Analog-On-Top (AoT) Mixed-Signal (AMS) design flow is a time-consuming process, heavily reliant on expert knowledge and manual iteration. A critical step involves reserving top-level layout regions for digital blocks, which typically requires several back-and-forth exchanges between analog and digital teams due to the complex interplay of design constraints that affect the digital area requirements. Existing automated approaches often fail to generalize, as they are benchmarked on overly simplistic designs that lack real-world complexity. In this work, we frame the area adequacy check as a binary classification task and propose a Machine Learning (ML) solution to predict whether the reserved area for a digital block is sufficient. We conduct an extensive evaluation across multiple ML models on a dataset of production-level designs, achieving up to 94.38% F1 score with a Random Forest. Finally, we apply ensemble techniques to improve performance further, reaching 95.35% F1 with a majority-vote ensemble.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"119 ","pages":"Article 105221"},"PeriodicalIF":2.6,"publicationDate":"2025-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145467953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-29DOI: 10.1016/j.micpro.2025.105222
Natalia Cherezova , Artur Jutman , Maksim Jenihhin
The emergence of Deep Neural Networks (DNNs) in mission- and safety-critical applications brings their reliability to the front. High performance demands of DNNs require the use of specialized hardware accelerators. Systolic array architecture is widely used in DNN accelerators due to its parallelism and regular structure. This work presents a run-time reconfigurable systolic array architecture with three execution modes and four implementation options. All four implementations are evaluated in terms of resource utilization, throughput, and fault tolerance improvement. The proposed architecture is used for reliability enhancement of DNN inference on systolic array through heterogeneous mapping of different network layers to different execution modes. The approach is supported by a novel reliability assessment method based on fault propagation analysis. It is used for the exploration of the appropriate execution mode-layer mapping for DNN inference. The proposed architecture efficiently protects registers and MAC units of systolic array PEs from transient and permanent faults. The reconfigurability feature enables a speedup of up to , depending on layer vulnerability. Furthermore, it requires fewer resources compared to static redundancy and fewer resources compared to the previously proposed solution for transient faults.
{"title":"FORTALESA: Fault-tolerant reconfigurable systolic array for DNN inference","authors":"Natalia Cherezova , Artur Jutman , Maksim Jenihhin","doi":"10.1016/j.micpro.2025.105222","DOIUrl":"10.1016/j.micpro.2025.105222","url":null,"abstract":"<div><div>The emergence of Deep Neural Networks (DNNs) in mission- and safety-critical applications brings their reliability to the front. High performance demands of DNNs require the use of specialized hardware accelerators. Systolic array architecture is widely used in DNN accelerators due to its parallelism and regular structure. This work presents a run-time reconfigurable systolic array architecture with three execution modes and four implementation options. All four implementations are evaluated in terms of resource utilization, throughput, and fault tolerance improvement. The proposed architecture is used for reliability enhancement of DNN inference on systolic array through heterogeneous mapping of different network layers to different execution modes. The approach is supported by a novel reliability assessment method based on fault propagation analysis. It is used for the exploration of the appropriate execution mode-layer mapping for DNN inference. The proposed architecture efficiently protects registers and MAC units of systolic array PEs from transient and permanent faults. The reconfigurability feature enables a speedup of up to <span><math><mrow><mn>3</mn><mo>×</mo></mrow></math></span>, depending on layer vulnerability. Furthermore, it requires <span><math><mrow><mn>6</mn><mo>×</mo></mrow></math></span> fewer resources compared to static redundancy and <span><math><mrow><mn>2</mn><mo>.</mo><mn>5</mn><mo>×</mo></mrow></math></span> fewer resources compared to the previously proposed solution for transient faults.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"119 ","pages":"Article 105222"},"PeriodicalIF":2.6,"publicationDate":"2025-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145467952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-20DOI: 10.1016/j.micpro.2025.105219
Roberto Rocco, Francesco Gianchino, Antonio Miele, Gianluca Palermo
Nowadays, most computing systems experience highly dynamic workloads with performance-demanding applications entering and leaving the system with an unpredictable trend. Ensuring their performance guarantees led to the design of adaptive mechanisms, including (i) application autotuners, able to optimize algorithmic parameters (e.g., frame resolution in a video processing application), and (ii) runtime resource management to distribute computing resources among the running applications and tune architectural knobs (e.g., frequency scaling). Past work investigates the two directions separately, acting on a limited set of control knobs and objective functions; instead, this work proposes a combined framework to integrate these two complementary approaches in a single two-level governor acting on the overall hardware/software stack. The resource manager incorporates a policy for computing resource distribution and architectural knobs to guarantee the required performance of each application while limiting the side effect on results quality and minimizing system power consumption. Meanwhile, the autotuner manages the applications’ software knobs, ensuring results’ quality and performance constraint satisfaction while hiding application details from the controller. Experimental evaluation carried out on a homogeneous architecture for workstation machines demonstrates that the proposed framework is stable and can save more than 72% of the power consumed by one-layer solutions.
{"title":"Power/accuracy-aware dynamic workload optimization combining application autotuning and runtime resource management on homogeneous architectures","authors":"Roberto Rocco, Francesco Gianchino, Antonio Miele, Gianluca Palermo","doi":"10.1016/j.micpro.2025.105219","DOIUrl":"10.1016/j.micpro.2025.105219","url":null,"abstract":"<div><div>Nowadays, most computing systems experience highly dynamic workloads with performance-demanding applications entering and leaving the system with an unpredictable trend. Ensuring their performance guarantees led to the design of adaptive mechanisms, including (i) application autotuners, able to optimize algorithmic parameters (e.g., frame resolution in a video processing application), and (ii) runtime resource management to distribute computing resources among the running applications and tune architectural knobs (e.g., frequency scaling). Past work investigates the two directions separately, acting on a limited set of control knobs and objective functions; instead, this work proposes a combined framework to integrate these two complementary approaches in a single two-level governor acting on the overall hardware/software stack. The resource manager incorporates a policy for computing resource distribution and architectural knobs to guarantee the required performance of each application while limiting the side effect on results quality and minimizing system power consumption. Meanwhile, the autotuner manages the applications’ software knobs, ensuring results’ quality and performance constraint satisfaction while hiding application details from the controller. Experimental evaluation carried out on a homogeneous architecture for workstation machines demonstrates that the proposed framework is stable and can save more than 72% of the power consumed by one-layer solutions.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"119 ","pages":"Article 105219"},"PeriodicalIF":2.6,"publicationDate":"2025-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145365057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}