Pub Date : 2026-01-24DOI: 10.1016/j.micpro.2026.105251
Premalatha R , Jayanthi K B , Rajasekaran C , Sureshkumar R
Vision Transformer (ViT) models have demonstrated excellent performance in medical image processing. Their deployment in resource-constrained situations is limited by their high computational complexity and memory requirements. Although parameter-efficient tuning of ViT models is made possible by Low-Rank Adaptation (LoRA), its use in real-time clinical datasets and edge-device deployment is yet mainly unexplored. Using a real-time lung infection dataset, this research assesses ViT-LoRA's effectiveness in real-world medical imaging scenarios and investigates its generalisation potential on a public COVID-19 CT dataset. Four ViT fine-tuning procedures are thoroughly compared: LoRA-based tuning (ViT-LoRA), adapter-based tuning (ViT-APT), partial fine-tuning (ViT-PFT), and full fine-tuning (ViT-FFT). ViT-LoRA attains a testing accuracy of 98.50 % with only 2.104 million trainable parameters, resulting in a significantly reduced memory of 24.08 MB. The optimized ViT-LoRA Model has been deployed to a NVIDIA Jetson Nano and evaluated against the 30 test images. This evaluation of the ViT-LoRA Model resulted in an average of 3.44 seconds per test image for real-time edge-based medical imaging applications.
{"title":"ViT-LoRA: Optimized vision transformer for efficient edge computing in medical imaging","authors":"Premalatha R , Jayanthi K B , Rajasekaran C , Sureshkumar R","doi":"10.1016/j.micpro.2026.105251","DOIUrl":"10.1016/j.micpro.2026.105251","url":null,"abstract":"<div><div>Vision Transformer (ViT) models have demonstrated excellent performance in medical image processing. Their deployment in resource-constrained situations is limited by their high computational complexity and memory requirements. Although parameter-efficient tuning of ViT models is made possible by Low-Rank Adaptation (LoRA), its use in real-time clinical datasets and edge-device deployment is yet mainly unexplored. Using a real-time lung infection dataset, this research assesses ViT-LoRA's effectiveness in real-world medical imaging scenarios and investigates its generalisation potential on a public COVID-19 CT dataset. Four ViT fine-tuning procedures are thoroughly compared: LoRA-based tuning (ViT-LoRA), adapter-based tuning (ViT-APT), partial fine-tuning (ViT-PFT), and full fine-tuning (ViT-FFT). ViT-LoRA attains a testing accuracy of 98.50 % with only 2.104 million trainable parameters, resulting in a significantly reduced memory of 24.08 MB. The optimized ViT-LoRA Model has been deployed to a NVIDIA Jetson Nano and evaluated against the 30 test images. This evaluation of the ViT-LoRA Model resulted in an average of 3.44 seconds per test image for real-time edge-based medical imaging applications.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"121 ","pages":"Article 105251"},"PeriodicalIF":2.6,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146078755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1016/j.micpro.2026.105250
Rubén Nieto , Laura de Diego-Otón , Miguel Tapiador , Víctor M. Navarro , Santiago Murano , Álvaro Hernández , Jesús Ureña
Non-Intrusive Load Monitoring (NILM) systems allow the disaggregation of the individual consumption of different appliances from aggregate electrical measurements, for applications such as improving energy efficiency at home. In other contexts, NILM techniques are also useful to promote independent living for elderly, as they enable the inference and monitoring of their behavior through the analysis of their energy consumption and the identification of the appliances’ usage patterns. To achieve this, aggregated voltage and current signals are collected at the entrance of the house using a NILM sensor system. This analysis often involves sending the collected data to the cloud for further processing, which can result in significant bandwidth usage, especially when a high sampling rate approach is employed. In this work, a System-on-Chip (SoC) architecture based on a FPGA (Field-Programmable Gate Array) device is proposed for NILM processing, fully performed on edge computing. This architecture is focused on Ambient Intelligence for Independent Living (AIIL) of elderly. Voltage and current data are acquired at 4 kSPS (kilo Samples Per Second), where on/off switchings (events) of appliances are detected, thus delimiting a window of 4096 samples around both signals. These windows are processed by a Convolutional Neural Network (CNN) that implements the load identification. Unlike prior works that primarily focus on algorithmic enhancements, this study introduces a complete hardware/software design of a FPGA-based SoC architecture and its real-time validation. The proposed architecture achieves an inference latency of and a classification accuracy of 84.7% for fourteen classes (ON/OFF states of seven appliances), while reducing bandwidth usage by transmitting only the final identification instead of raw signals. These results demonstrate the feasibility of real-time implementations of NILM applications at the edge with competitive performance.
{"title":"Edge computing System-on-Chip architecture for a Non-Intrusive Load Monitoring sensor in ambient intelligence applications","authors":"Rubén Nieto , Laura de Diego-Otón , Miguel Tapiador , Víctor M. Navarro , Santiago Murano , Álvaro Hernández , Jesús Ureña","doi":"10.1016/j.micpro.2026.105250","DOIUrl":"10.1016/j.micpro.2026.105250","url":null,"abstract":"<div><div>Non-Intrusive Load Monitoring (NILM) systems allow the disaggregation of the individual consumption of different appliances from aggregate electrical measurements, for applications such as improving energy efficiency at home. In other contexts, NILM techniques are also useful to promote independent living for elderly, as they enable the inference and monitoring of their behavior through the analysis of their energy consumption and the identification of the appliances’ usage patterns. To achieve this, aggregated voltage and current signals are collected at the entrance of the house using a NILM sensor system. This analysis often involves sending the collected data to the cloud for further processing, which can result in significant bandwidth usage, especially when a high sampling rate approach is employed. In this work, a System-on-Chip (SoC) architecture based on a FPGA (Field-Programmable Gate Array) device is proposed for NILM processing, fully performed on edge computing. This architecture is focused on Ambient Intelligence for Independent Living (AIIL) of elderly. Voltage and current data are acquired at 4 kSPS (kilo Samples Per Second), where on/off switchings (events) of appliances are detected, thus delimiting a window of 4096 samples around both signals. These windows are processed by a Convolutional Neural Network (CNN) that implements the load identification. Unlike prior works that primarily focus on algorithmic enhancements, this study introduces a complete hardware/software design of a FPGA-based SoC architecture and its real-time validation. The proposed architecture achieves an inference latency of <span><math><mrow><mn>56</mn><mspace></mspace><mi>ms</mi></mrow></math></span> and a classification accuracy of 84.7% for fourteen classes (ON/OFF states of seven appliances), while reducing bandwidth usage by transmitting only the final identification instead of raw signals. These results demonstrate the feasibility of real-time implementations of NILM applications at the edge with competitive performance.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"121 ","pages":"Article 105250"},"PeriodicalIF":2.6,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146038351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31DOI: 10.1016/j.micpro.2025.105242
Haodong Zhao, Junna Shang
Currently, high-precision GNSS receivers are expensive and the cost of using them in mobile communication networks is extremely high. To reduce the construction cost of positioning and timing capabilities in mobile communication networks, the existing ordinary GNSS receivers in the network are used to form a self-differential enhanced iterative network to achieve high-precision positioning in local areas.Based on high-precision positioning, various delay errors in the current 1PPS second pulse are corrected by differential information data to solve the precise time of the local clock, thereby improving timing accuracy. In engineering applications, the self-differential enhanced iterative network algorithm is used to make embedded improvements to the antenna parameter sensor commonly used in mobile communication networks. The improved antenna parameter sensor has obtained high-precision positioning and timing functions based on the original attitude and direction measurement functions. Its positioning accuracy can reach millimeter level, and the timing accuracy can reach 20 nanoseconds.
{"title":"High-precision positioning and timing method of GNSS receiver for mobile communication networks","authors":"Haodong Zhao, Junna Shang","doi":"10.1016/j.micpro.2025.105242","DOIUrl":"10.1016/j.micpro.2025.105242","url":null,"abstract":"<div><div>Currently, high-precision GNSS receivers are expensive and the cost of using them in mobile communication networks is extremely high. To reduce the construction cost of positioning and timing capabilities in mobile communication networks, the existing ordinary GNSS receivers in the network are used to form a self-differential enhanced iterative network to achieve high-precision positioning in local areas.Based on high-precision positioning, various delay errors in the current 1PPS second pulse are corrected by differential information data to solve the precise time of the local clock, thereby improving timing accuracy. In engineering applications, the self-differential enhanced iterative network algorithm is used to make embedded improvements to the antenna parameter sensor commonly used in mobile communication networks. The improved antenna parameter sensor has obtained high-precision positioning and timing functions based on the original attitude and direction measurement functions. Its positioning accuracy can reach millimeter level, and the timing accuracy can reach 20 nanoseconds.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"121 ","pages":"Article 105242"},"PeriodicalIF":2.6,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-30DOI: 10.1016/j.micpro.2025.105243
Eduardo Ortega , Agustín Martínez , Antonio Oliva , Fernando Sanz , Óscar Rodríguez , Manuel Prieto , Pablo Parra , Antonio da Silva , Sebastián Sánchez
The burgeoning interest within the space community in digital beamforming is largely attributable to the superior flexibility that satellites with active antenna systems offer for a wide range of applications, notably in communication services. This paper delves into the analysis and practical implementation of a Digital Beamforming and Digital Down Conversion (DDC) chain, leveraging a high-speed Analog-to-Digital Converter (ADC) certified for space applications alongside a high-performance Field-Programmable Gate Array (FPGA). The proposed design strategy focuses on optimizing resource efficiency and minimizing power consumption by strategically sequencing the beamformer processor ahead of the complex down-conversion operation. This innovative approach entails the application of demodulation and low-pass filtering exclusively to the aggregated beam channel, culminating in a marked reduction in the requisite digital signal processing resources relative to traditional, more resource-intensive digital beamforming and DDC architectures. In the experimental validation, an evaluation board integrating a high-speed ADC and a FPGA was utilized. This setup facilitated the empirical validation of the design’s efficacy by applying various RF input signals to the digital beamforming receiver system. The ADC employed is capable of high-resolution signal processing, while the FPGA provides the necessary computational flexibility and speed for real-time digital signal processing tasks. The findings underscore the potential of this design to significantly enhance the efficiency and performance of digital beamforming systems in space applications.
{"title":"A digital beamforming receiver architecture implemented on a FPGA for space applications","authors":"Eduardo Ortega , Agustín Martínez , Antonio Oliva , Fernando Sanz , Óscar Rodríguez , Manuel Prieto , Pablo Parra , Antonio da Silva , Sebastián Sánchez","doi":"10.1016/j.micpro.2025.105243","DOIUrl":"10.1016/j.micpro.2025.105243","url":null,"abstract":"<div><div>The burgeoning interest within the space community in digital beamforming is largely attributable to the superior flexibility that satellites with active antenna systems offer for a wide range of applications, notably in communication services. This paper delves into the analysis and practical implementation of a Digital Beamforming and Digital Down Conversion (DDC) chain, leveraging a high-speed Analog-to-Digital Converter (ADC) certified for space applications alongside a high-performance Field-Programmable Gate Array (FPGA). The proposed design strategy focuses on optimizing resource efficiency and minimizing power consumption by strategically sequencing the beamformer processor ahead of the complex down-conversion operation. This innovative approach entails the application of demodulation and low-pass filtering exclusively to the aggregated beam channel, culminating in a marked reduction in the requisite digital signal processing resources relative to traditional, more resource-intensive digital beamforming and DDC architectures. In the experimental validation, an evaluation board integrating a high-speed ADC and a FPGA was utilized. This setup facilitated the empirical validation of the design’s efficacy by applying various RF input signals to the digital beamforming receiver system. The ADC employed is capable of high-resolution signal processing, while the FPGA provides the necessary computational flexibility and speed for real-time digital signal processing tasks. The findings underscore the potential of this design to significantly enhance the efficiency and performance of digital beamforming systems in space applications.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"121 ","pages":"Article 105243"},"PeriodicalIF":2.6,"publicationDate":"2025-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-25DOI: 10.1016/j.micpro.2025.105241
Cameron Vogeli, Daniel Llamocca
We present scalable and generalized hardware designs for k×k median filters based on separability of sorting networks, where we can process 4 pixels at a time. The fully customized (performance, bit-width) hardware architectures allow for design space exploration to establish trade-offs among processing time and resource usage. Results are presented in terms of resources, processing cycles, and throughput. We present true scalable architectures: our approach features a linear increase (becoming even less pronounced) in hardware resources and processing time as grows. As far as we are aware, there are no competing works (that use separability) for . The proposed architectures, validated on modern FPGAs for , are expected to be used as building blocks on a variety of image processing applications.
{"title":"Scalable hardware designs for median filters based on separable sorting networks","authors":"Cameron Vogeli, Daniel Llamocca","doi":"10.1016/j.micpro.2025.105241","DOIUrl":"10.1016/j.micpro.2025.105241","url":null,"abstract":"<div><div>We present scalable and generalized hardware designs for <strong><em>k</em></strong> <strong><em>×</em></strong> <strong><em>k</em></strong> median filters based on separability of sorting networks, where we can process 4 pixels at a time. The fully customized (performance, bit-width) hardware architectures allow for design space exploration to establish trade-offs among processing time and resource usage. Results are presented in terms of resources, processing cycles, and throughput. We present true scalable architectures: our approach features a linear increase (becoming even less pronounced) in hardware resources and processing time as <span><math><mi>k</mi></math></span> grows. As far as we are aware, there are no competing works (that use separability) for <span><math><mrow><mi>k</mi><mo>></mo><mn>5</mn></mrow></math></span>. The proposed architectures, validated on modern FPGAs for <span><math><mrow><mi>k</mi><mo>=</mo><mn>3</mn><mo>,</mo><mn>5</mn><mo>,</mo><mn>7</mn><mo>,</mo><mn>9</mn><mo>,</mo><mn>11</mn></mrow></math></span>, are expected to be used as building blocks on a variety of image processing applications.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"121 ","pages":"Article 105241"},"PeriodicalIF":2.6,"publicationDate":"2025-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145869493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-12DOI: 10.1016/j.micpro.2025.105237
Martí Alonso , Andreu Gironés , Juan-José Costa , Enric Morancho , Stefano Di Carlo , Ramon Canal
The fast-paced evolution of cyberattacks to digital infrastructures requires new protection mechanisms to counterattack them. Malware attacks, a type of cyberattacks ranging from viruses and worms to ransomware and spyware, have been traditionally detected using signature-based methods. But with new versions of malware, this approach is not good enough, and new machine learning tools look promising. In this paper we present two methods to detect Linux malware using machine learning models: (1) a dynamic approach, that tracks the application executed instructions (opcodes) while they are being executed; and (2) a static approach, that inspects the binary application files before execution. We evaluate (1) five machine learning models (Support Vector Machine, k-Nearest Neighbor, Naive Bayes, Decision Tree and Random Forest) and (2) a deep neural network using a Long Short-Term Memory architecture with word embedding. We show the methodology, the initial dataset preparation, the infrastructure used to obtain the traces of executed instructions, and the evaluation of the results for the different models used. The obtained results show that the dynamic approach with a Random Forest classifier gets a 90% accuracy or higher, while the static approach obtains a 98% accuracy.
{"title":"Automatic linux malware detection using binary inspection and runtime opcode tracing","authors":"Martí Alonso , Andreu Gironés , Juan-José Costa , Enric Morancho , Stefano Di Carlo , Ramon Canal","doi":"10.1016/j.micpro.2025.105237","DOIUrl":"10.1016/j.micpro.2025.105237","url":null,"abstract":"<div><div>The fast-paced evolution of cyberattacks to digital infrastructures requires new protection mechanisms to counterattack them. Malware attacks, a type of cyberattacks ranging from viruses and worms to ransomware and spyware, have been traditionally detected using signature-based methods. But with new versions of malware, this approach is not good enough, and new machine learning tools look promising. In this paper we present two methods to detect Linux malware using machine learning models: (1) a dynamic approach, that tracks the application executed instructions (opcodes) while they are being executed; and (2) a static approach, that inspects the binary application files before execution. We evaluate (1) five machine learning models (Support Vector Machine, k-Nearest Neighbor, Naive Bayes, Decision Tree and Random Forest) and (2) a deep neural network using a Long Short-Term Memory architecture with word embedding. We show the methodology, the initial dataset preparation, the infrastructure used to obtain the traces of executed instructions, and the evaluation of the results for the different models used. The obtained results show that the dynamic approach with a Random Forest classifier gets a 90% accuracy or higher, while the static approach obtains a 98% accuracy.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"120 ","pages":"Article 105237"},"PeriodicalIF":2.6,"publicationDate":"2025-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-04DOI: 10.1016/j.micpro.2025.105236
Adebayo Omotosho , Sirine Ilahi , Ernesto Cristopher Villegas Castillo , Christian Hammer , Hans-Martin Bluethgen
Return-oriented programming (ROP) chains together sequences of instructions residing in executable pages of the memory to compromise a program’s control flow. On embedded systems, ROP detection is intricate as such devices lack the resources to directly run sophisticated software-based detection techniques, as these are memory and CPU-intensive.
However, a Field Programmable Gate Array (FPGA) can enhance the capabilities of an embedded device to handle resource-intensive tasks. Hence, this paper presents the first performance evaluation of a Support Vector Machine (SVM) hardware accelerator for automatic ROP classification on Xtensa-embedded devices using hardware performance counters (HPCs).
In addition to meeting security requirements, modern cyber–physical systems must exhibit high reliability against hardware failures to ensure correct functionality. To assess the reliability level of our proposed SVM architecture, we perform simulation-based fault injection at the RT-level. To improve the efficiency of this evaluation, we utilize a hybrid virtual prototype that integrates the RT-level model of the SVM accelerator with the Tensilica LX7 Instruction Set Simulator. This setup enables early-stage reliability assessment, helping to identify vulnerabilities and reduce the need for extensive fault injection campaigns during later stages of the design process.
Our evaluation results show that an SVM accelerator targeting an FPGA device can detect and prevent ROP attacks on an embedded processor with high accuracy in real time. In addition, we explore the most vulnerable locations of our SVM design to permanent faults, enabling the exploration of safety mechanisms that increase fault coverage in future works.
{"title":"SHAX: Evaluation of SVM hardware accelerator for detecting and preventing ROP on Xtensa","authors":"Adebayo Omotosho , Sirine Ilahi , Ernesto Cristopher Villegas Castillo , Christian Hammer , Hans-Martin Bluethgen","doi":"10.1016/j.micpro.2025.105236","DOIUrl":"10.1016/j.micpro.2025.105236","url":null,"abstract":"<div><div><em>Return-oriented programming</em> (ROP) chains together sequences of instructions residing in executable pages of the memory to compromise a program’s control flow. On <em>embedded systems</em>, ROP detection is intricate as such devices lack the resources to directly run sophisticated software-based detection techniques, as these are memory and CPU-intensive.</div><div>However, a <em>Field Programmable Gate Array</em> (FPGA) can enhance the capabilities of an embedded device to handle resource-intensive tasks. Hence, this paper presents the first performance evaluation of a Support Vector Machine (SVM) hardware accelerator for automatic ROP classification on Xtensa-embedded devices using hardware performance counters (HPCs).</div><div>In addition to meeting security requirements, modern cyber–physical systems must exhibit high reliability against hardware failures to ensure correct functionality. To assess the reliability level of our proposed SVM architecture, we perform simulation-based fault injection at the RT-level. To improve the efficiency of this evaluation, we utilize a hybrid virtual prototype that integrates the RT-level model of the SVM accelerator with the Tensilica LX7 Instruction Set Simulator. This setup enables early-stage reliability assessment, helping to identify vulnerabilities and reduce the need for extensive fault injection campaigns during later stages of the design process.</div><div>Our evaluation results show that an SVM accelerator targeting an FPGA device can detect and prevent ROP attacks on an embedded processor with high accuracy in real time. In addition, we explore the most vulnerable locations of our SVM design to permanent faults, enabling the exploration of safety mechanisms that increase fault coverage in future works.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"120 ","pages":"Article 105236"},"PeriodicalIF":2.6,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-21DOI: 10.1016/j.micpro.2025.105224
Roberto Ammendola , Andrea Biagioni , Carlotta Chiarini , Paolo Cretaro , Ottorino Frezza , Francesca Lo Cicero , Alessandro Lonardo , Michele Martinelli , Pier Stanislao Paolucci , Elena Pastorelli , Pierpaolo Perticaroli , Luca Pontisso , Cristian Rossi , Francesco Simula , Piero Vicini
High speed interconnects are critical to provide robust and highly efficient services to every user in a cluster. Several commercial offerings – many of which now firmly established in the market – have arisen throughout the years, spanning the very many possible tradeoffs between cost, reconfigurability, performance, resiliency and support for a variety of processing architectures. On the other hand, custom interconnects may represent an appealing solution for applications requiring cost-effectiveness, customizability and flexibility.
In this regard, the APEnet project was started in 2003, focusing on the design of PCIe FPGA-based custom Network Interface Cards (NIC) for cluster interconnects with a 3D torus topology. In this work, we highlight the main features of APEnetX, the latest version of the APEnet NIC. Designed on the Xilinx Alveo U200 card, it implements Remote Direct Memory Access (RDMA) transactions using both Xilinx Ultrascale+ IPs and custom hardware and software components to ensure efficient data transfer without the involvement of the host operating system. The software stack lets the user interface with the NIC directly via a low level driver or through a plug-in for the OpenMPI stack, aligning our NIC to the application layer standards in the HPC community. The APEnetX architecture integrates a Quality-of-Service (QoS) scheme implementation, in order to enforce some level of performance during network congestion events. Finally, APEnetX is accompanied by an Omnet++ based simulator which enables probing the performance of the network when its size is pushed to numbers of nodes otherwise unattainable for cost and/or practicality reasons.
{"title":"Hardware and software design of APEnetX: A custom high-speed interconnect for scientific computing","authors":"Roberto Ammendola , Andrea Biagioni , Carlotta Chiarini , Paolo Cretaro , Ottorino Frezza , Francesca Lo Cicero , Alessandro Lonardo , Michele Martinelli , Pier Stanislao Paolucci , Elena Pastorelli , Pierpaolo Perticaroli , Luca Pontisso , Cristian Rossi , Francesco Simula , Piero Vicini","doi":"10.1016/j.micpro.2025.105224","DOIUrl":"10.1016/j.micpro.2025.105224","url":null,"abstract":"<div><div>High speed interconnects are critical to provide robust and highly efficient services to every user in a cluster. Several commercial offerings – many of which now firmly established in the market – have arisen throughout the years, spanning the very many possible tradeoffs between cost, reconfigurability, performance, resiliency and support for a variety of processing architectures. On the other hand, custom interconnects may represent an appealing solution for applications requiring cost-effectiveness, customizability and flexibility.</div><div>In this regard, the APEnet project was started in 2003, focusing on the design of PCIe FPGA-based custom Network Interface Cards (NIC) for cluster interconnects with a 3D torus topology. In this work, we highlight the main features of APEnetX, the latest version of the APEnet NIC. Designed on the Xilinx Alveo U200 card, it implements Remote Direct Memory Access (RDMA) transactions using both Xilinx Ultrascale+ IPs and custom hardware and software components to ensure efficient data transfer without the involvement of the host operating system. The software stack lets the user interface with the NIC directly via a low level driver or through a plug-in for the OpenMPI stack, aligning our NIC to the application layer standards in the HPC community. The APEnetX architecture integrates a Quality-of-Service (QoS) scheme implementation, in order to enforce some level of performance during network congestion events. Finally, APEnetX is accompanied by an Omnet++ based simulator which enables probing the performance of the network when its size is pushed to numbers of nodes otherwise unattainable for cost and/or practicality reasons.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"120 ","pages":"Article 105224"},"PeriodicalIF":2.6,"publicationDate":"2025-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145571201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-19DOI: 10.1016/j.micpro.2025.105226
Ioannis Tsounis, Dimitris Agiakatsikas, Mihalis Psarakis
Low-Latency Approximate Adders (LLAAs) are high-performance adder models that perform either approximate addition with configurable accuracy-loss or accurate addition by integrating proper circuitry to detect and correct the expected approximation error. Due to their block-based structure, these adder models offer lower latency at the expense of configurable accuracy loss and area overhead. However, hardware accelerators employing such adders are susceptible to hardware (HW) faults, which can cause extra errors (i.e., HW errors) in addition to the expected approximation errors during their operation. In this work, we propose a novel Accuracy Configurable Low-latency and Fault-tolerant Adder, namely ALFA, that offers 100% fault coverage taking into consideration the required accuracy level. Our approach takes advantage of the resemblance between the HW errors and the approximation errors to build a scheme based on selective Triple Modular Redundancy (TMR), which can detect and correct all errors that violate the accuracy threshold. The proposed ALFA model for approximate operation achieves significant performance gains with minimum area overhead compared to the state-of-the-art Reduced Precision Redundancy (RPR) Ripple Carry Adders (RCA) with the same level of fault-tolerance. Furthermore, the accurate ALFA model outperforms the RCA with classical TMR in terms of performance.
{"title":"ALFA: Design of an accuracy-configurable and low-latency fault-tolerant adder","authors":"Ioannis Tsounis, Dimitris Agiakatsikas, Mihalis Psarakis","doi":"10.1016/j.micpro.2025.105226","DOIUrl":"10.1016/j.micpro.2025.105226","url":null,"abstract":"<div><div>Low-Latency Approximate Adders (LLAAs) are high-performance adder models that perform either approximate addition with configurable accuracy-loss or accurate addition by integrating proper circuitry to detect and correct the expected approximation error. Due to their block-based structure, these adder models offer lower latency at the expense of configurable accuracy loss and area overhead. However, hardware accelerators employing such adders are susceptible to hardware (HW) faults, which can cause extra errors (i.e., HW errors) in addition to the expected approximation errors during their operation. In this work, we propose a novel Accuracy Configurable Low-latency and Fault-tolerant Adder, namely ALFA, that offers 100% fault coverage taking into consideration the required accuracy level. Our approach takes advantage of the resemblance between the HW errors and the approximation errors to build a scheme based on selective Triple Modular Redundancy (TMR), which can detect and correct all errors that violate the accuracy threshold. The proposed ALFA model for approximate operation achieves significant performance gains with minimum area overhead compared to the state-of-the-art Reduced Precision Redundancy (RPR) Ripple Carry Adders (RCA) with the same level of fault-tolerance. Furthermore, the accurate ALFA model outperforms the RCA with classical TMR in terms of performance.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"120 ","pages":"Article 105226"},"PeriodicalIF":2.6,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-17DOI: 10.1016/j.micpro.2025.105223
Ehsan Kabir , Jason D. Bakos , David Andrews , Miaoqing Huang
Transformer neural networks (TNN) excel in natural language processing (NLP), machine translation, and computer vision (CV) without relying on recurrent or convolutional layers. However, they have high computational and memory demands, particularly on resource constrained devices like FPGAs. Moreover, transformer models vary in processing time across applications, requiring custom models with specific parameters. Designing custom accelerators for each model is complex and time-intensive. Some custom accelerators exist with no runtime adaptability, and they often rely on sparse matrices to reduce latency. However, hardware designs become more challenging due to the need for application-specific sparsity patterns. This paper introduces ADAPTOR, a runtime-adaptive accelerator for dense matrix computations in transformer encoders and decoders on FPGAs. ADAPTOR enhances the utilization of processing elements and on-chip memory, enhancing parallelism and reducing latency. It incorporates efficient matrix tiling to distribute resources across FPGA platforms and is fully quantized for computational efficiency and portability. Evaluations on Xilinx Alveo U55C data center cards and embedded platforms like VC707 and ZCU102 show that our design is 1.2 and 2.87 more power efficient than the NVIDIA K80 GPU and the i7-8700K CPU respectively. Additionally, it achieves a speedup of 1.7 to 2.25 compared to some state-of-the-art FPGA-based accelerators.
{"title":"A runtime-adaptive transformer neural network accelerator on FPGAs","authors":"Ehsan Kabir , Jason D. Bakos , David Andrews , Miaoqing Huang","doi":"10.1016/j.micpro.2025.105223","DOIUrl":"10.1016/j.micpro.2025.105223","url":null,"abstract":"<div><div>Transformer neural networks (TNN) excel in natural language processing (NLP), machine translation, and computer vision (CV) without relying on recurrent or convolutional layers. However, they have high computational and memory demands, particularly on resource constrained devices like FPGAs. Moreover, transformer models vary in processing time across applications, requiring custom models with specific parameters. Designing custom accelerators for each model is complex and time-intensive. Some custom accelerators exist with no runtime adaptability, and they often rely on sparse matrices to reduce latency. However, hardware designs become more challenging due to the need for application-specific sparsity patterns. This paper introduces ADAPTOR, a runtime-adaptive accelerator for dense matrix computations in transformer encoders and decoders on FPGAs. ADAPTOR enhances the utilization of processing elements and on-chip memory, enhancing parallelism and reducing latency. It incorporates efficient matrix tiling to distribute resources across FPGA platforms and is fully quantized for computational efficiency and portability. Evaluations on Xilinx Alveo U55C data center cards and embedded platforms like VC707 and ZCU102 show that our design is 1.2<span><math><mo>×</mo></math></span> and 2.87<span><math><mo>×</mo></math></span> more power efficient than the NVIDIA K80 GPU and the i7-8700K CPU respectively. Additionally, it achieves a speedup of 1.7 to 2.25<span><math><mo>×</mo></math></span> compared to some state-of-the-art FPGA-based accelerators.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"120 ","pages":"Article 105223"},"PeriodicalIF":2.6,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145571199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}