首页 > 最新文献

Microprocessors and Microsystems最新文献

英文 中文
Automatic linux malware detection using binary inspection and runtime opcode tracing 自动linux恶意软件检测使用二进制检查和运行时操作码跟踪
IF 2.6 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-12-12 DOI: 10.1016/j.micpro.2025.105237
Martí Alonso , Andreu Gironés , Juan-José Costa , Enric Morancho , Stefano Di Carlo , Ramon Canal
The fast-paced evolution of cyberattacks to digital infrastructures requires new protection mechanisms to counterattack them. Malware attacks, a type of cyberattacks ranging from viruses and worms to ransomware and spyware, have been traditionally detected using signature-based methods. But with new versions of malware, this approach is not good enough, and new machine learning tools look promising. In this paper we present two methods to detect Linux malware using machine learning models: (1) a dynamic approach, that tracks the application executed instructions (opcodes) while they are being executed; and (2) a static approach, that inspects the binary application files before execution. We evaluate (1) five machine learning models (Support Vector Machine, k-Nearest Neighbor, Naive Bayes, Decision Tree and Random Forest) and (2) a deep neural network using a Long Short-Term Memory architecture with word embedding. We show the methodology, the initial dataset preparation, the infrastructure used to obtain the traces of executed instructions, and the evaluation of the results for the different models used. The obtained results show that the dynamic approach with a Random Forest classifier gets a 90% accuracy or higher, while the static approach obtains a 98% accuracy.
网络攻击对数字基础设施的快速演变需要新的保护机制来反击。恶意软件攻击是一种网络攻击,从病毒和蠕虫到勒索软件和间谍软件,传统上使用基于签名的方法来检测。但是对于新版本的恶意软件,这种方法还不够好,新的机器学习工具看起来很有希望。在本文中,我们提出了两种使用机器学习模型检测Linux恶意软件的方法:(1)动态方法,在应用程序执行指令(操作码)时跟踪它们;(2)静态方法,在执行前检查二进制应用程序文件。我们评估了(1)五种机器学习模型(支持向量机,k近邻,朴素贝叶斯,决策树和随机森林)和(2)使用长短期记忆架构和词嵌入的深度神经网络。我们展示了方法、初始数据集准备、用于获取执行指令的跟踪的基础设施,以及对所使用的不同模型的结果的评估。得到的结果表明,使用随机森林分类器的动态方法获得90%以上的准确率,而静态方法获得98%的准确率。
{"title":"Automatic linux malware detection using binary inspection and runtime opcode tracing","authors":"Martí Alonso ,&nbsp;Andreu Gironés ,&nbsp;Juan-José Costa ,&nbsp;Enric Morancho ,&nbsp;Stefano Di Carlo ,&nbsp;Ramon Canal","doi":"10.1016/j.micpro.2025.105237","DOIUrl":"10.1016/j.micpro.2025.105237","url":null,"abstract":"<div><div>The fast-paced evolution of cyberattacks to digital infrastructures requires new protection mechanisms to counterattack them. Malware attacks, a type of cyberattacks ranging from viruses and worms to ransomware and spyware, have been traditionally detected using signature-based methods. But with new versions of malware, this approach is not good enough, and new machine learning tools look promising. In this paper we present two methods to detect Linux malware using machine learning models: (1) a dynamic approach, that tracks the application executed instructions (opcodes) while they are being executed; and (2) a static approach, that inspects the binary application files before execution. We evaluate (1) five machine learning models (Support Vector Machine, k-Nearest Neighbor, Naive Bayes, Decision Tree and Random Forest) and (2) a deep neural network using a Long Short-Term Memory architecture with word embedding. We show the methodology, the initial dataset preparation, the infrastructure used to obtain the traces of executed instructions, and the evaluation of the results for the different models used. The obtained results show that the dynamic approach with a Random Forest classifier gets a 90% accuracy or higher, while the static approach obtains a 98% accuracy.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"120 ","pages":"Article 105237"},"PeriodicalIF":2.6,"publicationDate":"2025-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SHAX: Evaluation of SVM hardware accelerator for detecting and preventing ROP on Xtensa Xtensa上用于检测和预防ROP的SVM硬件加速器的评估
IF 2.6 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-12-04 DOI: 10.1016/j.micpro.2025.105236
Adebayo Omotosho , Sirine Ilahi , Ernesto Cristopher Villegas Castillo , Christian Hammer , Hans-Martin Bluethgen
Return-oriented programming (ROP) chains together sequences of instructions residing in executable pages of the memory to compromise a program’s control flow. On embedded systems, ROP detection is intricate as such devices lack the resources to directly run sophisticated software-based detection techniques, as these are memory and CPU-intensive.
However, a Field Programmable Gate Array (FPGA) can enhance the capabilities of an embedded device to handle resource-intensive tasks. Hence, this paper presents the first performance evaluation of a Support Vector Machine (SVM) hardware accelerator for automatic ROP classification on Xtensa-embedded devices using hardware performance counters (HPCs).
In addition to meeting security requirements, modern cyber–physical systems must exhibit high reliability against hardware failures to ensure correct functionality. To assess the reliability level of our proposed SVM architecture, we perform simulation-based fault injection at the RT-level. To improve the efficiency of this evaluation, we utilize a hybrid virtual prototype that integrates the RT-level model of the SVM accelerator with the Tensilica LX7 Instruction Set Simulator. This setup enables early-stage reliability assessment, helping to identify vulnerabilities and reduce the need for extensive fault injection campaigns during later stages of the design process.
Our evaluation results show that an SVM accelerator targeting an FPGA device can detect and prevent ROP attacks on an embedded processor with high accuracy in real time. In addition, we explore the most vulnerable locations of our SVM design to permanent faults, enabling the exploration of safety mechanisms that increase fault coverage in future works.
面向返回的编程(ROP)将驻留在内存可执行页中的指令序列链接在一起,以破坏程序的控制流。在嵌入式系统上,ROP检测是复杂的,因为这些设备缺乏直接运行复杂的基于软件的检测技术的资源,因为这些是内存和cpu密集型的。然而,现场可编程门阵列(FPGA)可以增强嵌入式设备处理资源密集型任务的能力。因此,本文提出了支持向量机(SVM)硬件加速器在xtensa嵌入式设备上使用硬件性能计数器(hpc)进行自动ROP分类的首次性能评估。除了满足安全要求外,现代网络物理系统必须在硬件故障时表现出高可靠性,以确保正确的功能。为了评估我们提出的SVM架构的可靠性水平,我们在rt级执行基于仿真的故障注入。为了提高评估效率,我们使用了一个混合虚拟样机,该样机将支持向量机加速器的rt级模型与Tensilica LX7指令集模拟器集成在一起。这种设置支持早期可靠性评估,有助于识别漏洞,并减少在设计过程的后期阶段进行大量故障注入活动的需要。我们的评估结果表明,针对FPGA器件的SVM加速器可以实时高精度地检测和防止嵌入式处理器的ROP攻击。此外,我们还探索了SVM设计中最容易受到永久故障影响的位置,从而可以在未来的工作中探索增加故障覆盖的安全机制。
{"title":"SHAX: Evaluation of SVM hardware accelerator for detecting and preventing ROP on Xtensa","authors":"Adebayo Omotosho ,&nbsp;Sirine Ilahi ,&nbsp;Ernesto Cristopher Villegas Castillo ,&nbsp;Christian Hammer ,&nbsp;Hans-Martin Bluethgen","doi":"10.1016/j.micpro.2025.105236","DOIUrl":"10.1016/j.micpro.2025.105236","url":null,"abstract":"<div><div><em>Return-oriented programming</em> (ROP) chains together sequences of instructions residing in executable pages of the memory to compromise a program’s control flow. On <em>embedded systems</em>, ROP detection is intricate as such devices lack the resources to directly run sophisticated software-based detection techniques, as these are memory and CPU-intensive.</div><div>However, a <em>Field Programmable Gate Array</em> (FPGA) can enhance the capabilities of an embedded device to handle resource-intensive tasks. Hence, this paper presents the first performance evaluation of a Support Vector Machine (SVM) hardware accelerator for automatic ROP classification on Xtensa-embedded devices using hardware performance counters (HPCs).</div><div>In addition to meeting security requirements, modern cyber–physical systems must exhibit high reliability against hardware failures to ensure correct functionality. To assess the reliability level of our proposed SVM architecture, we perform simulation-based fault injection at the RT-level. To improve the efficiency of this evaluation, we utilize a hybrid virtual prototype that integrates the RT-level model of the SVM accelerator with the Tensilica LX7 Instruction Set Simulator. This setup enables early-stage reliability assessment, helping to identify vulnerabilities and reduce the need for extensive fault injection campaigns during later stages of the design process.</div><div>Our evaluation results show that an SVM accelerator targeting an FPGA device can detect and prevent ROP attacks on an embedded processor with high accuracy in real time. In addition, we explore the most vulnerable locations of our SVM design to permanent faults, enabling the exploration of safety mechanisms that increase fault coverage in future works.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"120 ","pages":"Article 105236"},"PeriodicalIF":2.6,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hardware and software design of APEnetX: A custom high-speed interconnect for scientific computing 科学计算专用高速互连APEnetX的软硬件设计
IF 2.6 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-11-21 DOI: 10.1016/j.micpro.2025.105224
Roberto Ammendola , Andrea Biagioni , Carlotta Chiarini , Paolo Cretaro , Ottorino Frezza , Francesca Lo Cicero , Alessandro Lonardo , Michele Martinelli , Pier Stanislao Paolucci , Elena Pastorelli , Pierpaolo Perticaroli , Luca Pontisso , Cristian Rossi , Francesco Simula , Piero Vicini
High speed interconnects are critical to provide robust and highly efficient services to every user in a cluster. Several commercial offerings – many of which now firmly established in the market – have arisen throughout the years, spanning the very many possible tradeoffs between cost, reconfigurability, performance, resiliency and support for a variety of processing architectures. On the other hand, custom interconnects may represent an appealing solution for applications requiring cost-effectiveness, customizability and flexibility.
In this regard, the APEnet project was started in 2003, focusing on the design of PCIe FPGA-based custom Network Interface Cards (NIC) for cluster interconnects with a 3D torus topology. In this work, we highlight the main features of APEnetX, the latest version of the APEnet NIC. Designed on the Xilinx Alveo U200 card, it implements Remote Direct Memory Access (RDMA) transactions using both Xilinx Ultrascale+ IPs and custom hardware and software components to ensure efficient data transfer without the involvement of the host operating system. The software stack lets the user interface with the NIC directly via a low level driver or through a plug-in for the OpenMPI stack, aligning our NIC to the application layer standards in the HPC community. The APEnetX architecture integrates a Quality-of-Service (QoS) scheme implementation, in order to enforce some level of performance during network congestion events. Finally, APEnetX is accompanied by an Omnet++ based simulator which enables probing the performance of the network when its size is pushed to numbers of nodes otherwise unattainable for cost and/or practicality reasons.
高速互连对于为集群中的每个用户提供健壮和高效的服务至关重要。多年来,已经出现了一些商业产品,其中许多已经在市场上站稳了脚跟,跨越了成本、可重构性、性能、弹性和对各种处理架构的支持之间的许多可能的权衡。另一方面,对于需要成本效益、可定制性和灵活性的应用程序来说,自定义互连可能是一个有吸引力的解决方案。在这方面,APEnet项目于2003年启动,重点是设计基于PCIe fpga的自定义网络接口卡(NIC),用于具有3D环面拓扑的集群互连。在这项工作中,我们重点介绍了APEnetX的主要功能,APEnetX是APEnet网卡的最新版本。在Xilinx Alveo U200卡上设计,它使用Xilinx Ultrascale+ ip和定制硬件和软件组件实现远程直接内存访问(RDMA)事务,以确保高效的数据传输,而无需主机操作系统的参与。软件栈允许用户直接通过底层驱动程序或OpenMPI栈的插件与网卡进行交互,使我们的网卡与高性能计算社区的应用层标准保持一致。APEnetX体系结构集成了服务质量(QoS)方案实现,以便在网络拥塞事件期间强制执行某种级别的性能。最后,APEnetX还附带了一个基于omnet++的模拟器,当网络的大小被推到节点数量时,它可以探测网络的性能,否则由于成本和/或实用性原因无法实现。
{"title":"Hardware and software design of APEnetX: A custom high-speed interconnect for scientific computing","authors":"Roberto Ammendola ,&nbsp;Andrea Biagioni ,&nbsp;Carlotta Chiarini ,&nbsp;Paolo Cretaro ,&nbsp;Ottorino Frezza ,&nbsp;Francesca Lo Cicero ,&nbsp;Alessandro Lonardo ,&nbsp;Michele Martinelli ,&nbsp;Pier Stanislao Paolucci ,&nbsp;Elena Pastorelli ,&nbsp;Pierpaolo Perticaroli ,&nbsp;Luca Pontisso ,&nbsp;Cristian Rossi ,&nbsp;Francesco Simula ,&nbsp;Piero Vicini","doi":"10.1016/j.micpro.2025.105224","DOIUrl":"10.1016/j.micpro.2025.105224","url":null,"abstract":"<div><div>High speed interconnects are critical to provide robust and highly efficient services to every user in a cluster. Several commercial offerings – many of which now firmly established in the market – have arisen throughout the years, spanning the very many possible tradeoffs between cost, reconfigurability, performance, resiliency and support for a variety of processing architectures. On the other hand, custom interconnects may represent an appealing solution for applications requiring cost-effectiveness, customizability and flexibility.</div><div>In this regard, the APEnet project was started in 2003, focusing on the design of PCIe FPGA-based custom Network Interface Cards (NIC) for cluster interconnects with a 3D torus topology. In this work, we highlight the main features of APEnetX, the latest version of the APEnet NIC. Designed on the Xilinx Alveo U200 card, it implements Remote Direct Memory Access (RDMA) transactions using both Xilinx Ultrascale+ IPs and custom hardware and software components to ensure efficient data transfer without the involvement of the host operating system. The software stack lets the user interface with the NIC directly via a low level driver or through a plug-in for the OpenMPI stack, aligning our NIC to the application layer standards in the HPC community. The APEnetX architecture integrates a Quality-of-Service (QoS) scheme implementation, in order to enforce some level of performance during network congestion events. Finally, APEnetX is accompanied by an Omnet++ based simulator which enables probing the performance of the network when its size is pushed to numbers of nodes otherwise unattainable for cost and/or practicality reasons.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"120 ","pages":"Article 105224"},"PeriodicalIF":2.6,"publicationDate":"2025-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145571201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ALFA: Design of an accuracy-configurable and low-latency fault-tolerant adder ALFA:一种精度可配置、低延迟容错加法器的设计
IF 2.6 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-11-19 DOI: 10.1016/j.micpro.2025.105226
Ioannis Tsounis, Dimitris Agiakatsikas, Mihalis Psarakis
Low-Latency Approximate Adders (LLAAs) are high-performance adder models that perform either approximate addition with configurable accuracy-loss or accurate addition by integrating proper circuitry to detect and correct the expected approximation error. Due to their block-based structure, these adder models offer lower latency at the expense of configurable accuracy loss and area overhead. However, hardware accelerators employing such adders are susceptible to hardware (HW) faults, which can cause extra errors (i.e., HW errors) in addition to the expected approximation errors during their operation. In this work, we propose a novel Accuracy Configurable Low-latency and Fault-tolerant Adder, namely ALFA, that offers 100% fault coverage taking into consideration the required accuracy level. Our approach takes advantage of the resemblance between the HW errors and the approximation errors to build a scheme based on selective Triple Modular Redundancy (TMR), which can detect and correct all errors that violate the accuracy threshold. The proposed ALFA model for approximate operation achieves significant performance gains with minimum area overhead compared to the state-of-the-art Reduced Precision Redundancy (RPR) Ripple Carry Adders (RCA) with the same level of fault-tolerance. Furthermore, the accurate ALFA model outperforms the RCA with classical TMR in terms of performance.
低延迟近似加法器(LLAAs)是高性能加法器模型,通过集成适当的电路来检测和纠正预期的近似误差,执行具有可配置精度损失的近似加法或精确加法。由于其基于块的结构,这些加法器模型以牺牲可配置精度损失和面积开销为代价提供了较低的延迟。然而,使用这种加法器的硬件加速器容易受到硬件(HW)故障的影响,这除了在其操作期间预期的近似误差外,还可能导致额外的错误(即HW错误)。在这项工作中,我们提出了一种新颖的精度可配置低延迟和容错加法器,即ALFA,它在考虑所需精度水平的情况下提供100%的故障覆盖率。我们的方法利用了HW误差和近似误差之间的相似性,建立了一种基于选择性三模冗余(TMR)的方案,该方案可以检测和纠正所有违反精度阈值的误差。与具有相同容错级别的最先进的降低精度冗余(RPR)纹波进位加法器(RCA)相比,所提出的用于近似操作的ALFA模型以最小的面积开销实现了显着的性能提升。此外,精确的ALFA模型在性能方面优于具有经典TMR的RCA模型。
{"title":"ALFA: Design of an accuracy-configurable and low-latency fault-tolerant adder","authors":"Ioannis Tsounis,&nbsp;Dimitris Agiakatsikas,&nbsp;Mihalis Psarakis","doi":"10.1016/j.micpro.2025.105226","DOIUrl":"10.1016/j.micpro.2025.105226","url":null,"abstract":"<div><div>Low-Latency Approximate Adders (LLAAs) are high-performance adder models that perform either approximate addition with configurable accuracy-loss or accurate addition by integrating proper circuitry to detect and correct the expected approximation error. Due to their block-based structure, these adder models offer lower latency at the expense of configurable accuracy loss and area overhead. However, hardware accelerators employing such adders are susceptible to hardware (HW) faults, which can cause extra errors (i.e., HW errors) in addition to the expected approximation errors during their operation. In this work, we propose a novel Accuracy Configurable Low-latency and Fault-tolerant Adder, namely ALFA, that offers 100% fault coverage taking into consideration the required accuracy level. Our approach takes advantage of the resemblance between the HW errors and the approximation errors to build a scheme based on selective Triple Modular Redundancy (TMR), which can detect and correct all errors that violate the accuracy threshold. The proposed ALFA model for approximate operation achieves significant performance gains with minimum area overhead compared to the state-of-the-art Reduced Precision Redundancy (RPR) Ripple Carry Adders (RCA) with the same level of fault-tolerance. Furthermore, the accurate ALFA model outperforms the RCA with classical TMR in terms of performance.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"120 ","pages":"Article 105226"},"PeriodicalIF":2.6,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A runtime-adaptive transformer neural network accelerator on FPGAs 基于fpga的运行自适应变压器神经网络加速器
IF 2.6 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-11-17 DOI: 10.1016/j.micpro.2025.105223
Ehsan Kabir , Jason D. Bakos , David Andrews , Miaoqing Huang
Transformer neural networks (TNN) excel in natural language processing (NLP), machine translation, and computer vision (CV) without relying on recurrent or convolutional layers. However, they have high computational and memory demands, particularly on resource constrained devices like FPGAs. Moreover, transformer models vary in processing time across applications, requiring custom models with specific parameters. Designing custom accelerators for each model is complex and time-intensive. Some custom accelerators exist with no runtime adaptability, and they often rely on sparse matrices to reduce latency. However, hardware designs become more challenging due to the need for application-specific sparsity patterns. This paper introduces ADAPTOR, a runtime-adaptive accelerator for dense matrix computations in transformer encoders and decoders on FPGAs. ADAPTOR enhances the utilization of processing elements and on-chip memory, enhancing parallelism and reducing latency. It incorporates efficient matrix tiling to distribute resources across FPGA platforms and is fully quantized for computational efficiency and portability. Evaluations on Xilinx Alveo U55C data center cards and embedded platforms like VC707 and ZCU102 show that our design is 1.2× and 2.87× more power efficient than the NVIDIA K80 GPU and the i7-8700K CPU respectively. Additionally, it achieves a speedup of 1.7 to 2.25× compared to some state-of-the-art FPGA-based accelerators.
变压器神经网络(TNN)在自然语言处理(NLP),机器翻译和计算机视觉(CV)方面表现出色,而不依赖于循环或卷积层。然而,它们有很高的计算和内存需求,特别是在像fpga这样资源受限的设备上。此外,变压器模型在不同应用程序之间的处理时间不同,需要使用特定参数的自定义模型。为每个模型设计定制加速器既复杂又耗时。一些自定义加速器没有运行时适应性,它们通常依赖于稀疏矩阵来减少延迟。然而,由于需要特定于应用程序的稀疏性模式,硬件设计变得更具挑战性。本文介绍了一种用于fpga上变压器编码器和解码器密集矩阵计算的运行时自适应加速器ADAPTOR。ADAPTOR提高了处理元件和片上存储器的利用率,增强了并行性并减少了延迟。它结合了高效的矩阵平铺,在FPGA平台上分配资源,并完全量化了计算效率和可移植性。在Xilinx Alveo U55C数据中心卡和VC707和ZCU102等嵌入式平台上的测试表明,我们的设计比NVIDIA K80 GPU和i7-8700K CPU的能效分别提高1.2倍和2.87倍。此外,与一些最先进的基于fpga的加速器相比,它实现了1.7到2.25倍的加速。
{"title":"A runtime-adaptive transformer neural network accelerator on FPGAs","authors":"Ehsan Kabir ,&nbsp;Jason D. Bakos ,&nbsp;David Andrews ,&nbsp;Miaoqing Huang","doi":"10.1016/j.micpro.2025.105223","DOIUrl":"10.1016/j.micpro.2025.105223","url":null,"abstract":"<div><div>Transformer neural networks (TNN) excel in natural language processing (NLP), machine translation, and computer vision (CV) without relying on recurrent or convolutional layers. However, they have high computational and memory demands, particularly on resource constrained devices like FPGAs. Moreover, transformer models vary in processing time across applications, requiring custom models with specific parameters. Designing custom accelerators for each model is complex and time-intensive. Some custom accelerators exist with no runtime adaptability, and they often rely on sparse matrices to reduce latency. However, hardware designs become more challenging due to the need for application-specific sparsity patterns. This paper introduces ADAPTOR, a runtime-adaptive accelerator for dense matrix computations in transformer encoders and decoders on FPGAs. ADAPTOR enhances the utilization of processing elements and on-chip memory, enhancing parallelism and reducing latency. It incorporates efficient matrix tiling to distribute resources across FPGA platforms and is fully quantized for computational efficiency and portability. Evaluations on Xilinx Alveo U55C data center cards and embedded platforms like VC707 and ZCU102 show that our design is 1.2<span><math><mo>×</mo></math></span> and 2.87<span><math><mo>×</mo></math></span> more power efficient than the NVIDIA K80 GPU and the i7-8700K CPU respectively. Additionally, it achieves a speedup of 1.7 to 2.25<span><math><mo>×</mo></math></span> compared to some state-of-the-art FPGA-based accelerators.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"120 ","pages":"Article 105223"},"PeriodicalIF":2.6,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145571199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cardiac arrhythmia classification system: An optimized HLS-based hardware implementation on PYNQ platform 心律失常分类系统:PYNQ平台上基于hls的优化硬件实现
IF 2.6 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-11-17 DOI: 10.1016/j.micpro.2025.105225
Soumyashree Mangaraj, Kamalakanta Mahapatra, Samit Ari
Electrocardiogram (ECG) study to diagnose cardiac abnormalities is a popular non-invasive technique. Architecture relying on deep learning (DL), and its hardware deployment on edge is crucial for effective diagnosis in smart health care applications. This inference on resource limited FPGA platform poses a significant challenge with intense mathematical computations of DL architectures. Existing FPGA implemented convolutional neural network (CNN) architectures typically adopt sequential deep convolutional stacking, which demands recurrent use of memory to retrieve data, and ultimately degrading throughput and adding latency. A hardware efficient tri-branch CNN architecture is introduced for arrhythmia classification, which leverages FPGA’s intrinsic parallel architecture and minimizes overhead of data management. The proposed CNN’s hardware architecture is implemented in a high-level synthesis (HLS) framework through three key optimizations: (i) pool-conv-graded-quantized (PCGQ) module, (ii) in-pool merged function module, and (iii) skip-zero connection. These enhancements improve layer level precision, reduce quantization error, lower latency, and optimize FPGA resource utilization. Implemented on a PYNQ-Z2 FPGA, the design utilizes 27.79% LUTs, 12.24% FFs, 50.45% DSPs, 34.29% BRAM, and delivers 347 GOPS throughput at 45 ms latency, validated in Vivado 2022.2. The proposed system is assessed using the MIT-BIH Arrhythmia Dataset in accordance with AAMI EC57 standards, and attained a classification accuracy of 97.98% across five types of ECG beats, highlighting its suitability for portable healthcare applications.
心电图检查诊断心脏异常是一种常用的无创技术。基于深度学习(DL)的架构及其边缘硬件部署对于智能医疗保健应用中的有效诊断至关重要。这种在资源有限的FPGA平台上的推断对深度学习架构的大量数学计算提出了重大挑战。现有FPGA实现的卷积神经网络(CNN)架构通常采用顺序深度卷积堆叠,这需要反复使用内存来检索数据,最终会降低吞吐量并增加延迟。利用FPGA固有的并行结构和最小化数据管理开销,提出了一种硬件高效的三分支CNN结构用于心律失常分类。本文提出的CNN硬件架构在一个高级综合(high-level synthesis, HLS)框架中通过三个关键优化实现:(i)池-卷积-分级-量化(PCGQ)模块,(ii)池内合并功能模块,(iii)跳零连接。这些增强提高了层级精度,减少了量化误差,降低了延迟,并优化了FPGA资源利用率。该设计在PYNQ-Z2 FPGA上实现,利用27.79%的lut, 12.24%的ff, 50.45%的dsp, 34.29%的BRAM,在45 ms延迟下提供347 GOPS吞吐量,在Vivado 2022.2中验证。根据AAMI EC57标准,使用MIT-BIH心律失常数据集对该系统进行了评估,并在五种类型的ECG心跳中获得了97.98%的分类准确率,突出了其适用于便携式医疗保健应用。
{"title":"Cardiac arrhythmia classification system: An optimized HLS-based hardware implementation on PYNQ platform","authors":"Soumyashree Mangaraj,&nbsp;Kamalakanta Mahapatra,&nbsp;Samit Ari","doi":"10.1016/j.micpro.2025.105225","DOIUrl":"10.1016/j.micpro.2025.105225","url":null,"abstract":"<div><div>Electrocardiogram (ECG) study to diagnose cardiac abnormalities is a popular non-invasive technique. Architecture relying on deep learning (DL), and its hardware deployment on edge is crucial for effective diagnosis in smart health care applications. This inference on resource limited FPGA platform poses a significant challenge with intense mathematical computations of DL architectures. Existing FPGA implemented convolutional neural network (CNN) architectures typically adopt sequential deep convolutional stacking, which demands recurrent use of memory to retrieve data, and ultimately degrading throughput and adding latency. A hardware efficient tri-branch CNN architecture is introduced for arrhythmia classification, which leverages FPGA’s intrinsic parallel architecture and minimizes overhead of data management. The proposed CNN’s hardware architecture is implemented in a high-level synthesis (HLS) framework through three key optimizations: (i) pool-conv-graded-quantized (PCGQ) module, (ii) in-pool merged function module, and (iii) skip-zero connection. These enhancements improve layer level precision, reduce quantization error, lower latency, and optimize FPGA resource utilization. Implemented on a PYNQ-Z2 FPGA, the design utilizes 27.79% LUTs, 12.24% FFs, 50.45% DSPs, 34.29% BRAM, and delivers 347 GOPS throughput at 45 ms latency, validated in Vivado 2022.2. The proposed system is assessed using the MIT-BIH Arrhythmia Dataset in accordance with AAMI EC57 standards, and attained a classification accuracy of 97.98% across five types of ECG beats, highlighting its suitability for portable healthcare applications.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"120 ","pages":"Article 105225"},"PeriodicalIF":2.6,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145555216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A CGRA frontend for bandwidth utilization in HiPReP 用于HiPReP中带宽利用的CGRA前端
IF 2.6 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-11-08 DOI: 10.1016/j.micpro.2025.105220
Philipp Käsgen , Markus Weinhardt , Christian Hochberger
When dealing with multiple data consumers and producers in a highly parallel accelerator architecture the challenge arises how to coordinate the requests to memory. An example of such an accelerator is a coarse-grained reconfigurable array (CGRA). CGRAs consist of multiple processing elements (PEs) which can consume and produce data. On the one hand, the resulting load and store requests to the memory need to be orchestrated such that the CGRA does not deadlock when connected to a cache hierarchy responding to memory requests out-of-request-order. On the other hand, multiple consumers and producers open up the possibility to make better use of the available memory bandwidth such that the cache is busy constantly. We call the unit to address these challenges and opportunities frontend (FE).
We propose a synthesizable FE for the HiPReP CGRA which enables the integration with a RISC-V based host system. Based on an example application, we showcase a methodology to match the number of consumers and producers (i.e. PEs) with the memory hierarchy such that the CGRA can efficiently harness the available L1 data cache bandwidth, reaching 99.6% of the theoretical peak bandwidth in a synthetic benchmark, and enabling a speedup of up to 21.9x over an out-of-order processor for dense matrix-matrix-multiplications. Moreover, we explore the FE design, the impact of the different numbers of PEs, memory access patterns, synthesis results, and compare the accelerator runtime with the runtime on the host itself as baseline.
在高度并行的加速器体系结构中处理多个数据消费者和生产者时,如何协调对内存的请求是一个挑战。这种加速器的一个例子是粗粒度可重构数组(CGRA)。CGRAs由多个可以消费和产生数据的处理元素(pe)组成。一方面,对内存产生的加载和存储请求需要进行编排,以便在连接到响应非请求顺序内存请求的缓存层次结构时,CGRA不会死锁。另一方面,多个消费者和生产者开启了更好地利用可用内存带宽的可能性,这样缓存就会一直很忙。我们称该部门为应对这些挑战和机遇的前端(FE)。我们提出了一种可合成的HiPReP CGRA FE,使其能够与基于RISC-V的主机系统集成。基于一个示例应用程序,我们展示了一种将消费者和生产者(即pe)的数量与内存层次结构相匹配的方法,这样CGRA可以有效地利用可用的L1数据缓存带宽,在合成基准测试中达到理论峰值带宽的99.6%,并在无序处理器上实现高达21.9倍的加速,用于密集矩阵-矩阵-乘法。此外,我们还探讨了FE设计、不同数量pe的影响、内存访问模式、合成结果,并将加速器运行时与主机本身的运行时作为基线进行了比较。
{"title":"A CGRA frontend for bandwidth utilization in HiPReP","authors":"Philipp Käsgen ,&nbsp;Markus Weinhardt ,&nbsp;Christian Hochberger","doi":"10.1016/j.micpro.2025.105220","DOIUrl":"10.1016/j.micpro.2025.105220","url":null,"abstract":"<div><div>When dealing with multiple data consumers and producers in a highly parallel accelerator architecture the challenge arises how to coordinate the requests to memory. An example of such an accelerator is a coarse-grained reconfigurable array (CGRA). CGRAs consist of multiple processing elements (PEs) which can consume and produce data. On the one hand, the resulting load and store requests to the memory need to be orchestrated such that the CGRA does not deadlock when connected to a cache hierarchy responding to memory requests out-of-request-order. On the other hand, multiple consumers and producers open up the possibility to make better use of the available memory bandwidth such that the cache is busy constantly. We call the unit to address these challenges and opportunities <em>frontend</em> (FE).</div><div>We propose a synthesizable FE for the HiPReP CGRA which enables the integration with a RISC-V based host system. Based on an example application, we showcase a methodology to match the number of consumers and producers (i.e. PEs) with the memory hierarchy such that the CGRA can efficiently harness the available L1 data cache bandwidth, reaching 99.6% of the theoretical peak bandwidth in a synthetic benchmark, and enabling a speedup of up to 21.9x over an out-of-order processor for dense matrix-matrix-multiplications. Moreover, we explore the FE design, the impact of the different numbers of PEs, memory access patterns, synthesis results, and compare the accelerator runtime with the runtime on the host itself as baseline.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"119 ","pages":"Article 105220"},"PeriodicalIF":2.6,"publicationDate":"2025-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine learning for predicting digital block layout feasibility in Analog-On-Top designs 基于机器学习的模拟顶层设计中数字块布局可行性预测
IF 2.6 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-11-04 DOI: 10.1016/j.micpro.2025.105221
Francesco Daghero , Gabriele Faraone , Eugenio Serianni , Nicola Di Carolo , Giovanna Antonella Franchino , Michelangelo Grosso , Daniele Jahier Pagliari
The Analog-On-Top (AoT) Mixed-Signal (AMS) design flow is a time-consuming process, heavily reliant on expert knowledge and manual iteration. A critical step involves reserving top-level layout regions for digital blocks, which typically requires several back-and-forth exchanges between analog and digital teams due to the complex interplay of design constraints that affect the digital area requirements. Existing automated approaches often fail to generalize, as they are benchmarked on overly simplistic designs that lack real-world complexity. In this work, we frame the area adequacy check as a binary classification task and propose a Machine Learning (ML) solution to predict whether the reserved area for a digital block is sufficient. We conduct an extensive evaluation across multiple ML models on a dataset of production-level designs, achieving up to 94.38% F1 score with a Random Forest. Finally, we apply ensemble techniques to improve performance further, reaching 95.35% F1 with a majority-vote ensemble.
模拟上顶(AoT)混合信号(AMS)设计流程是一个耗时的过程,严重依赖于专家知识和人工迭代。一个关键步骤是为数字块保留顶层布局区域,由于影响数字区域要求的设计约束的复杂相互作用,这通常需要模拟和数字团队之间进行多次来回交换。现有的自动化方法往往不能泛化,因为它们是在缺乏现实世界复杂性的过于简单的设计上进行基准测试的。在这项工作中,我们将区域充分性检查框架为二元分类任务,并提出了一种机器学习(ML)解决方案来预测数字块的保留区域是否足够。我们在生产级设计的数据集上对多个ML模型进行了广泛的评估,随机森林的F1得分高达94.38%。最后,我们应用集成技术进一步提高性能,多数投票集成达到95.35%的F1。
{"title":"Machine learning for predicting digital block layout feasibility in Analog-On-Top designs","authors":"Francesco Daghero ,&nbsp;Gabriele Faraone ,&nbsp;Eugenio Serianni ,&nbsp;Nicola Di Carolo ,&nbsp;Giovanna Antonella Franchino ,&nbsp;Michelangelo Grosso ,&nbsp;Daniele Jahier Pagliari","doi":"10.1016/j.micpro.2025.105221","DOIUrl":"10.1016/j.micpro.2025.105221","url":null,"abstract":"<div><div>The Analog-On-Top (AoT) Mixed-Signal (AMS) design flow is a time-consuming process, heavily reliant on expert knowledge and manual iteration. A critical step involves reserving top-level layout regions for digital blocks, which typically requires several back-and-forth exchanges between analog and digital teams due to the complex interplay of design constraints that affect the digital area requirements. Existing automated approaches often fail to generalize, as they are benchmarked on overly simplistic designs that lack real-world complexity. In this work, we frame the area adequacy check as a binary classification task and propose a Machine Learning (ML) solution to predict whether the reserved area for a digital block is sufficient. We conduct an extensive evaluation across multiple ML models on a dataset of production-level designs, achieving up to 94.38% F1 score with a Random Forest. Finally, we apply ensemble techniques to improve performance further, reaching 95.35% F1 with a majority-vote ensemble.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"119 ","pages":"Article 105221"},"PeriodicalIF":2.6,"publicationDate":"2025-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145467953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FORTALESA: Fault-tolerant reconfigurable systolic array for DNN inference FORTALESA: DNN推理的容错可重构收缩阵列
IF 2.6 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-29 DOI: 10.1016/j.micpro.2025.105222
Natalia Cherezova , Artur Jutman , Maksim Jenihhin
The emergence of Deep Neural Networks (DNNs) in mission- and safety-critical applications brings their reliability to the front. High performance demands of DNNs require the use of specialized hardware accelerators. Systolic array architecture is widely used in DNN accelerators due to its parallelism and regular structure. This work presents a run-time reconfigurable systolic array architecture with three execution modes and four implementation options. All four implementations are evaluated in terms of resource utilization, throughput, and fault tolerance improvement. The proposed architecture is used for reliability enhancement of DNN inference on systolic array through heterogeneous mapping of different network layers to different execution modes. The approach is supported by a novel reliability assessment method based on fault propagation analysis. It is used for the exploration of the appropriate execution mode-layer mapping for DNN inference. The proposed architecture efficiently protects registers and MAC units of systolic array PEs from transient and permanent faults. The reconfigurability feature enables a speedup of up to 3×, depending on layer vulnerability. Furthermore, it requires 6× fewer resources compared to static redundancy and 2.5× fewer resources compared to the previously proposed solution for transient faults.
深度神经网络(dnn)在任务和安全关键应用中的出现使其可靠性成为人们关注的焦点。dnn的高性能要求需要使用专门的硬件加速器。收缩阵列结构由于其并行性和规则性被广泛应用于深度神经网络加速器中。本文提出了一种具有三种执行模式和四种实现选项的运行时可重构收缩阵列架构。根据资源利用率、吞吐量和容错性改进对所有四种实现进行评估。该架构通过不同网络层对不同执行模式的异构映射,增强了收缩阵列上DNN推理的可靠性。该方法得到了一种基于故障传播分析的可靠性评估方法的支持。它用于探索DNN推理的适当执行模式层映射。该结构有效地保护了收缩阵列pe的寄存器和MAC单元免受瞬时和永久故障的影响。可重构特性可根据层的漏洞实现高达3倍的加速。此外,与静态冗余相比,它需要的资源减少了6倍,与之前提出的瞬态故障解决方案相比,它需要的资源减少了2.5倍。
{"title":"FORTALESA: Fault-tolerant reconfigurable systolic array for DNN inference","authors":"Natalia Cherezova ,&nbsp;Artur Jutman ,&nbsp;Maksim Jenihhin","doi":"10.1016/j.micpro.2025.105222","DOIUrl":"10.1016/j.micpro.2025.105222","url":null,"abstract":"<div><div>The emergence of Deep Neural Networks (DNNs) in mission- and safety-critical applications brings their reliability to the front. High performance demands of DNNs require the use of specialized hardware accelerators. Systolic array architecture is widely used in DNN accelerators due to its parallelism and regular structure. This work presents a run-time reconfigurable systolic array architecture with three execution modes and four implementation options. All four implementations are evaluated in terms of resource utilization, throughput, and fault tolerance improvement. The proposed architecture is used for reliability enhancement of DNN inference on systolic array through heterogeneous mapping of different network layers to different execution modes. The approach is supported by a novel reliability assessment method based on fault propagation analysis. It is used for the exploration of the appropriate execution mode-layer mapping for DNN inference. The proposed architecture efficiently protects registers and MAC units of systolic array PEs from transient and permanent faults. The reconfigurability feature enables a speedup of up to <span><math><mrow><mn>3</mn><mo>×</mo></mrow></math></span>, depending on layer vulnerability. Furthermore, it requires <span><math><mrow><mn>6</mn><mo>×</mo></mrow></math></span> fewer resources compared to static redundancy and <span><math><mrow><mn>2</mn><mo>.</mo><mn>5</mn><mo>×</mo></mrow></math></span> fewer resources compared to the previously proposed solution for transient faults.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"119 ","pages":"Article 105222"},"PeriodicalIF":2.6,"publicationDate":"2025-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145467952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Power/accuracy-aware dynamic workload optimization combining application autotuning and runtime resource management on homogeneous architectures 功耗/精度感知动态工作负载优化,在同构架构上结合应用程序自动调优和运行时资源管理
IF 2.6 4区 计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-20 DOI: 10.1016/j.micpro.2025.105219
Roberto Rocco, Francesco Gianchino, Antonio Miele, Gianluca Palermo
Nowadays, most computing systems experience highly dynamic workloads with performance-demanding applications entering and leaving the system with an unpredictable trend. Ensuring their performance guarantees led to the design of adaptive mechanisms, including (i) application autotuners, able to optimize algorithmic parameters (e.g., frame resolution in a video processing application), and (ii) runtime resource management to distribute computing resources among the running applications and tune architectural knobs (e.g., frequency scaling). Past work investigates the two directions separately, acting on a limited set of control knobs and objective functions; instead, this work proposes a combined framework to integrate these two complementary approaches in a single two-level governor acting on the overall hardware/software stack. The resource manager incorporates a policy for computing resource distribution and architectural knobs to guarantee the required performance of each application while limiting the side effect on results quality and minimizing system power consumption. Meanwhile, the autotuner manages the applications’ software knobs, ensuring results’ quality and performance constraint satisfaction while hiding application details from the controller. Experimental evaluation carried out on a homogeneous architecture for workstation machines demonstrates that the proposed framework is stable and can save more than 72% of the power consumed by one-layer solutions.
如今,大多数计算系统都经历了高度动态的工作负载,对性能要求很高的应用程序以不可预测的趋势进入和离开系统。确保它们的性能保证导致了自适应机制的设计,包括(i)应用程序自动调谐器,能够优化算法参数(例如,视频处理应用程序中的帧分辨率),以及(ii)运行时资源管理,以便在运行的应用程序之间分配计算资源并调整架构旋钮(例如,频率缩放)。过去的工作分别研究了这两个方向,作用于一组有限的控制旋钮和目标函数;相反,这项工作提出了一个组合框架,将这两种互补的方法集成到一个单独的两级调控器中,作用于整个硬件/软件堆栈。资源管理器合并了计算资源分配和架构旋钮的策略,以保证每个应用程序所需的性能,同时限制对结果质量的副作用,并将系统功耗降至最低。同时,自动调谐器管理应用程序的软件旋钮,确保结果的质量和性能约束的满足,同时对控制器隐藏应用程序的细节。在工作站机器的同构架构上进行的实验评估表明,所提出的框架是稳定的,并且可以节省单层解决方案消耗的72%以上的功耗。
{"title":"Power/accuracy-aware dynamic workload optimization combining application autotuning and runtime resource management on homogeneous architectures","authors":"Roberto Rocco,&nbsp;Francesco Gianchino,&nbsp;Antonio Miele,&nbsp;Gianluca Palermo","doi":"10.1016/j.micpro.2025.105219","DOIUrl":"10.1016/j.micpro.2025.105219","url":null,"abstract":"<div><div>Nowadays, most computing systems experience highly dynamic workloads with performance-demanding applications entering and leaving the system with an unpredictable trend. Ensuring their performance guarantees led to the design of adaptive mechanisms, including (i) application autotuners, able to optimize algorithmic parameters (e.g., frame resolution in a video processing application), and (ii) runtime resource management to distribute computing resources among the running applications and tune architectural knobs (e.g., frequency scaling). Past work investigates the two directions separately, acting on a limited set of control knobs and objective functions; instead, this work proposes a combined framework to integrate these two complementary approaches in a single two-level governor acting on the overall hardware/software stack. The resource manager incorporates a policy for computing resource distribution and architectural knobs to guarantee the required performance of each application while limiting the side effect on results quality and minimizing system power consumption. Meanwhile, the autotuner manages the applications’ software knobs, ensuring results’ quality and performance constraint satisfaction while hiding application details from the controller. Experimental evaluation carried out on a homogeneous architecture for workstation machines demonstrates that the proposed framework is stable and can save more than 72% of the power consumed by one-layer solutions.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"119 ","pages":"Article 105219"},"PeriodicalIF":2.6,"publicationDate":"2025-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145365057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Microprocessors and Microsystems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1