Microprocessors and Microsystems最新文献_第2页

Algorithms for scheduling CNNs on multicore MCUs at the neuron and layer levels 多核 MCU 神经元和层级 CNN 调度算法

IF 1.9 4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Microprocessors and Microsystems

Pub Date : 2024-11-01 DOI: 10.1016/j.micpro.2024.105107

Petr Dobiáš , Thomas Garbay , Bertrand Granado , Khalil Hachicha , Andrea Pinna

Convolutional neural networks (CNNs) are progressively deployed on embedded systems, which is challenging because their computational and energy requirements need to be satisfied by devices with limited resources and power supplies. For instance, they can be implemented in the Internet of Things or edge computing, i.e., in applications using low-power and low-performance microcontroller units (MCUs). Monocore MCUs are not tailored to respond to the computational and energy requirements of CNNs due to their limited resources, but a multicore MCU can overcome these limitations. This paper presents an empirical study analysing three algorithms for scheduling CNNs on embedded systems at two different levels (neuron and layer levels) and evaluates their performance in terms of makespan and energy consumption using six neural networks, both in general and in the case of CubeSats. The results show that the SNN algorithm outperforms the other two algorithms (STD and STS) and that scheduling at the layer level significantly reduces the energy consumption. Therefore, embedded systems based on multicore MCUs are suitable for executing CNNs, and they can be used, for example, on board small satellites called CubeSats.

卷积神经网络（CNN）正逐步部署到嵌入式系统中，这具有挑战性，因为其计算和能源需求需要由资源和电源有限的设备来满足。例如，它们可以在物联网或边缘计算中实施，即在使用低功耗和低性能微控制器单元（MCU）的应用中实施。由于资源有限，单核 MCU 无法满足 CNN 的计算和能源需求，但多核 MCU 可以克服这些限制。本文介绍了一项实证研究，分析了在嵌入式系统上对两个不同级别（神经元和层级）的 CNN 进行调度的三种算法，并使用六个神经网络评估了它们在一般情况下和立方体卫星情况下的正常运行时间和能耗方面的性能。结果表明，SNN 算法优于其他两种算法（STD 和 STS），层级调度可显著降低能耗。因此，基于多核微控制器的嵌入式系统适用于执行 CNN，例如可用于被称为 CubeSats 的小型卫星。

{"title":"Algorithms for scheduling CNNs on multicore MCUs at the neuron and layer levels","authors":"Petr Dobiáš , Thomas Garbay , Bertrand Granado , Khalil Hachicha , Andrea Pinna","doi":"10.1016/j.micpro.2024.105107","DOIUrl":"10.1016/j.micpro.2024.105107","url":null,"abstract":"<div><div>Convolutional neural networks (CNNs) are progressively deployed on embedded systems, which is challenging because their computational and energy requirements need to be satisfied by devices with limited resources and power supplies. For instance, they can be implemented in the Internet of Things or edge computing, i.e., in applications using low-power and low-performance microcontroller units (MCUs). Monocore MCUs are not tailored to respond to the computational and energy requirements of CNNs due to their limited resources, but a multicore MCU can overcome these limitations. This paper presents an empirical study analysing three algorithms for scheduling CNNs on embedded systems at two different levels (neuron and layer levels) and evaluates their performance in terms of makespan and energy consumption using six neural networks, both in general and in the case of CubeSats. The results show that the <span>SNN</span> algorithm outperforms the other two algorithms (<span>STD</span> and <span>STS</span>) and that scheduling at the layer level significantly reduces the energy consumption. Therefore, embedded systems based on multicore MCUs are suitable for executing CNNs, and they can be used, for example, on board small satellites called CubeSats.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"111 ","pages":"Article 105107"},"PeriodicalIF":1.9,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Quality-driven design of deep neural network hardware accelerators for low power CPS and IoT applications 面向低功耗 CPS 和物联网应用的深度神经网络硬件加速器的质量驱动设计

IF 1.9 4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Microprocessors and Microsystems

Pub Date : 2024-11-01 DOI: 10.1016/j.micpro.2024.105119

Yahya Jan, Lech Jóźwiak

This paper presents the results of our analysis of the main problems that have to be solved in the design of highly parallel high-performance accelerators for Deep Neural Networks (DNNs) used in low power Cyber–Physical System (CPS) and Internet of Things (IoT) devices, in application areas such as smart automotive, health and smart services in social networks (Facebook, Instagram, X/Twitter, etc.). Our analysis demonstrates that to arrive a to high-quality DNN accelerator architecture, complex mutual trade-offs have to be resolved among the accelerator micro- and macro-architecture, and the corresponding memory and communication architectures, as well as among the performance, power consumption and area. Therefore, we developed a multi-processor accelerator design methodology involving an automatic design-space exploration (DSE) framework that enables a very efficient construction and analysis of DNN accelerator architectures, as well as an adequate trade-off exploitation. To satisfy the low power demands of IoT devices, we extend our quality-driven model-based multi-processor accelerator design methodology with some novel power optimization techniques at the Processor’s and memory exploration stages. Our proposed power optimization techniques at the processor’s exploration stage achieve up to 66.5% reduction in power consumption, while our proposed data reuse techniques avoid up to 85.92% of redundant memory accesses thereby reducing the power consumption of accelerator necessary for low-power IoT applications. Currently, we are beginning to apply this methodology with the proposed power optimization techniques to the design of low-power DNN accelerators for IoT applications.

本文介绍了我们对用于低功耗网络物理系统（CPS）和物联网（IoT）设备的深度神经网络（DNN）高并行高性能加速器设计中必须解决的主要问题的分析结果，这些问题涉及智能汽车、健康和社交网络（Facebook、Instagram、X/Twitter 等）中的智能服务等应用领域。我们的分析表明，要实现高质量的 DNN 加速器架构，必须解决加速器微观和宏观架构、相应的内存和通信架构以及性能、功耗和面积之间复杂的相互权衡问题。因此，我们开发了一种涉及自动设计空间探索（DSE）框架的多处理器加速器设计方法，该框架能够非常高效地构建和分析 DNN 加速器架构，并进行充分的权衡利用。为了满足物联网设备的低功耗需求，我们在处理器和内存探索阶段采用了一些新颖的功耗优化技术，从而扩展了基于质量驱动模型的多处理器加速器设计方法。我们在处理器探索阶段提出的功耗优化技术最多可降低 66.5% 的功耗，而我们提出的数据重用技术最多可避免 85.92% 的冗余内存访问，从而降低了低功耗物联网应用所需的加速器功耗。目前，我们正开始将这一方法与所提出的功耗优化技术应用于物联网应用的低功耗 DNN 加速器设计。

{"title":"Quality-driven design of deep neural network hardware accelerators for low power CPS and IoT applications","authors":"Yahya Jan, Lech Jóźwiak","doi":"10.1016/j.micpro.2024.105119","DOIUrl":"10.1016/j.micpro.2024.105119","url":null,"abstract":"<div><div>This paper presents the results of our analysis of the main problems that have to be solved in the design of highly parallel high-performance accelerators for Deep Neural Networks (DNNs) used in low power Cyber–Physical System (CPS) and Internet of Things (IoT) devices, in application areas such as smart automotive, health and smart services in social networks (Facebook, Instagram, X/Twitter, etc.). Our analysis demonstrates that to arrive a to high-quality DNN accelerator architecture, complex mutual trade-offs have to be resolved among the accelerator micro- and macro-architecture, and the corresponding memory and communication architectures, as well as among the performance, power consumption and area. Therefore, we developed a multi-processor accelerator design methodology involving an automatic design-space exploration (DSE) framework that enables a very efficient construction and analysis of DNN accelerator architectures, as well as an adequate trade-off exploitation. To satisfy the low power demands of IoT devices, we extend our quality-driven model-based multi-processor accelerator design methodology with some novel power optimization techniques at the Processor’s and memory exploration stages. Our proposed power optimization techniques at the processor’s exploration stage achieve up to 66.5% reduction in power consumption, while our proposed data reuse techniques avoid up to 85.92% of redundant memory accesses thereby reducing the power consumption of accelerator necessary for low-power IoT applications. Currently, we are beginning to apply this methodology with the proposed power optimization techniques to the design of low-power DNN accelerators for IoT applications.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"111 ","pages":"Article 105119"},"PeriodicalIF":1.9,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142661033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Lower the RISC: Designing optical-probing-attack-resistant cores 降低 RISC：设计抗光攻击的内核

IF 1.9 4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Microprocessors and Microsystems

Pub Date : 2024-11-01 DOI: 10.1016/j.micpro.2024.105121

Sajjad Parvin , Sallar Ahmadi-Pour , Chandan Kumar Jha , Frank Sill Torres , Rolf Drechsler

Recently, a new Side-Channel Analysis (SCA)-based attack, namely the Optical Probing (OP) attack, has been shown to bypass the implemented protection mechanisms on the chip, allowing unauthorized access to confidential information such as stored security keys or Intellectual Property (IP). Several countermeasures against the OP attack exist, which require changes in the chip’s fabrication process, i.e., chip fabrication using OP-resistant materials, resulting in increased fabrication costs. On the other hand, other countermeasures are implemented at the layout level. These countermeasures suffer from a significant drop in performance due to the utilization of custom logic cells. Additionally, available techniques against OP at the layout level require a layout design of the logic cell library from scratch which is a time-consuming process. In this work, we mitigate these limitations and propose a methodology to design high-performance OP-attack-resistant circuits. Using a two-folded methodology, we achieve an OP attack-resistant circuit. Firstly, we design a high-performance, and Low optical Leakage-Dual Rail Logic (LoL-DRL) cell library based on a standard CMOS logic cell library. Hence, no complete redesign of the layout is required. Secondly, we propose a streamlined synthesis technique to synthesize OP-attack-resistant circuits from the original circuit’s netlist. Thus, our method seamlessly integrates into the existing synthesis flow. On top of that, we analyzed the optical leakage information of several logic cells from both the standard logic cell library and our proposed LoL-DRL logic cell library against the OP attack. We used a metric called Optical Leakage Value (OLV) to report the robustness of a logic cell against the OP attack. Furthermore, as a case study, we applied our design methodology to an open-source RISC-V core to design the first OP-attack-resistant RISC-V core, named Lo-RISK. Our approach minimizes any adverse impact on performance yet incurs significant expenses in terms of both area and power consumption, which is acceptable for an OP-secure end product. On average, our proposed LoL-DRL logic cell library exhibits

2 \times

less information leakage through OP compared to the standard CMOS logic cell library. Our approach to designing OP-resistant circuits result in

2 \times

the area and a

1.36 \times

power increase while operating at the same frequency in comparison to a circuit designed using a standard CMOS logic cell library.

最近，一种新的基于侧信道分析（SCA）的攻击，即光学探测（OP）攻击，被证明可以绕过芯片上已实施的保护机制，允许未经授权访问存储的安全密钥或知识产权（IP）等机密信息。目前已有几种针对 OP 攻击的对策，但需要改变芯片制造工艺，即使用抗 OP 材料制造芯片，从而导致制造成本增加。另一方面，其他对策是在布局层面实施的。由于使用定制逻辑单元，这些对策的性能会大幅下降。此外，现有的布局级反 OP 技术需要从头开始进行逻辑单元库的布局设计，这是一个耗时的过程。在这项工作中，我们减少了这些限制，并提出了一种设计高性能抗 OP 攻击电路的方法。我们采用双重方法实现了抗 OP 攻击电路。首先，我们在标准 CMOS 逻辑单元库的基础上设计了一个高性能低光漏双轨逻辑（LoL-DRL）单元库。因此，无需重新设计电路布局。其次，我们提出了一种简化的合成技术，可从原始电路的网表合成抗 OP 攻击电路。因此，我们的方法可以无缝集成到现有的综合流程中。在此基础上，我们分析了标准逻辑单元库和我们提出的 LoL-DRL 逻辑单元库中多个逻辑单元的光泄漏信息，以对抗 OP 攻击。我们使用一种名为 "光学泄漏值"（OLV）的指标来报告逻辑单元对 OP 攻击的鲁棒性。此外，作为一项案例研究，我们将我们的设计方法应用于一个开源 RISC-V 内核，设计出第一个抗 OP 攻击的 RISC-V 内核，命名为 Lo-RISK。我们的方法最大限度地减少了对性能的不利影响，但在面积和功耗方面却产生了巨大的开销，这对于 OP 安全的最终产品来说是可以接受的。与标准 CMOS 逻辑单元库相比，我们提出的 LoL-DRL 逻辑单元库平均减少了 2 倍的 OP 信息泄漏。与使用标准 CMOS 逻辑单元库设计的电路相比，我们的抗 OP 电路设计方法在相同频率下工作时，面积增加了 2 倍，功耗增加了 1.36 倍。

{"title":"Lower the RISC: Designing optical-probing-attack-resistant cores","authors":"Sajjad Parvin , Sallar Ahmadi-Pour , Chandan Kumar Jha , Frank Sill Torres , Rolf Drechsler","doi":"10.1016/j.micpro.2024.105121","DOIUrl":"10.1016/j.micpro.2024.105121","url":null,"abstract":"<div><div>Recently, a new Side-Channel Analysis (SCA)-based attack, namely the Optical Probing (OP) attack, has been shown to bypass the implemented protection mechanisms on the chip, allowing unauthorized access to confidential information such as stored security keys or Intellectual Property (IP). Several countermeasures against the OP attack exist, which require changes in the chip’s fabrication process, i.e., chip fabrication using OP-resistant materials, resulting in increased fabrication costs. On the other hand, other countermeasures are implemented at the layout level. These countermeasures suffer from a significant drop in performance due to the utilization of custom logic cells. Additionally, available techniques against OP at the layout level require a layout design of the logic cell library from scratch which is a time-consuming process. In this work, we mitigate these limitations and propose a methodology to design high-performance OP-attack-resistant circuits. Using a two-folded methodology, we achieve an OP attack-resistant circuit. Firstly, we design a high-performance, and Low optical Leakage-Dual Rail Logic (LoL-DRL) cell library based on a standard CMOS logic cell library. Hence, no complete redesign of the layout is required. Secondly, we propose a streamlined synthesis technique to synthesize OP-attack-resistant circuits from the original circuit’s netlist. Thus, our method seamlessly integrates into the existing synthesis flow. On top of that, we analyzed the optical leakage information of several logic cells from both the standard logic cell library and our proposed LoL-DRL logic cell library against the OP attack. We used a metric called Optical Leakage Value (OLV) to report the robustness of a logic cell against the OP attack. Furthermore, as a case study, we applied our design methodology to an open-source RISC-V core to design the first OP-attack-resistant RISC-V core, named <em>Lo-RISK</em>. Our approach minimizes any adverse impact on performance yet incurs significant expenses in terms of both area and power consumption, which is acceptable for an OP-secure end product. On average, our proposed LoL-DRL logic cell library exhibits <span><math><mrow><mn>2</mn><mo>×</mo></mrow></math></span> less information leakage through OP compared to the standard CMOS logic cell library. Our approach to designing OP-resistant circuits result in <span><math><mrow><mn>2</mn><mo>×</mo></mrow></math></span> the area and a <span><math><mrow><mn>1</mn><mo>.</mo><mn>36</mn><mo>×</mo></mrow></math></span> power increase while operating at the same frequency in comparison to a circuit designed using a standard CMOS logic cell library.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"111 ","pages":"Article 105121"},"PeriodicalIF":1.9,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142661034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Low-cost constant time signed digit selection for most significant bit first multiplication 低成本恒定时间有符号数位选择，用于最显著位首数乘法

IF 1.9 4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Microprocessors and Microsystems

Pub Date : 2024-11-01 DOI: 10.1016/j.micpro.2024.105118

Ghassem Jaberipur , Saeid Gorgin , Jeong-A. Lee

Serial binary multiplication is frequently used in many digital applications. In particular, left-to-right (aka online) manipulation of operands promotes the real-time generation of product digits for immediate utilization in subsequent online computations (e.g., successive layers of a neural network). In the left-to-right arithmetic operations, where a residual is maintained for digit selection, utilization of a redundant number system for the representation of outputs is mandatory, while the input operands and the residual may be redundant or non-redundant. However, when the input data paths are narrow (e.g., eight bits as in BFloat16), conventional non-redundant representations of inputs and residual provide some advantages. For example, the immediate and costless sign detection of the residual that is necessary for the next digit selection; a property not shared by redundant numbers. Nevertheless, digit selection, as practiced in the previous realizations, with both redundant and non-redundant inputs and/or residual, is slow and rather complex. Therefore, in this paper, we offer an imprecise, but faster digit selection scheme, with the required correction in the next cycle. Analytical evaluations and synthesis of the proposed circuits on FPGA platform, shows 30 % speedup and less cost with respect to both cases with redundant and non-redundant inputs and residual.

串行二进制乘法经常用于许多数字应用中。特别是，操作数的从左到右（又称在线）运算可促进实时生成乘积数字，以便在随后的在线计算（如神经网络的连续层）中立即使用。在从左到右的算术运算中，需要保留一个残差用于数字选择，因此必须使用冗余数字系统来表示输出，而输入操作数和残差可以是冗余或非冗余的。然而，当输入数据路径较窄时（如 BFloat16 中的 8 位），传统的非冗余输入和残差表示法具有一些优势。例如，下一位数选择所需的残差可立即、无代价地进行符号检测；这是冗余数字所不具备的特性。尽管如此，在以往的实现过程中，利用冗余和非冗余输入和/或残差进行数字选择的速度很慢，而且相当复杂。因此，在本文中，我们提供了一种不精确但更快的数字选择方案，并在下一个周期进行所需的校正。在 FPGA 平台上对所提电路进行的分析评估和综合显示，与冗余和非冗余输入及残差两种情况相比，速度提高了 30%，成本降低了。

{"title":"Low-cost constant time signed digit selection for most significant bit first multiplication","authors":"Ghassem Jaberipur , Saeid Gorgin , Jeong-A. Lee","doi":"10.1016/j.micpro.2024.105118","DOIUrl":"10.1016/j.micpro.2024.105118","url":null,"abstract":"<div><div>Serial binary multiplication is frequently used in many digital applications. In particular, left-to-right (aka online) manipulation of operands promotes the real-time generation of product digits for immediate utilization in subsequent online computations (e.g., successive layers of a neural network). In the left-to-right arithmetic operations, where a residual is maintained for digit selection, utilization of a redundant number system for the representation of outputs is mandatory, while the input operands and the residual may be redundant or non-redundant. However, when the input data paths are narrow (e.g., eight bits as in BFloat16), conventional non-redundant representations of inputs and residual provide some advantages. For example, the immediate and costless sign detection of the residual that is necessary for the next digit selection; a property not shared by redundant numbers. Nevertheless, digit selection, as practiced in the previous realizations, with both redundant and non-redundant inputs and/or residual, is slow and rather complex. Therefore, in this paper, we offer an imprecise, but faster digit selection scheme, with the required correction in the next cycle. Analytical evaluations and synthesis of the proposed circuits on FPGA platform, shows 30 % speedup and less cost with respect to both cases with redundant and non-redundant inputs and residual.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"111 ","pages":"Article 105118"},"PeriodicalIF":1.9,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142578683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Retraction notice to “A Hybrid Semantic Similarity Measurement for Geospatial Entities” [Microprocessors and Microsystems 80 (2021) 103526] 地理空间实体的混合语义相似性测量 "的撤稿通知 [Microprocessors and Microsystems 80 (2021) 103526]

IF 1.9 4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Microprocessors and Microsystems

Pub Date : 2024-10-20 DOI: 10.1016/j.micpro.2024.105117

引用次数: 0

SIMIL: SIMple Issue Logic for GPUs SIMIL：用于 GPU 的简单问题逻辑

IF 1.9 4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Microprocessors and Microsystems

Pub Date : 2024-10-09 DOI: 10.1016/j.micpro.2024.105105

Rodrigo Huerta , José-Lorenzo Cruz , Jose-Maria Arnau , Antonio González

GPU architectures have become popular for executing general-purpose programs. In particular, they are some of the most efficient architectures for machine learning applications which are among the most trendy and demanding applications nowadays.

This paper presents SIMIL (SIMple Issue Logic for GPUs), an architectural modification to the issue stage that replaces scoreboards with a Dependence Matrix to track dependencies among instructions and avoid data hazards. We show that a Dependence Matrix is more effective in the presence of repetitive use of source operands, which is common in many applications. Besides, a Dependence Matrix with minor extensions can also support a simplistic out-of-order issue. Evaluations on an NVIDIA Tesla V100-like GPU show that SIMIL provides a speed-up of up to 2.39 in some machine learning programs and 1.31 on average for various benchmarks, while it reduces energy consumption by 12.81%, with only 1.5% area overhead. We also show that SIMIL outperforms a recently proposed approach for out-of-order issue that uses register renaming.

GPU 架构已成为执行通用程序的流行架构。本文介绍了 SIMIL（SIMple Issue Logic for GPUs，GPU 的简单问题逻辑），它是对问题阶段的架构修改，用依赖矩阵取代记分板，以跟踪指令之间的依赖关系并避免数据危险。我们的研究表明，在重复使用源操作数的情况下，依赖矩阵更为有效，而这在许多应用中都很常见。此外，稍加扩展的依赖性矩阵还能支持简单的失序问题。在英伟达™（NVIDIA®）Tesla V100 类 GPU 上进行的评估表明，SIMIL 在某些机器学习程序中的速度提升了 2.39 倍，在各种基准测试中的平均速度提升了 1.31 倍，能耗降低了 12.81%，而面积开销仅为 1.5%。我们还表明，SIMIL 的性能优于最近提出的一种使用寄存器重命名来处理失序问题的方法。

引用次数: 0

A hardware architecture for single and multiple ellipse detection using genetic algorithms and high-level synthesis tools 利用遗传算法和高级合成工具实现单椭圆和多椭圆检测的硬件架构

IF 1.9 4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Microprocessors and Microsystems

Pub Date : 2024-10-09 DOI: 10.1016/j.micpro.2024.105106

Francisco J. Iñiguez-Lomeli , Carlos H. Garcia-Capulin , Horacio Rostro-Gonzalez

Ellipse detection techniques are often developed and validated in software environments, neglecting the critical consideration of computational efficiency and resource constraints prevalent in embedded systems. Furthermore, programmable logic devices, notably Field Programmable Gate Arrays (FPGAs), have emerged as indispensable assets for enhancing performance and expediting various processing applications. In the realm of computational efficiency, hardware implementations have the flexibility to tailor the required arithmetic for various applications using fixed-point representation. This approach enables faster computations while upholding adequate accuracy, resulting in reduced resource and energy consumption compared to software applications that rely on higher clock speeds, which often lead to increased resource and energy consumption. Additionally, hardware solutions provide portability and are suitable for resource-constrained and battery-powered applications. This study introduces a novel hardware architecture in the form of an intellectual property core that harnesses the capabilities of a genetic algorithm to detect single and multi ellipses in digital images. In general, genetic algorithms have been demonstrated to be an alternative that shows better results than those based on traditional methods such as the Hough Transform and Random Sample Consensus, particularly in terms of accuracy, flexibility, and robustness. Our genetic algorithm randomly takes five edge points as parameters from the image tested, creating an individual treated as a potential candidate ellipse. The fitness evaluation function determines whether the candidate ellipse truly exists in the image space. The core is designed using Vitis High-Level Synthesis (HLS), a powerful tool that converts C or C＋＋functions into Register-Transfer Level (RTL) code, including VHDL and Verilog. The implementation and testing of the ellipse detection system were carried out on the PYNQ-Z1, a cost-effective development board housing the Xilinx Zynq-7000 System-on-Chip (SoC). PYNQ, an open-source framework, seamlessly integrates programmable logic with a dual-core ARM Cortex-A9 processor, offering the flexibility of Python programming for the onboard SoC processor. The experimental results, based on synthetic and real images, some of them with the presence of noise processed by the developed ellipse detection system, highlight the intellectual property core’s exceptional suitability for resource-constrained embedded systems. Notably, it achieves remarkable performance and accuracy rates, consistently exceeding 99% in most cases. This research aims to contribute to the advancement of hardware-accelerated ellipse detection, catering to the demanding requirements of real-time applications while minimizing resource consumption.

椭圆检测技术通常是在软件环境中开发和验证的，忽略了对嵌入式系统中普遍存在的计算效率和资源限制的重要考虑。此外，可编程逻辑器件，特别是现场可编程门阵列（FPGA），已成为提高性能和加速各种处理应用不可或缺的资产。在计算效率方面，硬件实现可以灵活地使用定点表示法为各种应用定制所需的算术。与依赖较高时钟速度的软件应用程序相比，这种方法能在保持足够精度的同时加快计算速度，从而减少资源和能源消耗，而软件应用程序往往会导致资源和能源消耗增加。此外，硬件解决方案还具有可移植性，适用于资源受限和电池供电的应用。本研究以知识产权核心的形式介绍了一种新颖的硬件架构，该架构利用遗传算法的能力来检测数字图像中的单椭圆和多椭圆。一般来说，遗传算法已被证明是一种替代方法，其结果优于基于传统方法（如 Hough 变换和随机样本共识）的算法，特别是在准确性、灵活性和鲁棒性方面。我们的遗传算法从被测图像中随机抽取五个边缘点作为参数，创建一个被视为潜在候选椭圆的个体。适配性评估功能可确定候选椭圆是否真正存在于图像空间中。内核使用 Vitis 高级合成（HLS）设计，这是一种功能强大的工具，可将 C 或 C＋＋函数转换为寄存器传输层（RTL）代码，包括 VHDL 和 Verilog。椭圆检测系统的实施和测试是在PYNQ-Z1上进行的，PYNQ-Z1是一个内置Xilinx Zynq-7000系统级芯片（SoC）的高性价比开发板。PYNQ是一个开源框架，将可编程逻辑与双核ARM Cortex-A9处理器无缝集成，为板载SoC处理器提供了Python编程的灵活性。实验结果基于合成图像和真实图像，其中一些图像经开发的椭圆检测系统处理后存在噪声。值得注意的是，它实现了卓越的性能和准确率，在大多数情况下始终超过 99%。这项研究旨在推动硬件加速椭圆检测技术的发展，满足实时应用的苛刻要求，同时最大限度地减少资源消耗。

{"title":"A hardware architecture for single and multiple ellipse detection using genetic algorithms and high-level synthesis tools","authors":"Francisco J. Iñiguez-Lomeli , Carlos H. Garcia-Capulin , Horacio Rostro-Gonzalez","doi":"10.1016/j.micpro.2024.105106","DOIUrl":"10.1016/j.micpro.2024.105106","url":null,"abstract":"<div><div>Ellipse detection techniques are often developed and validated in software environments, neglecting the critical consideration of computational efficiency and resource constraints prevalent in embedded systems. Furthermore, programmable logic devices, notably Field Programmable Gate Arrays (FPGAs), have emerged as indispensable assets for enhancing performance and expediting various processing applications. In the realm of computational efficiency, hardware implementations have the flexibility to tailor the required arithmetic for various applications using fixed-point representation. This approach enables faster computations while upholding adequate accuracy, resulting in reduced resource and energy consumption compared to software applications that rely on higher clock speeds, which often lead to increased resource and energy consumption. Additionally, hardware solutions provide portability and are suitable for resource-constrained and battery-powered applications. This study introduces a novel hardware architecture in the form of an intellectual property core that harnesses the capabilities of a genetic algorithm to detect single and multi ellipses in digital images. In general, genetic algorithms have been demonstrated to be an alternative that shows better results than those based on traditional methods such as the Hough Transform and Random Sample Consensus, particularly in terms of accuracy, flexibility, and robustness. Our genetic algorithm randomly takes five edge points as parameters from the image tested, creating an individual treated as a potential candidate ellipse. The fitness evaluation function determines whether the candidate ellipse truly exists in the image space. The core is designed using Vitis High-Level Synthesis (HLS), a powerful tool that converts C or C＋＋functions into Register-Transfer Level (RTL) code, including VHDL and Verilog. The implementation and testing of the ellipse detection system were carried out on the PYNQ-Z1, a cost-effective development board housing the Xilinx Zynq-7000 System-on-Chip (SoC). PYNQ, an open-source framework, seamlessly integrates programmable logic with a dual-core ARM Cortex-A9 processor, offering the flexibility of Python programming for the onboard SoC processor. The experimental results, based on synthetic and real images, some of them with the presence of noise processed by the developed ellipse detection system, highlight the intellectual property core’s exceptional suitability for resource-constrained embedded systems. Notably, it achieves remarkable performance and accuracy rates, consistently exceeding 99% in most cases. This research aims to contribute to the advancement of hardware-accelerated ellipse detection, catering to the demanding requirements of real-time applications while minimizing resource consumption.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"111 ","pages":"Article 105106"},"PeriodicalIF":1.9,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142432862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Tuning high-level synthesis SpMV kernels in Alveo FPGAs 在 Alveo FPGA 中调整高级合成 SpMV 内核

IF 1.9 4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Microprocessors and Microsystems

Pub Date : 2024-10-01 DOI: 10.1016/j.micpro.2024.105104

Federico Favaro , Ernesto Dufrechou , Juan P. Oliver , Pablo Ezzatti

Sparse Matrix-Vector Multiplication (SpMV) is an essential operation in scientific and engineering fields, with applications in areas like finite element analysis, image processing, and machine learning. To address the need for faster and more energy-efficient computing, this paper investigates the acceleration of SpMV through Field-Programmable Gate Arrays (FPGAs), leveraging High-Level Synthesis (HLS) for design simplicity. Our study focuses on the AMD-Xilinx Alveo U280 FPGA, assessing the performance of the SpMV kernel from Vitis Libraries, which is the state of the art on SpMV acceleration on FPGAs. We explore kernel modifications, transition to single precision, and varying partition sizes, demonstrating the impact of these changes on execution time. Furthermore, we investigate matrix preprocessing techniques, including Reverse Cuthill-McKee (RCM) reordering and a hybrid sparse storage format, to enhance efficiency. Our findings reveal that the performance of FPGA-accelerated SpMV is influenced by matrix characteristics, by smaller partition sizes, and by specific preprocessing techniques delivering notable performance improvements. By selecting the best results from these experiments, we achieved execution time enhancements of up to 3.2

\times

. This study advances the understanding of FPGA-accelerated SpMV, providing insights into key factors that impact performance and potential avenues for further improvement.

稀疏矩阵-矢量乘法（SpMV）是科学和工程领域的一项基本运算，在有限元分析、图像处理和机器学习等领域都有应用。为了满足对更快、更节能计算的需求，本文研究了如何通过现场可编程门阵列（FPGA）加速 SpMV，并利用高级合成（HLS）简化设计。我们的研究以 AMD-Xilinx Alveo U280 FPGA 为重点，评估了 Vitis Libraries 的 SpMV 内核的性能，该内核是在 FPGA 上加速 SpMV 的最新技术。我们探索了内核修改、向单精度的过渡以及不同的分区大小，展示了这些变化对执行时间的影响。此外，我们还研究了矩阵预处理技术，包括反向 Cuthill-McKee (RCM) 重新排序和混合稀疏存储格式，以提高效率。我们的研究结果表明，FPGA 加速 SpMV 的性能受矩阵特性、较小的分区大小以及可显著提高性能的特定预处理技术的影响。通过从这些实验中选择最佳结果，我们实现了高达 3.2 倍的执行时间提升。这项研究加深了人们对 FPGA 加速 SpMV 的理解，使人们深入了解了影响性能的关键因素和进一步改进的潜在途径。

{"title":"Tuning high-level synthesis SpMV kernels in Alveo FPGAs","authors":"Federico Favaro , Ernesto Dufrechou , Juan P. Oliver , Pablo Ezzatti","doi":"10.1016/j.micpro.2024.105104","DOIUrl":"10.1016/j.micpro.2024.105104","url":null,"abstract":"<div><div>Sparse Matrix-Vector Multiplication (SpMV) is an essential operation in scientific and engineering fields, with applications in areas like finite element analysis, image processing, and machine learning. To address the need for faster and more energy-efficient computing, this paper investigates the acceleration of SpMV through Field-Programmable Gate Arrays (FPGAs), leveraging High-Level Synthesis (HLS) for design simplicity. Our study focuses on the AMD-Xilinx Alveo U280 FPGA, assessing the performance of the SpMV kernel from Vitis Libraries, which is the state of the art on SpMV acceleration on FPGAs. We explore kernel modifications, transition to single precision, and varying partition sizes, demonstrating the impact of these changes on execution time. Furthermore, we investigate matrix preprocessing techniques, including Reverse Cuthill-McKee (RCM) reordering and a hybrid sparse storage format, to enhance efficiency. Our findings reveal that the performance of FPGA-accelerated SpMV is influenced by matrix characteristics, by smaller partition sizes, and by specific preprocessing techniques delivering notable performance improvements. By selecting the best results from these experiments, we achieved execution time enhancements of up to 3.2<span><math><mo>×</mo></math></span>. This study advances the understanding of FPGA-accelerated SpMV, providing insights into key factors that impact performance and potential avenues for further improvement.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"110 ","pages":"Article 105104"},"PeriodicalIF":1.9,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142424090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SLOPE: Safety LOg PEripherals implementation and software drivers for a safe RISC-V microcontroller unit SLOPE：用于安全 RISC-V 微控制器单元的安全 LOg PEripherals 实现和软件驱动程序

IF 1.9 4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Microprocessors and Microsystems

Pub Date : 2024-09-19 DOI: 10.1016/j.micpro.2024.105103

Francesco Cosimi , Antonio Arena , Sergio Saponara , Paolo Gai

The focus of this manuscript is related to the main safety issues regarding a mixed criticality system running multiple concurrent tasks. Our concerns are related to the guarantee of Freedom of Interference between concurrent partitions, and to the respect of the Worst Case Execution Time for tasks. Moreover, we are interested in the evaluation of resources budgeting and the study of system behavior in case of occurring random hardware failures. In this paper we present a set of Safety LOg PEripherals (SLOPE): Performance Monitoring Unit (PMU), Execution Tracing Unit (ETU), Error Management Unit (EMU), Time Management Unit (TMU) and Data Log Unit (DLU); then, an implementation of SLOPE on a single core RISC-V architecture is proposed. Such peripherals are able to collect software and hardware information about execution, and eventually trigger recovery actions to mitigate a possible dangerous misbehavior. We show results of the hardware implementation and software testing of the units with a dedicated software library. For the PMU we standardized the software layer according to embedded Performance Application Programming Interface (ePAPI), and compared its functionality with a bare-metal use of the library. To test the ETU we compared the hardware simulation results with software ones, to understand if overflow may occur in internal hardware buffers during tracing. In conclusion, designed devices introduce new instruments for system investigation for RISC-V technologies and can generate an execution profile for safety related tasks.

本手稿的重点是运行多个并发任务的混合临界系统的主要安全问题。我们关注的是并发分区之间的自由干扰保证，以及任务的最坏执行时间。此外，我们还对资源预算评估和发生随机硬件故障时的系统行为研究感兴趣。在本文中，我们提出了一套安全 LOg PEripherals (SLOPE)：性能监控单元（PMU）、执行跟踪单元（ETU）、错误管理单元（EMU）、时间管理单元（TMU）和数据日志单元（DLU）。这些外设能够收集有关执行的软件和硬件信息，并最终触发恢复行动，以减轻可能出现的危险不当行为。我们展示了使用专用软件库对这些单元进行硬件实施和软件测试的结果。对于 PMU，我们根据嵌入式性能应用编程接口（ePAPI）对软件层进行了标准化，并将其功能与裸机使用的库进行了比较。为了测试 ETU，我们将硬件模拟结果与软件结果进行了比较，以了解在跟踪过程中内部硬件缓冲区是否会发生溢出。总之，设计的设备为 RISC-V 技术的系统研究引入了新的工具，并能为安全相关任务生成执行配置文件。

{"title":"SLOPE: Safety LOg PEripherals implementation and software drivers for a safe RISC-V microcontroller unit","authors":"Francesco Cosimi , Antonio Arena , Sergio Saponara , Paolo Gai","doi":"10.1016/j.micpro.2024.105103","DOIUrl":"10.1016/j.micpro.2024.105103","url":null,"abstract":"<div><p>The focus of this manuscript is related to the main safety issues regarding a mixed criticality system running multiple concurrent tasks. Our concerns are related to the guarantee of Freedom of Interference between concurrent partitions, and to the respect of the Worst Case Execution Time for tasks. Moreover, we are interested in the evaluation of resources budgeting and the study of system behavior in case of occurring random hardware failures. In this paper we present a set of Safety LOg PEripherals (SLOPE): Performance Monitoring Unit (PMU), Execution Tracing Unit (ETU), Error Management Unit (EMU), Time Management Unit (TMU) and Data Log Unit (DLU); then, an implementation of SLOPE on a single core RISC-V architecture is proposed. Such peripherals are able to collect software and hardware information about execution, and eventually trigger recovery actions to mitigate a possible dangerous misbehavior. We show results of the hardware implementation and software testing of the units with a dedicated software library. For the PMU we standardized the software layer according to embedded Performance Application Programming Interface (ePAPI), and compared its functionality with a bare-metal use of the library. To test the ETU we compared the hardware simulation results with software ones, to understand if overflow may occur in internal hardware buffers during tracing. In conclusion, designed devices introduce new instruments for system investigation for RISC-V technologies and can generate an execution profile for safety related tasks.</p></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"110 ","pages":"Article 105103"},"PeriodicalIF":1.9,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142274383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RED-SEA Project: Towards a new-generation European interconnect RED-SEA 项目：建立新一代欧洲互连网

IF 1.9 4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Microprocessors and Microsystems

Pub Date : 2024-09-16 DOI: 10.1016/j.micpro.2024.105102

Maria Engracia Gomez , Julio Sahuquillo , Andrea Biagioni , Nikos Chrysos , Damien Berton , Ottorino Frezza , Francesca Lo Cicero , Alessandro Lonardo , Michele Martinelli , Pier Stanislao Paolucci , Elena Pastorelli , Francesco Simula , Matteo Turisini , Piero Vicini , Roberto Ammendola , Carlotta Chiarini , Chiara De Luca , Fabrizio Capuani , Adrián Castelló , Jose Duro , Simon Pickartz

RED-SEA is a H2020 EuroHPC project, whose main objective is to prepare a new-generation European Interconnect, capable of powering the EU Exascale systems to come, through an economically viable and technologically efficient interconnect, leveraging European interconnect technology (BXI) associated with standard and mature technology (Ethernet), previous EU-funded initiatives, as well as open standards and compatible APIs.

To achieve this objective, the RED-SEA project is being carried out around four key pillars: (i) network architecture and workload requirements-interconnects co-design – aiming at optimizing the fit with the other EuroHPC projects and with the EPI processors; (ii) development of a high-performance, low-latency, seamless bridge with Ethernet; (iii) efficient network resource management, including congestion and Quality-of-Service; and (iv) end-to-end functions implemented at the network edges.

This paper presents key achievements and results at the midterm of the project for each key pillar in the way to reach the final project objective. In this regard we can highlight: (i) The definition of the network requirements and architecture as well as a list of benchmarks and applications; (ii) In addition to initially planned IPs progress, BXI3 architecture has evolved to support natively Ethernet at low level, resulting in reduced complexity, with advantages in terms of cost optimization, and power consumption; (iii) The congestion characterization of target applications and proposals to reduce this congestion by the optimization of collective communication primitives, injection throttling and adaptive routing; and (iv) the low-latency high-message rate endpoint functions and their connection with new open technologies.

RED-SEA 是一个 H2020 EuroHPC 项目，其主要目标是利用与标准成熟技术（以太网）相关的欧洲互联技术（BXI）、以前的欧盟资助计划以及开放标准和兼容 API，通过经济上可行、技术上高效的互联技术，为新一代欧洲互联技术做好准备，使其能够为未来的欧盟超大规模系统提供动力。为实现这一目标，RED-SEA 项目围绕四个关键支柱展开：(i) 网络架构和工作负载要求--互连协同设计--旨在优化与其他 EuroHPC 项目和 EPI 处理器的匹配；(ii) 开发高性能、低延迟、与以太网无缝连接的桥接器；(iii) 高效网络资源管理，包括拥塞和服务质量；(iv) 在网络边缘实现端到端功能。本文介绍了在实现项目最终目标的过程中，每个关键支柱在项目中期取得的主要成就和成果。在这方面，我们可以强调(i) 网络要求和架构的定义，以及基准和应用清单；(ii) 除了最初计划的 IP 进展外，BXI3 架构已发展到在低层次上支持本地以太网，从而降低了复杂性，在成本优化和功耗方面具有优势；(iii) 目标应用的拥塞特征，以及通过优化集体通信基元、注入节流和自适应路由来减少拥塞的建议；以及 (iv) 低延迟高信息速率端点功能及其与新开放技术的连接。

{"title":"RED-SEA Project: Towards a new-generation European interconnect","authors":"Maria Engracia Gomez , Julio Sahuquillo , Andrea Biagioni , Nikos Chrysos , Damien Berton , Ottorino Frezza , Francesca Lo Cicero , Alessandro Lonardo , Michele Martinelli , Pier Stanislao Paolucci , Elena Pastorelli , Francesco Simula , Matteo Turisini , Piero Vicini , Roberto Ammendola , Carlotta Chiarini , Chiara De Luca , Fabrizio Capuani , Adrián Castelló , Jose Duro , Simon Pickartz","doi":"10.1016/j.micpro.2024.105102","DOIUrl":"10.1016/j.micpro.2024.105102","url":null,"abstract":"<div><div>RED-SEA is a H2020 EuroHPC project, whose main objective is to prepare a new-generation European Interconnect, capable of powering the EU Exascale systems to come, through an economically viable and technologically efficient interconnect, leveraging European interconnect technology (BXI) associated with standard and mature technology (Ethernet), previous EU-funded initiatives, as well as open standards and compatible APIs.</div><div>To achieve this objective, the RED-SEA project is being carried out around four key pillars: (i) network architecture and workload requirements-interconnects co-design – aiming at optimizing the fit with the other EuroHPC projects and with the EPI processors; (ii) development of a high-performance, low-latency, seamless bridge with Ethernet; (iii) efficient network resource management, including congestion and Quality-of-Service; and (iv) end-to-end functions implemented at the network edges.</div><div>This paper presents key achievements and results at the midterm of the project for each key pillar in the way to reach the final project objective. In this regard we can highlight: (i) The definition of the network requirements and architecture as well as a list of benchmarks and applications; (ii) In addition to initially planned IPs progress, BXI3 architecture has evolved to support natively Ethernet at low level, resulting in reduced complexity, with advantages in terms of cost optimization, and power consumption; (iii) The congestion characterization of target applications and proposals to reduce this congestion by the optimization of collective communication primitives, injection throttling and adaptive routing; and (iv) the low-latency high-message rate endpoint functions and their connection with new open technologies.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"110 ","pages":"Article 105102"},"PeriodicalIF":1.9,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0141933124000978/pdfft?md5=078031f75a9ce320a049b03c1e432247&pid=1-s2.0-S0141933124000978-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142314850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0