Pub Date : 2024-11-01DOI: 10.1016/j.micpro.2024.105120
V. Hurbungs , T.P. Fowdur , V. Bassoo
Edge machine learning brings intelligence to low-power devices at the periphery of a network. By running machine learning algorithms on the Edge, classification can be performed faster without the need to transmit large data volumes across a network. However, on-device training is often not feasible since Edge devices have limited computing and storage resources. Improved, Scalable, Efficient, and Fast classifieR (iSEFR) is a classifier that performs both training and testing on low-power devices using linearly separable balanced datasets. The novelty of this work is the improvement of the iSEFR accuracy by fine-tuning the algorithm with datasets having an uneven class distribution. Three adaptive linear function transformation techniques were proposed to improve the decision threshold which is in the form of a linear function. Experiments using stratified sampling with 5-fold cross-validation demonstrate that one of the proposed techniques significantly improved F1-score, Recall and Matthews Correlation Coefficient (MCC) by an average of 23 %, 35 % and 21 % compared to iSEFR. Further evaluation of this technique in a Fog environment using highly imbalanced datasets such as credit card fraud, network intrusion and diabetic retinopathy also showed a significant increase of 38 %, 44 % and 30 % in F1-score, Recall and MCC with a Precision of 97 %. The adaptive binary classifier maintained the time complexity of iSEFR without altering the class imbalance.
{"title":"An adaptive binary classifier for highly imbalanced datasets on the Edge","authors":"V. Hurbungs , T.P. Fowdur , V. Bassoo","doi":"10.1016/j.micpro.2024.105120","DOIUrl":"10.1016/j.micpro.2024.105120","url":null,"abstract":"<div><div>Edge machine learning brings intelligence to low-power devices at the periphery of a network. By running machine learning algorithms on the Edge, classification can be performed faster without the need to transmit large data volumes across a network. However, on-device training is often not feasible since Edge devices have limited computing and storage resources. Improved, Scalable, Efficient, and Fast classifieR (iSEFR) is a classifier that performs both training and testing on low-power devices using linearly separable balanced datasets. The novelty of this work is the improvement of the iSEFR accuracy by fine-tuning the algorithm with datasets having an uneven class distribution. Three adaptive linear function transformation techniques were proposed to improve the decision threshold which is in the form of a linear function. Experiments using stratified sampling with 5-fold cross-validation demonstrate that one of the proposed techniques significantly improved F1-score, Recall and Matthews Correlation Coefficient (MCC) by an average of 23 %, 35 % and 21 % compared to iSEFR. Further evaluation of this technique in a Fog environment using highly imbalanced datasets such as credit card fraud, network intrusion and diabetic retinopathy also showed a significant increase of 38 %, 44 % and 30 % in F1-score, Recall and MCC with a Precision of 97 %. The adaptive binary classifier maintained the time complexity of iSEFR without altering the class imbalance.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"111 ","pages":"Article 105120"},"PeriodicalIF":1.9,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142661032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-01DOI: 10.1016/j.micpro.2024.105107
Petr Dobiáš , Thomas Garbay , Bertrand Granado , Khalil Hachicha , Andrea Pinna
Convolutional neural networks (CNNs) are progressively deployed on embedded systems, which is challenging because their computational and energy requirements need to be satisfied by devices with limited resources and power supplies. For instance, they can be implemented in the Internet of Things or edge computing, i.e., in applications using low-power and low-performance microcontroller units (MCUs). Monocore MCUs are not tailored to respond to the computational and energy requirements of CNNs due to their limited resources, but a multicore MCU can overcome these limitations. This paper presents an empirical study analysing three algorithms for scheduling CNNs on embedded systems at two different levels (neuron and layer levels) and evaluates their performance in terms of makespan and energy consumption using six neural networks, both in general and in the case of CubeSats. The results show that the SNN algorithm outperforms the other two algorithms (STD and STS) and that scheduling at the layer level significantly reduces the energy consumption. Therefore, embedded systems based on multicore MCUs are suitable for executing CNNs, and they can be used, for example, on board small satellites called CubeSats.
{"title":"Algorithms for scheduling CNNs on multicore MCUs at the neuron and layer levels","authors":"Petr Dobiáš , Thomas Garbay , Bertrand Granado , Khalil Hachicha , Andrea Pinna","doi":"10.1016/j.micpro.2024.105107","DOIUrl":"10.1016/j.micpro.2024.105107","url":null,"abstract":"<div><div>Convolutional neural networks (CNNs) are progressively deployed on embedded systems, which is challenging because their computational and energy requirements need to be satisfied by devices with limited resources and power supplies. For instance, they can be implemented in the Internet of Things or edge computing, i.e., in applications using low-power and low-performance microcontroller units (MCUs). Monocore MCUs are not tailored to respond to the computational and energy requirements of CNNs due to their limited resources, but a multicore MCU can overcome these limitations. This paper presents an empirical study analysing three algorithms for scheduling CNNs on embedded systems at two different levels (neuron and layer levels) and evaluates their performance in terms of makespan and energy consumption using six neural networks, both in general and in the case of CubeSats. The results show that the <span>SNN</span> algorithm outperforms the other two algorithms (<span>STD</span> and <span>STS</span>) and that scheduling at the layer level significantly reduces the energy consumption. Therefore, embedded systems based on multicore MCUs are suitable for executing CNNs, and they can be used, for example, on board small satellites called CubeSats.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"111 ","pages":"Article 105107"},"PeriodicalIF":1.9,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-01DOI: 10.1016/j.micpro.2024.105119
Yahya Jan, Lech Jóźwiak
This paper presents the results of our analysis of the main problems that have to be solved in the design of highly parallel high-performance accelerators for Deep Neural Networks (DNNs) used in low power Cyber–Physical System (CPS) and Internet of Things (IoT) devices, in application areas such as smart automotive, health and smart services in social networks (Facebook, Instagram, X/Twitter, etc.). Our analysis demonstrates that to arrive a to high-quality DNN accelerator architecture, complex mutual trade-offs have to be resolved among the accelerator micro- and macro-architecture, and the corresponding memory and communication architectures, as well as among the performance, power consumption and area. Therefore, we developed a multi-processor accelerator design methodology involving an automatic design-space exploration (DSE) framework that enables a very efficient construction and analysis of DNN accelerator architectures, as well as an adequate trade-off exploitation. To satisfy the low power demands of IoT devices, we extend our quality-driven model-based multi-processor accelerator design methodology with some novel power optimization techniques at the Processor’s and memory exploration stages. Our proposed power optimization techniques at the processor’s exploration stage achieve up to 66.5% reduction in power consumption, while our proposed data reuse techniques avoid up to 85.92% of redundant memory accesses thereby reducing the power consumption of accelerator necessary for low-power IoT applications. Currently, we are beginning to apply this methodology with the proposed power optimization techniques to the design of low-power DNN accelerators for IoT applications.
{"title":"Quality-driven design of deep neural network hardware accelerators for low power CPS and IoT applications","authors":"Yahya Jan, Lech Jóźwiak","doi":"10.1016/j.micpro.2024.105119","DOIUrl":"10.1016/j.micpro.2024.105119","url":null,"abstract":"<div><div>This paper presents the results of our analysis of the main problems that have to be solved in the design of highly parallel high-performance accelerators for Deep Neural Networks (DNNs) used in low power Cyber–Physical System (CPS) and Internet of Things (IoT) devices, in application areas such as smart automotive, health and smart services in social networks (Facebook, Instagram, X/Twitter, etc.). Our analysis demonstrates that to arrive a to high-quality DNN accelerator architecture, complex mutual trade-offs have to be resolved among the accelerator micro- and macro-architecture, and the corresponding memory and communication architectures, as well as among the performance, power consumption and area. Therefore, we developed a multi-processor accelerator design methodology involving an automatic design-space exploration (DSE) framework that enables a very efficient construction and analysis of DNN accelerator architectures, as well as an adequate trade-off exploitation. To satisfy the low power demands of IoT devices, we extend our quality-driven model-based multi-processor accelerator design methodology with some novel power optimization techniques at the Processor’s and memory exploration stages. Our proposed power optimization techniques at the processor’s exploration stage achieve up to 66.5% reduction in power consumption, while our proposed data reuse techniques avoid up to 85.92% of redundant memory accesses thereby reducing the power consumption of accelerator necessary for low-power IoT applications. Currently, we are beginning to apply this methodology with the proposed power optimization techniques to the design of low-power DNN accelerators for IoT applications.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"111 ","pages":"Article 105119"},"PeriodicalIF":1.9,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142661033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, a new Side-Channel Analysis (SCA)-based attack, namely the Optical Probing (OP) attack, has been shown to bypass the implemented protection mechanisms on the chip, allowing unauthorized access to confidential information such as stored security keys or Intellectual Property (IP). Several countermeasures against the OP attack exist, which require changes in the chip’s fabrication process, i.e., chip fabrication using OP-resistant materials, resulting in increased fabrication costs. On the other hand, other countermeasures are implemented at the layout level. These countermeasures suffer from a significant drop in performance due to the utilization of custom logic cells. Additionally, available techniques against OP at the layout level require a layout design of the logic cell library from scratch which is a time-consuming process. In this work, we mitigate these limitations and propose a methodology to design high-performance OP-attack-resistant circuits. Using a two-folded methodology, we achieve an OP attack-resistant circuit. Firstly, we design a high-performance, and Low optical Leakage-Dual Rail Logic (LoL-DRL) cell library based on a standard CMOS logic cell library. Hence, no complete redesign of the layout is required. Secondly, we propose a streamlined synthesis technique to synthesize OP-attack-resistant circuits from the original circuit’s netlist. Thus, our method seamlessly integrates into the existing synthesis flow. On top of that, we analyzed the optical leakage information of several logic cells from both the standard logic cell library and our proposed LoL-DRL logic cell library against the OP attack. We used a metric called Optical Leakage Value (OLV) to report the robustness of a logic cell against the OP attack. Furthermore, as a case study, we applied our design methodology to an open-source RISC-V core to design the first OP-attack-resistant RISC-V core, named Lo-RISK. Our approach minimizes any adverse impact on performance yet incurs significant expenses in terms of both area and power consumption, which is acceptable for an OP-secure end product. On average, our proposed LoL-DRL logic cell library exhibits less information leakage through OP compared to the standard CMOS logic cell library. Our approach to designing OP-resistant circuits result in the area and a power increase while operating at the same frequency in comparison to a circuit designed using a standard CMOS logic cell library.
最近,一种新的基于侧信道分析(SCA)的攻击,即光学探测(OP)攻击,被证明可以绕过芯片上已实施的保护机制,允许未经授权访问存储的安全密钥或知识产权(IP)等机密信息。目前已有几种针对 OP 攻击的对策,但需要改变芯片制造工艺,即使用抗 OP 材料制造芯片,从而导致制造成本增加。另一方面,其他对策是在布局层面实施的。由于使用定制逻辑单元,这些对策的性能会大幅下降。此外,现有的布局级反 OP 技术需要从头开始进行逻辑单元库的布局设计,这是一个耗时的过程。在这项工作中,我们减少了这些限制,并提出了一种设计高性能抗 OP 攻击电路的方法。我们采用双重方法实现了抗 OP 攻击电路。首先,我们在标准 CMOS 逻辑单元库的基础上设计了一个高性能低光漏双轨逻辑(LoL-DRL)单元库。因此,无需重新设计电路布局。其次,我们提出了一种简化的合成技术,可从原始电路的网表合成抗 OP 攻击电路。因此,我们的方法可以无缝集成到现有的综合流程中。在此基础上,我们分析了标准逻辑单元库和我们提出的 LoL-DRL 逻辑单元库中多个逻辑单元的光泄漏信息,以对抗 OP 攻击。我们使用一种名为 "光学泄漏值"(OLV)的指标来报告逻辑单元对 OP 攻击的鲁棒性。此外,作为一项案例研究,我们将我们的设计方法应用于一个开源 RISC-V 内核,设计出第一个抗 OP 攻击的 RISC-V 内核,命名为 Lo-RISK。我们的方法最大限度地减少了对性能的不利影响,但在面积和功耗方面却产生了巨大的开销,这对于 OP 安全的最终产品来说是可以接受的。与标准 CMOS 逻辑单元库相比,我们提出的 LoL-DRL 逻辑单元库平均减少了 2 倍的 OP 信息泄漏。与使用标准 CMOS 逻辑单元库设计的电路相比,我们的抗 OP 电路设计方法在相同频率下工作时,面积增加了 2 倍,功耗增加了 1.36 倍。
{"title":"Lower the RISC: Designing optical-probing-attack-resistant cores","authors":"Sajjad Parvin , Sallar Ahmadi-Pour , Chandan Kumar Jha , Frank Sill Torres , Rolf Drechsler","doi":"10.1016/j.micpro.2024.105121","DOIUrl":"10.1016/j.micpro.2024.105121","url":null,"abstract":"<div><div>Recently, a new Side-Channel Analysis (SCA)-based attack, namely the Optical Probing (OP) attack, has been shown to bypass the implemented protection mechanisms on the chip, allowing unauthorized access to confidential information such as stored security keys or Intellectual Property (IP). Several countermeasures against the OP attack exist, which require changes in the chip’s fabrication process, i.e., chip fabrication using OP-resistant materials, resulting in increased fabrication costs. On the other hand, other countermeasures are implemented at the layout level. These countermeasures suffer from a significant drop in performance due to the utilization of custom logic cells. Additionally, available techniques against OP at the layout level require a layout design of the logic cell library from scratch which is a time-consuming process. In this work, we mitigate these limitations and propose a methodology to design high-performance OP-attack-resistant circuits. Using a two-folded methodology, we achieve an OP attack-resistant circuit. Firstly, we design a high-performance, and Low optical Leakage-Dual Rail Logic (LoL-DRL) cell library based on a standard CMOS logic cell library. Hence, no complete redesign of the layout is required. Secondly, we propose a streamlined synthesis technique to synthesize OP-attack-resistant circuits from the original circuit’s netlist. Thus, our method seamlessly integrates into the existing synthesis flow. On top of that, we analyzed the optical leakage information of several logic cells from both the standard logic cell library and our proposed LoL-DRL logic cell library against the OP attack. We used a metric called Optical Leakage Value (OLV) to report the robustness of a logic cell against the OP attack. Furthermore, as a case study, we applied our design methodology to an open-source RISC-V core to design the first OP-attack-resistant RISC-V core, named <em>Lo-RISK</em>. Our approach minimizes any adverse impact on performance yet incurs significant expenses in terms of both area and power consumption, which is acceptable for an OP-secure end product. On average, our proposed LoL-DRL logic cell library exhibits <span><math><mrow><mn>2</mn><mo>×</mo></mrow></math></span> less information leakage through OP compared to the standard CMOS logic cell library. Our approach to designing OP-resistant circuits result in <span><math><mrow><mn>2</mn><mo>×</mo></mrow></math></span> the area and a <span><math><mrow><mn>1</mn><mo>.</mo><mn>36</mn><mo>×</mo></mrow></math></span> power increase while operating at the same frequency in comparison to a circuit designed using a standard CMOS logic cell library.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"111 ","pages":"Article 105121"},"PeriodicalIF":1.9,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142661034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-01DOI: 10.1016/j.micpro.2024.105118
Ghassem Jaberipur , Saeid Gorgin , Jeong-A. Lee
Serial binary multiplication is frequently used in many digital applications. In particular, left-to-right (aka online) manipulation of operands promotes the real-time generation of product digits for immediate utilization in subsequent online computations (e.g., successive layers of a neural network). In the left-to-right arithmetic operations, where a residual is maintained for digit selection, utilization of a redundant number system for the representation of outputs is mandatory, while the input operands and the residual may be redundant or non-redundant. However, when the input data paths are narrow (e.g., eight bits as in BFloat16), conventional non-redundant representations of inputs and residual provide some advantages. For example, the immediate and costless sign detection of the residual that is necessary for the next digit selection; a property not shared by redundant numbers. Nevertheless, digit selection, as practiced in the previous realizations, with both redundant and non-redundant inputs and/or residual, is slow and rather complex. Therefore, in this paper, we offer an imprecise, but faster digit selection scheme, with the required correction in the next cycle. Analytical evaluations and synthesis of the proposed circuits on FPGA platform, shows 30 % speedup and less cost with respect to both cases with redundant and non-redundant inputs and residual.
{"title":"Low-cost constant time signed digit selection for most significant bit first multiplication","authors":"Ghassem Jaberipur , Saeid Gorgin , Jeong-A. Lee","doi":"10.1016/j.micpro.2024.105118","DOIUrl":"10.1016/j.micpro.2024.105118","url":null,"abstract":"<div><div>Serial binary multiplication is frequently used in many digital applications. In particular, left-to-right (aka online) manipulation of operands promotes the real-time generation of product digits for immediate utilization in subsequent online computations (e.g., successive layers of a neural network). In the left-to-right arithmetic operations, where a residual is maintained for digit selection, utilization of a redundant number system for the representation of outputs is mandatory, while the input operands and the residual may be redundant or non-redundant. However, when the input data paths are narrow (e.g., eight bits as in BFloat16), conventional non-redundant representations of inputs and residual provide some advantages. For example, the immediate and costless sign detection of the residual that is necessary for the next digit selection; a property not shared by redundant numbers. Nevertheless, digit selection, as practiced in the previous realizations, with both redundant and non-redundant inputs and/or residual, is slow and rather complex. Therefore, in this paper, we offer an imprecise, but faster digit selection scheme, with the required correction in the next cycle. Analytical evaluations and synthesis of the proposed circuits on FPGA platform, shows 30 % speedup and less cost with respect to both cases with redundant and non-redundant inputs and residual.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"111 ","pages":"Article 105118"},"PeriodicalIF":1.9,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142578683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GPU architectures have become popular for executing general-purpose programs. In particular, they are some of the most efficient architectures for machine learning applications which are among the most trendy and demanding applications nowadays.
This paper presents SIMIL (SIMple Issue Logic for GPUs), an architectural modification to the issue stage that replaces scoreboards with a Dependence Matrix to track dependencies among instructions and avoid data hazards. We show that a Dependence Matrix is more effective in the presence of repetitive use of source operands, which is common in many applications. Besides, a Dependence Matrix with minor extensions can also support a simplistic out-of-order issue. Evaluations on an NVIDIA Tesla V100-like GPU show that SIMIL provides a speed-up of up to 2.39 in some machine learning programs and 1.31 on average for various benchmarks, while it reduces energy consumption by 12.81%, with only 1.5% area overhead. We also show that SIMIL outperforms a recently proposed approach for out-of-order issue that uses register renaming.
{"title":"SIMIL: SIMple Issue Logic for GPUs","authors":"Rodrigo Huerta , José-Lorenzo Cruz , Jose-Maria Arnau , Antonio González","doi":"10.1016/j.micpro.2024.105105","DOIUrl":"10.1016/j.micpro.2024.105105","url":null,"abstract":"<div><div>GPU architectures have become popular for executing general-purpose programs. In particular, they are some of the most efficient architectures for machine learning applications which are among the most trendy and demanding applications nowadays.</div><div>This paper presents SIMIL (SIMple Issue Logic for GPUs), an architectural modification to the issue stage that replaces scoreboards with a Dependence Matrix to track dependencies among instructions and avoid data hazards. We show that a Dependence Matrix is more effective in the presence of repetitive use of source operands, which is common in many applications. Besides, a Dependence Matrix with minor extensions can also support a simplistic out-of-order issue. Evaluations on an NVIDIA Tesla V100-like GPU show that SIMIL provides a speed-up of up to 2.39 in some machine learning programs and 1.31 on average for various benchmarks, while it reduces energy consumption by 12.81%, with only 1.5% area overhead. We also show that SIMIL outperforms a recently proposed approach for out-of-order issue that uses register renaming.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"111 ","pages":"Article 105105"},"PeriodicalIF":1.9,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142531711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-09DOI: 10.1016/j.micpro.2024.105106
Francisco J. Iñiguez-Lomeli , Carlos H. Garcia-Capulin , Horacio Rostro-Gonzalez
Ellipse detection techniques are often developed and validated in software environments, neglecting the critical consideration of computational efficiency and resource constraints prevalent in embedded systems. Furthermore, programmable logic devices, notably Field Programmable Gate Arrays (FPGAs), have emerged as indispensable assets for enhancing performance and expediting various processing applications. In the realm of computational efficiency, hardware implementations have the flexibility to tailor the required arithmetic for various applications using fixed-point representation. This approach enables faster computations while upholding adequate accuracy, resulting in reduced resource and energy consumption compared to software applications that rely on higher clock speeds, which often lead to increased resource and energy consumption. Additionally, hardware solutions provide portability and are suitable for resource-constrained and battery-powered applications. This study introduces a novel hardware architecture in the form of an intellectual property core that harnesses the capabilities of a genetic algorithm to detect single and multi ellipses in digital images. In general, genetic algorithms have been demonstrated to be an alternative that shows better results than those based on traditional methods such as the Hough Transform and Random Sample Consensus, particularly in terms of accuracy, flexibility, and robustness. Our genetic algorithm randomly takes five edge points as parameters from the image tested, creating an individual treated as a potential candidate ellipse. The fitness evaluation function determines whether the candidate ellipse truly exists in the image space. The core is designed using Vitis High-Level Synthesis (HLS), a powerful tool that converts C or C++functions into Register-Transfer Level (RTL) code, including VHDL and Verilog. The implementation and testing of the ellipse detection system were carried out on the PYNQ-Z1, a cost-effective development board housing the Xilinx Zynq-7000 System-on-Chip (SoC). PYNQ, an open-source framework, seamlessly integrates programmable logic with a dual-core ARM Cortex-A9 processor, offering the flexibility of Python programming for the onboard SoC processor. The experimental results, based on synthetic and real images, some of them with the presence of noise processed by the developed ellipse detection system, highlight the intellectual property core’s exceptional suitability for resource-constrained embedded systems. Notably, it achieves remarkable performance and accuracy rates, consistently exceeding 99% in most cases. This research aims to contribute to the advancement of hardware-accelerated ellipse detection, catering to the demanding requirements of real-time applications while minimizing resource consumption.
{"title":"A hardware architecture for single and multiple ellipse detection using genetic algorithms and high-level synthesis tools","authors":"Francisco J. Iñiguez-Lomeli , Carlos H. Garcia-Capulin , Horacio Rostro-Gonzalez","doi":"10.1016/j.micpro.2024.105106","DOIUrl":"10.1016/j.micpro.2024.105106","url":null,"abstract":"<div><div>Ellipse detection techniques are often developed and validated in software environments, neglecting the critical consideration of computational efficiency and resource constraints prevalent in embedded systems. Furthermore, programmable logic devices, notably Field Programmable Gate Arrays (FPGAs), have emerged as indispensable assets for enhancing performance and expediting various processing applications. In the realm of computational efficiency, hardware implementations have the flexibility to tailor the required arithmetic for various applications using fixed-point representation. This approach enables faster computations while upholding adequate accuracy, resulting in reduced resource and energy consumption compared to software applications that rely on higher clock speeds, which often lead to increased resource and energy consumption. Additionally, hardware solutions provide portability and are suitable for resource-constrained and battery-powered applications. This study introduces a novel hardware architecture in the form of an intellectual property core that harnesses the capabilities of a genetic algorithm to detect single and multi ellipses in digital images. In general, genetic algorithms have been demonstrated to be an alternative that shows better results than those based on traditional methods such as the Hough Transform and Random Sample Consensus, particularly in terms of accuracy, flexibility, and robustness. Our genetic algorithm randomly takes five edge points as parameters from the image tested, creating an individual treated as a potential candidate ellipse. The fitness evaluation function determines whether the candidate ellipse truly exists in the image space. The core is designed using Vitis High-Level Synthesis (HLS), a powerful tool that converts C or C++functions into Register-Transfer Level (RTL) code, including VHDL and Verilog. The implementation and testing of the ellipse detection system were carried out on the PYNQ-Z1, a cost-effective development board housing the Xilinx Zynq-7000 System-on-Chip (SoC). PYNQ, an open-source framework, seamlessly integrates programmable logic with a dual-core ARM Cortex-A9 processor, offering the flexibility of Python programming for the onboard SoC processor. The experimental results, based on synthetic and real images, some of them with the presence of noise processed by the developed ellipse detection system, highlight the intellectual property core’s exceptional suitability for resource-constrained embedded systems. Notably, it achieves remarkable performance and accuracy rates, consistently exceeding 99% in most cases. This research aims to contribute to the advancement of hardware-accelerated ellipse detection, catering to the demanding requirements of real-time applications while minimizing resource consumption.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"111 ","pages":"Article 105106"},"PeriodicalIF":1.9,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142432862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1016/j.micpro.2024.105104
Federico Favaro , Ernesto Dufrechou , Juan P. Oliver , Pablo Ezzatti
Sparse Matrix-Vector Multiplication (SpMV) is an essential operation in scientific and engineering fields, with applications in areas like finite element analysis, image processing, and machine learning. To address the need for faster and more energy-efficient computing, this paper investigates the acceleration of SpMV through Field-Programmable Gate Arrays (FPGAs), leveraging High-Level Synthesis (HLS) for design simplicity. Our study focuses on the AMD-Xilinx Alveo U280 FPGA, assessing the performance of the SpMV kernel from Vitis Libraries, which is the state of the art on SpMV acceleration on FPGAs. We explore kernel modifications, transition to single precision, and varying partition sizes, demonstrating the impact of these changes on execution time. Furthermore, we investigate matrix preprocessing techniques, including Reverse Cuthill-McKee (RCM) reordering and a hybrid sparse storage format, to enhance efficiency. Our findings reveal that the performance of FPGA-accelerated SpMV is influenced by matrix characteristics, by smaller partition sizes, and by specific preprocessing techniques delivering notable performance improvements. By selecting the best results from these experiments, we achieved execution time enhancements of up to 3.2. This study advances the understanding of FPGA-accelerated SpMV, providing insights into key factors that impact performance and potential avenues for further improvement.
{"title":"Tuning high-level synthesis SpMV kernels in Alveo FPGAs","authors":"Federico Favaro , Ernesto Dufrechou , Juan P. Oliver , Pablo Ezzatti","doi":"10.1016/j.micpro.2024.105104","DOIUrl":"10.1016/j.micpro.2024.105104","url":null,"abstract":"<div><div>Sparse Matrix-Vector Multiplication (SpMV) is an essential operation in scientific and engineering fields, with applications in areas like finite element analysis, image processing, and machine learning. To address the need for faster and more energy-efficient computing, this paper investigates the acceleration of SpMV through Field-Programmable Gate Arrays (FPGAs), leveraging High-Level Synthesis (HLS) for design simplicity. Our study focuses on the AMD-Xilinx Alveo U280 FPGA, assessing the performance of the SpMV kernel from Vitis Libraries, which is the state of the art on SpMV acceleration on FPGAs. We explore kernel modifications, transition to single precision, and varying partition sizes, demonstrating the impact of these changes on execution time. Furthermore, we investigate matrix preprocessing techniques, including Reverse Cuthill-McKee (RCM) reordering and a hybrid sparse storage format, to enhance efficiency. Our findings reveal that the performance of FPGA-accelerated SpMV is influenced by matrix characteristics, by smaller partition sizes, and by specific preprocessing techniques delivering notable performance improvements. By selecting the best results from these experiments, we achieved execution time enhancements of up to 3.2<span><math><mo>×</mo></math></span>. This study advances the understanding of FPGA-accelerated SpMV, providing insights into key factors that impact performance and potential avenues for further improvement.</div></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"110 ","pages":"Article 105104"},"PeriodicalIF":1.9,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142424090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-19DOI: 10.1016/j.micpro.2024.105103
Francesco Cosimi , Antonio Arena , Sergio Saponara , Paolo Gai
The focus of this manuscript is related to the main safety issues regarding a mixed criticality system running multiple concurrent tasks. Our concerns are related to the guarantee of Freedom of Interference between concurrent partitions, and to the respect of the Worst Case Execution Time for tasks. Moreover, we are interested in the evaluation of resources budgeting and the study of system behavior in case of occurring random hardware failures. In this paper we present a set of Safety LOg PEripherals (SLOPE): Performance Monitoring Unit (PMU), Execution Tracing Unit (ETU), Error Management Unit (EMU), Time Management Unit (TMU) and Data Log Unit (DLU); then, an implementation of SLOPE on a single core RISC-V architecture is proposed. Such peripherals are able to collect software and hardware information about execution, and eventually trigger recovery actions to mitigate a possible dangerous misbehavior. We show results of the hardware implementation and software testing of the units with a dedicated software library. For the PMU we standardized the software layer according to embedded Performance Application Programming Interface (ePAPI), and compared its functionality with a bare-metal use of the library. To test the ETU we compared the hardware simulation results with software ones, to understand if overflow may occur in internal hardware buffers during tracing. In conclusion, designed devices introduce new instruments for system investigation for RISC-V technologies and can generate an execution profile for safety related tasks.
{"title":"SLOPE: Safety LOg PEripherals implementation and software drivers for a safe RISC-V microcontroller unit","authors":"Francesco Cosimi , Antonio Arena , Sergio Saponara , Paolo Gai","doi":"10.1016/j.micpro.2024.105103","DOIUrl":"10.1016/j.micpro.2024.105103","url":null,"abstract":"<div><p>The focus of this manuscript is related to the main safety issues regarding a mixed criticality system running multiple concurrent tasks. Our concerns are related to the guarantee of Freedom of Interference between concurrent partitions, and to the respect of the Worst Case Execution Time for tasks. Moreover, we are interested in the evaluation of resources budgeting and the study of system behavior in case of occurring random hardware failures. In this paper we present a set of Safety LOg PEripherals (SLOPE): Performance Monitoring Unit (PMU), Execution Tracing Unit (ETU), Error Management Unit (EMU), Time Management Unit (TMU) and Data Log Unit (DLU); then, an implementation of SLOPE on a single core RISC-V architecture is proposed. Such peripherals are able to collect software and hardware information about execution, and eventually trigger recovery actions to mitigate a possible dangerous misbehavior. We show results of the hardware implementation and software testing of the units with a dedicated software library. For the PMU we standardized the software layer according to embedded Performance Application Programming Interface (ePAPI), and compared its functionality with a bare-metal use of the library. To test the ETU we compared the hardware simulation results with software ones, to understand if overflow may occur in internal hardware buffers during tracing. In conclusion, designed devices introduce new instruments for system investigation for RISC-V technologies and can generate an execution profile for safety related tasks.</p></div>","PeriodicalId":49815,"journal":{"name":"Microprocessors and Microsystems","volume":"110 ","pages":"Article 105103"},"PeriodicalIF":1.9,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142274383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}