Pub Date : 2025-02-05DOI: 10.1109/LES.2025.3538827
Jimin Lee;Soonhoi Ha
The rapid advancement of deep learning (DL) models has led to a pressing need for efficient on-device DL solutions, particularly for edge devices with limited resources. processing-in-memory (PIM) technology is considered a promising technology to address the worsening memory wall problem by integrating processing capabilities directly into memory modules. This letter evaluates the potential of Samsung PIM technology in enhancing the performance of on-device language inference. We assess the impact of PIM on the inference stage of three transformer models, Gemma, Qwen2, and TinyBERT demonstrating an average 1.92x speed-up in end-to-end latency compared to CPU by offloading all linear layers to PIM. Notably, Qwen2, which has characteristics favorable to PIM, achieves a 1.25x speed-up in end-to-end latency compared to GPU. Our findings emphasize the importance of understanding model characteristics for effective PIM deployment. The results demonstrate the PIM solution’s efficiency in enabling on-device language models and its edge deployment potential.
{"title":"Empowering Edge Devices With Processing-in-Memory for On-Device Language Inference","authors":"Jimin Lee;Soonhoi Ha","doi":"10.1109/LES.2025.3538827","DOIUrl":"https://doi.org/10.1109/LES.2025.3538827","url":null,"abstract":"The rapid advancement of deep learning (DL) models has led to a pressing need for efficient on-device DL solutions, particularly for edge devices with limited resources. processing-in-memory (PIM) technology is considered a promising technology to address the worsening memory wall problem by integrating processing capabilities directly into memory modules. This letter evaluates the potential of Samsung PIM technology in enhancing the performance of on-device language inference. We assess the impact of PIM on the inference stage of three transformer models, Gemma, Qwen2, and TinyBERT demonstrating an average 1.92x speed-up in end-to-end latency compared to CPU by offloading all linear layers to PIM. Notably, Qwen2, which has characteristics favorable to PIM, achieves a 1.25x speed-up in end-to-end latency compared to GPU. Our findings emphasize the importance of understanding model characteristics for effective PIM deployment. The results demonstrate the PIM solution’s efficiency in enabling on-device language models and its edge deployment potential.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 4","pages":"244-247"},"PeriodicalIF":2.0,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144843066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-04DOI: 10.1109/LES.2025.3538470
Monalisa Das;Babita Jajodia
The demand for large integer polynomial multiplications has become increasingly significant in modern cryptographic algorithms. The practical implementation of such multipliers presents a field of research focused on optimizing hardware design concerning space and time complexity. In this letter, the authors propose an efficient polynomial multiplier based on a hybrid recursive Karatsuba multiplication (HRKM) algorithm. The overall performance of the proposed design is evaluated using the area-time-product (ATP) metric. The hardware implementation of the proposed architecture is carried out on a Virtex-7 FPGA device using the Xilinx ISE platform. Hardware implementation results show that the proposed HRKM architecture shows ATP reduction of 67.885%, 70.128%, and 65.869% for 128, 256, and 512 bits, respectively, in comparison to Hybrid Karatsuba (nonrecursive) multiplications.
{"title":"Hybrid Recursive Karatsuba Multiplications on FPGAs","authors":"Monalisa Das;Babita Jajodia","doi":"10.1109/LES.2025.3538470","DOIUrl":"https://doi.org/10.1109/LES.2025.3538470","url":null,"abstract":"The demand for large integer polynomial multiplications has become increasingly significant in modern cryptographic algorithms. The practical implementation of such multipliers presents a field of research focused on optimizing hardware design concerning space and time complexity. In this letter, the authors propose an efficient polynomial multiplier based on a hybrid recursive Karatsuba multiplication (HRKM) algorithm. The overall performance of the proposed design is evaluated using the area-time-product (ATP) metric. The hardware implementation of the proposed architecture is carried out on a Virtex-7 FPGA device using the Xilinx ISE platform. Hardware implementation results show that the proposed HRKM architecture shows ATP reduction of 67.885%, 70.128%, and 65.869% for 128, 256, and 512 bits, respectively, in comparison to Hybrid Karatsuba (nonrecursive) multiplications.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 4","pages":"240-243"},"PeriodicalIF":2.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144842961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-04DOI: 10.1109/LES.2025.3538552
Faeze S. Banitaba;Sercan Aygun;Mehran Shoushtari Moghadam;Amirhossein Jalilvand;Bingzhe Li;M. Hassan Najafi
Deep learning excels by utilizing vast datasets and sophisticated training algorithms. It achieves superior performance across many machine learning challenges compared to traditional methods. However, deep neural networks (DNNs) are not flawless; they are particularly susceptible to adversarial samples during the inference phase. These inputs area deliberately designed by attackers to cause DNNs to make incorrect classifications, exploiting the networks’ vulnerabilities. This letter proposes a novel perspective to fortify the neural network (NN) defense against adversarial attacks. We enhance the NN security by employing an emerging model of computation, namely, stochastic computing (SC). We show that strengthening NN with SC counteracts the adverse effects of these attacks on an NN output and adds a vital defense layer. Our evaluation results reveal that SC notably increases NN robustness and decreases susceptibility to interference, creating secure, reliable NN systems. The proposed method improves accuracy and reduces hardware footprint and energy consumption by up to 85%, 88%, and 95%, respectively.
{"title":"Adversarial Attack Bypass by Stochastic Computing","authors":"Faeze S. Banitaba;Sercan Aygun;Mehran Shoushtari Moghadam;Amirhossein Jalilvand;Bingzhe Li;M. Hassan Najafi","doi":"10.1109/LES.2025.3538552","DOIUrl":"https://doi.org/10.1109/LES.2025.3538552","url":null,"abstract":"Deep learning excels by utilizing vast datasets and sophisticated training algorithms. It achieves superior performance across many machine learning challenges compared to traditional methods. However, deep neural networks (DNNs) are not flawless; they are particularly susceptible to adversarial samples during the inference phase. These inputs area deliberately designed by attackers to cause DNNs to make incorrect classifications, exploiting the networks’ vulnerabilities. This letter proposes a novel perspective to fortify the neural network (NN) defense against adversarial attacks. We enhance the NN security by employing an emerging model of computation, namely, stochastic computing (SC). We show that strengthening NN with SC counteracts the adverse effects of these attacks on an NN output and adds a vital defense layer. Our evaluation results reveal that SC notably increases NN robustness and decreases susceptibility to interference, creating secure, reliable NN systems. The proposed method improves accuracy and reduces hardware footprint and energy consumption by up to 85%, 88%, and 95%, respectively.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 4","pages":"234-239"},"PeriodicalIF":2.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144843086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-03DOI: 10.1109/LES.2025.3538159
Swati;Shantanu Banarjee;Pinalkumar Engineer
Convolutional neural networks (CNNs) are the epitome of artificial intelligence (AI)-based applications. The computationally intensive convolution operation is the core of the entire architecture. Acceleration of CNN-based applications requires several algorithmic level manipulations and optimization for resource-constrained devices. In this work, we have proposed a template-based methodology for CNN acceleration on field programmable gate arrays (FPGA) hardware by designing reusable cores for individual layers like convolution, pooling, and dense layers. We explored various optimization techniques to achieve the best-hardware-designing strategy with data reuse and design space exploration. We have verified our methodology for LeNet-5 with kernel $5times 5$ and a custom CNN with kernel $3times 3$ for classification. The hardware-system design was validated on FPGA Xilinx XC7Z020 FPGA. Our proposed methodology achieves 2.9 GOPS/s performance outperforming existing implementation by $1.28times $ .
{"title":"A Template-Based Methodology for Efficient DNNs Inference on FPGA Devices With HW-SW Co-Design","authors":"Swati;Shantanu Banarjee;Pinalkumar Engineer","doi":"10.1109/LES.2025.3538159","DOIUrl":"https://doi.org/10.1109/LES.2025.3538159","url":null,"abstract":"Convolutional neural networks (CNNs) are the epitome of artificial intelligence (AI)-based applications. The computationally intensive convolution operation is the core of the entire architecture. Acceleration of CNN-based applications requires several algorithmic level manipulations and optimization for resource-constrained devices. In this work, we have proposed a template-based methodology for CNN acceleration on field programmable gate arrays (FPGA) hardware by designing reusable cores for individual layers like convolution, pooling, and dense layers. We explored various optimization techniques to achieve the best-hardware-designing strategy with data reuse and design space exploration. We have verified our methodology for LeNet-5 with kernel <inline-formula> <tex-math>$5times 5$ </tex-math></inline-formula> and a custom CNN with kernel <inline-formula> <tex-math>$3times 3$ </tex-math></inline-formula> for classification. The hardware-system design was validated on FPGA Xilinx XC7Z020 FPGA. Our proposed methodology achieves 2.9 GOPS/s performance outperforming existing implementation by <inline-formula> <tex-math>$1.28times $ </tex-math></inline-formula>.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 4","pages":"280-283"},"PeriodicalIF":2.0,"publicationDate":"2025-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144843065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-03DOI: 10.1109/LES.2025.3538013
Dahoon Jeong;Yooshin Kim;Donghoon Shin
Critical infrastructure (CI) is essential for societal and economic stability, making it a prime target for cyber threats. Traditional anomaly detection models like LSTM and Transformers require substantial computational resources, which are often unavailable in CI environments. Cloud computing offers on-demand resources but introduces privacy concerns due to the need to transmit sensitive data to cloud servers. Homomorphic encryption (HE) enables secure processing of encrypted data but is computationally intensive, particularly due to operations like bootstrapping. This letter proposes a bootstrapping-free lightweight anomaly detection model optimized for homomorphically encrypted data, leveraging CI’s operational characteristics. The model employs a two-stage data separation process and introduces state-vectors for normal operation detection, forming a allowlist anomaly detection approach. Experimental results on the SWaT and WADI datasets demonstrate the model’s competitive performance and efficiency, with significantly reduced training times while maintaining robust security.
{"title":"Privacy-Preserving Anomaly Detection With Homomorphic Encryption for Industrial Control Systems in Critical Infrastructure","authors":"Dahoon Jeong;Yooshin Kim;Donghoon Shin","doi":"10.1109/LES.2025.3538013","DOIUrl":"https://doi.org/10.1109/LES.2025.3538013","url":null,"abstract":"Critical infrastructure (CI) is essential for societal and economic stability, making it a prime target for cyber threats. Traditional anomaly detection models like LSTM and Transformers require substantial computational resources, which are often unavailable in CI environments. Cloud computing offers on-demand resources but introduces privacy concerns due to the need to transmit sensitive data to cloud servers. Homomorphic encryption (HE) enables secure processing of encrypted data but is computationally intensive, particularly due to operations like bootstrapping. This letter proposes a bootstrapping-free lightweight anomaly detection model optimized for homomorphically encrypted data, leveraging CI’s operational characteristics. The model employs a two-stage data separation process and introduces state-vectors for normal operation detection, forming a allowlist anomaly detection approach. Experimental results on the SWaT and WADI datasets demonstrate the model’s competitive performance and efficiency, with significantly reduced training times while maintaining robust security.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 4","pages":"276-279"},"PeriodicalIF":2.0,"publicationDate":"2025-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144842984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this letter, a microcontroller-based smart fault handling system (FHS) is proposed which is capable of early sensing and managing the thermal runaway (TR) event in real-time battery management system (BMS) through online internal resistance (IR) computation method. In overcharging region of lithium-ion (Li-ion) batteries (LIBs), TR is one of the critical issues which occur when used in electric vehicles (EVs) and battery energy storage systems (BESSs). Therefore, a proper subsystem is utmost required in the BMS for detecting the TR event quite early, which will automatically prevent the battery modules from critical accidents like fire, explosion, etc. The developed smart FHS utilizes an efficient, cost-effective, and reliable online IR sensing-based early TR sensing (ETRS) system which detects the TR event ~3.9 min prior to the TR onset point (outperforming the other detection methods) and shuts down the charging mechanism. Additionally, this system sends an IoT-based short message service (SMS) alert notification to the users allowing them to take necessary preventive steps.
{"title":"Online Internal Resistance Computation-Based Early Sensing of Thermal Runaway for Smart Fault Handling System (FHS) of Li-Ion Batteries","authors":"Abhijit Dey;Supratik Mondal;Biswajit Chakraborty;Sovan Dalai;Kesab Bhattacharya","doi":"10.1109/LES.2025.3535836","DOIUrl":"https://doi.org/10.1109/LES.2025.3535836","url":null,"abstract":"In this letter, a microcontroller-based smart fault handling system (FHS) is proposed which is capable of early sensing and managing the thermal runaway (TR) event in real-time battery management system (BMS) through online internal resistance (IR) computation method. In overcharging region of lithium-ion (Li-ion) batteries (LIBs), TR is one of the critical issues which occur when used in electric vehicles (EVs) and battery energy storage systems (BESSs). Therefore, a proper subsystem is utmost required in the BMS for detecting the TR event quite early, which will automatically prevent the battery modules from critical accidents like fire, explosion, etc. The developed smart FHS utilizes an efficient, cost-effective, and reliable online IR sensing-based early TR sensing (ETRS) system which detects the TR event ~3.9 min prior to the TR onset point (outperforming the other detection methods) and shuts down the charging mechanism. Additionally, this system sends an IoT-based short message service (SMS) alert notification to the users allowing them to take necessary preventive steps.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 4","pages":"284-287"},"PeriodicalIF":2.0,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144843087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-27DOI: 10.1109/LES.2025.3534237
Abhishek Yadav;Vyom Kumar Gupta;Binod Kumar
This work designs and implements a custom hardware accelerator for single object classification from drone imagery, for surveillance applications. A lightweight attention-based convolutional neural network (CNN) is developed and translated into hardware implementation as an IP/core. This accelerator is implemented as programmable logic (PL) and further optimized with buffer incorporation through high-level synthesis (HLS). The optimized PL is integrated with a processing system (PS), i.e., ZYNQ UltraScale+ MPSoC, enabling a hardware/software co-design paradigm for enhancing the one-versus-rest classification task. The system architecture is tested through the PYNQ overlay process. Experimental results show the architecture is lightweight (1.97 MB) and requires 8.9 million trainable parameters, on the targeted ZCU104 embedded FPGA board. It performs 53.8 Giga MAC operations, achieves an inference time of 1.05 ms, and a throughput of 947.2 frames per second (FPS). It consumes 5.65 W of power at 100 MHz of frequency, shows an efficiency of 9.52 GOPs/W, and dissipates 0.006 J of energy per inference. Codes and subsequent files are available at https://shorturl.at/iX0jw.
{"title":"Lightweight Surveillance Image Classification Through Hardware-Software Co-Design","authors":"Abhishek Yadav;Vyom Kumar Gupta;Binod Kumar","doi":"10.1109/LES.2025.3534237","DOIUrl":"https://doi.org/10.1109/LES.2025.3534237","url":null,"abstract":"This work designs and implements a custom hardware accelerator for single object classification from drone imagery, for surveillance applications. A lightweight attention-based convolutional neural network (CNN) is developed and translated into hardware implementation as an IP/core. This accelerator is implemented as programmable logic (PL) and further optimized with buffer incorporation through high-level synthesis (HLS). The optimized PL is integrated with a processing system (PS), i.e., ZYNQ UltraScale+ MPSoC, enabling a hardware/software co-design paradigm for enhancing the one-versus-rest classification task. The system architecture is tested through the PYNQ overlay process. Experimental results show the architecture is lightweight (1.97 MB) and requires 8.9 million trainable parameters, on the targeted ZCU104 embedded FPGA board. It performs 53.8 Giga MAC operations, achieves an inference time of 1.05 ms, and a throughput of 947.2 frames per second (FPS). It consumes 5.65 W of power at 100 MHz of frequency, shows an efficiency of 9.52 GOPs/W, and dissipates 0.006 J of energy per inference. Codes and subsequent files are available at <uri>https://shorturl.at/iX0jw</uri>.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 4","pages":"222-225"},"PeriodicalIF":2.0,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144842963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Malware identification and classification is an active field of research. A popular approach is to classify malware binaries using visual analysis, by converting malware binaries into images, which reveal different class-specific patterns. To develop a highly accurate multiclass malware classifier, in this letter, we propose Kol-4-Gen, a set of four novel deep learning models based on the Kolmogorov-Arnold Network (KAN) with trainable activation functions, and a generative adversarial network (GAN) to address data imbalance (if applicable) during training. Our models, tested on the standard Malimg (grayscale, imbalanced, 25 classes), Malevis (RGB, balanced, 26 classes), and the miniature Virus-MNIST (grayscale, imbalanced, 10 classes) datasets, outperform state-of-the-art (S-O-T-A) models, achieving $approx 99.36%$ , $approx 95.44%$ , and $approx 92.12%$ validation accuracy, respectively.
{"title":"Kol-4-Gen: Stacked Kolmogorov-Arnold and Generative Adversarial Networks for Malware Binary Classification Through Visual Analysis","authors":"Anurag Dutta;Satya Prakash Nayak;Ruchira Naskar;Rajat Subhra Chakraborty","doi":"10.1109/LES.2025.3529625","DOIUrl":"https://doi.org/10.1109/LES.2025.3529625","url":null,"abstract":"Malware identification and classification is an active field of research. A popular approach is to classify malware binaries using visual analysis, by converting malware binaries into images, which reveal different class-specific patterns. To develop a highly accurate multiclass malware classifier, in this letter, we propose Kol-4-Gen, a set of four novel deep learning models based on the Kolmogorov-Arnold Network (KAN) with trainable activation functions, and a generative adversarial network (GAN) to address data imbalance (if applicable) during training. Our models, tested on the standard <monospace>Malimg</monospace> (grayscale, imbalanced, 25 classes), <monospace>Malevis</monospace> (RGB, balanced, 26 classes), and the miniature <monospace>Virus-MNIST</monospace> (grayscale, imbalanced, 10 classes) datasets, outperform state-of-the-art (S-O-T-A) models, achieving <inline-formula> <tex-math>$approx 99.36%$ </tex-math></inline-formula>, <inline-formula> <tex-math>$approx 95.44%$ </tex-math></inline-formula>, and <inline-formula> <tex-math>$approx 92.12%$ </tex-math></inline-formula> validation accuracy, respectively.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 4","pages":"268-271"},"PeriodicalIF":2.0,"publicationDate":"2025-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10851320","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144843062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This letter proposes a RISC-V-based accelerator for inferring a model that uses efficient sparse Winograd convolutional neural networks. This accelerator consists of a RISC-V processor (Andes NX27V) and a coprocessor; the latter performs the Winograd-ReLU convolutions and fully connected layers of the network. The pooling and ReLU layers of the network are executed by the processor in parallel with the coprocessor to increase throughput. In addition, on-chip buffers are used for the input/output data and filter weights to ensure pipelined operation. Implemented on an AMD VCU118 FPGA platform operating at 250 MHz, the accelerator achieves an average throughput of 5104.6 GOP/s when inferring a VGG16-based model.
{"title":"A RISC-V-Based High-Throughput Accelerator for Sparse Winograd CNN Inference on FPGA","authors":"Shabirahmed Badashasab Jigalur;Chang-Ling Tsai;Yu-Chi Shih;Yen-Cheng Kuan","doi":"10.1109/LES.2025.3531251","DOIUrl":"https://doi.org/10.1109/LES.2025.3531251","url":null,"abstract":"This letter proposes a RISC-V-based accelerator for inferring a model that uses efficient sparse Winograd convolutional neural networks. This accelerator consists of a RISC-V processor (Andes NX27V) and a coprocessor; the latter performs the Winograd-ReLU convolutions and fully connected layers of the network. The pooling and ReLU layers of the network are executed by the processor in parallel with the coprocessor to increase throughput. In addition, on-chip buffers are used for the input/output data and filter weights to ensure pipelined operation. Implemented on an AMD VCU118 FPGA platform operating at 250 MHz, the accelerator achieves an average throughput of 5104.6 GOP/s when inferring a VGG16-based model.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 4","pages":"256-259"},"PeriodicalIF":2.0,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144843061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-16DOI: 10.1109/LES.2025.3528326
Brian Maximiliano Gluzman;Ramiro Avalos Ribas;Jorge Castiñeira Moreira;Alejandro José Uriz;Juan Alberto Etcheverry
This letter proposes the design, construction, and measurement of a quadrature coupler at 2 GHz using microstrip technology. The device has the ability to divide the input signal into two, each one ideally 3 dB lower in power, and with a phase difference of 90° between them. The main novelty is the design of four coupler models with different proposals for intersections between the feed lines and the coupler sections. Based on the simulation results obtained, the best performing alternative is selected, and its dimensions are adjusted to operate at the design frequency. After construction the device was validated using a vector network analyzer (VNA).
{"title":"Design, Construction, and Measurement of Branchline Coupler","authors":"Brian Maximiliano Gluzman;Ramiro Avalos Ribas;Jorge Castiñeira Moreira;Alejandro José Uriz;Juan Alberto Etcheverry","doi":"10.1109/LES.2025.3528326","DOIUrl":"https://doi.org/10.1109/LES.2025.3528326","url":null,"abstract":"This letter proposes the design, construction, and measurement of a quadrature coupler at 2 GHz using microstrip technology. The device has the ability to divide the input signal into two, each one ideally 3 dB lower in power, and with a phase difference of 90° between them. The main novelty is the design of four coupler models with different proposals for intersections between the feed lines and the coupler sections. Based on the simulation results obtained, the best performing alternative is selected, and its dimensions are adjusted to operate at the design frequency. After construction the device was validated using a vector network analyzer (VNA).","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 6","pages":"370-373"},"PeriodicalIF":2.0,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}