To address the computational power and energy efficiency challenges in Llama2 large-model inference, this letter proposes a hardware-software co-design method and finally implements a high energy efficiency accelerator named QLlama based on FPGA. This work first employs a novel quantization method based on a microscaling data format, which allows sharing a scaling factor with E8M0 format for each subtensor block, thus enabling quantization and dequantization operations to be completed using only shift operations. Second, on this basis, a mixed precision configuration is implemented for different layers of Llama2 to balance accuracy loss and computational efficiency. Finally, a dedicated accelerator QLlama is designed, whose core units include a quantization unit for dynamic quantization, a vector-matrix multiplication unit for high density computation of quantized weights, a scaled dot product unit, and a basic operator unit. Experimental results show that this scheme achieves energy efficiency improvements of $2.13sim 10.66times $ with negligible accuracy loss, i.e., <0.2>https://github.com/wendadawen/QLlama.
{"title":"QLlama: An FPGA-Based Microscaling Quantization Accelerator for Energy-Efficient Llama2 Inference","authors":"Hongbing Wen;Zihao Wang;Jiale Dong;Wenqi Lou;Lei Gong;Chao Wang;Xuehai Zhou","doi":"10.1109/LES.2025.3600563","DOIUrl":"https://doi.org/10.1109/LES.2025.3600563","url":null,"abstract":"To address the computational power and energy efficiency challenges in Llama2 large-model inference, this letter proposes a hardware-software co-design method and finally implements a high energy efficiency accelerator named QLlama based on FPGA. This work first employs a novel quantization method based on a microscaling data format, which allows sharing a scaling factor with E8M0 format for each subtensor block, thus enabling quantization and dequantization operations to be completed using only shift operations. Second, on this basis, a mixed precision configuration is implemented for different layers of Llama2 to balance accuracy loss and computational efficiency. Finally, a dedicated accelerator QLlama is designed, whose core units include a quantization unit for dynamic quantization, a vector-matrix multiplication unit for high density computation of quantized weights, a scaled dot product unit, and a basic operator unit. Experimental results show that this scheme achieves energy efficiency improvements of <inline-formula> <tex-math>$2.13sim 10.66times $ </tex-math></inline-formula> with negligible accuracy loss, i.e., <0.2>https://github.com/wendadawen/QLlama</uri>.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 5","pages":"337-340"},"PeriodicalIF":2.0,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
key-value (KV) stores based on log-structured merge trees (LSM-trees) have become vital for managing large-scale unstructured data. Recent studies have proposed hybrid zoned storage architectures—combining host-managed shingled magnetic recording (HM-SMR) HDDs and zoned namespace (ZNS) SSDs—to balance performance and cost, making them well-suited for LSM-tree–based KV stores. Although a number of novel schemes have been developed to optimize write performance, garbage collection, and compaction overhead, read performance remains a critical challenge. Specifically, we observe that read requests often concentrate on low-performance HM-SMR HDDs, resulting in severe read bottlenecks. To address this issue, we propose hybrid zoned cache improvement (HZCI) to enhance read efficiency in hybrid zoned KV stores. First, we construct a hybrid-granularity zoned cache that leverages file access patterns to exploit the high-speed characteristics of ZNS SSDs. Second, we introduce an access-aware cache management strategy to intelligently manage the KV cache within ZNS SSDs. Finally, we design a compaction mechanism that balances read performance with compaction overhead, thereby improving cache efficiency. Experimental results show that HZCI improves average read throughput by 32%, 40%, and 52% compared to GearDB, ZoneKV, and SpanDB, respectively.
{"title":"Accelerating LSM-Tree KV Stores via Caching Hot Keys on Hybrid Zoned Storage","authors":"Shiqiang Nie;Menghan Li;Chi Zhang;Di Zhang;Weiguo Wu","doi":"10.1109/LES.2025.3599998","DOIUrl":"https://doi.org/10.1109/LES.2025.3599998","url":null,"abstract":"key-value (KV) stores based on log-structured merge trees (LSM-trees) have become vital for managing large-scale unstructured data. Recent studies have proposed hybrid zoned storage architectures—combining host-managed shingled magnetic recording (HM-SMR) HDDs and zoned namespace (ZNS) SSDs—to balance performance and cost, making them well-suited for LSM-tree–based KV stores. Although a number of novel schemes have been developed to optimize write performance, garbage collection, and compaction overhead, read performance remains a critical challenge. Specifically, we observe that read requests often concentrate on low-performance HM-SMR HDDs, resulting in severe read bottlenecks. To address this issue, we propose hybrid zoned cache improvement (HZCI) to enhance read efficiency in hybrid zoned KV stores. First, we construct a hybrid-granularity zoned cache that leverages file access patterns to exploit the high-speed characteristics of ZNS SSDs. Second, we introduce an access-aware cache management strategy to intelligently manage the KV cache within ZNS SSDs. Finally, we design a compaction mechanism that balances read performance with compaction overhead, thereby improving cache efficiency. Experimental results show that HZCI improves average read throughput by 32%, 40%, and 52% compared to GearDB, ZoneKV, and SpanDB, respectively.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 5","pages":"321-324"},"PeriodicalIF":2.0,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-29DOI: 10.1109/LES.2025.3589350
Sherline Y. Cruz-Nava;Mario J. Rosas-Fregoso;Francisco López-Huerta;Rosa M. Woo-García;Edith Osorio-de-la-Rosa
Advances in communications have enabled the development of various types of radars for applications, such as cartography, military industry, materials testing, air traffic control, and autonomous vehicle guidance. We present the design and implementation of an embedded system for generating radio frequency (RF) signals for radar applications. Using a Cyclone V field programmable gate array (FPGA) and the BladeRF 2.0 platform, the system employs direct digital synthesis (DDS) and in-phase/quadrature (I/Q) modulation techniques for precise signal generation. The carrier signal embeds information from lineal frequency-modulated signals (LFM), which are shifted in frequency to the S-band. Additionally, a synchronization module has been implemented to ensure precise activation during transmission. Simulation and experimental results demonstrate significant improvements in signal stability, flexibility, and precision. This development advances high-frequency embedded technologies with applications in wireless communications and radar detection systems.
{"title":"FPGA-Based RF Signal Generator for Radar Applications","authors":"Sherline Y. Cruz-Nava;Mario J. Rosas-Fregoso;Francisco López-Huerta;Rosa M. Woo-García;Edith Osorio-de-la-Rosa","doi":"10.1109/LES.2025.3589350","DOIUrl":"https://doi.org/10.1109/LES.2025.3589350","url":null,"abstract":"Advances in communications have enabled the development of various types of radars for applications, such as cartography, military industry, materials testing, air traffic control, and autonomous vehicle guidance. We present the design and implementation of an embedded system for generating radio frequency (RF) signals for radar applications. Using a Cyclone V field programmable gate array (FPGA) and the BladeRF 2.0 platform, the system employs direct digital synthesis (DDS) and in-phase/quadrature (I/Q) modulation techniques for precise signal generation. The carrier signal embeds information from lineal frequency-modulated signals (LFM), which are shifted in frequency to the S-band. Additionally, a synchronization module has been implemented to ensure precise activation during transmission. Simulation and experimental results demonstrate significant improvements in signal stability, flexibility, and precision. This development advances high-frequency embedded technologies with applications in wireless communications and radar detection systems.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 6","pages":"365-369"},"PeriodicalIF":2.0,"publicationDate":"2025-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-16DOI: 10.1109/LES.2025.3589547
Santiago Germino;Martín N. Menéndez;Ariel Lutenberg
Essential infrastructure and services depend on critical systems. To ensure that critical systems function properly, regular testing and monitoring are necessary. Establishing direct, dedicated data connections for remote testing can be expensive, while using public cellular, satellite, or fiber Internet connections can introduce privacy and security risks. Securing the medium often requires placing trust in third parties. The novel proposal introduced in this work suggests using zero-knowledge proofs, a modern cryptographic technique, to conduct secure remote testing and monitoring of critical systems over affordable public networks, which can include email or instant messaging apps. This approach guarantees both the integrity and confidentiality of the transmitted data, as well as the integrity of the processes involved in preparing the data for transmission. We will present this approach and demonstrate its implementation through a real-world use case: the remote testing of an electronic railway interlocking system.
{"title":"Secure Protocol for Remote Testing Critical Systems Over Public Networks Using Zero-Knowledge Proofs","authors":"Santiago Germino;Martín N. Menéndez;Ariel Lutenberg","doi":"10.1109/LES.2025.3589547","DOIUrl":"https://doi.org/10.1109/LES.2025.3589547","url":null,"abstract":"Essential infrastructure and services depend on critical systems. To ensure that critical systems function properly, regular testing and monitoring are necessary. Establishing direct, dedicated data connections for remote testing can be expensive, while using public cellular, satellite, or fiber Internet connections can introduce privacy and security risks. Securing the medium often requires placing trust in third parties. The novel proposal introduced in this work suggests using zero-knowledge proofs, a modern cryptographic technique, to conduct secure remote testing and monitoring of critical systems over affordable public networks, which can include email or instant messaging apps. This approach guarantees both the integrity and confidentiality of the transmitted data, as well as the integrity of the processes involved in preparing the data for transmission. We will present this approach and demonstrate its implementation through a real-world use case: the remote testing of an electronic railway interlocking system.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 6","pages":"427-430"},"PeriodicalIF":2.0,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-02DOI: 10.1109/LES.2025.3575321
Micaela Benavides;Nicolas Nunovero;Jimmy Tarrillo
Temperature and humidity chambers are often employed for stress testing materials. In the case of cultural heritage materials, it is also crucial to incorporate light condition in these evaluations. Consequently, controlling these parameters is essential to effectively simulate accelerated environmental conditions. This letter outlines the design and implementation of an embedded system that manages temperature, humidity, and light levels. Built on an ARM Cortex-M3 System on Chip, the system integrates various temperature and humidity sensors and actuators, along with a light controller. It also features an embedded user interface and facilitates communication with an external PC. The validity of our proposal is demonstrated through the implementation of a proportional-integer control mechanism for the regulation of temperature and relative humidity.
{"title":"Embedded System for Controlling Temperature, Relative Humidity, and Lighting for a Test Chamber","authors":"Micaela Benavides;Nicolas Nunovero;Jimmy Tarrillo","doi":"10.1109/LES.2025.3575321","DOIUrl":"https://doi.org/10.1109/LES.2025.3575321","url":null,"abstract":"Temperature and humidity chambers are often employed for stress testing materials. In the case of cultural heritage materials, it is also crucial to incorporate light condition in these evaluations. Consequently, controlling these parameters is essential to effectively simulate accelerated environmental conditions. This letter outlines the design and implementation of an embedded system that manages temperature, humidity, and light levels. Built on an ARM Cortex-M3 System on Chip, the system integrates various temperature and humidity sensors and actuators, along with a light controller. It also features an embedded user interface and facilitates communication with an external PC. The validity of our proposal is demonstrated through the implementation of a proportional-integer control mechanism for the regulation of temperature and relative humidity.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 6","pages":"415-418"},"PeriodicalIF":2.0,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-02DOI: 10.1109/LES.2025.3575372
M. Celeste Cebedio;Lucas A. Rabioglio;Luciana De Micco
This study analyzes techniques for compressing generative autoencoders (AEs) to enable their deployment on resource-constrained devices, addressing the challenges and optimizations required for such environments. As a case study, we present a quantized generative AE optimized for efficiently generating underwater sound spectrograms. The model is evaluated across diverse scenarios, demonstrating its ability to produce low-dimensional spectrograms while adapting to various acoustic conditions. The hardware optimization process focuses on balancing computational efficiency and model accuracy, ensuring performance comparable to its nonquantized counterpart.
{"title":"Quantized Generative Autoencoder for Audio Spectrograms","authors":"M. Celeste Cebedio;Lucas A. Rabioglio;Luciana De Micco","doi":"10.1109/LES.2025.3575372","DOIUrl":"https://doi.org/10.1109/LES.2025.3575372","url":null,"abstract":"This study analyzes techniques for compressing generative autoencoders (AEs) to enable their deployment on resource-constrained devices, addressing the challenges and optimizations required for such environments. As a case study, we present a quantized generative AE optimized for efficiently generating underwater sound spectrograms. The model is evaluated across diverse scenarios, demonstrating its ability to produce low-dimensional spectrograms while adapting to various acoustic conditions. The hardware optimization process focuses on balancing computational efficiency and model accuracy, ensuring performance comparable to its nonquantized counterpart.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 6","pages":"419-422"},"PeriodicalIF":2.0,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-29DOI: 10.1109/LES.2025.3565235
Abraham Josue Delgado-Nava;Jorge Rivera;Susana Ortega-Cisneros
Application of stochastic computing (SC) for reckoning trascendental functions as $tanh (x)$ , so used as an activation function in convolutional neural networks is an active research area. Currently, most of the works for computing functions via SC are based on the unipolar encoding format $(xin [{0,1}])$ , due to this, a method based on bipolar encoding format $(xin [-1,1])$ is here proposed with the goal of reducing the implementation complexity, and the correlation between stochastic bitstreams. For that, a collection of existing methods is adapted for the purpose of this letter. Moreover, for reducing correlation between bitstream, an algorithm is proposed for the selection of different seeds for distinct linear feedback shift registers that yields to a low MSE. The seed selection along with the adaptation of methods for implementing polynomials with SC digital circuits based on a bipolar encoding format yields to more accurate results. Simulations were carried out for the polynomial approximation of several functions. Function $tanh (x)$ was compared with an existing solution, verifying in that way the superior performance of the proposed approach.
{"title":"Seeding Algorithm for Bipolar Stochastic Computing for Polynomial Approximations","authors":"Abraham Josue Delgado-Nava;Jorge Rivera;Susana Ortega-Cisneros","doi":"10.1109/LES.2025.3565235","DOIUrl":"https://doi.org/10.1109/LES.2025.3565235","url":null,"abstract":"Application of stochastic computing (SC) for reckoning trascendental functions as <inline-formula> <tex-math>$tanh (x)$ </tex-math></inline-formula>, so used as an activation function in convolutional neural networks is an active research area. Currently, most of the works for computing functions via SC are based on the unipolar encoding format <inline-formula> <tex-math>$(xin [{0,1}])$ </tex-math></inline-formula>, due to this, a method based on bipolar encoding format <inline-formula> <tex-math>$(xin [-1,1])$ </tex-math></inline-formula> is here proposed with the goal of reducing the implementation complexity, and the correlation between stochastic bitstreams. For that, a collection of existing methods is adapted for the purpose of this letter. Moreover, for reducing correlation between bitstream, an algorithm is proposed for the selection of different seeds for distinct linear feedback shift registers that yields to a low MSE. The seed selection along with the adaptation of methods for implementing polynomials with SC digital circuits based on a bipolar encoding format yields to more accurate results. Simulations were carried out for the polynomial approximation of several functions. Function <inline-formula> <tex-math>$tanh (x)$ </tex-math></inline-formula> was compared with an existing solution, verifying in that way the superior performance of the proposed approach.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"17 6","pages":"406-410"},"PeriodicalIF":2.0,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}