Empowered by Large Language Models (LLMs), substantial progress has been made in enhancing the EDA design flow in terms of high-level synthesis, such as direct translation from high-level language into RTL description. On the other hand, little research has been done for logic synthesis on the netlist generation. A direct application of LLMs for netlist generation presents additional challenges due to the scarcity of netlist-specific data, the need for tailored fine-tuning, and effective generation methods. This work first presents a novel training set and two evaluation sets catered for direct netlist generation LLMs, and an effective dataset construction pipeline to construct these datasets. Then this work proposes LLM4Netlist, a novel step-based netlist generation framework via fine-tuned LLM. The framework consists of a step-based prompt construction module, a fine-tuned LLM, a code confidence estimator, and a feedback loop module, and is able to generate netlist codes directly from natural language functional descriptions. We evaluate the efficacy of our approach with our novel evaluation datasets. The experimental results demonstrate that, compared to the average score of the 10 commercial LLMs listed in our experiments, our method shows a functional correctness increase of 183.41% on the NetlistEval dataset and a 91.07% increase on NGen. The training and testing data, along with the processing code, can be found at https://github.com/klyebit/LLM4Netlist.git
在大型语言模型(Large Language Models, LLMs)的支持下,从高级综合的角度增强EDA设计流程已经取得了实质性的进展,例如从高级语言直接翻译为RTL描述。另一方面,关于网表生成的逻辑综合研究很少。由于网络列表特定数据的稀缺,需要定制微调和有效的生成方法,直接应用llm生成网络列表会带来额外的挑战。这项工作首先提出了一个新的训练集和两个评估集,用于直接生成网络列表的llm,以及一个有效的数据集构建管道来构建这些数据集。在此基础上,本文提出了一种基于步进的网络列表生成框架LLM4Netlist。该框架由一个基于步骤的提示构建模块、一个微调的LLM、一个代码置信度估计器和一个反馈循环模块组成,能够直接从自然语言功能描述中生成网表代码。我们用新的评估数据集来评估我们的方法的有效性。实验结果表明,与我们实验中列出的10个商业llm的平均分数相比,我们的方法在NetlistEval数据集上的功能正确性提高了183.41%,在NGen上的功能正确性提高了91.07%。训练和测试数据以及处理代码可以在https://github.com/klyebit/LLM4Netlist.git上找到
{"title":"LLM4Netlist: LLM-Enabled Step-Based Netlist Generation From Natural Language Description","authors":"Kailiang Ye;Qingyu Yang;Zheng Lu;Heng Yu;Tianxiang Cui;Ruibin Bai;Linlin Shen","doi":"10.1109/JETCAS.2025.3568548","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3568548","url":null,"abstract":"Empowered by Large Language Models (LLMs), substantial progress has been made in enhancing the EDA design flow in terms of high-level synthesis, such as direct translation from high-level language into RTL description. On the other hand, little research has been done for logic synthesis on the netlist generation. A direct application of LLMs for netlist generation presents additional challenges due to the scarcity of netlist-specific data, the need for tailored fine-tuning, and effective generation methods. This work first presents a novel training set and two evaluation sets catered for direct netlist generation LLMs, and an effective dataset construction pipeline to construct these datasets. Then this work proposes <sc>LLM4Netlist</small>, a novel step-based netlist generation framework via fine-tuned LLM. The framework consists of a step-based prompt construction module, a fine-tuned LLM, a code confidence estimator, and a feedback loop module, and is able to generate netlist codes directly from natural language functional descriptions. We evaluate the efficacy of our approach with our novel evaluation datasets. The experimental results demonstrate that, compared to the average score of the 10 commercial LLMs listed in our experiments, our method shows a functional correctness increase of 183.41% on the NetlistEval dataset and a 91.07% increase on NGen. The training and testing data, along with the processing code, can be found at <uri>https://github.com/klyebit/LLM4Netlist.git</uri>","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"337-348"},"PeriodicalIF":3.7,"publicationDate":"2025-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automatically designing fast and low-cost digital circuits is challenging because of the discrete nature of circuits and the enormous design space, particularly in the exploration of approximate circuits. However, recent advances in generative artificial intelligence (GAI) have shed light to address these challenges. In this work, we present GPTAC, a domain-specific generative pre-trained (GPT) model customized for designing approximate circuits. By specifying the desired circuit accuracy or area, GPTAC can automatically generate an approximate circuit using its generative capabilities. We represent circuits using domain-specific language tokens, refined through a hardware description language keyword filter applied to gate-level code. This representation enables GPTAC to effectively learn approximate circuits from existing datasets by leveraging the GPT language model, as the training data can be directly derived from gate-level code. Additionally, by focusing on a domain-specific language, only a limited set of keywords is maintained, facilitating faster model convergence. To improve the success rate of the generated circuits, we introduce a circuit check rule that masks the GPTAC inference results when necessary. The experiment indicated that GPTAC is capable of producing approximate multipliers in under 15 seconds while utilizing merely 4GB of GPU memory, achieving a 10-40% reduction in area relative to the accuracy multiplier depending on various accuracy needs.
{"title":"GPTAC: Domain-Specific Generative Pre-Trained Model for Approximate Circuit Design Exploration","authors":"Sipei Yi;Weichuan Zuo;Hongyi Wu;Ruicheng Dai;Weikang Qian;Jienan Chen","doi":"10.1109/JETCAS.2025.3568606","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3568606","url":null,"abstract":"Automatically designing fast and low-cost digital circuits is challenging because of the discrete nature of circuits and the enormous design space, particularly in the exploration of approximate circuits. However, recent advances in generative artificial intelligence (GAI) have shed light to address these challenges. In this work, we present GPTAC, a domain-specific generative pre-trained (GPT) model customized for designing approximate circuits. By specifying the desired circuit accuracy or area, GPTAC can automatically generate an approximate circuit using its generative capabilities. We represent circuits using domain-specific language tokens, refined through a hardware description language keyword filter applied to gate-level code. This representation enables GPTAC to effectively learn approximate circuits from existing datasets by leveraging the GPT language model, as the training data can be directly derived from gate-level code. Additionally, by focusing on a domain-specific language, only a limited set of keywords is maintained, facilitating faster model convergence. To improve the success rate of the generated circuits, we introduce a circuit check rule that masks the GPTAC inference results when necessary. The experiment indicated that GPTAC is capable of producing approximate multipliers in under 15 seconds while utilizing merely 4GB of GPU memory, achieving a 10-40% reduction in area relative to the accuracy multiplier depending on various accuracy needs.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"349-360"},"PeriodicalIF":3.7,"publicationDate":"2025-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-09DOI: 10.1109/JETCAS.2025.3568716
Ashkan Moradifirouzabadi;Mingu Kang
Despite their remarkable success in achieving high performance, Transformer-based models impose substantial computational and memory bandwidth requirements, posing significant challenges for hardware deployment. A key contributor to these challenges is the large KV cache, which increases data movement costs in addition to the model parameters. While various token pruning techniques have been proposed to reduce the computational complexity and storage requirements of the attention mechanism by eliminating redundant tokens, these methods often introduce irregularities in the sparsity patterns that complicate hardware implementation. To address these challenges, we propose a hardware and algorithm co-design approach. Our solution features a Runtime Cache Eviction (RCE) algorithm that removes the least relevant tokens and replaces them with newly generated ones, maintaining a constant KV cache size across blocks and inputs. To support this algorithm, we design an accelerator equipped with a KV Memory Management Unit (KV-MMU), which efficiently manages active tokens through eviction and replacement, thereby optimizing DRAM storage and access. Additionally, our design integrates batch processing and an optimized processing pipeline to improve end-to-end throughput, effectively meeting the requirements of both pre-filling and generation stages. The proposed system achieves up to $8times $ KV cache size reduction with minimal accuracy degradation. In a 65 nm process, the proposed accelerator demonstrates $1.52times $ energy savings and $3.62times $ delay reductions when processing a batch size of 16, with only a 1.11% energy overhead attributed to the specialized KV-MMU.
{"title":"End-to-End Acceleration of Generative Models With Runtime Regularized KV Cache Management","authors":"Ashkan Moradifirouzabadi;Mingu Kang","doi":"10.1109/JETCAS.2025.3568716","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3568716","url":null,"abstract":"Despite their remarkable success in achieving high performance, Transformer-based models impose substantial computational and memory bandwidth requirements, posing significant challenges for hardware deployment. A key contributor to these challenges is the large KV cache, which increases data movement costs in addition to the model parameters. While various token pruning techniques have been proposed to reduce the computational complexity and storage requirements of the attention mechanism by eliminating redundant tokens, these methods often introduce irregularities in the sparsity patterns that complicate hardware implementation. To address these challenges, we propose a hardware and algorithm co-design approach. Our solution features a Runtime Cache Eviction (RCE) algorithm that removes the least relevant tokens and replaces them with newly generated ones, maintaining a constant KV cache size across blocks and inputs. To support this algorithm, we design an accelerator equipped with a KV Memory Management Unit (KV-MMU), which efficiently manages active tokens through eviction and replacement, thereby optimizing DRAM storage and access. Additionally, our design integrates batch processing and an optimized processing pipeline to improve end-to-end throughput, effectively meeting the requirements of both pre-filling and generation stages. The proposed system achieves up to <inline-formula> <tex-math>$8times $ </tex-math></inline-formula> KV cache size reduction with minimal accuracy degradation. In a 65 nm process, the proposed accelerator demonstrates <inline-formula> <tex-math>$1.52times $ </tex-math></inline-formula> energy savings and <inline-formula> <tex-math>$3.62times $ </tex-math></inline-formula> delay reductions when processing a batch size of 16, with only a 1.11% energy overhead attributed to the specialized KV-MMU.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"217-230"},"PeriodicalIF":3.7,"publicationDate":"2025-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-09DOI: 10.1109/JETCAS.2025.3568712
Gaoche Zhang;Dingyang Zou;Kairui Sun;Zhihuan Chen;Meiqi Wang;Zhongfeng Wang
Recent advancements in artificial intelligence (AI) models have intensified the need for specialized AI accelerators. The design of optimized general matrix multiplication (GEMM) module tailored for these accelerators is crucial but time-consuming and expertise-demanding, creating a demand for automating design processes. Large language models (LLMs), capable of generating high-quality designs from human instructions, show great promise in automating GEMM module creation. However, the GEMM module’s vast design space and stringent performance requirements, along with the limitations of datasets and the lack of hardware performance awareness of LLMs, have made previous LLM-based register transfer level (RTL) code generation efforts unsuitable for GEMM design. To tackle these challenges, this paper proposes an automated performance-aware LLM-based framework, GEMMV, for generating high-correctness and high-performance Verilog code for GEMM. This framework utilizes in-context learning based on GPT-4 to automatically generate high-quality and well-annotated Verilog code for different variants of the GEMM. Additionally, it leverages in-context learning to obtain performance awareness by integrating a multi-level performance model (MLPM) with fine-tuned LLMs. The Verilog code generated by this framework reduces latency by 3.1x and improves syntax correctness by 65% and functionality correctness by 70% compared to earlier efforts.
{"title":"GEMMV: An LLM-Based Automated Performance-Aware Framework for GEMM Verilog Generation","authors":"Gaoche Zhang;Dingyang Zou;Kairui Sun;Zhihuan Chen;Meiqi Wang;Zhongfeng Wang","doi":"10.1109/JETCAS.2025.3568712","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3568712","url":null,"abstract":"Recent advancements in artificial intelligence (AI) models have intensified the need for specialized AI accelerators. The design of optimized general matrix multiplication (GEMM) module tailored for these accelerators is crucial but time-consuming and expertise-demanding, creating a demand for automating design processes. Large language models (LLMs), capable of generating high-quality designs from human instructions, show great promise in automating GEMM module creation. However, the GEMM module’s vast design space and stringent performance requirements, along with the limitations of datasets and the lack of hardware performance awareness of LLMs, have made previous LLM-based register transfer level (RTL) code generation efforts unsuitable for GEMM design. To tackle these challenges, this paper proposes an automated performance-aware LLM-based framework, GEMMV, for generating high-correctness and high-performance Verilog code for GEMM. This framework utilizes in-context learning based on GPT-4 to automatically generate high-quality and well-annotated Verilog code for different variants of the GEMM. Additionally, it leverages in-context learning to obtain performance awareness by integrating a multi-level performance model (MLPM) with fine-tuned LLMs. The Verilog code generated by this framework reduces latency by 3.1x and improves syntax correctness by 65% and functionality correctness by 70% compared to earlier efforts.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"325-336"},"PeriodicalIF":3.7,"publicationDate":"2025-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-05DOI: 10.1109/JETCAS.2025.3566929
Gian Singh;Sarma Vrudhula
Large language models (LLMs) have achieved high accuracy in diverse NLP and computer vision tasks due to self-attention mechanisms relying on GEMM and GEMV operations. However, scaling LLMs poses significant computational and energy challenges, particularly for traditional Von-Neumann architectures (CPUs/GPUs), which incur high latency and energy consumption from frequent data movement. These issues are even more pronounced in energy-constrained edge environments. While DRAM-based near-memory architectures offer improved energy efficiency and throughput, their processing elements are limited by strict area, power, and timing constraints. This work introduces CIDAN-3D, a novel Processing-in-Memory (PIM) architecture tailored for LLMs. It features an ultra-low-power Neuron Processing Element (NPE) with high compute density (#Operations/Area), enabling efficient in-situ execution of LLM operations by leveraging high parallelism within DRAM. CIDAN-3D reduces data movement, improves locality, and achieves substantial gains in performance and energy efficiency—showing up to $1.3times $ higher throughput and $21.9times $ better energy efficiency for smaller models, and $3times $ throughput and $71times $ energy improvement for large decoder-only models compared to prior near-memory designs. As a result, CIDAN-3D offers a scalable, energy-efficient platform for LLM-driven Gen-AI applications.
{"title":"A Scalable and Energy-Efficient Processing-in-Memory Architecture for Gen-AI","authors":"Gian Singh;Sarma Vrudhula","doi":"10.1109/JETCAS.2025.3566929","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3566929","url":null,"abstract":"Large language models (LLMs) have achieved high accuracy in diverse NLP and computer vision tasks due to self-attention mechanisms relying on GEMM and GEMV operations. However, scaling LLMs poses significant computational and energy challenges, particularly for traditional Von-Neumann architectures (CPUs/GPUs), which incur high latency and energy consumption from frequent data movement. These issues are even more pronounced in energy-constrained edge environments. While DRAM-based near-memory architectures offer improved energy efficiency and throughput, their processing elements are limited by strict area, power, and timing constraints. This work introduces CIDAN-3D, a novel Processing-in-Memory (PIM) architecture tailored for LLMs. It features an ultra-low-power Neuron Processing Element (NPE) with high compute density (#Operations/Area), enabling efficient in-situ execution of LLM operations by leveraging high parallelism within DRAM. CIDAN-3D reduces data movement, improves locality, and achieves substantial gains in performance and energy efficiency—showing up to <inline-formula> <tex-math>$1.3times $ </tex-math></inline-formula> higher throughput and <inline-formula> <tex-math>$21.9times $ </tex-math></inline-formula> better energy efficiency for smaller models, and <inline-formula> <tex-math>$3times $ </tex-math></inline-formula> throughput and <inline-formula> <tex-math>$71times $ </tex-math></inline-formula> energy improvement for large decoder-only models compared to prior near-memory designs. As a result, CIDAN-3D offers a scalable, energy-efficient platform for LLM-driven Gen-AI applications.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"285-298"},"PeriodicalIF":3.7,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-10DOI: 10.1109/JETCAS.2025.3540360
Christian Herglotz;Daniel Palomino;Olivier Le Meur;C.-C. Jay Kuo
The past years have shown that due to the global success of video communication technology, the corresponding hardware systems nowadays contribute significantly to pollution and resource consumption on a global scale, accounting for 1% of global green house gas emissions in 2018. This aspect of sustainability has thus reached increasing attention in academia and industry. In this paper, we present different aspects of sustainability including resource consumption and greenhouse gas emissions, while putting a major focus on the energy consumption during the use of video systems. Finally, we provide an overview on recent research in the domain of green video communications showing promising results and highlighting areas where more research should be performed.
{"title":"Circuits and Systems for Green Video Communications: Fundamentals and Recent Trends","authors":"Christian Herglotz;Daniel Palomino;Olivier Le Meur;C.-C. Jay Kuo","doi":"10.1109/JETCAS.2025.3540360","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3540360","url":null,"abstract":"The past years have shown that due to the global success of video communication technology, the corresponding hardware systems nowadays contribute significantly to pollution and resource consumption on a global scale, accounting for 1% of global green house gas emissions in 2018. This aspect of sustainability has thus reached increasing attention in academia and industry. In this paper, we present different aspects of sustainability including resource consumption and greenhouse gas emissions, while putting a major focus on the energy consumption during the use of video systems. Finally, we provide an overview on recent research in the domain of green video communications showing promising results and highlighting areas where more research should be performed.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 1","pages":"4-15"},"PeriodicalIF":3.7,"publicationDate":"2025-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143602009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-07DOI: 10.1109/JETCAS.2025.3539948
Mohammad Ghasempour;Hadi Amirpour;Christian Timmerer
Live video streaming’s growing demand for high-quality content has resulted in significant energy consumption, creating challenges for sustainable media delivery. Traditional adaptive video streaming approaches rely on the over-provisioning of resources leading to a fixed bitrate ladder, which is often inefficient for the heterogeneous set of use cases and video content. Although dynamic approaches like per-title encoding optimize the bitrate ladder for each video, they mainly target video-on-demand to avoid latency and fail to address energy consumption. In this paper, we present LiveESTR, a method for building a quality- and energy-aware bitrate ladder for live video streaming. LiveESTR eliminates the need for exhaustive video encoding processes on the server side, ensuring that the bitrate ladder construction process is fast and energy efficient. A lightweight model for multi-label classification, along with a lookup table, is utilized to estimate the optimized resolution-bitrate pair in the bitrate ladder. Furthermore, both spatial and temporal resolutions are supported to achieve high energy savings while preserving compression efficiency. Therefore, a tunable parameter $lambda $ and a threshold $tau $ are introduced to balance the trade-off between compression/quality and energy efficiency. Experimental results show that LiveESTR reduces the encoder and decoder energy consumption by 74.6 % and 29.7 %, with only a 2.1 % increase in Bjøntegaard Delta Rate (BD-Rate) compared to traditional per-title encoding. Furthermore, it is shown that by increasing $lambda $ to prioritize video quality, LiveESTR achieves 2.2 % better compression efficiency in terms of BD-Rate while still reducing decoder energy consumption by 7.5 %.
直播视频流对高质量内容的需求不断增长,导致了大量的能源消耗,为可持续的媒体交付带来了挑战。传统的自适应视频流方法依赖于资源的过度供应,导致固定的比特率阶梯,这对于异构的用例和视频内容集通常是低效的。虽然像按标题编码这样的动态方法优化了每个视频的比特率阶梯,但它们主要针对视频点播,以避免延迟,无法解决能耗问题。在本文中,我们提出了LiveESTR,一种为实时视频流构建质量和能量感知比特率阶梯的方法。LiveESTR消除了在服务器端进行详尽的视频编码过程的需要,确保了比特率阶梯构建过程的快速和节能。使用轻量级的多标签分类模型和查找表来估计比特率阶梯中优化的分辨率-比特率对。此外,支持空间和时间分辨率,在保持压缩效率的同时实现高能量节约。因此,引入可调参数$lambda $和阈值$tau $来平衡压缩/质量和能源效率之间的权衡。实验结果表明,LiveESTR将编码器和解码器的能耗降低了74.6% % and 29.7 %, with only a 2.1 % increase in Bjøntegaard Delta Rate (BD-Rate) compared to traditional per-title encoding. Furthermore, it is shown that by increasing $lambda $ to prioritize video quality, LiveESTR achieves 2.2 % better compression efficiency in terms of BD-Rate while still reducing decoder energy consumption by 7.5 %.
{"title":"Real-Time Quality- and Energy-Aware Bitrate Ladder Construction for Live Video Streaming","authors":"Mohammad Ghasempour;Hadi Amirpour;Christian Timmerer","doi":"10.1109/JETCAS.2025.3539948","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3539948","url":null,"abstract":"Live video streaming’s growing demand for high-quality content has resulted in significant energy consumption, creating challenges for sustainable media delivery. Traditional adaptive video streaming approaches rely on the over-provisioning of resources leading to a fixed bitrate ladder, which is often inefficient for the heterogeneous set of use cases and video content. Although dynamic approaches like per-title encoding optimize the bitrate ladder for each video, they mainly target video-on-demand to avoid latency and fail to address energy consumption. In this paper, we present LiveESTR, a method for building a quality- and energy-aware bitrate ladder for live video streaming. LiveESTR eliminates the need for exhaustive video encoding processes on the server side, ensuring that the bitrate ladder construction process is fast and energy efficient. A lightweight model for multi-label classification, along with a lookup table, is utilized to estimate the optimized resolution-bitrate pair in the bitrate ladder. Furthermore, both spatial and temporal resolutions are supported to achieve high energy savings while preserving compression efficiency. Therefore, a tunable parameter <inline-formula> <tex-math>$lambda $ </tex-math></inline-formula> and a threshold <inline-formula> <tex-math>$tau $ </tex-math></inline-formula> are introduced to balance the trade-off between compression/quality and energy efficiency. Experimental results show that LiveESTR reduces the encoder and decoder energy consumption by 74.6 % and 29.7 %, with only a 2.1 % increase in Bjøntegaard Delta Rate (BD-Rate) compared to traditional per-title encoding. Furthermore, it is shown that by increasing <inline-formula> <tex-math>$lambda $ </tex-math></inline-formula> to prioritize video quality, LiveESTR achieves 2.2 % better compression efficiency in terms of BD-Rate while still reducing decoder energy consumption by 7.5 %.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 1","pages":"83-93"},"PeriodicalIF":3.7,"publicationDate":"2025-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10877851","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143602039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-04DOI: 10.1109/JETCAS.2025.3538652
Runyu Yang;Dong Liu;Feng Wu;Wen Gao
Learned image compression has shown remarkable compression efficiency gain over the traditional image compression solutions, which is partially attributed to the learned entropy models and the adopted entropy coding engine. However, the inference of the entropy models and the sequential nature of the entropy coding both incur high time complexity. Meanwhile, the neural network-based entropy models usually involve floating-point computations, which incur inconsistent probability estimation and decoding failure in different platforms. We address these limitations by introducing an efficient and cross-platform entropy coding method, chain coding-based latent compression (CC-LC), into learned image compression. First, we leverage the classic chain coding and carefully design a block-based entropy coding procedure, significantly reducing the number of coding symbols and thus the coding time. Second, since CC-LC is not based on neural networks, we propose a rate estimation network as a surrogate of CC-LC during the end-to-end training. Third, we alternately train the analysis/synthesis networks and the rate estimation network for the rate-distortion optimization, making the learned latent fit CC-LC. Experimental results show that our method achieves much lower time complexity than the other learned image compression methods, ensures cross-platform consistency, and has comparable compression efficiency with BPG. Our code and models are publicly available at https://github.com/Yang-Runyu/CC-LC.
{"title":"Learned Image Compression With Efficient Cross-Platform Entropy Coding","authors":"Runyu Yang;Dong Liu;Feng Wu;Wen Gao","doi":"10.1109/JETCAS.2025.3538652","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3538652","url":null,"abstract":"Learned image compression has shown remarkable compression efficiency gain over the traditional image compression solutions, which is partially attributed to the learned entropy models and the adopted entropy coding engine. However, the inference of the entropy models and the sequential nature of the entropy coding both incur high time complexity. Meanwhile, the neural network-based entropy models usually involve floating-point computations, which incur inconsistent probability estimation and decoding failure in different platforms. We address these limitations by introducing an efficient and cross-platform entropy coding method, chain coding-based latent compression (CC-LC), into learned image compression. First, we leverage the classic chain coding and carefully design a block-based entropy coding procedure, significantly reducing the number of coding symbols and thus the coding time. Second, since CC-LC is not based on neural networks, we propose a rate estimation network as a surrogate of CC-LC during the end-to-end training. Third, we alternately train the analysis/synthesis networks and the rate estimation network for the rate-distortion optimization, making the learned latent fit CC-LC. Experimental results show that our method achieves much lower time complexity than the other learned image compression methods, ensures cross-platform consistency, and has comparable compression efficiency with BPG. Our code and models are publicly available at <uri>https://github.com/Yang-Runyu/CC-LC</uri>.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 1","pages":"72-82"},"PeriodicalIF":3.7,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143602040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-03DOI: 10.1109/JETCAS.2025.3538016
Rashed Al Amin;Roman Obermaisser
Super-resolution (SR) systems represent a rapidly advancing area within Information and Communication Technology (ICT) due to their significant applications in computer vision and visual communication. Integrating SR systems with Deep Neural Networks (DNNs) is a widely adopted method for leveraging faster and improved image reconstruction. However, the real-time computational demands, extensive energy overhead and the huge memory footprints associated with DNN-based SR systems limit their throughput and scalability. Field-programmable gate arrays (FPGAs) present a viable and promising solution for exploring the structure and architecture of SR systems due to their reconfigurable nature and parallel computing capabilities. The existing FPGA-based solutions can effectively reduce the computational latency in SR systems, they often result in higher resource and energy consumption. Besides, the traditional SR techniques generally focus on either upscaling or downscaling images or videos without offering any scaling reconfigurability. To address these limitations, this paper introduces BiDSRS+, a novel FPGA based resource-efficient and reconfigurable real-time SR system using modified bicubic interpolation method. In addition, BiDSRS+ supports both upscaling and downscaling of images and videos, enhancing its versatility. Evaluations conducted on the Xilinx ZCU 102 FPGA board reveal substantial resource savings, with reductions of 44x LUT, 31x BRAM, and 35x DSP utilization compared to state-of-the-art DNN-based SR systems, albeit with a trade-off in throughput of 0.5x. Furthermore, when compared to leading algorithm-based SR systems, BiDSRS+ achieves reductions of 5.8x LUT, 1.75x BRAM, and 2.3x Power consumption, without compromising the throughput. Due to its high resource efficiency and reconfigurability with a throughput of 4K@60 FPS, BiDSRS+ offers significant advantages in promoting sustainable and energy-efficient green video communication.
{"title":"BiDSRS+: Resource Efficient Reconfigurable Real Time Bidirectional Super Resolution System for FPGAs","authors":"Rashed Al Amin;Roman Obermaisser","doi":"10.1109/JETCAS.2025.3538016","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3538016","url":null,"abstract":"Super-resolution (SR) systems represent a rapidly advancing area within Information and Communication Technology (ICT) due to their significant applications in computer vision and visual communication. Integrating SR systems with Deep Neural Networks (DNNs) is a widely adopted method for leveraging faster and improved image reconstruction. However, the real-time computational demands, extensive energy overhead and the huge memory footprints associated with DNN-based SR systems limit their throughput and scalability. Field-programmable gate arrays (FPGAs) present a viable and promising solution for exploring the structure and architecture of SR systems due to their reconfigurable nature and parallel computing capabilities. The existing FPGA-based solutions can effectively reduce the computational latency in SR systems, they often result in higher resource and energy consumption. Besides, the traditional SR techniques generally focus on either upscaling or downscaling images or videos without offering any scaling reconfigurability. To address these limitations, this paper introduces <italic>BiDSRS+</i>, a novel FPGA based resource-efficient and reconfigurable real-time SR system using modified bicubic interpolation method. In addition, <italic>BiDSRS+</i> supports both upscaling and downscaling of images and videos, enhancing its versatility. Evaluations conducted on the Xilinx ZCU 102 FPGA board reveal substantial resource savings, with reductions of 44x LUT, 31x BRAM, and 35x DSP utilization compared to state-of-the-art DNN-based SR systems, albeit with a trade-off in throughput of 0.5x. Furthermore, when compared to leading algorithm-based SR systems, <italic>BiDSRS+</i> achieves reductions of 5.8x LUT, 1.75x BRAM, and 2.3x Power consumption, without compromising the throughput. Due to its high resource efficiency and reconfigurability with a throughput of 4K@60 FPS, <italic>BiDSRS+</i> offers significant advantages in promoting sustainable and energy-efficient green video communication.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 1","pages":"120-132"},"PeriodicalIF":3.7,"publicationDate":"2025-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143601976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As silicon scaling nears its limits and the Big Data era unfolds, in-memory computing is increasingly important for overcoming the Von Neumann bottleneck and thus enhancing modern computing performance. One of the rising in-memory technologies are Memristors, which are resistors capable of memorizing state based on an applied voltage, making them useful for storage and computation. Another emerging computing paradigm is Approximate Computing, which allows for errors in calculations to in turn reduce die area, processing time and energy consumption. In an attempt to combine both concepts and leverage their benefits, we propose the memristor-based adaptive approximate adder ApprOchs - which is able to selectively compute segments of an addition either approximately or exactly. ApprOchs is designed to adapt to the input data given and thus only compute as much as is needed, a quality current State-of-the-Art (SoA) in-memory adders lack. Despite also using OR-based approximation in the lower k bit, ApprOchs has the edge over S-SINC because ApprOchs can skip the computation of the upper n-k bit for a small number of possible input combinations (22k of 22n possible combinations skip the upper bits). Compared to SoA in-memory approximate adders, ApprOchs outperforms them in terms of energy consumption while being highly competitive in terms of error behavior, with moderate speed and area efficiency. In application use cases, ApprOchs demonstrates its energy efficiency, particularly in machine learning applications. In MNIST classification using Deep Convolutional Neural Networks, we achieve 78.4% energy savings compared to SoA approximate adders with the same accuracy as exact adders at 98.9%, while for k-means clustering, we observed a 69% reduction in energy consumption with no quality drop in clustering results compared to the exact computation. For image blurring, we achieve up to 32.7% energy reduction over the exact computation and in its most promising configuration ($k=3$ ), the ApprOchs adder consumes 13.4% less energy than the most energy-efficient competing SoA design (S-SINC+), while achieving a similarly excellent median image quality at 43.74dB PSNR and 0.995 SSIM.
{"title":"ApprOchs: A Memristor-Based In-Memory Adaptive Approximate Adder","authors":"Dominik Ochs;Lukas Rapp;Leandro Borzyk;Nima Amirafshar;Nima TaheriNejad","doi":"10.1109/JETCAS.2025.3537328","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3537328","url":null,"abstract":"As silicon scaling nears its limits and the <italic>Big Data</i> era unfolds, in-memory computing is increasingly important for overcoming the <italic>Von Neumann</i> bottleneck and thus enhancing modern computing performance. One of the rising in-memory technologies are <italic>Memristors</i>, which are resistors capable of memorizing state based on an applied voltage, making them useful for storage and computation. Another emerging computing paradigm is <italic>Approximate Computing</i>, which allows for errors in calculations to in turn reduce die area, processing time and energy consumption. In an attempt to combine both concepts and leverage their benefits, we propose the memristor-based adaptive approximate adder <italic>ApprOchs</i> - which is able to selectively compute segments of an addition either approximately or exactly. ApprOchs is designed to adapt to the input data given and thus only compute as much as is needed, a quality current State-of-the-Art (SoA) in-memory adders lack. Despite also using OR-based approximation in the lower k bit, ApprOchs has the edge over S-SINC because ApprOchs can skip the computation of the upper n-k bit for a small number of possible input combinations (22k of 22n possible combinations skip the upper bits). Compared to SoA in-memory approximate adders, ApprOchs outperforms them in terms of energy consumption while being highly competitive in terms of error behavior, with moderate speed and area efficiency. In application use cases, ApprOchs demonstrates its energy efficiency, particularly in machine learning applications. In MNIST classification using Deep Convolutional Neural Networks, we achieve 78.4% energy savings compared to SoA approximate adders with the same accuracy as exact adders at 98.9%, while for k-means clustering, we observed a 69% reduction in energy consumption with no quality drop in clustering results compared to the exact computation. For image blurring, we achieve up to 32.7% energy reduction over the exact computation and in its most promising configuration (<inline-formula> <tex-math>$k=3$ </tex-math></inline-formula>), the ApprOchs adder consumes 13.4% less energy than the most energy-efficient competing SoA design (S-SINC+), while achieving a similarly excellent median image quality at 43.74dB PSNR and 0.995 SSIM.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 1","pages":"105-119"},"PeriodicalIF":3.7,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143601977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}