首页 > 最新文献

IEEE Journal on Emerging and Selected Topics in Circuits and Systems最新文献

英文 中文
LLM4Netlist: LLM-Enabled Step-Based Netlist Generation From Natural Language Description LLM4Netlist:从自然语言描述生成基于llm的基于步骤的网络列表
IF 3.7 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-03-09 DOI: 10.1109/JETCAS.2025.3568548
Kailiang Ye;Qingyu Yang;Zheng Lu;Heng Yu;Tianxiang Cui;Ruibin Bai;Linlin Shen
Empowered by Large Language Models (LLMs), substantial progress has been made in enhancing the EDA design flow in terms of high-level synthesis, such as direct translation from high-level language into RTL description. On the other hand, little research has been done for logic synthesis on the netlist generation. A direct application of LLMs for netlist generation presents additional challenges due to the scarcity of netlist-specific data, the need for tailored fine-tuning, and effective generation methods. This work first presents a novel training set and two evaluation sets catered for direct netlist generation LLMs, and an effective dataset construction pipeline to construct these datasets. Then this work proposes LLM4Netlist, a novel step-based netlist generation framework via fine-tuned LLM. The framework consists of a step-based prompt construction module, a fine-tuned LLM, a code confidence estimator, and a feedback loop module, and is able to generate netlist codes directly from natural language functional descriptions. We evaluate the efficacy of our approach with our novel evaluation datasets. The experimental results demonstrate that, compared to the average score of the 10 commercial LLMs listed in our experiments, our method shows a functional correctness increase of 183.41% on the NetlistEval dataset and a 91.07% increase on NGen. The training and testing data, along with the processing code, can be found at https://github.com/klyebit/LLM4Netlist.git
在大型语言模型(Large Language Models, LLMs)的支持下,从高级综合的角度增强EDA设计流程已经取得了实质性的进展,例如从高级语言直接翻译为RTL描述。另一方面,关于网表生成的逻辑综合研究很少。由于网络列表特定数据的稀缺,需要定制微调和有效的生成方法,直接应用llm生成网络列表会带来额外的挑战。这项工作首先提出了一个新的训练集和两个评估集,用于直接生成网络列表的llm,以及一个有效的数据集构建管道来构建这些数据集。在此基础上,本文提出了一种基于步进的网络列表生成框架LLM4Netlist。该框架由一个基于步骤的提示构建模块、一个微调的LLM、一个代码置信度估计器和一个反馈循环模块组成,能够直接从自然语言功能描述中生成网表代码。我们用新的评估数据集来评估我们的方法的有效性。实验结果表明,与我们实验中列出的10个商业llm的平均分数相比,我们的方法在NetlistEval数据集上的功能正确性提高了183.41%,在NGen上的功能正确性提高了91.07%。训练和测试数据以及处理代码可以在https://github.com/klyebit/LLM4Netlist.git上找到
{"title":"LLM4Netlist: LLM-Enabled Step-Based Netlist Generation From Natural Language Description","authors":"Kailiang Ye;Qingyu Yang;Zheng Lu;Heng Yu;Tianxiang Cui;Ruibin Bai;Linlin Shen","doi":"10.1109/JETCAS.2025.3568548","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3568548","url":null,"abstract":"Empowered by Large Language Models (LLMs), substantial progress has been made in enhancing the EDA design flow in terms of high-level synthesis, such as direct translation from high-level language into RTL description. On the other hand, little research has been done for logic synthesis on the netlist generation. A direct application of LLMs for netlist generation presents additional challenges due to the scarcity of netlist-specific data, the need for tailored fine-tuning, and effective generation methods. This work first presents a novel training set and two evaluation sets catered for direct netlist generation LLMs, and an effective dataset construction pipeline to construct these datasets. Then this work proposes <sc>LLM4Netlist</small>, a novel step-based netlist generation framework via fine-tuned LLM. The framework consists of a step-based prompt construction module, a fine-tuned LLM, a code confidence estimator, and a feedback loop module, and is able to generate netlist codes directly from natural language functional descriptions. We evaluate the efficacy of our approach with our novel evaluation datasets. The experimental results demonstrate that, compared to the average score of the 10 commercial LLMs listed in our experiments, our method shows a functional correctness increase of 183.41% on the NetlistEval dataset and a 91.07% increase on NGen. The training and testing data, along with the processing code, can be found at <uri>https://github.com/klyebit/LLM4Netlist.git</uri>","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"337-348"},"PeriodicalIF":3.7,"publicationDate":"2025-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GPTAC: Domain-Specific Generative Pre-Trained Model for Approximate Circuit Design Exploration 面向近似电路设计探索的领域特定生成预训练模型
IF 3.7 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-03-09 DOI: 10.1109/JETCAS.2025.3568606
Sipei Yi;Weichuan Zuo;Hongyi Wu;Ruicheng Dai;Weikang Qian;Jienan Chen
Automatically designing fast and low-cost digital circuits is challenging because of the discrete nature of circuits and the enormous design space, particularly in the exploration of approximate circuits. However, recent advances in generative artificial intelligence (GAI) have shed light to address these challenges. In this work, we present GPTAC, a domain-specific generative pre-trained (GPT) model customized for designing approximate circuits. By specifying the desired circuit accuracy or area, GPTAC can automatically generate an approximate circuit using its generative capabilities. We represent circuits using domain-specific language tokens, refined through a hardware description language keyword filter applied to gate-level code. This representation enables GPTAC to effectively learn approximate circuits from existing datasets by leveraging the GPT language model, as the training data can be directly derived from gate-level code. Additionally, by focusing on a domain-specific language, only a limited set of keywords is maintained, facilitating faster model convergence. To improve the success rate of the generated circuits, we introduce a circuit check rule that masks the GPTAC inference results when necessary. The experiment indicated that GPTAC is capable of producing approximate multipliers in under 15 seconds while utilizing merely 4GB of GPU memory, achieving a 10-40% reduction in area relative to the accuracy multiplier depending on various accuracy needs.
由于电路的离散性和巨大的设计空间,特别是在近似电路的探索中,自动设计快速和低成本的数字电路是具有挑战性的。然而,生成式人工智能(GAI)的最新进展为解决这些挑战提供了线索。在这项工作中,我们提出了GPTAC,一种为设计近似电路而定制的特定领域生成预训练(GPT)模型。通过指定所需的电路精度或面积,GPTAC可以使用其生成能力自动生成近似电路。我们使用特定于领域的语言标记来表示电路,通过应用于门级代码的硬件描述语言关键字过滤器进行细化。通过利用GPT语言模型,这种表示使GPTAC能够有效地从现有数据集中学习近似电路,因为训练数据可以直接从门级代码中获得。此外,通过关注特定于领域的语言,只维护了有限的关键字集,从而促进了更快的模型收敛。为了提高生成电路的成功率,我们引入了电路检查规则,在必要时屏蔽GPTAC推理结果。实验表明,GPTAC能够在15秒内产生近似乘数,同时仅利用4GB GPU内存,根据各种精度需求,相对于精度乘数,实现10-40%的面积减少。
{"title":"GPTAC: Domain-Specific Generative Pre-Trained Model for Approximate Circuit Design Exploration","authors":"Sipei Yi;Weichuan Zuo;Hongyi Wu;Ruicheng Dai;Weikang Qian;Jienan Chen","doi":"10.1109/JETCAS.2025.3568606","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3568606","url":null,"abstract":"Automatically designing fast and low-cost digital circuits is challenging because of the discrete nature of circuits and the enormous design space, particularly in the exploration of approximate circuits. However, recent advances in generative artificial intelligence (GAI) have shed light to address these challenges. In this work, we present GPTAC, a domain-specific generative pre-trained (GPT) model customized for designing approximate circuits. By specifying the desired circuit accuracy or area, GPTAC can automatically generate an approximate circuit using its generative capabilities. We represent circuits using domain-specific language tokens, refined through a hardware description language keyword filter applied to gate-level code. This representation enables GPTAC to effectively learn approximate circuits from existing datasets by leveraging the GPT language model, as the training data can be directly derived from gate-level code. Additionally, by focusing on a domain-specific language, only a limited set of keywords is maintained, facilitating faster model convergence. To improve the success rate of the generated circuits, we introduce a circuit check rule that masks the GPTAC inference results when necessary. The experiment indicated that GPTAC is capable of producing approximate multipliers in under 15 seconds while utilizing merely 4GB of GPU memory, achieving a 10-40% reduction in area relative to the accuracy multiplier depending on various accuracy needs.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"349-360"},"PeriodicalIF":3.7,"publicationDate":"2025-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
End-to-End Acceleration of Generative Models With Runtime Regularized KV Cache Management 基于运行时正则化KV缓存管理的生成模型端到端加速
IF 3.7 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-03-09 DOI: 10.1109/JETCAS.2025.3568716
Ashkan Moradifirouzabadi;Mingu Kang
Despite their remarkable success in achieving high performance, Transformer-based models impose substantial computational and memory bandwidth requirements, posing significant challenges for hardware deployment. A key contributor to these challenges is the large KV cache, which increases data movement costs in addition to the model parameters. While various token pruning techniques have been proposed to reduce the computational complexity and storage requirements of the attention mechanism by eliminating redundant tokens, these methods often introduce irregularities in the sparsity patterns that complicate hardware implementation. To address these challenges, we propose a hardware and algorithm co-design approach. Our solution features a Runtime Cache Eviction (RCE) algorithm that removes the least relevant tokens and replaces them with newly generated ones, maintaining a constant KV cache size across blocks and inputs. To support this algorithm, we design an accelerator equipped with a KV Memory Management Unit (KV-MMU), which efficiently manages active tokens through eviction and replacement, thereby optimizing DRAM storage and access. Additionally, our design integrates batch processing and an optimized processing pipeline to improve end-to-end throughput, effectively meeting the requirements of both pre-filling and generation stages. The proposed system achieves up to $8times $ KV cache size reduction with minimal accuracy degradation. In a 65 nm process, the proposed accelerator demonstrates $1.52times $ energy savings and $3.62times $ delay reductions when processing a batch size of 16, with only a 1.11% energy overhead attributed to the specialized KV-MMU.
尽管它们在实现高性能方面取得了显著的成功,但基于transformer的模型施加了大量的计算和内存带宽要求,给硬件部署带来了重大挑战。造成这些挑战的一个关键因素是大KV缓存,除了模型参数之外,它还增加了数据移动成本。虽然已经提出了各种令牌修剪技术,通过消除冗余令牌来降低注意力机制的计算复杂性和存储需求,但这些方法通常会在稀疏模式中引入不规则性,从而使硬件实现复杂化。为了应对这些挑战,我们提出了一种硬件和算法协同设计方法。我们的解决方案具有运行时缓存清除(RCE)算法,该算法删除最不相关的令牌并用新生成的令牌替换它们,在块和输入之间保持恒定的KV缓存大小。为了支持该算法,我们设计了一个配备KV内存管理单元(KV- mmu)的加速器,该加速器通过移除和替换有效地管理活动令牌,从而优化DRAM存储和访问。此外,我们的设计集成了批量处理和优化的处理管道,以提高端到端吞吐量,有效地满足预填充和生成阶段的要求。所提出的系统在最小精度下降的情况下实现了高达8倍KV的缓存大小减小。在65纳米工艺中,当处理16个批量时,所提出的加速器节省了1.52美元的能源,减少了3.62美元的延迟,而专用的KV-MMU仅减少了1.11%的能源开销。
{"title":"End-to-End Acceleration of Generative Models With Runtime Regularized KV Cache Management","authors":"Ashkan Moradifirouzabadi;Mingu Kang","doi":"10.1109/JETCAS.2025.3568716","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3568716","url":null,"abstract":"Despite their remarkable success in achieving high performance, Transformer-based models impose substantial computational and memory bandwidth requirements, posing significant challenges for hardware deployment. A key contributor to these challenges is the large KV cache, which increases data movement costs in addition to the model parameters. While various token pruning techniques have been proposed to reduce the computational complexity and storage requirements of the attention mechanism by eliminating redundant tokens, these methods often introduce irregularities in the sparsity patterns that complicate hardware implementation. To address these challenges, we propose a hardware and algorithm co-design approach. Our solution features a Runtime Cache Eviction (RCE) algorithm that removes the least relevant tokens and replaces them with newly generated ones, maintaining a constant KV cache size across blocks and inputs. To support this algorithm, we design an accelerator equipped with a KV Memory Management Unit (KV-MMU), which efficiently manages active tokens through eviction and replacement, thereby optimizing DRAM storage and access. Additionally, our design integrates batch processing and an optimized processing pipeline to improve end-to-end throughput, effectively meeting the requirements of both pre-filling and generation stages. The proposed system achieves up to <inline-formula> <tex-math>$8times $ </tex-math></inline-formula> KV cache size reduction with minimal accuracy degradation. In a 65 nm process, the proposed accelerator demonstrates <inline-formula> <tex-math>$1.52times $ </tex-math></inline-formula> energy savings and <inline-formula> <tex-math>$3.62times $ </tex-math></inline-formula> delay reductions when processing a batch size of 16, with only a 1.11% energy overhead attributed to the specialized KV-MMU.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"217-230"},"PeriodicalIF":3.7,"publicationDate":"2025-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GEMMV: An LLM-Based Automated Performance-Aware Framework for GEMM Verilog Generation GEMMV:用于GEMM Verilog生成的基于llm的自动性能感知框架
IF 3.7 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-03-09 DOI: 10.1109/JETCAS.2025.3568712
Gaoche Zhang;Dingyang Zou;Kairui Sun;Zhihuan Chen;Meiqi Wang;Zhongfeng Wang
Recent advancements in artificial intelligence (AI) models have intensified the need for specialized AI accelerators. The design of optimized general matrix multiplication (GEMM) module tailored for these accelerators is crucial but time-consuming and expertise-demanding, creating a demand for automating design processes. Large language models (LLMs), capable of generating high-quality designs from human instructions, show great promise in automating GEMM module creation. However, the GEMM module’s vast design space and stringent performance requirements, along with the limitations of datasets and the lack of hardware performance awareness of LLMs, have made previous LLM-based register transfer level (RTL) code generation efforts unsuitable for GEMM design. To tackle these challenges, this paper proposes an automated performance-aware LLM-based framework, GEMMV, for generating high-correctness and high-performance Verilog code for GEMM. This framework utilizes in-context learning based on GPT-4 to automatically generate high-quality and well-annotated Verilog code for different variants of the GEMM. Additionally, it leverages in-context learning to obtain performance awareness by integrating a multi-level performance model (MLPM) with fine-tuned LLMs. The Verilog code generated by this framework reduces latency by 3.1x and improves syntax correctness by 65% and functionality correctness by 70% compared to earlier efforts.
人工智能(AI)模型的最新进展加剧了对专门的AI加速器的需求。为这些加速器量身定制的优化通用矩阵乘法(GEMM)模块的设计至关重要,但耗时且对专业知识要求很高,因此需要自动化设计过程。大型语言模型(llm)能够根据人类指令生成高质量的设计,在自动化GEMM模块创建方面显示出巨大的前景。然而,GEMM模块巨大的设计空间和严格的性能要求,加上数据集的限制和缺乏对llm硬件性能的认识,使得以前基于llm的寄存器传输级别(RTL)代码生成工作不适合GEMM设计。为了应对这些挑战,本文提出了一个基于性能感知的自动化llm框架GEMMV,用于为GEMM生成高正确性和高性能的Verilog代码。该框架利用基于GPT-4的上下文学习,为GEMM的不同变体自动生成高质量和注释良好的Verilog代码。此外,它利用上下文学习,通过集成多级性能模型(MLPM)和微调llm来获得性能感知。与之前的成果相比,这个框架生成的Verilog代码延迟减少了3.1倍,语法正确性提高了65%,功能正确性提高了70%。
{"title":"GEMMV: An LLM-Based Automated Performance-Aware Framework for GEMM Verilog Generation","authors":"Gaoche Zhang;Dingyang Zou;Kairui Sun;Zhihuan Chen;Meiqi Wang;Zhongfeng Wang","doi":"10.1109/JETCAS.2025.3568712","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3568712","url":null,"abstract":"Recent advancements in artificial intelligence (AI) models have intensified the need for specialized AI accelerators. The design of optimized general matrix multiplication (GEMM) module tailored for these accelerators is crucial but time-consuming and expertise-demanding, creating a demand for automating design processes. Large language models (LLMs), capable of generating high-quality designs from human instructions, show great promise in automating GEMM module creation. However, the GEMM module’s vast design space and stringent performance requirements, along with the limitations of datasets and the lack of hardware performance awareness of LLMs, have made previous LLM-based register transfer level (RTL) code generation efforts unsuitable for GEMM design. To tackle these challenges, this paper proposes an automated performance-aware LLM-based framework, GEMMV, for generating high-correctness and high-performance Verilog code for GEMM. This framework utilizes in-context learning based on GPT-4 to automatically generate high-quality and well-annotated Verilog code for different variants of the GEMM. Additionally, it leverages in-context learning to obtain performance awareness by integrating a multi-level performance model (MLPM) with fine-tuned LLMs. The Verilog code generated by this framework reduces latency by 3.1x and improves syntax correctness by 65% and functionality correctness by 70% compared to earlier efforts.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"325-336"},"PeriodicalIF":3.7,"publicationDate":"2025-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Scalable and Energy-Efficient Processing-in-Memory Architecture for Gen-AI 面向Gen-AI的可扩展节能内存处理架构
IF 3.7 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-03-05 DOI: 10.1109/JETCAS.2025.3566929
Gian Singh;Sarma Vrudhula
Large language models (LLMs) have achieved high accuracy in diverse NLP and computer vision tasks due to self-attention mechanisms relying on GEMM and GEMV operations. However, scaling LLMs poses significant computational and energy challenges, particularly for traditional Von-Neumann architectures (CPUs/GPUs), which incur high latency and energy consumption from frequent data movement. These issues are even more pronounced in energy-constrained edge environments. While DRAM-based near-memory architectures offer improved energy efficiency and throughput, their processing elements are limited by strict area, power, and timing constraints. This work introduces CIDAN-3D, a novel Processing-in-Memory (PIM) architecture tailored for LLMs. It features an ultra-low-power Neuron Processing Element (NPE) with high compute density (#Operations/Area), enabling efficient in-situ execution of LLM operations by leveraging high parallelism within DRAM. CIDAN-3D reduces data movement, improves locality, and achieves substantial gains in performance and energy efficiency—showing up to $1.3times $ higher throughput and $21.9times $ better energy efficiency for smaller models, and $3times $ throughput and $71times $ energy improvement for large decoder-only models compared to prior near-memory designs. As a result, CIDAN-3D offers a scalable, energy-efficient platform for LLM-driven Gen-AI applications.
由于依赖于GEMM和GEMV操作的自注意机制,大型语言模型(llm)在各种NLP和计算机视觉任务中取得了很高的准确性。然而,扩展llm带来了巨大的计算和能源挑战,特别是对于传统的冯-诺伊曼架构(cpu / gpu),它会因频繁的数据移动而产生高延迟和能耗。这些问题在能源受限的边缘环境中更加明显。虽然基于dram的近内存架构提供了更高的能源效率和吞吐量,但它们的处理元素受到严格的面积、功率和时间限制。这项工作介绍了CIDAN-3D,一种为llm量身定制的新型内存中处理(PIM)架构。它具有具有高计算密度(#Operations/Area)的超低功耗神经元处理元件(NPE),通过利用DRAM内的高并行性,可以高效地原位执行LLM操作。CIDAN-3D减少了数据移动,提高了局域性,并在性能和能源效率方面取得了实质性的进步-与之前的近内存设计相比,小型模型的吞吐量提高了1.3倍,能源效率提高了21.9倍,大型解码器模型的吞吐量提高了3倍,能源效率提高了71倍。因此,CIDAN-3D为llm驱动的Gen-AI应用提供了一个可扩展、节能的平台。
{"title":"A Scalable and Energy-Efficient Processing-in-Memory Architecture for Gen-AI","authors":"Gian Singh;Sarma Vrudhula","doi":"10.1109/JETCAS.2025.3566929","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3566929","url":null,"abstract":"Large language models (LLMs) have achieved high accuracy in diverse NLP and computer vision tasks due to self-attention mechanisms relying on GEMM and GEMV operations. However, scaling LLMs poses significant computational and energy challenges, particularly for traditional Von-Neumann architectures (CPUs/GPUs), which incur high latency and energy consumption from frequent data movement. These issues are even more pronounced in energy-constrained edge environments. While DRAM-based near-memory architectures offer improved energy efficiency and throughput, their processing elements are limited by strict area, power, and timing constraints. This work introduces CIDAN-3D, a novel Processing-in-Memory (PIM) architecture tailored for LLMs. It features an ultra-low-power Neuron Processing Element (NPE) with high compute density (#Operations/Area), enabling efficient in-situ execution of LLM operations by leveraging high parallelism within DRAM. CIDAN-3D reduces data movement, improves locality, and achieves substantial gains in performance and energy efficiency—showing up to <inline-formula> <tex-math>$1.3times $ </tex-math></inline-formula> higher throughput and <inline-formula> <tex-math>$21.9times $ </tex-math></inline-formula> better energy efficiency for smaller models, and <inline-formula> <tex-math>$3times $ </tex-math></inline-formula> throughput and <inline-formula> <tex-math>$71times $ </tex-math></inline-formula> energy improvement for large decoder-only models compared to prior near-memory designs. As a result, CIDAN-3D offers a scalable, energy-efficient platform for LLM-driven Gen-AI applications.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"285-298"},"PeriodicalIF":3.7,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Circuits and Systems for Green Video Communications: Fundamentals and Recent Trends 绿色视频通信的电路和系统:基本原理和最新趋势
IF 3.7 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-02-10 DOI: 10.1109/JETCAS.2025.3540360
Christian Herglotz;Daniel Palomino;Olivier Le Meur;C.-C. Jay Kuo
The past years have shown that due to the global success of video communication technology, the corresponding hardware systems nowadays contribute significantly to pollution and resource consumption on a global scale, accounting for 1% of global green house gas emissions in 2018. This aspect of sustainability has thus reached increasing attention in academia and industry. In this paper, we present different aspects of sustainability including resource consumption and greenhouse gas emissions, while putting a major focus on the energy consumption during the use of video systems. Finally, we provide an overview on recent research in the domain of green video communications showing promising results and highlighting areas where more research should be performed.
过去几年的情况表明,由于视频通信技术在全球范围内的成功,如今相应的硬件系统在全球范围内造成了巨大的污染和资源消耗,2018年占全球温室气体排放量的1%。因此,可持续性的这一方面在学术界和工业界受到越来越多的关注。在本文中,我们介绍了可持续性的不同方面,包括资源消耗和温室气体排放,同时主要关注视频系统使用过程中的能源消耗。最后,我们概述了绿色视频通信领域的最新研究,展示了有希望的结果,并强调了应该进行更多研究的领域。
{"title":"Circuits and Systems for Green Video Communications: Fundamentals and Recent Trends","authors":"Christian Herglotz;Daniel Palomino;Olivier Le Meur;C.-C. Jay Kuo","doi":"10.1109/JETCAS.2025.3540360","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3540360","url":null,"abstract":"The past years have shown that due to the global success of video communication technology, the corresponding hardware systems nowadays contribute significantly to pollution and resource consumption on a global scale, accounting for 1% of global green house gas emissions in 2018. This aspect of sustainability has thus reached increasing attention in academia and industry. In this paper, we present different aspects of sustainability including resource consumption and greenhouse gas emissions, while putting a major focus on the energy consumption during the use of video systems. Finally, we provide an overview on recent research in the domain of green video communications showing promising results and highlighting areas where more research should be performed.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 1","pages":"4-15"},"PeriodicalIF":3.7,"publicationDate":"2025-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143602009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Real-Time Quality- and Energy-Aware Bitrate Ladder Construction for Live Video Streaming 实时视频流的质量和能量感知比特率阶梯结构
IF 3.7 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-02-07 DOI: 10.1109/JETCAS.2025.3539948
Mohammad Ghasempour;Hadi Amirpour;Christian Timmerer
Live video streaming’s growing demand for high-quality content has resulted in significant energy consumption, creating challenges for sustainable media delivery. Traditional adaptive video streaming approaches rely on the over-provisioning of resources leading to a fixed bitrate ladder, which is often inefficient for the heterogeneous set of use cases and video content. Although dynamic approaches like per-title encoding optimize the bitrate ladder for each video, they mainly target video-on-demand to avoid latency and fail to address energy consumption. In this paper, we present LiveESTR, a method for building a quality- and energy-aware bitrate ladder for live video streaming. LiveESTR eliminates the need for exhaustive video encoding processes on the server side, ensuring that the bitrate ladder construction process is fast and energy efficient. A lightweight model for multi-label classification, along with a lookup table, is utilized to estimate the optimized resolution-bitrate pair in the bitrate ladder. Furthermore, both spatial and temporal resolutions are supported to achieve high energy savings while preserving compression efficiency. Therefore, a tunable parameter $lambda $ and a threshold $tau $ are introduced to balance the trade-off between compression/quality and energy efficiency. Experimental results show that LiveESTR reduces the encoder and decoder energy consumption by 74.6 % and 29.7 %, with only a 2.1 % increase in Bjøntegaard Delta Rate (BD-Rate) compared to traditional per-title encoding. Furthermore, it is shown that by increasing $lambda $ to prioritize video quality, LiveESTR achieves 2.2 % better compression efficiency in terms of BD-Rate while still reducing decoder energy consumption by 7.5 %.
直播视频流对高质量内容的需求不断增长,导致了大量的能源消耗,为可持续的媒体交付带来了挑战。传统的自适应视频流方法依赖于资源的过度供应,导致固定的比特率阶梯,这对于异构的用例和视频内容集通常是低效的。虽然像按标题编码这样的动态方法优化了每个视频的比特率阶梯,但它们主要针对视频点播,以避免延迟,无法解决能耗问题。在本文中,我们提出了LiveESTR,一种为实时视频流构建质量和能量感知比特率阶梯的方法。LiveESTR消除了在服务器端进行详尽的视频编码过程的需要,确保了比特率阶梯构建过程的快速和节能。使用轻量级的多标签分类模型和查找表来估计比特率阶梯中优化的分辨率-比特率对。此外,支持空间和时间分辨率,在保持压缩效率的同时实现高能量节约。因此,引入可调参数$lambda $和阈值$tau $来平衡压缩/质量和能源效率之间的权衡。实验结果表明,LiveESTR将编码器和解码器的能耗降低了74.6% % and 29.7 %, with only a 2.1 % increase in Bjøntegaard Delta Rate (BD-Rate) compared to traditional per-title encoding. Furthermore, it is shown that by increasing $lambda $ to prioritize video quality, LiveESTR achieves 2.2 % better compression efficiency in terms of BD-Rate while still reducing decoder energy consumption by 7.5 %.
{"title":"Real-Time Quality- and Energy-Aware Bitrate Ladder Construction for Live Video Streaming","authors":"Mohammad Ghasempour;Hadi Amirpour;Christian Timmerer","doi":"10.1109/JETCAS.2025.3539948","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3539948","url":null,"abstract":"Live video streaming’s growing demand for high-quality content has resulted in significant energy consumption, creating challenges for sustainable media delivery. Traditional adaptive video streaming approaches rely on the over-provisioning of resources leading to a fixed bitrate ladder, which is often inefficient for the heterogeneous set of use cases and video content. Although dynamic approaches like per-title encoding optimize the bitrate ladder for each video, they mainly target video-on-demand to avoid latency and fail to address energy consumption. In this paper, we present LiveESTR, a method for building a quality- and energy-aware bitrate ladder for live video streaming. LiveESTR eliminates the need for exhaustive video encoding processes on the server side, ensuring that the bitrate ladder construction process is fast and energy efficient. A lightweight model for multi-label classification, along with a lookup table, is utilized to estimate the optimized resolution-bitrate pair in the bitrate ladder. Furthermore, both spatial and temporal resolutions are supported to achieve high energy savings while preserving compression efficiency. Therefore, a tunable parameter <inline-formula> <tex-math>$lambda $ </tex-math></inline-formula> and a threshold <inline-formula> <tex-math>$tau $ </tex-math></inline-formula> are introduced to balance the trade-off between compression/quality and energy efficiency. Experimental results show that LiveESTR reduces the encoder and decoder energy consumption by 74.6 % and 29.7 %, with only a 2.1 % increase in Bjøntegaard Delta Rate (BD-Rate) compared to traditional per-title encoding. Furthermore, it is shown that by increasing <inline-formula> <tex-math>$lambda $ </tex-math></inline-formula> to prioritize video quality, LiveESTR achieves 2.2 % better compression efficiency in terms of BD-Rate while still reducing decoder energy consumption by 7.5 %.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 1","pages":"83-93"},"PeriodicalIF":3.7,"publicationDate":"2025-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10877851","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143602039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learned Image Compression With Efficient Cross-Platform Entropy Coding 学习图像压缩与高效的跨平台熵编码
IF 3.7 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-02-04 DOI: 10.1109/JETCAS.2025.3538652
Runyu Yang;Dong Liu;Feng Wu;Wen Gao
Learned image compression has shown remarkable compression efficiency gain over the traditional image compression solutions, which is partially attributed to the learned entropy models and the adopted entropy coding engine. However, the inference of the entropy models and the sequential nature of the entropy coding both incur high time complexity. Meanwhile, the neural network-based entropy models usually involve floating-point computations, which incur inconsistent probability estimation and decoding failure in different platforms. We address these limitations by introducing an efficient and cross-platform entropy coding method, chain coding-based latent compression (CC-LC), into learned image compression. First, we leverage the classic chain coding and carefully design a block-based entropy coding procedure, significantly reducing the number of coding symbols and thus the coding time. Second, since CC-LC is not based on neural networks, we propose a rate estimation network as a surrogate of CC-LC during the end-to-end training. Third, we alternately train the analysis/synthesis networks and the rate estimation network for the rate-distortion optimization, making the learned latent fit CC-LC. Experimental results show that our method achieves much lower time complexity than the other learned image compression methods, ensures cross-platform consistency, and has comparable compression efficiency with BPG. Our code and models are publicly available at https://github.com/Yang-Runyu/CC-LC.
与传统的图像压缩方案相比,学习图像压缩显示出显著的压缩效率提高,这部分归功于学习熵模型和所采用的熵编码引擎。然而,熵模型的推断性和熵编码的时序性都会导致较高的时间复杂度。同时,基于神经网络的熵模型通常涉及浮点计算,在不同的平台上导致概率估计不一致和解码失败。我们通过在学习图像压缩中引入一种高效的跨平台熵编码方法,基于链编码的潜在压缩(CC-LC)来解决这些限制。首先,我们利用经典的链编码,精心设计了基于块的熵编码过程,显著减少了编码符号的数量,从而缩短了编码时间。其次,由于CC-LC不是基于神经网络,我们提出了一个速率估计网络作为端到端训练CC-LC的替代品。第三,我们交替训练速率失真优化的分析/综合网络和速率估计网络,使学习到的潜在拟合CC-LC。实验结果表明,该方法的时间复杂度远低于其他已学习的图像压缩方法,并保证了跨平台的一致性,压缩效率与BPG相当。我们的代码和模型可以在https://github.com/Yang-Runyu/CC-LC上公开获得。
{"title":"Learned Image Compression With Efficient Cross-Platform Entropy Coding","authors":"Runyu Yang;Dong Liu;Feng Wu;Wen Gao","doi":"10.1109/JETCAS.2025.3538652","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3538652","url":null,"abstract":"Learned image compression has shown remarkable compression efficiency gain over the traditional image compression solutions, which is partially attributed to the learned entropy models and the adopted entropy coding engine. However, the inference of the entropy models and the sequential nature of the entropy coding both incur high time complexity. Meanwhile, the neural network-based entropy models usually involve floating-point computations, which incur inconsistent probability estimation and decoding failure in different platforms. We address these limitations by introducing an efficient and cross-platform entropy coding method, chain coding-based latent compression (CC-LC), into learned image compression. First, we leverage the classic chain coding and carefully design a block-based entropy coding procedure, significantly reducing the number of coding symbols and thus the coding time. Second, since CC-LC is not based on neural networks, we propose a rate estimation network as a surrogate of CC-LC during the end-to-end training. Third, we alternately train the analysis/synthesis networks and the rate estimation network for the rate-distortion optimization, making the learned latent fit CC-LC. Experimental results show that our method achieves much lower time complexity than the other learned image compression methods, ensures cross-platform consistency, and has comparable compression efficiency with BPG. Our code and models are publicly available at <uri>https://github.com/Yang-Runyu/CC-LC</uri>.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 1","pages":"72-82"},"PeriodicalIF":3.7,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143602040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BiDSRS+: Resource Efficient Reconfigurable Real Time Bidirectional Super Resolution System for FPGAs BiDSRS+: fpga的资源高效可重构实时双向超分辨率系统
IF 3.7 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-02-03 DOI: 10.1109/JETCAS.2025.3538016
Rashed Al Amin;Roman Obermaisser
Super-resolution (SR) systems represent a rapidly advancing area within Information and Communication Technology (ICT) due to their significant applications in computer vision and visual communication. Integrating SR systems with Deep Neural Networks (DNNs) is a widely adopted method for leveraging faster and improved image reconstruction. However, the real-time computational demands, extensive energy overhead and the huge memory footprints associated with DNN-based SR systems limit their throughput and scalability. Field-programmable gate arrays (FPGAs) present a viable and promising solution for exploring the structure and architecture of SR systems due to their reconfigurable nature and parallel computing capabilities. The existing FPGA-based solutions can effectively reduce the computational latency in SR systems, they often result in higher resource and energy consumption. Besides, the traditional SR techniques generally focus on either upscaling or downscaling images or videos without offering any scaling reconfigurability. To address these limitations, this paper introduces BiDSRS+, a novel FPGA based resource-efficient and reconfigurable real-time SR system using modified bicubic interpolation method. In addition, BiDSRS+ supports both upscaling and downscaling of images and videos, enhancing its versatility. Evaluations conducted on the Xilinx ZCU 102 FPGA board reveal substantial resource savings, with reductions of 44x LUT, 31x BRAM, and 35x DSP utilization compared to state-of-the-art DNN-based SR systems, albeit with a trade-off in throughput of 0.5x. Furthermore, when compared to leading algorithm-based SR systems, BiDSRS+ achieves reductions of 5.8x LUT, 1.75x BRAM, and 2.3x Power consumption, without compromising the throughput. Due to its high resource efficiency and reconfigurability with a throughput of 4K@60 FPS, BiDSRS+ offers significant advantages in promoting sustainable and energy-efficient green video communication.
超分辨率(SR)系统由于其在计算机视觉和视觉通信中的重要应用,代表了信息和通信技术(ICT)中快速发展的领域。将SR系统与深度神经网络(dnn)集成是一种广泛采用的方法,可以利用更快、更好的图像重建。然而,与基于dnn的SR系统相关的实时计算需求、广泛的能源开销和巨大的内存占用限制了它们的吞吐量和可扩展性。现场可编程门阵列(fpga)由于其可重构特性和并行计算能力,为探索SR系统的结构和架构提供了一个可行且有前途的解决方案。现有的基于fpga的解决方案可以有效地降低SR系统的计算延迟,但往往会导致更高的资源和能源消耗。此外,传统的SR技术通常只关注图像或视频的升级或降级,而不提供任何缩放可重构性。为了解决这些限制,本文介绍了一种基于FPGA的资源高效可重构实时SR系统BiDSRS+,该系统采用改进的双三次插值方法。此外,BiDSRS+支持图像和视频的放大和缩小,增强了其通用性。对Xilinx ZCU 102 FPGA板进行的评估显示,与最先进的基于dnn的SR系统相比,节省了大量资源,减少了44倍的LUT, 31倍的BRAM和35倍的DSP利用率,尽管吞吐量降低了0.5倍。此外,与领先的基于算法的SR系统相比,BiDSRS+实现了5.8倍的LUT, 1.75倍的BRAM和2.3倍的功耗降低,而不影响吞吐量。BiDSRS+具有较高的资源效率和可重构性,吞吐量可达4K@60 FPS,在推动可持续节能的绿色视频通信方面具有显著优势。
{"title":"BiDSRS+: Resource Efficient Reconfigurable Real Time Bidirectional Super Resolution System for FPGAs","authors":"Rashed Al Amin;Roman Obermaisser","doi":"10.1109/JETCAS.2025.3538016","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3538016","url":null,"abstract":"Super-resolution (SR) systems represent a rapidly advancing area within Information and Communication Technology (ICT) due to their significant applications in computer vision and visual communication. Integrating SR systems with Deep Neural Networks (DNNs) is a widely adopted method for leveraging faster and improved image reconstruction. However, the real-time computational demands, extensive energy overhead and the huge memory footprints associated with DNN-based SR systems limit their throughput and scalability. Field-programmable gate arrays (FPGAs) present a viable and promising solution for exploring the structure and architecture of SR systems due to their reconfigurable nature and parallel computing capabilities. The existing FPGA-based solutions can effectively reduce the computational latency in SR systems, they often result in higher resource and energy consumption. Besides, the traditional SR techniques generally focus on either upscaling or downscaling images or videos without offering any scaling reconfigurability. To address these limitations, this paper introduces <italic>BiDSRS+</i>, a novel FPGA based resource-efficient and reconfigurable real-time SR system using modified bicubic interpolation method. In addition, <italic>BiDSRS+</i> supports both upscaling and downscaling of images and videos, enhancing its versatility. Evaluations conducted on the Xilinx ZCU 102 FPGA board reveal substantial resource savings, with reductions of 44x LUT, 31x BRAM, and 35x DSP utilization compared to state-of-the-art DNN-based SR systems, albeit with a trade-off in throughput of 0.5x. Furthermore, when compared to leading algorithm-based SR systems, <italic>BiDSRS+</i> achieves reductions of 5.8x LUT, 1.75x BRAM, and 2.3x Power consumption, without compromising the throughput. Due to its high resource efficiency and reconfigurability with a throughput of 4K@60 FPS, <italic>BiDSRS+</i> offers significant advantages in promoting sustainable and energy-efficient green video communication.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 1","pages":"120-132"},"PeriodicalIF":3.7,"publicationDate":"2025-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143601976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ApprOchs: A Memristor-Based In-Memory Adaptive Approximate Adder 一种基于忆阻器的内存自适应近似加法器
IF 3.7 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-01-31 DOI: 10.1109/JETCAS.2025.3537328
Dominik Ochs;Lukas Rapp;Leandro Borzyk;Nima Amirafshar;Nima TaheriNejad
As silicon scaling nears its limits and the Big Data era unfolds, in-memory computing is increasingly important for overcoming the Von Neumann bottleneck and thus enhancing modern computing performance. One of the rising in-memory technologies are Memristors, which are resistors capable of memorizing state based on an applied voltage, making them useful for storage and computation. Another emerging computing paradigm is Approximate Computing, which allows for errors in calculations to in turn reduce die area, processing time and energy consumption. In an attempt to combine both concepts and leverage their benefits, we propose the memristor-based adaptive approximate adder ApprOchs - which is able to selectively compute segments of an addition either approximately or exactly. ApprOchs is designed to adapt to the input data given and thus only compute as much as is needed, a quality current State-of-the-Art (SoA) in-memory adders lack. Despite also using OR-based approximation in the lower k bit, ApprOchs has the edge over S-SINC because ApprOchs can skip the computation of the upper n-k bit for a small number of possible input combinations (22k of 22n possible combinations skip the upper bits). Compared to SoA in-memory approximate adders, ApprOchs outperforms them in terms of energy consumption while being highly competitive in terms of error behavior, with moderate speed and area efficiency. In application use cases, ApprOchs demonstrates its energy efficiency, particularly in machine learning applications. In MNIST classification using Deep Convolutional Neural Networks, we achieve 78.4% energy savings compared to SoA approximate adders with the same accuracy as exact adders at 98.9%, while for k-means clustering, we observed a 69% reduction in energy consumption with no quality drop in clustering results compared to the exact computation. For image blurring, we achieve up to 32.7% energy reduction over the exact computation and in its most promising configuration ( $k=3$ ), the ApprOchs adder consumes 13.4% less energy than the most energy-efficient competing SoA design (S-SINC+), while achieving a similarly excellent median image quality at 43.74dB PSNR and 0.995 SSIM.
随着芯片规模接近极限和大数据时代的到来,内存计算对于克服冯·诺伊曼瓶颈从而提高现代计算性能变得越来越重要。记忆电阻器是一种新兴的内存技术,它是一种能够根据施加的电压记忆状态的电阻器,可用于存储和计算。另一种新兴的计算范式是近似计算,它允许计算中的错误,从而减少模具面积、处理时间和能耗。为了结合这两个概念并利用它们的优点,我们提出了基于忆阻器的自适应近似加法器方法-它能够有选择地近似或精确地计算加法的部分。方法被设计为适应给定的输入数据,因此只计算所需的数据,这是当前最先进的(SoA)内存加法器所缺乏的质量。尽管在较低的k位也使用基于or的近似,但ApprOchs比S-SINC有优势,因为对于少量可能的输入组合,ApprOchs可以跳过较高的n-k位的计算(22n种可能的组合中有22k种会跳过较高的位)。与SoA内存中的近似加法器相比,方法在能耗方面优于它们,同时在错误行为方面具有很强的竞争力,具有中等的速度和面积效率。在应用用例中,ApprOchs展示了其能源效率,特别是在机器学习应用中。在使用深度卷积神经网络的MNIST分类中,与SoA近似加法器相比,我们实现了78.4%的节能,而精确加法器的准确率为98.9%,而对于k-means聚类,我们观察到与精确计算相比,能耗降低了69%,聚类结果的质量没有下降。对于图像模糊,我们在精确计算中实现了高达32.7%的能量减少,并且在其最有前途的配置($k=3$)中,ApprOchs加器比最节能的SoA设计(S-SINC+)消耗的能量少13.4%,同时在43.74dB PSNR和0.995 SSIM上实现了同样出色的中位数图像质量。
{"title":"ApprOchs: A Memristor-Based In-Memory Adaptive Approximate Adder","authors":"Dominik Ochs;Lukas Rapp;Leandro Borzyk;Nima Amirafshar;Nima TaheriNejad","doi":"10.1109/JETCAS.2025.3537328","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3537328","url":null,"abstract":"As silicon scaling nears its limits and the <italic>Big Data</i> era unfolds, in-memory computing is increasingly important for overcoming the <italic>Von Neumann</i> bottleneck and thus enhancing modern computing performance. One of the rising in-memory technologies are <italic>Memristors</i>, which are resistors capable of memorizing state based on an applied voltage, making them useful for storage and computation. Another emerging computing paradigm is <italic>Approximate Computing</i>, which allows for errors in calculations to in turn reduce die area, processing time and energy consumption. In an attempt to combine both concepts and leverage their benefits, we propose the memristor-based adaptive approximate adder <italic>ApprOchs</i> - which is able to selectively compute segments of an addition either approximately or exactly. ApprOchs is designed to adapt to the input data given and thus only compute as much as is needed, a quality current State-of-the-Art (SoA) in-memory adders lack. Despite also using OR-based approximation in the lower k bit, ApprOchs has the edge over S-SINC because ApprOchs can skip the computation of the upper n-k bit for a small number of possible input combinations (22k of 22n possible combinations skip the upper bits). Compared to SoA in-memory approximate adders, ApprOchs outperforms them in terms of energy consumption while being highly competitive in terms of error behavior, with moderate speed and area efficiency. In application use cases, ApprOchs demonstrates its energy efficiency, particularly in machine learning applications. In MNIST classification using Deep Convolutional Neural Networks, we achieve 78.4% energy savings compared to SoA approximate adders with the same accuracy as exact adders at 98.9%, while for k-means clustering, we observed a 69% reduction in energy consumption with no quality drop in clustering results compared to the exact computation. For image blurring, we achieve up to 32.7% energy reduction over the exact computation and in its most promising configuration (<inline-formula> <tex-math>$k=3$ </tex-math></inline-formula>), the ApprOchs adder consumes 13.4% less energy than the most energy-efficient competing SoA design (S-SINC+), while achieving a similarly excellent median image quality at 43.74dB PSNR and 0.995 SSIM.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 1","pages":"105-119"},"PeriodicalIF":3.7,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143601977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Journal on Emerging and Selected Topics in Circuits and Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1