Pub Date : 2025-07-17DOI: 10.1109/JETCAS.2025.3590269
Yang Liu;Feiyang Ma;Yuzhang Zang;Jun Wang;Zhifei Xu;Kai-Da Xu
With increasing density and complexity in 3-D integrated circuits, thermal management has become a major design challenge. In this paper, we present a precise 3-D thermal analysis model incorporating lateral thermal resistance, based on physical structure and material thermal properties. Analytical expressions for lateral thermal resistance and capacitance are derived, enabling accurate thermal modeling of complex 3-D stacked structures. We incorporate these analytical expressions into the RC-Tensorial Analysis Network (RC-TAN) framework, resulting in the 3-D RC-TAN method, which enhances computational efficiency while maintaining high accuracy. Simulation and experimental results demonstrate that the 3-D RC-TAN method outperforms traditional 1-D thermal analysis approaches, offering more than a 97% reduction in computation time compared with finite element method (FEM).
{"title":"3-D Thermal Model With Lateral Thermal Resistance for Fast Thermal Analysis of Complex Stacked Structures","authors":"Yang Liu;Feiyang Ma;Yuzhang Zang;Jun Wang;Zhifei Xu;Kai-Da Xu","doi":"10.1109/JETCAS.2025.3590269","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3590269","url":null,"abstract":"With increasing density and complexity in 3-D integrated circuits, thermal management has become a major design challenge. In this paper, we present a precise 3-D thermal analysis model incorporating lateral thermal resistance, based on physical structure and material thermal properties. Analytical expressions for lateral thermal resistance and capacitance are derived, enabling accurate thermal modeling of complex 3-D stacked structures. We incorporate these analytical expressions into the RC-Tensorial Analysis Network (RC-TAN) framework, resulting in the 3-D RC-TAN method, which enhances computational efficiency while maintaining high accuracy. Simulation and experimental results demonstrate that the 3-D RC-TAN method outperforms traditional 1-D thermal analysis approaches, offering more than a 97% reduction in computation time compared with finite element method (FEM).","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 3","pages":"445-457"},"PeriodicalIF":3.8,"publicationDate":"2025-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145061893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-17DOI: 10.1109/JETCAS.2025.3590065
Runxi Wang;Ziheng Wang;Ting Lin;Jacob Michael Raby;Mircea R. Stan;Xinfei Guo
The rapid advancement of three-dimensional integrated circuits (3DICs) has heightened the need for early-phase design space exploration (DSE) to minimize design iterations and unexpected challenges. Emphasizing the pre-register-transfer level (Pre-RTL) design phase is crucial for reducing trial-and-error costs. However, 3DIC design introduces additional complexities due to thermal constraints and an expanded design space resulting from vertical stacking and various cooling strategies. Despite this need, existing Pre-RTL DSE tools for 3DICs remain scarce, with available solutions often lacking comprehensive design options and full customization support. To bridge this gap, we present Cool-3D, an end-to-end, thermal-aware framework for 3DIC design that integrates mainstream architectural-level simulators, including gem5, McPAT, and HotSpot 7.0, with advanced cooling models. Cool-3D enables broad and fine-grained design space exploration, built-in microfluidic cooling support for thermal analysis, and an extension interface for non-parameterizable customization, allowing designers to model and optimize 3DIC architectures with greater flexibility and accuracy. To validate the Cool-3D framework, we conduct four case studies demonstrating its ability to model various hardware design options and accurately capture thermal behaviors. Cool-3D serves as a foundational framework that not only facilitates comprehensive 3DIC design space exploration but also enables future innovations in 3DIC architecture, cooling strategies, and optimization techniques. The entire framework, along with the experimental data, is in the process of being released on GitHub. The GitHub link is available on https://github.com/iCAS-SJTU/Cool-3D
{"title":"Cool-3D: An End-to-End Thermal-Aware Framework for Early-Phase Design Space Exploration of Microfluidic-Cooled 3DICs","authors":"Runxi Wang;Ziheng Wang;Ting Lin;Jacob Michael Raby;Mircea R. Stan;Xinfei Guo","doi":"10.1109/JETCAS.2025.3590065","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3590065","url":null,"abstract":"The rapid advancement of three-dimensional integrated circuits (3DICs) has heightened the need for early-phase design space exploration (DSE) to minimize design iterations and unexpected challenges. Emphasizing the pre-register-transfer level (Pre-RTL) design phase is crucial for reducing trial-and-error costs. However, 3DIC design introduces additional complexities due to thermal constraints and an expanded design space resulting from vertical stacking and various cooling strategies. Despite this need, existing Pre-RTL DSE tools for 3DICs remain scarce, with available solutions often lacking comprehensive design options and full customization support. To bridge this gap, we present Cool-3D, an end-to-end, thermal-aware framework for 3DIC design that integrates mainstream architectural-level simulators, including gem5, McPAT, and HotSpot 7.0, with advanced cooling models. Cool-3D enables broad and fine-grained design space exploration, built-in microfluidic cooling support for thermal analysis, and an extension interface for non-parameterizable customization, allowing designers to model and optimize 3DIC architectures with greater flexibility and accuracy. To validate the Cool-3D framework, we conduct four case studies demonstrating its ability to model various hardware design options and accurately capture thermal behaviors. Cool-3D serves as a foundational framework that not only facilitates comprehensive 3DIC design space exploration but also enables future innovations in 3DIC architecture, cooling strategies, and optimization techniques. The entire framework, along with the experimental data, is in the process of being released on GitHub. The GitHub link is available on <uri>https://github.com/iCAS-SJTU/Cool-3D</uri>","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 4","pages":"659-673"},"PeriodicalIF":3.8,"publicationDate":"2025-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145808589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-25DOI: 10.1109/JETCAS.2025.3573432
{"title":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","authors":"","doi":"10.1109/JETCAS.2025.3573432","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3573432","url":null,"abstract":"","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"C3-C3"},"PeriodicalIF":3.7,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11050017","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-25DOI: 10.1109/JETCAS.2025.3573428
{"title":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems Publication Information","authors":"","doi":"10.1109/JETCAS.2025.3573428","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3573428","url":null,"abstract":"","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"C2-C2"},"PeriodicalIF":3.7,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11050009","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-25DOI: 10.1109/JETCAS.2025.3573430
{"title":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems Information for Authors","authors":"","doi":"10.1109/JETCAS.2025.3573430","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3573430","url":null,"abstract":"","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"361-361"},"PeriodicalIF":3.7,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11050010","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-25DOI: 10.1109/JETCAS.2025.3572258
Chuan Zhang;Naigang Wang;Jongsun Park;Li Zhang
{"title":"Guest Editorial Generative Artificial Intelligence Compute: Algorithms, Implementations, and Applications to CAS","authors":"Chuan Zhang;Naigang Wang;Jongsun Park;Li Zhang","doi":"10.1109/JETCAS.2025.3572258","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3572258","url":null,"abstract":"","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"144-148"},"PeriodicalIF":3.7,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11050018","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Generative artificial intelligence (GenAI) has emerged as a pivotal focus in global innovation agendas, revealing transformative potential that extends beyond technological applications to reshape diverse societal domains. Given the fundamental dependency of GenAI deployment on circuits and systems (CAS), a co-evolutionary approach integrating both technological paradigms becomes imperative. This synergistic framework confronts three interrelated challenges: 1) developing deployment-ready GenAI algorithms, 2) engineering implementation-efficient CAS architectures, and 3) leveraging GenAI for autonomous CAS designs - each representing critical innovations vectors. Given the rapid advancement of GenAI-CAS technologies, a comprehensive synthesis has become an urgent priority across academia and industry. Consequently, this timely review systematically analyzes current advancements, provides integrative perspectives, and identifies emerging research trajectories. This review endeavors to serve both AI and CAS communities, thereby catalyzing an innovation feedback loop: GenAI-optimized CAS architectures in turn accelerate GenAI evolution through algorithm-hardware co-empowerment.
{"title":"Generative AI Through CAS Lens: An Integrated Overview of Algorithmic Optimizations, Architectural Advances, and Automated Designs","authors":"Chuan Zhang;You You;Naigang Wang;Jongsun Park;Li Zhang","doi":"10.1109/JETCAS.2025.3575272","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3575272","url":null,"abstract":"Generative artificial intelligence (GenAI) has emerged as a pivotal focus in global innovation agendas, revealing transformative potential that extends beyond technological applications to reshape diverse societal domains. Given the fundamental dependency of GenAI deployment on circuits and systems (CAS), a co-evolutionary approach integrating both technological paradigms becomes imperative. This synergistic framework confronts three interrelated challenges: 1) developing deployment-ready GenAI algorithms, 2) engineering implementation-efficient CAS architectures, and 3) leveraging GenAI for autonomous CAS designs - each representing critical innovations vectors. Given the rapid advancement of GenAI-CAS technologies, a comprehensive synthesis has become an urgent priority across academia and industry. Consequently, this timely review systematically analyzes current advancements, provides integrative perspectives, and identifies emerging research trajectories. This review endeavors to serve both AI and CAS communities, thereby catalyzing an innovation feedback loop: GenAI-optimized CAS architectures in turn accelerate GenAI evolution through algorithm-hardware co-empowerment.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"149-185"},"PeriodicalIF":3.7,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11024158","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-22DOI: 10.1109/JETCAS.2025.3563228
Sanxin Jiang;Jiro Katto;Heming Sun
Currently, denoising diffusion probability models (DDPM) have achieved significant success in various image generation tasks, but their application in image compression, especially in the context of learned image compression (LIC), is quite limited. In this study, we introduce a rate-distortion (RD) guided diffusion model, referred to as RDDM, to enhance the performance of LIC. In RDDM, LIC is treated as a lossy codec function constrained by RD, dividing the input image into two parts through encoding and decoding operations: the reconstructed image and the residual image. The construction of RDDM is primarily based on two points. First, RDDM treats diffusion models as repositories of image structures and textures, built using extensive real-world datasets. Under the guidance of RD constraints, it extracts and utilizes the necessary structural and textural priors from these repositories to restore the input image. Second, RDDM employs a Bayesian network to progressively infer the input image based on the reconstructed image and its codec function. Additionally, our research reveals that RDDM’s performance declines when its codec function does not match the reconstructed image. However, using the highest bitrate codec function minimizes this performance drop. The resulting model is referred to as $text{RDDM}^{star }$ . The experimental results indicate that both RDDM and $text{RDDM}^{star }$ can be applied to various architectures of LICs, such as CNN, Transformer, and their hybrid. They can significantly improve the fidelity of these codecs while maintaining or even enhancing perceptual quality to some extent.
{"title":"RDDM: A Rate-Distortion Guided Diffusion Model for Learned Image Compression Enhancement","authors":"Sanxin Jiang;Jiro Katto;Heming Sun","doi":"10.1109/JETCAS.2025.3563228","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3563228","url":null,"abstract":"Currently, denoising diffusion probability models (DDPM) have achieved significant success in various image generation tasks, but their application in image compression, especially in the context of learned image compression (LIC), is quite limited. In this study, we introduce a rate-distortion (RD) guided diffusion model, referred to as RDDM, to enhance the performance of LIC. In RDDM, LIC is treated as a lossy codec function constrained by RD, dividing the input image into two parts through encoding and decoding operations: the reconstructed image and the residual image. The construction of RDDM is primarily based on two points. First, RDDM treats diffusion models as repositories of image structures and textures, built using extensive real-world datasets. Under the guidance of RD constraints, it extracts and utilizes the necessary structural and textural priors from these repositories to restore the input image. Second, RDDM employs a Bayesian network to progressively infer the input image based on the reconstructed image and its codec function. Additionally, our research reveals that RDDM’s performance declines when its codec function does not match the reconstructed image. However, using the highest bitrate codec function minimizes this performance drop. The resulting model is referred to as <inline-formula> <tex-math>$text{RDDM}^{star }$ </tex-math></inline-formula>. The experimental results indicate that both RDDM and <inline-formula> <tex-math>$text{RDDM}^{star }$ </tex-math></inline-formula> can be applied to various architectures of LICs, such as CNN, Transformer, and their hybrid. They can significantly improve the fidelity of these codecs while maintaining or even enhancing perceptual quality to some extent.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"186-199"},"PeriodicalIF":3.7,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-21DOI: 10.1109/JETCAS.2025.3562734
Andrea Belano;Yvan Tortorella;Angelo Garofalo;Luca Benini;Davide Rossi;Francesco Conti
Transformer-based generative Artificial Intelligence (GenAI) models achieve remarkable results in a wide range of fields, including natural language processing, computer vision, and audio processing. However, this comes at the cost of increased complexity and the need of sophisticated non-linearities such as softmax and GELU. Even if Transformers are computationally dominated by matrix multiplications (MatMul), these non-linearities can become a performance bottleneck, especially if dedicated hardware is used to accelerate MatMul operators. In this work, we introduce a GenAI BFloat16 Transformer acceleration template based on a heterogeneous tightly-coupled cluster containing 256KiB of shared SRAM, 8 general-purpose RISC-V cores, a $24times 8$ systolic array MatMul accelerator, and a novel accelerator for Transformer softmax, GELU and SiLU non-linearities: SoftEx. SoftEx introduces an approximate exponentiation algorithm balancing efficiency ($121times $ speedup over glibc’s implementation) with accuracy (mean relative error of 0.14%). In 12 nm technology, SoftEx occupies 0.039 mm2, only 3.22% of the cluster, which achieves an operating frequency of 1.12 GHz. Compared to optimized software running on the RISC-V cores, SoftEx achieves significant improvements, accelerating softmax and GELU computations by up to $10.8times $ and $5.11times $ , respectively, while reducing their energy consumption by up to $10.8times $ and $5.29times $ . These enhancements translate into a $1.58times $ increase in throughput (310 GOPS at 0.8 V) and a $1.42times $ improvement in energy efficiency (1.34 TOPS/W at 0.55 V) on end-to-end ViT inference workloads.
{"title":"A Flexible Template for Edge Generative AI With High-Accuracy Accelerated Softmax and GELU","authors":"Andrea Belano;Yvan Tortorella;Angelo Garofalo;Luca Benini;Davide Rossi;Francesco Conti","doi":"10.1109/JETCAS.2025.3562734","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3562734","url":null,"abstract":"Transformer-based generative Artificial Intelligence (GenAI) models achieve remarkable results in a wide range of fields, including natural language processing, computer vision, and audio processing. However, this comes at the cost of increased complexity and the need of sophisticated non-linearities such as softmax and GELU. Even if Transformers are computationally dominated by matrix multiplications (MatMul), these non-linearities can become a performance bottleneck, especially if dedicated hardware is used to accelerate MatMul operators. In this work, we introduce a GenAI BFloat16 Transformer acceleration template based on a heterogeneous tightly-coupled cluster containing 256KiB of shared SRAM, 8 general-purpose RISC-V cores, a <inline-formula> <tex-math>$24times 8$ </tex-math></inline-formula> systolic array MatMul accelerator, and a novel accelerator for Transformer softmax, GELU and SiLU non-linearities: SoftEx. SoftEx introduces an approximate exponentiation algorithm balancing efficiency (<inline-formula> <tex-math>$121times $ </tex-math></inline-formula> speedup over glibc’s implementation) with accuracy (mean relative error of 0.14%). In 12 nm technology, SoftEx occupies 0.039 mm<sup>2</sup>, only 3.22% of the cluster, which achieves an operating frequency of 1.12 GHz. Compared to optimized software running on the RISC-V cores, SoftEx achieves significant improvements, accelerating softmax and GELU computations by up to <inline-formula> <tex-math>$10.8times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$5.11times $ </tex-math></inline-formula>, respectively, while reducing their energy consumption by up to <inline-formula> <tex-math>$10.8times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$5.29times $ </tex-math></inline-formula>. These enhancements translate into a <inline-formula> <tex-math>$1.58times $ </tex-math></inline-formula> increase in throughput (310 GOPS at 0.8 V) and a <inline-formula> <tex-math>$1.42times $ </tex-math></inline-formula> improvement in energy efficiency (1.34 TOPS/W at 0.55 V) on end-to-end ViT inference workloads.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"200-216"},"PeriodicalIF":3.7,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-21DOI: 10.1109/JETCAS.2025.3562937
Siqi Cai;Gang Wang;Wenjie Li;Dongxu Lyu;Guanghui He
Large language models (LLMs) face high computational and memory demands. While prior studies have leveraged quantization to reduce memory requirements, critical challenges persist: unaligned memory accesses, significant quantization errors when handling outliers that span larger quantization ranges, and the increased hardware overhead associated with processing high-bit-width outliers. To address these issues, we propose a quantization algorithm and hardware architecture co-design for efficient LLM acceleration. Algorithmically, a grouped adaptive two-range quantization (ATRQ) with an in-group embedded identifier is proposed to encode outliers and normal values in distinct ranges, achieving hardware-friendly aligned memory access and reducing quantization errors. From a hardware perspective, we develop a low-overhead ATRQ decoder and an outlier-bit-split processing element (PE) to reduce the hardware overhead associated with high-bit-width outliers, effectively leveraging their inherent sparsity. To support mixed-precision computation and accommodate diverse dataflows during the prefilling and decoding phases, we design a reconfigurable local accumulator that mitigates the overhead associated with additional adders. Experimental results show that the ATRQ-based accelerator outperforms existing solutions, achieving up to $2.48times $ speedup and $2.01times $ energy reduction in LLM prefilling phase, and $1.87times $ speedup and $2.03times $ energy reduction in the decoding phase, with superior model performance under post-training quantization.
{"title":"Adaptive Two-Range Quantization and Hardware Co-Design for Large Language Model Acceleration","authors":"Siqi Cai;Gang Wang;Wenjie Li;Dongxu Lyu;Guanghui He","doi":"10.1109/JETCAS.2025.3562937","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3562937","url":null,"abstract":"Large language models (LLMs) face high computational and memory demands. While prior studies have leveraged quantization to reduce memory requirements, critical challenges persist: unaligned memory accesses, significant quantization errors when handling outliers that span larger quantization ranges, and the increased hardware overhead associated with processing high-bit-width outliers. To address these issues, we propose a quantization algorithm and hardware architecture co-design for efficient LLM acceleration. Algorithmically, a grouped adaptive two-range quantization (ATRQ) with an in-group embedded identifier is proposed to encode outliers and normal values in distinct ranges, achieving hardware-friendly aligned memory access and reducing quantization errors. From a hardware perspective, we develop a low-overhead ATRQ decoder and an outlier-bit-split processing element (PE) to reduce the hardware overhead associated with high-bit-width outliers, effectively leveraging their inherent sparsity. To support mixed-precision computation and accommodate diverse dataflows during the prefilling and decoding phases, we design a reconfigurable local accumulator that mitigates the overhead associated with additional adders. Experimental results show that the ATRQ-based accelerator outperforms existing solutions, achieving up to <inline-formula> <tex-math>$2.48times $ </tex-math></inline-formula> speedup and <inline-formula> <tex-math>$2.01times $ </tex-math></inline-formula> energy reduction in LLM prefilling phase, and <inline-formula> <tex-math>$1.87times $ </tex-math></inline-formula> speedup and <inline-formula> <tex-math>$2.03times $ </tex-math></inline-formula> energy reduction in the decoding phase, with superior model performance under post-training quantization.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"272-284"},"PeriodicalIF":3.7,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}