Pub Date : 2025-09-15DOI: 10.1109/JETCAS.2025.3603802
{"title":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems Publication Information","authors":"","doi":"10.1109/JETCAS.2025.3603802","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3603802","url":null,"abstract":"","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 3","pages":"C2-C2"},"PeriodicalIF":3.8,"publicationDate":"2025-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11164995","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145061875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-07DOI: 10.1109/JETCAS.2025.3596593
Janak Sharda;Shimeng Yu
Recent progress in large language models (LLMs) suggests the feasibility of their deployment on personal devices with model size reduction to a few to dozens of GB. Still, intermediate data’s computing needs are intensive, requiring frequent data reloading from the high-bandwidth memory (HBM). Today’s HBM bandwidth is limited by the number of channels embedded in a 2.5D integrated system. Advanced packaging techniques such as through silicon vias (TSV) and Cu-Cu hybrid bonding (HB) could potentially provide higher bandwidth interconnects between memory and logic dies in a 3D integrated system, where the vertical interconnect can reduce the distance between memory and logic, reducing the total energy consumption. However, this creates a large design exploration space for mixing and matching different packaging techniques and can result in complex thermal management issues due to the proximity of various components. In this work, we describe an evaluation methodology which is used to construct a framework capable of benchmarking system-level power, performance, and area (PPA) metrics for 2.5D/3D integrated systems for LLM accelerators. Additionally, we utilize the framework to conduct a detailed analysis to identify the bottlenecks for training and inference across various models and batch sizes. It is observed that the memory bandwidth and routing energy bottlenecks the inference performance, and the available compute bottlenecks the training performance. Finally, we perform thermal evaluations to observe the trade-off between peak operating temperature and the throughput across different packaging configurations.
{"title":"System-Technology Co-Optimization Methodology for LLM Accelerators With Advanced Packaging","authors":"Janak Sharda;Shimeng Yu","doi":"10.1109/JETCAS.2025.3596593","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3596593","url":null,"abstract":"Recent progress in large language models (LLMs) suggests the feasibility of their deployment on personal devices with model size reduction to a few to dozens of GB. Still, intermediate data’s computing needs are intensive, requiring frequent data reloading from the high-bandwidth memory (HBM). Today’s HBM bandwidth is limited by the number of channels embedded in a 2.5D integrated system. Advanced packaging techniques such as through silicon vias (TSV) and Cu-Cu hybrid bonding (HB) could potentially provide higher bandwidth interconnects between memory and logic dies in a 3D integrated system, where the vertical interconnect can reduce the distance between memory and logic, reducing the total energy consumption. However, this creates a large design exploration space for mixing and matching different packaging techniques and can result in complex thermal management issues due to the proximity of various components. In this work, we describe an evaluation methodology which is used to construct a framework capable of benchmarking system-level power, performance, and area (PPA) metrics for 2.5D/3D integrated systems for LLM accelerators. Additionally, we utilize the framework to conduct a detailed analysis to identify the bottlenecks for training and inference across various models and batch sizes. It is observed that the memory bandwidth and routing energy bottlenecks the inference performance, and the available compute bottlenecks the training performance. Finally, we perform thermal evaluations to observe the trade-off between peak operating temperature and the throughput across different packaging configurations.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 4","pages":"577-584"},"PeriodicalIF":3.8,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145808550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-05DOI: 10.1109/JETCAS.2025.3595909
George Karfakis;Myriam Bouzidi;Yunhyeok Im;Alexander Graening;Suresh K. Sitaraman;Puneet Gupta
This paper investigates thermal management in tightly integrated heterogeneous chiplet systems, focusing on a novel approach using embedded thermal isolators. In many 2.5D systems, such as modern enterprise GPUs, thermally sensitive chiplets like High Bandwidth Memory (HBM) are thermally coupled to high-power compute chiplets, leading to performance degradation. We propose and evaluate the use of thermal isolators embedded within the heat spreader to effectively thermally decouple chiplets. Our thermal simulations of a water-cooled 2.5D integrated GPU system indicate that conventional approaches like thermally-aware floorplanning are less effective due to the dominant heat transfer through the heat spreader. In contrast, our proposed thermal isolators can significantly increase thermal isolation between chiplets (by up to 61%), or even reduce overall average peak chip temperature (by up to 22.5%). We develop a closed-loop workflow incorporating thermal results to quantify performance impacts of thermal-induced throttling, finding that in an example GPU+HBM system, the isolator approach can yield performance gains of up to 37% for memory-bound workloads. These findings open up new avenues for thermal management and thermal-system co-optimization in 2.5D heterogeneous integrated systems, potentially enabling more efficient and higher-performing chiplet-based architectures.
{"title":"Optimizing Thermal Performance in 2.5D Systems Using Embedded Isolators","authors":"George Karfakis;Myriam Bouzidi;Yunhyeok Im;Alexander Graening;Suresh K. Sitaraman;Puneet Gupta","doi":"10.1109/JETCAS.2025.3595909","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3595909","url":null,"abstract":"This paper investigates thermal management in tightly integrated heterogeneous chiplet systems, focusing on a novel approach using embedded thermal isolators. In many 2.5D systems, such as modern enterprise GPUs, thermally sensitive chiplets like High Bandwidth Memory (HBM) are thermally coupled to high-power compute chiplets, leading to performance degradation. We propose and evaluate the use of thermal isolators embedded within the heat spreader to effectively thermally decouple chiplets. Our thermal simulations of a water-cooled 2.5D integrated GPU system indicate that conventional approaches like thermally-aware floorplanning are less effective due to the dominant heat transfer through the heat spreader. In contrast, our proposed thermal isolators can significantly increase thermal isolation between chiplets (by up to 61%), or even reduce overall average peak chip temperature (by up to 22.5%). We develop a closed-loop workflow incorporating thermal results to quantify performance impacts of thermal-induced throttling, finding that in an example GPU+HBM system, the isolator approach can yield performance gains of up to 37% for memory-bound workloads. These findings open up new avenues for thermal management and thermal-system co-optimization in 2.5D heterogeneous integrated systems, potentially enabling more efficient and higher-performing chiplet-based architectures.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 3","pages":"458-468"},"PeriodicalIF":3.8,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145061945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-01DOI: 10.1109/JETCAS.2025.3594675
Galib Ibne Haidar;Jingbo Zhou;Md Sami Ul Islam Sami;Mark M. Tehranipoor;Farimah Farahmandi
System-in-Packages (SiPs) are gaining traction due to their enhanced performance, high yield rates, and accelerated time-to-market. However, integrating chiplets from untrusted sources introduces security risks during post-integration testing. Malicious chiplets within the SiP can intercept, modify, or block sensitive test data intended for specific chiplets. This article presents SAFET-HI, a framework designed to ensure a secure testing environment for SiPs. Within this framework, sensitive test data are accessible only to authenticated chiplets. To counter sniffing and spoofing attacks, SAFET-HI encrypts sensitive test patterns while maintaining minimal timing overhead. During post-integration testing, another major threat arises from outsourcing test patterns to untrusted testing facilities, increasing the risk of overproduction and counterfeiting. To address this, SAFET-HI incorporates a functional locking mechanism that prevents unauthorized production and distribution of defective SiPs. Additionally, scan encryption blocks are implemented to stop untrusted test facilities from generating a golden response database. To further enhance security, a watermark bitstream is embedded within the SiP to prevent remarking attacks by untrusted distributors. Simulation results show that SAFET-HI incurs area and timing overheads of only 1.42-4.27% and 13.7%, respectively, demonstrating its effectiveness in securing the SiP testing process.
{"title":"SAFET-HI: Secure Authentication-Based Framework for Encrypted Testing in Heterogeneous Integration","authors":"Galib Ibne Haidar;Jingbo Zhou;Md Sami Ul Islam Sami;Mark M. Tehranipoor;Farimah Farahmandi","doi":"10.1109/JETCAS.2025.3594675","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3594675","url":null,"abstract":"System-in-Packages (SiPs) are gaining traction due to their enhanced performance, high yield rates, and accelerated time-to-market. However, integrating chiplets from untrusted sources introduces security risks during post-integration testing. Malicious chiplets within the SiP can intercept, modify, or block sensitive test data intended for specific chiplets. This article presents SAFET-HI, a framework designed to ensure a secure testing environment for SiPs. Within this framework, sensitive test data are accessible only to authenticated chiplets. To counter sniffing and spoofing attacks, SAFET-HI encrypts sensitive test patterns while maintaining minimal timing overhead. During post-integration testing, another major threat arises from outsourcing test patterns to untrusted testing facilities, increasing the risk of overproduction and counterfeiting. To address this, SAFET-HI incorporates a functional locking mechanism that prevents unauthorized production and distribution of defective SiPs. Additionally, scan encryption blocks are implemented to stop untrusted test facilities from generating a golden response database. To further enhance security, a watermark bitstream is embedded within the SiP to prevent remarking attacks by untrusted distributors. Simulation results show that SAFET-HI incurs area and timing overheads of only 1.42-4.27% and 13.7%, respectively, demonstrating its effectiveness in securing the SiP testing process.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 3","pages":"478-492"},"PeriodicalIF":3.8,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145061894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents the Embedded Multi-die Active Bridge (EMAB) chip, a programmable bridge for cost-effective 2.5D/3.5D packaging technologies. The EMAB chip features a reconfigurable switch array to establish flexible I/O links for connecting multiple chiplets, forming an EMAB chipset based on user needs. It integrates low-dropout regulators (LDOs) for in-package voltage regulation and supports various transmission interfaces, including checkerboard I/Os (50 Mbps–1 Gbps) and MUX I/Os (up to 8 Gbps). Moreover, multiple EMAB chips can be interconnected in a daisy-chain configuration, enabling easy expansion of the EMAB chipset. Additionally, the EMAB chip eliminates TSVs in silicon interposer-based 2.5D packaging technologies and reduces redistribution layer (RDL) complexity through flexible I/O links established within the EMAB chip. Furthermore, EMAB chip can be pre-manufactured as a precast supporting layer (known good die, KGD), which shortens the product development cycle and enhance integration yield. Overall, the EMAB chip offers a miniaturized, low-cost, fast time-to-market and scalable solution for advanced 2.5D/3.5D packaging.
本文介绍了嵌入式多模有源桥接(EMAB)芯片,这是一种可编程桥接,用于经济高效的2.5D/3.5D封装技术。EMAB芯片采用可重新配置的开关阵列,建立灵活的I/O链路,用于连接多个小芯片,根据用户需求组成EMAB芯片组。它集成了用于封装内电压调节的低差稳压器(ldo),并支持各种传输接口,包括棋盘I/ o (50 Mbps-1 Gbps)和MUX I/ o(高达8 Gbps)。此外,多个EMAB芯片可以在菊花链配置中互连,使EMAB芯片组易于扩展。此外,EMAB芯片消除了基于硅介层的2.5D封装技术中的tsv,并通过在EMAB芯片内建立灵活的I/O链路降低了再分配层(RDL)的复杂性。此外,EMAB芯片可以作为预制支撑层(称为good die, KGD)进行预制造,缩短了产品开发周期,提高了成品率。总体而言,EMAB芯片为先进的2.5D/3.5D封装提供了小型化、低成本、快速上市和可扩展的解决方案。
{"title":"Miniaturized and Cost-Effective Programmable 2.5D/3.5D Platforms Enabled by Scalable Embedded Active Bridge Chipset","authors":"Wei Lu;Jie Zhang;Yi-Hui Wei;Hsu-Ming Hsiao;Sih-Han Li;Chao-Kai Hsu;Chih-Cheng Hsiao;Feng-Hsiang Lo;Shyh-Shyuan Sheu;Chin-Hung Wang;Ching-Iang Li;Yung-Sheng Chang;Ming-Ji Dai;Wei-Chung Lo;Shih-Chieh Chang;Hung-Ming Chen;Kuan-Neng Chen;Po-Tsang Huang","doi":"10.1109/JETCAS.2025.3594169","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3594169","url":null,"abstract":"This paper presents the Embedded Multi-die Active Bridge (EMAB) chip, a programmable bridge for cost-effective 2.5D/3.5D packaging technologies. The EMAB chip features a reconfigurable switch array to establish flexible I/O links for connecting multiple chiplets, forming an EMAB chipset based on user needs. It integrates low-dropout regulators (LDOs) for in-package voltage regulation and supports various transmission interfaces, including checkerboard I/Os (50 Mbps–1 Gbps) and MUX I/Os (up to 8 Gbps). Moreover, multiple EMAB chips can be interconnected in a daisy-chain configuration, enabling easy expansion of the EMAB chipset. Additionally, the EMAB chip eliminates TSVs in silicon interposer-based 2.5D packaging technologies and reduces redistribution layer (RDL) complexity through flexible I/O links established within the EMAB chip. Furthermore, EMAB chip can be pre-manufactured as a precast supporting layer (known good die, KGD), which shortens the product development cycle and enhance integration yield. Overall, the EMAB chip offers a miniaturized, low-cost, fast time-to-market and scalable solution for advanced 2.5D/3.5D packaging.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 3","pages":"379-391"},"PeriodicalIF":3.8,"publicationDate":"2025-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145061909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-28DOI: 10.1109/JETCAS.2025.3592984
Juan Suzano;Anthony Philippe;Fady Abouzeid;Giorgio Di Natale;Philippe Roche
Chiplet-based chips are the natural evolution of traditional 2D SoCs. In the future, off-the-shelf chiplets are expected to represent an important component of the semiconductor industry. The IEEE Std 1838(TM)-2019 design-for-testability (DFT) standard enable testing of stacked chiplets from multiple vendors. However, the shared DFT network threatens the confidentiality and integrity of test data and other sensitive information. This paper addresses the security concerns associated with DFT infrastructures in chiplet-based systems. We discuss the necessity of securing DFT infrastructures to prevent unauthorized access and malicious activities. Furthermore, we propose a hardware countermeasure that combines encryption and encoding to secure communication over the DFT network. Results show that the DFT can be protected from misbehavior from malicious chiplets on the stack, scan-based attacks, and brute force attacks with minimal overhead in terms of area and test time. The proposed solution causes less than 1% area overhead on designs composed of more than 5 million gates and less than 1% test time overhead for typical DFT implementations.
{"title":"Enhancing DFT Security in Chiplet-Based Systems With Encryption and Integrity Checking","authors":"Juan Suzano;Anthony Philippe;Fady Abouzeid;Giorgio Di Natale;Philippe Roche","doi":"10.1109/JETCAS.2025.3592984","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3592984","url":null,"abstract":"Chiplet-based chips are the natural evolution of traditional 2D SoCs. In the future, off-the-shelf chiplets are expected to represent an important component of the semiconductor industry. The IEEE Std 1838(TM)-2019 design-for-testability (DFT) standard enable testing of stacked chiplets from multiple vendors. However, the shared DFT network threatens the confidentiality and integrity of test data and other sensitive information. This paper addresses the security concerns associated with DFT infrastructures in chiplet-based systems. We discuss the necessity of securing DFT infrastructures to prevent unauthorized access and malicious activities. Furthermore, we propose a hardware countermeasure that combines encryption and encoding to secure communication over the DFT network. Results show that the DFT can be protected from misbehavior from malicious chiplets on the stack, scan-based attacks, and brute force attacks with minimal overhead in terms of area and test time. The proposed solution causes less than 1% area overhead on designs composed of more than 5 million gates and less than 1% test time overhead for typical DFT implementations.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 3","pages":"493-505"},"PeriodicalIF":3.8,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145061903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-28DOI: 10.1109/JETCAS.2025.3592902
Shenggao Li;Maher Amer
In this paper, we investigate Statistical Bit Error Rate (BER) analysis for low-loss short-reach chiplet interface and high-loss long-reach serial interface. We used jitter filtering to account for the residue jitter not tracked by a forwarded clock system and proposed a fast and exact Statistical BER method to account for the Tx jitter amplification effect in a high-loss channel. Our proposed method achieves a linear computation complexity.
{"title":"Fast and Accurate Jitter Amplification Modeling With Variable Pulse Width Response for Statistical BER Analysis in Chiplet Interconnects and Beyond","authors":"Shenggao Li;Maher Amer","doi":"10.1109/JETCAS.2025.3592902","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3592902","url":null,"abstract":"In this paper, we investigate Statistical Bit Error Rate (BER) analysis for low-loss short-reach chiplet interface and high-loss long-reach serial interface. We used jitter filtering to account for the residue jitter not tracked by a forwarded clock system and proposed a fast and exact Statistical BER method to account for the Tx jitter amplification effect in a high-loss channel. Our proposed method achieves a linear computation complexity.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 4","pages":"609-618"},"PeriodicalIF":3.8,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11097288","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145808574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-25DOI: 10.1109/JETCAS.2025.3592677
Matthew Joseph Adiletta;Gu-Yeon Wei;David Brooks
The demand for efficient machine learning in edge devices is challenging the capabilities of general-purpose computing systems. While domain-specific System on Chip (SoCs) are efficient, they are often prohibitively expensive due to long design times and high design costs. To address these limitations, the community has begun to explore System in Package (SiP) designs for low-cost assembly of reusable accelerators, available as chiplets, to democratize customization. This presents a new challenge of macro-architecture design space exploration (DSE). Prior works do not address this problem, having only investigated micro-architecture design and optimization of homogeneous SiPs. To address this need, and unlock the potential of assembling custom SiPs, comprising heterogeneous chiplets, we introduce an early DSE framework, CASCADE – A. CASCADE employs fast, first-order performance models to capture the tradeoffs of composable compute chiplets, leveraging tool-generated traces to comprehend dataflow patterns in the context of state-of-the-art machine learning tasks. Using CASCADE, we assess the performance benefits of composable SiPs comprising hetero-chiplets for single-tenant and two-tenant scenarios. Notably, we demonstrate that hetero-chiplet systems can deliver speedups in the range of 3-5x, depending on the application, compared to a baseline GPU chiplet system.
边缘设备对高效机器学习的需求正在挑战通用计算系统的能力。虽然特定领域的片上系统(soc)效率很高,但由于设计时间长和设计成本高,它们通常价格昂贵。为了解决这些限制,社区已经开始探索用于可重用加速器的低成本组装的系统封装(SiP)设计,以小芯片的形式提供,以实现民主化定制。这对宏观建筑设计空间探索(DSE)提出了新的挑战。先前的工作没有解决这个问题,只研究了同构sip的微架构设计和优化。为了满足这一需求,并释放组装由异构小芯片组成的定制sip的潜力,我们引入了早期的DSE框架CASCADE - A. CASCADE采用快速的一阶性能模型来捕获可组合计算小芯片的权衡,利用工具生成的跟踪来理解最先进的机器学习任务背景下的数据流模式。使用CASCADE,我们评估了单租户和双租户场景下包含异构小线程的可组合sip的性能优势。值得注意的是,我们证明了与基线GPU芯片系统相比,异晶片系统可以提供3-5倍的加速,具体取决于应用程序。
{"title":"Democratizing Customization for ML at the Edge Through Hetero-Chiplet SiP Architectures","authors":"Matthew Joseph Adiletta;Gu-Yeon Wei;David Brooks","doi":"10.1109/JETCAS.2025.3592677","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3592677","url":null,"abstract":"The demand for efficient machine learning in edge devices is challenging the capabilities of general-purpose computing systems. While domain-specific System on Chip (SoCs) are efficient, they are often prohibitively expensive due to long design times and high design costs. To address these limitations, the community has begun to explore System in Package (SiP) designs for low-cost assembly of reusable accelerators, available as chiplets, to democratize customization. This presents a new challenge of <italic>macro-architecture</i> design space exploration (DSE). Prior works do not address this problem, having only investigated micro-architecture design and optimization of homogeneous SiPs. To address this need, and unlock the potential of assembling custom SiPs, comprising heterogeneous chiplets, we introduce an early DSE framework, <italic>CASCADE</i> – A. <italic>CASCADE</i> employs fast, first-order performance models to capture the tradeoffs of composable compute chiplets, leveraging tool-generated traces to comprehend dataflow patterns in the context of state-of-the-art machine learning tasks. Using <italic>CASCADE</i>, we assess the performance benefits of composable SiPs comprising hetero-chiplets for single-tenant and two-tenant scenarios. Notably, we demonstrate that hetero-chiplet systems can deliver speedups in the range of 3-5x, depending on the application, compared to a baseline GPU chiplet system.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 4","pages":"634-647"},"PeriodicalIF":3.8,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11096615","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145808635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As deep neural network (DNN) models continue to grow in complexity, analog computing architectures have emerged as a promising solution to meet increasing computational demands. Among these, silicon photonic computing excels at efficiently executing dot product operations while leveraging inherent parallelism. Photonic phase change memory (photonic-PCM) further enhances photonic computing by enabling scalable, non-volatile storage. In this work, we introduce the 3D Large-Scale Photonic Accelerator (LSPA), a novel photonic computing architecture designed for large-scale DNN models. LSPA employs multi-layered 3D stacking of non-volatile photonic-PCM cells, creating a high-density computational fabric that optimizes energy efficiency, flexibility, and scalability. LSPA’s custom 3D photonic network enables simultaneous data multicast in two dimensions and accumulation in three dimensions, optimizing communication patterns essential for efficient DNN training. A distinctive feature of LSPA is its ability to execute multiple forward and backward passes in parallel within each mini-batch, reducing latency associated with data movement and photonic-PCM programming. This unique capability combined with high-bandwidth photonic interconnects allows LSPA to sustain efficient training across a wide range of DNN workloads. When evaluated against a range of neural network models including VGG-16, ResNet-50, GoogLeNet, Transformer, GNMT, LLaMA 7B, and LLaMA 30B, LSPA reduces execution time by up to 92% and energy consumption by up to 90%. These results highlight LSPA as a transformative advancement in scalable, high-performance photonic computing for deep learning.
{"title":"Extending Energy-Efficient and Scalable DNN Training and Inference With 3-D Photonic Accelerator","authors":"Juliana Curry;Yuan Li;Ahmed Louri;Avinash Karanth;Razvan Bunescu","doi":"10.1109/JETCAS.2025.3591812","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3591812","url":null,"abstract":"As deep neural network (DNN) models continue to grow in complexity, analog computing architectures have emerged as a promising solution to meet increasing computational demands. Among these, silicon photonic computing excels at efficiently executing dot product operations while leveraging inherent parallelism. Photonic phase change memory (photonic-PCM) further enhances photonic computing by enabling scalable, non-volatile storage. In this work, we introduce the 3D Large-Scale Photonic Accelerator (LSPA), a novel photonic computing architecture designed for large-scale DNN models. LSPA employs multi-layered 3D stacking of non-volatile photonic-PCM cells, creating a high-density computational fabric that optimizes energy efficiency, flexibility, and scalability. LSPA’s custom 3D photonic network enables simultaneous data multicast in two dimensions and accumulation in three dimensions, optimizing communication patterns essential for efficient DNN training. A distinctive feature of LSPA is its ability to execute multiple forward and backward passes in parallel within each mini-batch, reducing latency associated with data movement and photonic-PCM programming. This unique capability combined with high-bandwidth photonic interconnects allows LSPA to sustain efficient training across a wide range of DNN workloads. When evaluated against a range of neural network models including VGG-16, ResNet-50, GoogLeNet, Transformer, GNMT, LLaMA 7B, and LLaMA 30B, LSPA reduces execution time by up to 92% and energy consumption by up to 90%. These results highlight LSPA as a transformative advancement in scalable, high-performance photonic computing for deep learning.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 4","pages":"560-576"},"PeriodicalIF":3.8,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145808592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper describes the architecture of the wafer-on-wafer (WOW) via-last through silicon via (TSV), named Bumpless Build Cube-TSV (BBCube-TSV). At first, the three types of TSVs, $mu $ -bump technology, hybrid bonding technology, and BBCube-TSV are overviewed, addressing the detailed structures and the opportunities of applying for 3D-memories. Then, the process steps of the BBCube-TSV are summarized to figure out the key process steps. Three types of applications are reviewed to illustrate and discuss the potentiality of the BBCube-TSV to enhance 3D-memories, power delivery wiring in processor on stacked memory devices, and advantage in defect management with sophisticated ideas on stacked memories. The simplicity of the structure and the occupation of copper in the TSV structure are found to provide the advantages. The role of the TSV as a vertical interconnect in the hierarchy of multilayer wiring is discussed.
{"title":"A Through Silicon Via (TSV) Architecture of the Bumpless Build Cube (BBCube) for Stacked Memory Devices","authors":"Shinji Sugatani;Hiroyuki Ryoson;Norio Chujo;Masao Taguchi;Koji Sakui;Takayuki Ohba","doi":"10.1109/JETCAS.2025.3591627","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3591627","url":null,"abstract":"This paper describes the architecture of the wafer-on-wafer (WOW) via-last through silicon via (TSV), named Bumpless Build Cube-TSV (BBCube-TSV). At first, the three types of TSVs, <inline-formula> <tex-math>$mu $ </tex-math></inline-formula>-bump technology, hybrid bonding technology, and BBCube-TSV are overviewed, addressing the detailed structures and the opportunities of applying for 3D-memories. Then, the process steps of the BBCube-TSV are summarized to figure out the key process steps. Three types of applications are reviewed to illustrate and discuss the potentiality of the BBCube-TSV to enhance 3D-memories, power delivery wiring in processor on stacked memory devices, and advantage in defect management with sophisticated ideas on stacked memories. The simplicity of the structure and the occupation of copper in the TSV structure are found to provide the advantages. The role of the TSV as a vertical interconnect in the hierarchy of multilayer wiring is discussed.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 3","pages":"368-378"},"PeriodicalIF":3.8,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11088080","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145059835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}