{"title":"Bandwidth-Latency-Thermal Co-Optimization of Interconnect-Dominated Many-Core 3D-IC","authors":"Sudipta Das;Samuel Riedel;Mohamed Naeim;Moritz Brunion;Marco Bertuletti;Luca Benini;Julien Ryckaert;James Myers;Dwaipayan Biswas;Dragomir Milojevic","doi":"10.1109/TVLSI.2024.3467148","DOIUrl":null,"url":null,"abstract":"The ongoing integration of advanced functionalities in contemporary system-on-chips (SoCs) poses significant challenges related to memory bandwidth, capacity, and thermal stability. These challenges are further amplified with the advancement of artificial intelligence (AI), necessitating enhanced memory and interconnect bandwidth and latency. This article presents a comprehensive study encompassing architectural modifications of an interconnect-dominated many-core SoC targeting the significant increase of intermediate, on-chip cache memory bandwidth and access latency tuning. The proposed SoC has been implemented in 3-D using A10 nanosheet technology and early thermal analysis has been performed. Our workload simulations reveal, respectively, up to 12- and 2.5-fold acceleration in the 64-core and 16-core versions of the SoC. Such speed-up comes at 40% increase in die-area and a 60% rise in power dissipation when implemented in 2-D. In contrast, the 3-D counterpart not only minimizes the footprint but also yields 20% power savings, attributable to a 40% reduction in wirelength. The article further highlights the importance of pipeline restructuring to leverage the potential of 3-D technology for achieving lower latency and more efficient memory access. Finally, we discuss the thermal implications of various 3-D partitioning schemes in High Performance Computing (HPC) and mobile applications. Our analysis reveals that, unlike high-power density HPC cases, 3-D mobile case increases <inline-formula> <tex-math>$T_{\\max }$ </tex-math></inline-formula> only by <inline-formula> <tex-math>$2~^{\\circ } $ </tex-math></inline-formula>C–<inline-formula> <tex-math>$3~^{\\circ } $ </tex-math></inline-formula>C compared to 2-D, while the HPC scenario analysis requires multiconstrained efficient partitioning for 3-D implementations.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"346-357"},"PeriodicalIF":2.8000,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10720515/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
The ongoing integration of advanced functionalities in contemporary system-on-chips (SoCs) poses significant challenges related to memory bandwidth, capacity, and thermal stability. These challenges are further amplified with the advancement of artificial intelligence (AI), necessitating enhanced memory and interconnect bandwidth and latency. This article presents a comprehensive study encompassing architectural modifications of an interconnect-dominated many-core SoC targeting the significant increase of intermediate, on-chip cache memory bandwidth and access latency tuning. The proposed SoC has been implemented in 3-D using A10 nanosheet technology and early thermal analysis has been performed. Our workload simulations reveal, respectively, up to 12- and 2.5-fold acceleration in the 64-core and 16-core versions of the SoC. Such speed-up comes at 40% increase in die-area and a 60% rise in power dissipation when implemented in 2-D. In contrast, the 3-D counterpart not only minimizes the footprint but also yields 20% power savings, attributable to a 40% reduction in wirelength. The article further highlights the importance of pipeline restructuring to leverage the potential of 3-D technology for achieving lower latency and more efficient memory access. Finally, we discuss the thermal implications of various 3-D partitioning schemes in High Performance Computing (HPC) and mobile applications. Our analysis reveals that, unlike high-power density HPC cases, 3-D mobile case increases $T_{\max }$ only by $2~^{\circ } $ C–$3~^{\circ } $ C compared to 2-D, while the HPC scenario analysis requires multiconstrained efficient partitioning for 3-D implementations.
期刊介绍:
The IEEE Transactions on VLSI Systems is published as a monthly journal under the co-sponsorship of the IEEE Circuits and Systems Society, the IEEE Computer Society, and the IEEE Solid-State Circuits Society.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels.
To address this critical area through a common forum, the IEEE Transactions on VLSI Systems have been founded. The editorial board, consisting of international experts, invites original papers which emphasize and merit the novel systems integration aspects of microelectronic systems including interactions among systems design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and systems level qualification. Thus, the coverage of these Transactions will focus on VLSI/ULSI microelectronic systems integration.