Occamy: A 432-Core Dual-Chiplet Dual-HBM2E 768-DP-GFLOP/s RISC-V System for 8-to-64-bit Dense and Sparse Computing in 12-nm FinFET

IF 5.6 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Journal of Solid-state Circuits Pub Date : 2025-01-30 DOI:10.1109/JSSC.2025.3529249
Paul Scheffler;Thomas Benz;Viviane Potocnik;Tim Fischer;Luca Colagrande;Nils Wistoff;Yichao Zhang;Luca Bertaccini;Gianmarco Ottavi;Manuel Eggimann;Matheus Cavalcante;Gianna Paulin;Frank K. Gürkaynak;Davide Rossi;Luca Benini
{"title":"Occamy: A 432-Core Dual-Chiplet Dual-HBM2E 768-DP-GFLOP/s RISC-V System for 8-to-64-bit Dense and Sparse Computing in 12-nm FinFET","authors":"Paul Scheffler;Thomas Benz;Viviane Potocnik;Tim Fischer;Luca Colagrande;Nils Wistoff;Yichao Zhang;Luca Bertaccini;Gianmarco Ottavi;Manuel Eggimann;Matheus Cavalcante;Gianna Paulin;Frank K. Gürkaynak;Davide Rossi;Luca Benini","doi":"10.1109/JSSC.2025.3529249","DOIUrl":null,"url":null,"abstract":"Machine learning (ML) and high-performance computing (HPC) applications increasingly combine dense and sparse memory access computations to maximize storage efficiency. However, existing central processing units (CPUs) and graphics processing units (GPUs) struggle to flexibly handle these heterogeneous workloads with consistently high compute efficiency. We present Occamy, a 432-core, 768-DP-GFLOP/s, dual-HBM2E, dual-chiplet RISC-V system with a latency-tolerant hierarchical interconnect and in-core streaming units (SUs) designed to accelerate dense and sparse FP8-to-FP64 ML and HPC workloads. We implement Occamy’s compute chiplets in 12-nm FinFET and its passive interposer, Hedwig, in a 65-nm node. On dense linear algebra (LA), Occamy achieves a competitive floating-point unit (FPU) utilization of 89%. On stencil codes, Occamy reaches an FPU utilization of 83% and a technology-node-normalized compute density of 11.1 DP-GFLOP/s/mm2, leading state-of-the-art (SoA) processors by <inline-formula> <tex-math>$1.7\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$1.2\\times $ </tex-math></inline-formula>, respectively. On sparse-dense LA, it achieves 42% FPU utilization and a normalized compute density of 5.95 DP-GFLOP/s/mm2, surpassing the SoA by <inline-formula> <tex-math>$5.2\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$11\\times $ </tex-math></inline-formula>, respectively. On sparse–sparse LA, Occamy reaches a throughput of up to 187 GCOMP/s at 17.4 GCOMP/s/W and a compute density of 3.63 GCOMP/s/mm2. Finally, we reach up to 75% and 54% FPU utilization on and dense (large language model) and graph-sparse (graph convolutional network) ML inference workloads. Occamy’s register transfer level (RTL) description is freely available under a permissive open-source license.","PeriodicalId":13129,"journal":{"name":"IEEE Journal of Solid-state Circuits","volume":"60 4","pages":"1324-1338"},"PeriodicalIF":5.6000,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Solid-state Circuits","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10858367/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Machine learning (ML) and high-performance computing (HPC) applications increasingly combine dense and sparse memory access computations to maximize storage efficiency. However, existing central processing units (CPUs) and graphics processing units (GPUs) struggle to flexibly handle these heterogeneous workloads with consistently high compute efficiency. We present Occamy, a 432-core, 768-DP-GFLOP/s, dual-HBM2E, dual-chiplet RISC-V system with a latency-tolerant hierarchical interconnect and in-core streaming units (SUs) designed to accelerate dense and sparse FP8-to-FP64 ML and HPC workloads. We implement Occamy’s compute chiplets in 12-nm FinFET and its passive interposer, Hedwig, in a 65-nm node. On dense linear algebra (LA), Occamy achieves a competitive floating-point unit (FPU) utilization of 89%. On stencil codes, Occamy reaches an FPU utilization of 83% and a technology-node-normalized compute density of 11.1 DP-GFLOP/s/mm2, leading state-of-the-art (SoA) processors by $1.7\times $ and $1.2\times $ , respectively. On sparse-dense LA, it achieves 42% FPU utilization and a normalized compute density of 5.95 DP-GFLOP/s/mm2, surpassing the SoA by $5.2\times $ and $11\times $ , respectively. On sparse–sparse LA, Occamy reaches a throughput of up to 187 GCOMP/s at 17.4 GCOMP/s/W and a compute density of 3.63 GCOMP/s/mm2. Finally, we reach up to 75% and 54% FPU utilization on and dense (large language model) and graph-sparse (graph convolutional network) ML inference workloads. Occamy’s register transfer level (RTL) description is freely available under a permissive open-source license.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
432核双芯片Dual-HBM2E 768-DP-GFLOP/s RISC-V系统,用于12纳米FinFET的8至64位密集和稀疏计算
机器学习(ML)和高性能计算(HPC)应用越来越多地结合密集和稀疏内存访问计算,以最大限度地提高存储效率。然而,现有的cpu (central processing unit)和gpu (graphics processing unit)很难灵活地处理这些异构工作负载,同时保持较高的计算效率。我们提出了Occamy,一个432核,768 dp - gflop /s,双hbm2e,双芯片RISC-V系统,具有延迟容忍分层互连和核内流单元(su),旨在加速密集和稀疏的fp8到fp64 ML和HPC工作负载。我们在12nm的FinFET中实现了Occamy的计算芯片,在65nm的节点中实现了它的无源中介器Hedwig。在密集线性代数(LA)上,Occamy实现了89%具有竞争力的浮点单元(FPU)利用率。在模板代码方面,Occamy的FPU利用率达到83%,技术节点标准化计算密度为11.1 DP-GFLOP/s/mm2,分别领先于最先进的(SoA)处理器1.7倍和1.2倍。在稀疏密集的LA上,它实现了42%的FPU利用率和5.95 DP-GFLOP/s/mm2的归一化计算密度,分别比SoA高出5.2倍和11倍。在稀疏稀疏的LA上,Occamy在17.4 GCOMP/s/W下的吞吐量高达187 GCOMP/s,计算密度为3.63 GCOMP/s/mm2。最后,我们在密集(大型语言模型)和图稀疏(图卷积网络)ML推理工作负载上达到了75%和54%的FPU利用率。Occamy的注册传输级别(RTL)描述在宽松的开源许可证下是免费提供的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE Journal of Solid-state Circuits
IEEE Journal of Solid-state Circuits 工程技术-工程:电子与电气
CiteScore
11.00
自引率
20.40%
发文量
351
审稿时长
3-6 weeks
期刊介绍: The IEEE Journal of Solid-State Circuits publishes papers each month in the broad area of solid-state circuits with particular emphasis on transistor-level design of integrated circuits. It also provides coverage of topics such as circuits modeling, technology, systems design, layout, and testing that relate directly to IC design. Integrated circuits and VLSI are of principal interest; material related to discrete circuit design is seldom published. Experimental verification is strongly encouraged.
期刊最新文献
Sub- μ A Always-on Drive Loop for 3-Axis MEMS Gyroscope MITTA: A Multi-Task Transformer Accelerator With Mixed Precision Structured Sparsity and Hierarchical Task-Adaptive Power Management Xiling: Cryo-CMOS Manipulator Using Dual 18-bit R-2R DACs for Single-Electron Transistor at 60 mK A Wideband Digitally Assisted Frequency Tripler With Adaptively Optimized Output Power in 55-nm SiGe BiCMOS Design and Analysis of a 13.7–41 GHz Ultra-Wideband Frequency Doubler With Cross-Coupled Push-Push Structure
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1