Paul Scheffler;Thomas Benz;Viviane Potocnik;Tim Fischer;Luca Colagrande;Nils Wistoff;Yichao Zhang;Luca Bertaccini;Gianmarco Ottavi;Manuel Eggimann;Matheus Cavalcante;Gianna Paulin;Frank K. Gürkaynak;Davide Rossi;Luca Benini
{"title":"Occamy: A 432-Core Dual-Chiplet Dual-HBM2E 768-DP-GFLOP/s RISC-V System for 8-to-64-bit Dense and Sparse Computing in 12-nm FinFET","authors":"Paul Scheffler;Thomas Benz;Viviane Potocnik;Tim Fischer;Luca Colagrande;Nils Wistoff;Yichao Zhang;Luca Bertaccini;Gianmarco Ottavi;Manuel Eggimann;Matheus Cavalcante;Gianna Paulin;Frank K. Gürkaynak;Davide Rossi;Luca Benini","doi":"10.1109/JSSC.2025.3529249","DOIUrl":null,"url":null,"abstract":"Machine learning (ML) and high-performance computing (HPC) applications increasingly combine dense and sparse memory access computations to maximize storage efficiency. However, existing central processing units (CPUs) and graphics processing units (GPUs) struggle to flexibly handle these heterogeneous workloads with consistently high compute efficiency. We present Occamy, a 432-core, 768-DP-GFLOP/s, dual-HBM2E, dual-chiplet RISC-V system with a latency-tolerant hierarchical interconnect and in-core streaming units (SUs) designed to accelerate dense and sparse FP8-to-FP64 ML and HPC workloads. We implement Occamy’s compute chiplets in 12-nm FinFET and its passive interposer, Hedwig, in a 65-nm node. On dense linear algebra (LA), Occamy achieves a competitive floating-point unit (FPU) utilization of 89%. On stencil codes, Occamy reaches an FPU utilization of 83% and a technology-node-normalized compute density of 11.1 DP-GFLOP/s/mm2, leading state-of-the-art (SoA) processors by <inline-formula> <tex-math>$1.7\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$1.2\\times $ </tex-math></inline-formula>, respectively. On sparse-dense LA, it achieves 42% FPU utilization and a normalized compute density of 5.95 DP-GFLOP/s/mm2, surpassing the SoA by <inline-formula> <tex-math>$5.2\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$11\\times $ </tex-math></inline-formula>, respectively. On sparse–sparse LA, Occamy reaches a throughput of up to 187 GCOMP/s at 17.4 GCOMP/s/W and a compute density of 3.63 GCOMP/s/mm2. Finally, we reach up to 75% and 54% FPU utilization on and dense (large language model) and graph-sparse (graph convolutional network) ML inference workloads. Occamy’s register transfer level (RTL) description is freely available under a permissive open-source license.","PeriodicalId":13129,"journal":{"name":"IEEE Journal of Solid-state Circuits","volume":"60 4","pages":"1324-1338"},"PeriodicalIF":5.6000,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Solid-state Circuits","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10858367/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Machine learning (ML) and high-performance computing (HPC) applications increasingly combine dense and sparse memory access computations to maximize storage efficiency. However, existing central processing units (CPUs) and graphics processing units (GPUs) struggle to flexibly handle these heterogeneous workloads with consistently high compute efficiency. We present Occamy, a 432-core, 768-DP-GFLOP/s, dual-HBM2E, dual-chiplet RISC-V system with a latency-tolerant hierarchical interconnect and in-core streaming units (SUs) designed to accelerate dense and sparse FP8-to-FP64 ML and HPC workloads. We implement Occamy’s compute chiplets in 12-nm FinFET and its passive interposer, Hedwig, in a 65-nm node. On dense linear algebra (LA), Occamy achieves a competitive floating-point unit (FPU) utilization of 89%. On stencil codes, Occamy reaches an FPU utilization of 83% and a technology-node-normalized compute density of 11.1 DP-GFLOP/s/mm2, leading state-of-the-art (SoA) processors by $1.7\times $ and $1.2\times $ , respectively. On sparse-dense LA, it achieves 42% FPU utilization and a normalized compute density of 5.95 DP-GFLOP/s/mm2, surpassing the SoA by $5.2\times $ and $11\times $ , respectively. On sparse–sparse LA, Occamy reaches a throughput of up to 187 GCOMP/s at 17.4 GCOMP/s/W and a compute density of 3.63 GCOMP/s/mm2. Finally, we reach up to 75% and 54% FPU utilization on and dense (large language model) and graph-sparse (graph convolutional network) ML inference workloads. Occamy’s register transfer level (RTL) description is freely available under a permissive open-source license.
期刊介绍:
The IEEE Journal of Solid-State Circuits publishes papers each month in the broad area of solid-state circuits with particular emphasis on transistor-level design of integrated circuits. It also provides coverage of topics such as circuits modeling, technology, systems design, layout, and testing that relate directly to IC design. Integrated circuits and VLSI are of principal interest; material related to discrete circuit design is seldom published. Experimental verification is strongly encouraged.