{"title":"SARIS: Accelerating Stencil Computations on Energy-Efficient RISC-V Compute Clusters with Indirect Stream Registers","authors":"Paul Scheffler, Luca Colagrande, Luca Benini","doi":"arxiv-2404.05303","DOIUrl":null,"url":null,"abstract":"Stencil codes are performance-critical in many compute-intensive\napplications, but suffer from significant address calculation and irregular\nmemory access overheads. This work presents SARIS, a general and highly\nflexible methodology for stencil acceleration using register-mapped indirect\nstreams. We demonstrate SARIS for various stencil codes on an eight-core RISC-V\ncompute cluster with indirect stream registers, achieving significant speedups\nof 2.72x, near-ideal FPU utilizations of 81%, and energy efficiency\nimprovements of 1.58x over an RV32G baseline on average. Scaling out to a\n256-core manycore system, we estimate an average FPU utilization of 64%, an\naverage speedup of 2.14x, and up to 15% higher fractions of peak compute than a\nleading GPU code generator.","PeriodicalId":501256,"journal":{"name":"arXiv - CS - Mathematical Software","volume":"29 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Mathematical Software","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2404.05303","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Stencil codes are performance-critical in many compute-intensive
applications, but suffer from significant address calculation and irregular
memory access overheads. This work presents SARIS, a general and highly
flexible methodology for stencil acceleration using register-mapped indirect
streams. We demonstrate SARIS for various stencil codes on an eight-core RISC-V
compute cluster with indirect stream registers, achieving significant speedups
of 2.72x, near-ideal FPU utilizations of 81%, and energy efficiency
improvements of 1.58x over an RV32G baseline on average. Scaling out to a
256-core manycore system, we estimate an average FPU utilization of 64%, an
average speedup of 2.14x, and up to 15% higher fractions of peak compute than a
leading GPU code generator.