Yan Kang, Sayan Ghosh, M. Kandemir, Andrés Marquez
{"title":"Impact of Write-Allocate Elimination on Fujitsu A64FX","authors":"Yan Kang, Sayan Ghosh, M. Kandemir, Andrés Marquez","doi":"10.1145/3636480.3637283","DOIUrl":null,"url":null,"abstract":"ARM-based CPU architectures are currently driving massive disruptions in the High Performance Computing (HPC) community. Deployment of the 48-core Fujitsu A64FX ARM architecture based processor in RIKEN “Fugaku” supercomputer (#2 in the June 2023 Top500 list) was a major inflection point in pushing ARM to mainstream HPC. A key design criteria of Fujitsu A64FX is to enhance the throughput of modern memory-bound applications, which happens to be a dominant pattern in contemporary HPC, as opposed to traditional compute-bound or floating-point intensive science workloads. One of the mechanisms to enhance the throughput concerns write-allocate operations (e.g., streaming write operations), which are quite common in science applications. In particular, eliminating write-allocate operations (allocate cache line on a write miss) through a special “zero fill” instruction available on the ARM CPU architectures can improve the overall memory bandwidth, by avoiding the memory read into a cache line, which is unnecessary since the cache line will be written consequently. While bandwidth implications are relatively straightforward to measure via synthetic benchmarks with fixed-stride memory accesses, it is important to consider irregular memory-access driven scenarios such as graph analytics, and analyze the impact of write-allocate elimination on diverse data-driven applications. In this paper, we examine the impact of “zero fill” on OpenMP-based multithreaded graph application scenarios (Graph500 Breadth First Search, GAP benchmark suite, and Louvain Graph Clustering) and five application proxies from the Rodinia heterogeneous benchmark suite (molecular dynamics, sequence alignment, image processing, etc.), using LLVM-based ARM and GNU compilers on the Fujitsu FX700 A64FX platform of the Ookami system from Stony Brook University. Our results indicate that facilitating “zero fill” through code modifications to certain critical kernels or code segments that exhibit temporal write patterns can positively impact the overall performance of a variety of applications. We observe performance variations across the compilers and input data, and note end-to-end improvements between 5–20% for the benchmarks and diverse spectrum of application scenarios owing to “zero fill” related adaptations.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"5 12","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3636480.3637283","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
ARM-based CPU architectures are currently driving massive disruptions in the High Performance Computing (HPC) community. Deployment of the 48-core Fujitsu A64FX ARM architecture based processor in RIKEN “Fugaku” supercomputer (#2 in the June 2023 Top500 list) was a major inflection point in pushing ARM to mainstream HPC. A key design criteria of Fujitsu A64FX is to enhance the throughput of modern memory-bound applications, which happens to be a dominant pattern in contemporary HPC, as opposed to traditional compute-bound or floating-point intensive science workloads. One of the mechanisms to enhance the throughput concerns write-allocate operations (e.g., streaming write operations), which are quite common in science applications. In particular, eliminating write-allocate operations (allocate cache line on a write miss) through a special “zero fill” instruction available on the ARM CPU architectures can improve the overall memory bandwidth, by avoiding the memory read into a cache line, which is unnecessary since the cache line will be written consequently. While bandwidth implications are relatively straightforward to measure via synthetic benchmarks with fixed-stride memory accesses, it is important to consider irregular memory-access driven scenarios such as graph analytics, and analyze the impact of write-allocate elimination on diverse data-driven applications. In this paper, we examine the impact of “zero fill” on OpenMP-based multithreaded graph application scenarios (Graph500 Breadth First Search, GAP benchmark suite, and Louvain Graph Clustering) and five application proxies from the Rodinia heterogeneous benchmark suite (molecular dynamics, sequence alignment, image processing, etc.), using LLVM-based ARM and GNU compilers on the Fujitsu FX700 A64FX platform of the Ookami system from Stony Brook University. Our results indicate that facilitating “zero fill” through code modifications to certain critical kernels or code segments that exhibit temporal write patterns can positively impact the overall performance of a variety of applications. We observe performance variations across the compilers and input data, and note end-to-end improvements between 5–20% for the benchmarks and diverse spectrum of application scenarios owing to “zero fill” related adaptations.