Yan Kang, Sayan Ghosh, M. Kandemir, Andrés Marquez
{"title":"取消写入分配对富士通 A64FX 的影响","authors":"Yan Kang, Sayan Ghosh, M. Kandemir, Andrés Marquez","doi":"10.1145/3636480.3637283","DOIUrl":null,"url":null,"abstract":"ARM-based CPU architectures are currently driving massive disruptions in the High Performance Computing (HPC) community. Deployment of the 48-core Fujitsu A64FX ARM architecture based processor in RIKEN “Fugaku” supercomputer (#2 in the June 2023 Top500 list) was a major inflection point in pushing ARM to mainstream HPC. A key design criteria of Fujitsu A64FX is to enhance the throughput of modern memory-bound applications, which happens to be a dominant pattern in contemporary HPC, as opposed to traditional compute-bound or floating-point intensive science workloads. One of the mechanisms to enhance the throughput concerns write-allocate operations (e.g., streaming write operations), which are quite common in science applications. In particular, eliminating write-allocate operations (allocate cache line on a write miss) through a special “zero fill” instruction available on the ARM CPU architectures can improve the overall memory bandwidth, by avoiding the memory read into a cache line, which is unnecessary since the cache line will be written consequently. While bandwidth implications are relatively straightforward to measure via synthetic benchmarks with fixed-stride memory accesses, it is important to consider irregular memory-access driven scenarios such as graph analytics, and analyze the impact of write-allocate elimination on diverse data-driven applications. In this paper, we examine the impact of “zero fill” on OpenMP-based multithreaded graph application scenarios (Graph500 Breadth First Search, GAP benchmark suite, and Louvain Graph Clustering) and five application proxies from the Rodinia heterogeneous benchmark suite (molecular dynamics, sequence alignment, image processing, etc.), using LLVM-based ARM and GNU compilers on the Fujitsu FX700 A64FX platform of the Ookami system from Stony Brook University. Our results indicate that facilitating “zero fill” through code modifications to certain critical kernels or code segments that exhibit temporal write patterns can positively impact the overall performance of a variety of applications. We observe performance variations across the compilers and input data, and note end-to-end improvements between 5–20% for the benchmarks and diverse spectrum of application scenarios owing to “zero fill” related adaptations.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"5 12","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Impact of Write-Allocate Elimination on Fujitsu A64FX\",\"authors\":\"Yan Kang, Sayan Ghosh, M. Kandemir, Andrés Marquez\",\"doi\":\"10.1145/3636480.3637283\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ARM-based CPU architectures are currently driving massive disruptions in the High Performance Computing (HPC) community. Deployment of the 48-core Fujitsu A64FX ARM architecture based processor in RIKEN “Fugaku” supercomputer (#2 in the June 2023 Top500 list) was a major inflection point in pushing ARM to mainstream HPC. A key design criteria of Fujitsu A64FX is to enhance the throughput of modern memory-bound applications, which happens to be a dominant pattern in contemporary HPC, as opposed to traditional compute-bound or floating-point intensive science workloads. One of the mechanisms to enhance the throughput concerns write-allocate operations (e.g., streaming write operations), which are quite common in science applications. In particular, eliminating write-allocate operations (allocate cache line on a write miss) through a special “zero fill” instruction available on the ARM CPU architectures can improve the overall memory bandwidth, by avoiding the memory read into a cache line, which is unnecessary since the cache line will be written consequently. While bandwidth implications are relatively straightforward to measure via synthetic benchmarks with fixed-stride memory accesses, it is important to consider irregular memory-access driven scenarios such as graph analytics, and analyze the impact of write-allocate elimination on diverse data-driven applications. In this paper, we examine the impact of “zero fill” on OpenMP-based multithreaded graph application scenarios (Graph500 Breadth First Search, GAP benchmark suite, and Louvain Graph Clustering) and five application proxies from the Rodinia heterogeneous benchmark suite (molecular dynamics, sequence alignment, image processing, etc.), using LLVM-based ARM and GNU compilers on the Fujitsu FX700 A64FX platform of the Ookami system from Stony Brook University. Our results indicate that facilitating “zero fill” through code modifications to certain critical kernels or code segments that exhibit temporal write patterns can positively impact the overall performance of a variety of applications. We observe performance variations across the compilers and input data, and note end-to-end improvements between 5–20% for the benchmarks and diverse spectrum of application scenarios owing to “zero fill” related adaptations.\",\"PeriodicalId\":120904,\"journal\":{\"name\":\"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops\",\"volume\":\"5 12\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-01-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3636480.3637283\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3636480.3637283","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
基于 ARM 架构的 CPU 目前正在推动高性能计算 (HPC) 领域的大规模变革。基于 ARM 架构的 48 核富士通 A64FX 处理器在理研 "Fugaku "超级计算机(2023 年 6 月 Top500 榜单第二名)中的部署是将 ARM 推向主流 HPC 的一个重要拐点。富士通 A64FX 的一个关键设计标准是提高现代内存绑定应用的吞吐量,这恰好是当代 HPC 的主流模式,而不是传统的计算绑定或浮点密集型科学工作负载。提高吞吐量的机制之一涉及写入分配操作(如流式写入操作),这在科学应用中非常常见。特别是,通过 ARM CPU 架构上的特殊 "零填充 "指令消除写入分配操作(在写入未命中时分配高速缓存行),可以避免内存读入高速缓存行,从而提高整体内存带宽。虽然带宽影响可以通过固定跨度内存访问的合成基准进行相对直接的测量,但考虑不规则内存访问驱动的场景(如图形分析)并分析取消写分配对各种数据驱动应用的影响也很重要。在本文中,我们在石溪大学 Ookami 系统的富士通 FX700 A64FX 平台上,使用基于 LLVM 的 ARM 和 GNU 编译器,研究了 "零填充 "对基于 OpenMP 的多线程图应用场景(Graph500 Breadth First Search、GAP 基准套件和 Louvain Graph Clustering)和 Rodinia 异构基准套件(分子动力学、序列比对、图像处理等)中五个应用代理的影响。我们的研究结果表明,通过对某些关键内核或表现出时间写入模式的代码段进行代码修改来促进 "零填充",可以对各种应用程序的整体性能产生积极影响。我们观察到不同编译器和输入数据的性能差异,并注意到由于 "零填充 "相关的调整,基准和各种应用场景的端到端性能提高了 5-20%。
Impact of Write-Allocate Elimination on Fujitsu A64FX
ARM-based CPU architectures are currently driving massive disruptions in the High Performance Computing (HPC) community. Deployment of the 48-core Fujitsu A64FX ARM architecture based processor in RIKEN “Fugaku” supercomputer (#2 in the June 2023 Top500 list) was a major inflection point in pushing ARM to mainstream HPC. A key design criteria of Fujitsu A64FX is to enhance the throughput of modern memory-bound applications, which happens to be a dominant pattern in contemporary HPC, as opposed to traditional compute-bound or floating-point intensive science workloads. One of the mechanisms to enhance the throughput concerns write-allocate operations (e.g., streaming write operations), which are quite common in science applications. In particular, eliminating write-allocate operations (allocate cache line on a write miss) through a special “zero fill” instruction available on the ARM CPU architectures can improve the overall memory bandwidth, by avoiding the memory read into a cache line, which is unnecessary since the cache line will be written consequently. While bandwidth implications are relatively straightforward to measure via synthetic benchmarks with fixed-stride memory accesses, it is important to consider irregular memory-access driven scenarios such as graph analytics, and analyze the impact of write-allocate elimination on diverse data-driven applications. In this paper, we examine the impact of “zero fill” on OpenMP-based multithreaded graph application scenarios (Graph500 Breadth First Search, GAP benchmark suite, and Louvain Graph Clustering) and five application proxies from the Rodinia heterogeneous benchmark suite (molecular dynamics, sequence alignment, image processing, etc.), using LLVM-based ARM and GNU compilers on the Fujitsu FX700 A64FX platform of the Ookami system from Stony Brook University. Our results indicate that facilitating “zero fill” through code modifications to certain critical kernels or code segments that exhibit temporal write patterns can positively impact the overall performance of a variety of applications. We observe performance variations across the compilers and input data, and note end-to-end improvements between 5–20% for the benchmarks and diverse spectrum of application scenarios owing to “zero fill” related adaptations.