Accelerating Lattice QCD Simulations using GPUs

arXiv - PHYS - High Energy Physics - Lattice Pub Date : 2024-05-29 DOI:arxiv-2407.00041

Tilmann Matthaei

{"title":"Accelerating Lattice QCD Simulations using GPUs","authors":"Tilmann Matthaei","doi":"arxiv-2407.00041","DOIUrl":null,"url":null,"abstract":"Solving discretized versions of the Dirac equation represents a large share\nof execution time in lattice Quantum Chromodynamics (QCD) simulations. Many\nhigh-performance computing (HPC) clusters use graphics processing units (GPUs)\nto offer more computational resources. Our solver program, DDalphaAMG,\npreviously was unable to fully take advantage of GPUs to accelerate its\ncomputations. Making use of GPUs for DDalphaAMG is an ongoing development, and\nwe will present some current progress herein. Through a detailed description of\nour development, this thesis should offer valuable insights into using GPUs to\naccelerate a memory-bound CPU implementation. We developed a storage scheme for multiple tuples, which allows much more\nefficient memory access on GPUs, given that the element at the same index is\nread from multiple tuples simultaneously. Still, our implementation of a\ndiscrete Dirac operator is memory-bound, and we only achieved improvements for\nlarge linear systems on few nodes at the JUWELS cluster. These improvements do\nnot currently overcome additional introduced overheads. However, the results\nfor the application of the Wilson-Dirac operator show a speedup of around 3 for\nlarge lattices. If the additional overheads can be eliminated in the future,\nGPUs could reduce the DDalphaAMG execution time significantly for large\nlattices. We also found that a previous publication on the GPU acceleration of\nDDalphaAMG, underrepresented the achieved speedup, because small lattices were\nused. This further highlights that GPUs often require large-scale problems to\nsolve in order to be faster than CPUs","PeriodicalId":501191,"journal":{"name":"arXiv - PHYS - High Energy Physics - Lattice","volume":"133 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - PHYS - High Energy Physics - Lattice","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.00041","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Solving discretized versions of the Dirac equation represents a large share of execution time in lattice Quantum Chromodynamics (QCD) simulations. Many high-performance computing (HPC) clusters use graphics processing units (GPUs) to offer more computational resources. Our solver program, DDalphaAMG, previously was unable to fully take advantage of GPUs to accelerate its computations. Making use of GPUs for DDalphaAMG is an ongoing development, and we will present some current progress herein. Through a detailed description of our development, this thesis should offer valuable insights into using GPUs to accelerate a memory-bound CPU implementation. We developed a storage scheme for multiple tuples, which allows much more efficient memory access on GPUs, given that the element at the same index is read from multiple tuples simultaneously. Still, our implementation of a discrete Dirac operator is memory-bound, and we only achieved improvements for large linear systems on few nodes at the JUWELS cluster. These improvements do not currently overcome additional introduced overheads. However, the results for the application of the Wilson-Dirac operator show a speedup of around 3 for large lattices. If the additional overheads can be eliminated in the future, GPUs could reduce the DDalphaAMG execution time significantly for large lattices. We also found that a previous publication on the GPU acceleration of DDalphaAMG, underrepresented the achieved speedup, because small lattices were used. This further highlights that GPUs often require large-scale problems to solve in order to be faster than CPUs

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用 GPU 加速 Lattice QCD 模拟

在晶格量子色动力学（QCD）模拟中，求解离散化版本的狄拉克方程占用了大量的执行时间。许多高性能计算（HPC）集群使用图形处理器（GPU）来提供更多的计算资源。我们的求解程序 DDalphaAMG 以前无法充分利用 GPU 来加速计算。在DDalphaAMG中使用GPU是一项持续的开发工作，我们将在此介绍目前的一些进展。通过详细描述我们的开发过程，本论文将为使用 GPU 加速受内存限制的 CPU 实现提供有价值的见解。我们开发了一种多元组存储方案，考虑到同一索引中的元素会同时从多个元组中读取，该方案可以在 GPU 上实现更高效的内存访问。尽管如此，我们对离散狄拉克算子的实现仍然受到内存限制，而且我们只在 JUWELS 集群的少数节点上对大型线性系统进行了改进。这些改进目前还无法克服额外引入的开销。不过，应用威尔逊-狄拉克算子的结果显示，对于大型网格，速度提高了约 3 倍。如果将来能消除额外的开销，GPU 就能显著缩短大网格的 DDalphaAMG 执行时间。我们还发现，之前发表的一篇关于GPU加速DDalphaAMG的文章，由于使用的是小网格，所以没有充分反映所实现的加速。这进一步凸显了GPU通常需要解决大规模问题才能比CPU快

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - PHYS - High Energy Physics - Lattice

自引率

0.00%

发文量

期刊最新文献

The $η_c$-meson leading-twist distribution amplitude Bootstrap-determined p-values in Lattice QCD Inverse Spin Hall Effect in Nonequilibrium Dirac Systems Induced by Anomalous Flow Imbalance Supersymmetric QCD on the lattice: Fine-tuning and counterterms for the quartic couplings Finite-size topological phases from semimetals