首页 > 最新文献

International Journal of High Performance Computing Applications最新文献

英文 中文
Myths and legends in high-performance computing 高性能计算的神话和传说
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-01-06 DOI: 10.1177/10943420231166608
S. Matsuoka, Jens Domke, M. Wahib, Aleksandr Drozd, T. Hoefler
In this thought-provoking article, we discuss certain myths and legends that are folklore among members of the high-performance computing community. We gathered these myths from conversations at conferences and meetings, product advertisements, papers, and other communications such as tweets, blogs, and news articles within and beyond our community. We believe they represent the zeitgeist of the current era of massive change, driven by the end of many scaling laws such as Dennard scaling and Moore’s law. While some laws end, new directions are emerging, such as algorithmic scaling or novel architecture research. Nevertheless, these myths are rarely based on scientific facts, but rather on some evidence or argumentation. In fact, we believe that this is the very reason for the existence of many myths and why they cannot be answered clearly. While it feels like there should be clear answers for each, some may remain endless philosophical debates, such as whether Beethoven was better than Mozart. We would like to see our collection of myths as a discussion of possible new directions for research and industry investment.
在这篇发人深省的文章中,我们将讨论一些在高性能计算社区成员中广为流传的神话和传说。我们从会议上的对话、产品广告、论文和其他交流中收集了这些神话,比如推特、博客和社区内外的新闻文章。我们相信它们代表了当前巨大变革时代的时代精神,由许多缩放定律(如登纳德缩放定律和摩尔定律)的终结所驱动。虽然一些法律结束了,但新的方向正在出现,比如算法缩放或新颖的建筑研究。然而,这些神话很少基于科学事实,而是基于一些证据或论证。事实上,我们认为这正是许多神话存在的原因,也是它们无法得到明确回答的原因。虽然感觉每个问题都应该有明确的答案,但有些问题可能仍然是无休止的哲学辩论,比如贝多芬是否比莫扎特更好。我们希望将我们收集的神话视为对研究和工业投资可能的新方向的讨论。
{"title":"Myths and legends in high-performance computing","authors":"S. Matsuoka, Jens Domke, M. Wahib, Aleksandr Drozd, T. Hoefler","doi":"10.1177/10943420231166608","DOIUrl":"https://doi.org/10.1177/10943420231166608","url":null,"abstract":"In this thought-provoking article, we discuss certain myths and legends that are folklore among members of the high-performance computing community. We gathered these myths from conversations at conferences and meetings, product advertisements, papers, and other communications such as tweets, blogs, and news articles within and beyond our community. We believe they represent the zeitgeist of the current era of massive change, driven by the end of many scaling laws such as Dennard scaling and Moore’s law. While some laws end, new directions are emerging, such as algorithmic scaling or novel architecture research. Nevertheless, these myths are rarely based on scientific facts, but rather on some evidence or argumentation. In fact, we believe that this is the very reason for the existence of many myths and why they cannot be answered clearly. While it feels like there should be clear answers for each, some may remain endless philosophical debates, such as whether Beethoven was better than Mozart. We would like to see our collection of myths as a discussion of possible new directions for research and industry investment.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"245 - 259"},"PeriodicalIF":3.1,"publicationDate":"2023-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42989305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Mixed precision LU factorization on GPU tensor cores: reducing data movement and memory footprint GPU张量核上的混合精度LU因子分解:减少数据移动和内存占用
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-01-03 DOI: 10.1177/10943420221136848
Florent Lopez, Théo Mary
Modern GPUs equipped with mixed precision tensor core units present great potential to accelerate dense linear algebra operations such as LU factorization. However, state-of-the-art mixed half/single precision LU factorization algorithms all require the matrix to be stored in single precision, leading to expensive data movement and storage costs. This is explained by the fact that simply switching the storage precision from single to half leads to significant loss of accuracy, forfeiting all accuracy benefits from using tensor core technology. In this article, we propose a new factorization algorithm that is able to store the matrix in half precision without incurring any significant loss of accuracy. Our approach is based on a left-looking scheme employing single precision buffers of controlled size and a mixed precision doubly partitioned algorithm exploiting tensor cores in the panel factorizations. Our numerical results show that compared with the state of the art, the proposed approach is of similar accuracy but with only half the data movement and memory footprint, and hence potentially much faster: it achieves up to 2× and 3.5× speedups on V100 and A100 GPUs, respectively.
配备混合精度张量核心单元的现代GPU在加速密集线性代数运算(如LU因子分解)方面具有巨大潜力。然而,最先进的混合半精度/单精度LU分解算法都需要以单精度存储矩阵,这导致了昂贵的数据移动和存储成本。这可以解释为,简单地将存储精度从单一切换到一半会导致精度的显著损失,从而丧失使用张量核心技术带来的所有精度优势。在本文中,我们提出了一种新的因子分解算法,该算法能够以半精度存储矩阵,而不会导致任何显著的精度损失。我们的方法基于一种左向方案,该方案采用了大小可控的单精度缓冲区和一种在面板分解中利用张量核的混合精度双分割算法。我们的数值结果表明,与现有技术相比,所提出的方法具有相似的精度,但数据移动和内存占用只有一半,因此可能更快:它在V100和A100 GPU上分别实现了2倍和3.5倍的加速。
{"title":"Mixed precision LU factorization on GPU tensor cores: reducing data movement and memory footprint","authors":"Florent Lopez, Théo Mary","doi":"10.1177/10943420221136848","DOIUrl":"https://doi.org/10.1177/10943420221136848","url":null,"abstract":"Modern GPUs equipped with mixed precision tensor core units present great potential to accelerate dense linear algebra operations such as LU factorization. However, state-of-the-art mixed half/single precision LU factorization algorithms all require the matrix to be stored in single precision, leading to expensive data movement and storage costs. This is explained by the fact that simply switching the storage precision from single to half leads to significant loss of accuracy, forfeiting all accuracy benefits from using tensor core technology. In this article, we propose a new factorization algorithm that is able to store the matrix in half precision without incurring any significant loss of accuracy. Our approach is based on a left-looking scheme employing single precision buffers of controlled size and a mixed precision doubly partitioned algorithm exploiting tensor cores in the panel factorizations. Our numerical results show that compared with the state of the art, the proposed approach is of similar accuracy but with only half the data movement and memory footprint, and hence potentially much faster: it achieves up to 2× and 3.5× speedups on V100 and A100 GPUs, respectively.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"165 - 179"},"PeriodicalIF":3.1,"publicationDate":"2023-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42887017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
High Performance Computing: 38th International Conference, ISC High Performance 2023, Hamburg, Germany, May 21–25, 2023, Proceedings 高性能计算:第38届国际会议,ISC高性能2023,汉堡,德国,2023年5月21-25日,论文集
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-01-01 DOI: 10.1007/978-3-031-32041-5
{"title":"High Performance Computing: 38th International Conference, ISC High Performance 2023, Hamburg, Germany, May 21–25, 2023, Proceedings","authors":"","doi":"10.1007/978-3-031-32041-5","DOIUrl":"https://doi.org/10.1007/978-3-031-32041-5","url":null,"abstract":"","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"14 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88116938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Special issue: Introduction 特刊:简介
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2023-01-01 DOI: 10.1177/10943420221150081
M. Parsons
The COVID pandemic has changed all of our lives and continues to do so. The prizes recognise outstanding research achievement toward the understanding of the COVID-19 pandemic through the use of high-performance computing. The winning paper, entitled 'Digital transformation of droplet/aerosol infection risk assessment realised on "Fugaku" for the fight against COVID-19', was submitted by a team from the RIKEN Center for Computational Science in Japan. [Extracted from the article]
新冠肺炎疫情已经改变了我们所有人的生活,并将继续如此。该奖项表彰通过使用高性能计算来理解新冠肺炎疫情的杰出研究成就。获奖论文题为“为抗击新冠肺炎在“Fugaku”上实现的飞沫/气溶胶感染风险评估的数字化转型”,由日本理研计算科学中心的一个团队提交。[摘自文章]
{"title":"Special issue: Introduction","authors":"M. Parsons","doi":"10.1177/10943420221150081","DOIUrl":"https://doi.org/10.1177/10943420221150081","url":null,"abstract":"The COVID pandemic has changed all of our lives and continues to do so. The prizes recognise outstanding research achievement toward the understanding of the COVID-19 pandemic through the use of high-performance computing. The winning paper, entitled 'Digital transformation of droplet/aerosol infection risk assessment realised on \"Fugaku\" for the fight against COVID-19', was submitted by a team from the RIKEN Center for Computational Science in Japan. [Extracted from the article]","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"3 - 3"},"PeriodicalIF":3.1,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45540807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Performance comparison of the A-grid and C-grid shallow-water models on icosahedral grids 二十面体网格上a网格和c网格浅水模型的性能比较
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2022-11-15 DOI: 10.1177/10943420221139509
J. Middlecoff, Yonggang G. Yu, M. Govett
This study uses a single software framework to compare the CPU performance of Arakawa A-grid (NICAM) and C-grid (MPAS) schemes for solving the shallow-water equations on icosahedral grids. The focus is on high-resolution weather prediction. Performance analysis shows the simpler structure of the A-grid equations enables compiler optimization-based efficiency gains that the C-grid equations cannot match. Strong scaling runs at 3.5 km resolution show the A-grid is three times faster than the C-grid, enabling the A-grid to run at 50% higher resolution in only 15% more time. A performance comparison with the MPAS shallow-water model is included which demonstrates that our software implementation of the C-grid is robust and comparisons are fair.
本研究使用单一软件框架比较了Arakawa a网格(NICAM)和c网格(MPAS)方案在求解二十面体网格上浅水方程时的CPU性能。重点是高分辨率天气预报。性能分析表明,a网格方程的简单结构使基于编译器优化的效率提高,这是c网格方程无法比拟的。在3.5公里分辨率下的强缩放运行表明,a网格比c网格快三倍,使a网格在仅多15%的时间内以高50%的分辨率运行。与MPAS浅水模型的性能比较表明,我们的c网格软件实现是鲁棒的,比较是公平的。
{"title":"Performance comparison of the A-grid and C-grid shallow-water models on icosahedral grids","authors":"J. Middlecoff, Yonggang G. Yu, M. Govett","doi":"10.1177/10943420221139509","DOIUrl":"https://doi.org/10.1177/10943420221139509","url":null,"abstract":"This study uses a single software framework to compare the CPU performance of Arakawa A-grid (NICAM) and C-grid (MPAS) schemes for solving the shallow-water equations on icosahedral grids. The focus is on high-resolution weather prediction. Performance analysis shows the simpler structure of the A-grid equations enables compiler optimization-based efficiency gains that the C-grid equations cannot match. Strong scaling runs at 3.5 km resolution show the A-grid is three times faster than the C-grid, enabling the A-grid to run at 50% higher resolution in only 15% more time. A performance comparison with the MPAS shallow-water model is included which demonstrates that our software implementation of the C-grid is robust and comparisons are fair.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"197 - 208"},"PeriodicalIF":3.1,"publicationDate":"2022-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44206468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Acceleration of a parallel BDDC solver by using graphics processing units on subdomains 使用子域上的图形处理单元加速并行BDDC求解器
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2022-11-05 DOI: 10.1177/10943420221136873
J. Šístek, T. Oberhuber
An approach to accelerating a parallel domain decomposition (DD) solver by graphics processing units (GPUs) is investigated. The solver is based on the Balancing Domain Decomposition Method by Constraints (BDDC), which is a nonoverlapping DD technique. Two kinds of local matrices are required by BDDC. First, dense matrices corresponding to local Schur complements of interior unknowns are constructed by the sparse direct solver. These are further used as part of the local saddle-point problems within BDDC. In the next step, the local matrices are copied to GPUs. Repeated multiplications of local vectors with the dense matrix of the Schur complement are performed for each subdomain. In addition, factorizations and backsubstitutions with the dense saddle-point subdomain matrices are also performed on GPUs. Detailed times of main components of the algorithm are measured on a benchmark Poisson problem. The method is also applied to an unsteady problem of incompressible flow, where the Krylov subspace iterations are performed repeatedly in each time step. The results demonstrate the potential of the approach to speed up realistic simulations up to 5 times with a preference towards large subdomains.
研究了一种利用图形处理单元(GPU)加速并行域分解(DD)求解器的方法。该求解器基于约束平衡域分解法(BDDC),这是一种不重叠的DD技术。BDDC需要两种局部矩阵。首先,通过稀疏直接求解器构造了与内部未知的局部Schur补相对应的稠密矩阵。这些被进一步用作BDDC中的局部鞍点问题的一部分。在下一步中,将局部矩阵复制到GPU中。对每个子域执行局部向量与Schur补的稠密矩阵的重复乘法。此外,还对GPU进行了稠密鞍点子域矩阵的分解和反置换。算法主要组件的详细时间是在基准泊松问题上测量的。该方法还应用于不可压缩流的非定常问题,其中Krylov子空间迭代在每个时间步长中重复执行。结果表明,该方法有可能将逼真模拟速度提高5倍,并倾向于大型子域。
{"title":"Acceleration of a parallel BDDC solver by using graphics processing units on subdomains","authors":"J. Šístek, T. Oberhuber","doi":"10.1177/10943420221136873","DOIUrl":"https://doi.org/10.1177/10943420221136873","url":null,"abstract":"An approach to accelerating a parallel domain decomposition (DD) solver by graphics processing units (GPUs) is investigated. The solver is based on the Balancing Domain Decomposition Method by Constraints (BDDC), which is a nonoverlapping DD technique. Two kinds of local matrices are required by BDDC. First, dense matrices corresponding to local Schur complements of interior unknowns are constructed by the sparse direct solver. These are further used as part of the local saddle-point problems within BDDC. In the next step, the local matrices are copied to GPUs. Repeated multiplications of local vectors with the dense matrix of the Schur complement are performed for each subdomain. In addition, factorizations and backsubstitutions with the dense saddle-point subdomain matrices are also performed on GPUs. Detailed times of main components of the algorithm are measured on a benchmark Poisson problem. The method is also applied to an unsteady problem of incompressible flow, where the Krylov subspace iterations are performed repeatedly in each time step. The results demonstrate the potential of the approach to speed up realistic simulations up to 5 times with a preference towards large subdomains.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"151 - 164"},"PeriodicalIF":3.1,"publicationDate":"2022-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42735563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
End-to-end GPU acceleration of low-order-refined preconditioning for high-order finite element discretizations 高阶有限元离散化低阶精细预处理的端到端GPU加速
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2022-10-21 DOI: 10.1177/10943420231175462
Will Pazner, T. Kolev, Jean-Sylvain Camier
In this article, we present algorithms and implementations for the end-to-end GPU acceleration of matrix-free low-order-refined preconditioning of high-order finite element problems. The methods described here allow for the construction of effective preconditioners for high-order problems with optimal memory usage and computational complexity. The preconditioners are based on the construction of a spectrally equivalent low-order discretization on a refined mesh, which is then amenable to, for example, algebraic multigrid preconditioning. The constants of equivalence are independent of mesh size and polynomial degree. For vector finite element problems in H (curl) and H (div) (e.g., for electromagnetic or radiation diffusion problems), a specially constructed interpolation–histopolation basis is used to ensure fast convergence. Detailed performance studies are carried out to analyze the efficiency of the GPU algorithms. The kernel throughput of each of the main algorithmic components is measured, and the strong and weak parallel scalability of the methods is demonstrated. The different relative weighting and significance of the algorithmic components on GPUs and CPUs is discussed. Results on problems involving adaptively refined nonconforming meshes are shown, and the use of the preconditioners on a large-scale magnetic diffusion problem using all spaces of the finite element de Rham complex is illustrated.
在本文中,我们提出了高阶有限元问题的无矩阵低阶精细预处理的端到端GPU加速的算法和实现。这里描述的方法允许为具有最佳内存使用和计算复杂性的高阶问题构造有效的预处理器。预处理器基于在精细网格上构造频谱等效的低阶离散化,然后适用于例如代数多重网格预处理。等效常数与网格大小和多项式次数无关。对于H(旋度)和H(div)中的向量有限元问题(例如,对于电磁或辐射扩散问题),使用特殊构造的插值-组织插值基础来确保快速收敛。进行了详细的性能研究,以分析GPU算法的效率。测量了每个主要算法组件的内核吞吐量,并证明了这些方法的强和弱并行可扩展性。讨论了GPU和CPU上算法组件的不同相对权重和重要性。给出了涉及自适应精细非协调网格的问题的结果,并说明了预处理器在使用有限元de Rham复形的所有空间的大规模磁扩散问题上的使用。
{"title":"End-to-end GPU acceleration of low-order-refined preconditioning for high-order finite element discretizations","authors":"Will Pazner, T. Kolev, Jean-Sylvain Camier","doi":"10.1177/10943420231175462","DOIUrl":"https://doi.org/10.1177/10943420231175462","url":null,"abstract":"In this article, we present algorithms and implementations for the end-to-end GPU acceleration of matrix-free low-order-refined preconditioning of high-order finite element problems. The methods described here allow for the construction of effective preconditioners for high-order problems with optimal memory usage and computational complexity. The preconditioners are based on the construction of a spectrally equivalent low-order discretization on a refined mesh, which is then amenable to, for example, algebraic multigrid preconditioning. The constants of equivalence are independent of mesh size and polynomial degree. For vector finite element problems in H (curl) and H (div) (e.g., for electromagnetic or radiation diffusion problems), a specially constructed interpolation–histopolation basis is used to ensure fast convergence. Detailed performance studies are carried out to analyze the efficiency of the GPU algorithms. The kernel throughput of each of the main algorithmic components is measured, and the strong and weak parallel scalability of the methods is demonstrated. The different relative weighting and significance of the algorithmic components on GPUs and CPUs is discussed. Results on problems involving adaptively refined nonconforming meshes are shown, and the use of the preconditioners on a large-scale magnetic diffusion problem using all spaces of the finite element de Rham complex is illustrated.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"578 - 599"},"PeriodicalIF":3.1,"publicationDate":"2022-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45709588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Exploiting temporal data reuse and asynchrony in the reverse time migration 在反向时间迁移中利用时态数据重用和异步
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2022-10-03 DOI: 10.1177/10943420221128529
L. Qu, Rached Abdelkhalak, H. Ltaief, Issam Said, D. Keyes
Reverse Time Migration (RTM) is a state-of-the-art algorithm used in seismic depth imaging in complex geological environments for the oil and gas exploration industry. It calculates high-resolution images by solving the three-dimensional acoustic wave equation using seismic datasets recorded at various receiver locations. Reverse Time Migration’s computational phases are predominantly composed of stencil computational kernels for the finite-difference time-domain scheme, applying the absorbing boundary conditions, and I/O operations needed for the imaging condition. In this paper, we integrate the asynchronous Multicore Wavefront Diamond (MWD) tiling approach into the full RTM workflow. Multicore Wavefront Diamond permits to further increase data reuse by leveraging spatial with Temporal Blocking (TB) during the stencil computations. This integration engenders new challenges with a snowball effect on the legacy synchronous RTM workflow as it requires rethinking of how the absorbing boundary conditions, the I/O operations, and the imaging condition operate. These disruptive changes are necessary to maintain the performance superiority of asynchronous stencil execution throughout the time integration, while ensuring the quality of the subsurface image does not deteriorate. We assess the overall performance of the new MWD-based RTM and compare against traditional Spatial Blocking (SB)-based RTM on various shared-memory systems using the SEG Salt3D model. The MWD-based RTM achieves up to 70% performance speedup compared to SB-based RTM. To our knowledge, this paper highlights for the first time the applicability of asynchronous executions with temporal blocking throughout the whole RTM. This may eventually create new research opportunities in improving hydrocarbon extraction for the petroleum industry.
逆时偏移(RTM)是一种最先进的算法,用于石油和天然气勘探行业复杂地质环境中的地震深度成像。它通过使用在不同接收器位置记录的地震数据集求解三维声波方程来计算高分辨率图像。逆时偏移的计算阶段主要由时域有限差分格式的模板计算内核组成,应用吸收边界条件,以及成像条件所需的I/O操作。在本文中,我们将异步多核波前金刚石(MWD)平铺方法集成到整个RTM工作流程中。多核Wavefront Diamond允许在模板计算过程中利用空间和时间块(TB)来进一步增加数据重用。这种集成带来了新的挑战,对传统的同步RTM工作流程产生了滚雪球效应,因为它需要重新思考吸收边界条件、I/O操作和成像条件是如何操作的。这些破坏性更改对于在整个时间集成过程中保持异步模板执行的性能优势是必要的,同时确保次表面图像的质量不会恶化。我们评估了新的基于MWD的RTM的整体性能,并使用SEG-Salt3D模型在各种共享内存系统上与传统的基于空间块(SB)的RTM进行了比较。与基于SB的RTM相比,基于MWD的RTM实现了高达70%的性能提升。据我们所知,本文首次强调了具有时间阻塞的异步执行在整个RTM中的适用性。这可能最终为改善石油工业的碳氢化合物开采创造新的研究机会。
{"title":"Exploiting temporal data reuse and asynchrony in the reverse time migration","authors":"L. Qu, Rached Abdelkhalak, H. Ltaief, Issam Said, D. Keyes","doi":"10.1177/10943420221128529","DOIUrl":"https://doi.org/10.1177/10943420221128529","url":null,"abstract":"Reverse Time Migration (RTM) is a state-of-the-art algorithm used in seismic depth imaging in complex geological environments for the oil and gas exploration industry. It calculates high-resolution images by solving the three-dimensional acoustic wave equation using seismic datasets recorded at various receiver locations. Reverse Time Migration’s computational phases are predominantly composed of stencil computational kernels for the finite-difference time-domain scheme, applying the absorbing boundary conditions, and I/O operations needed for the imaging condition. In this paper, we integrate the asynchronous Multicore Wavefront Diamond (MWD) tiling approach into the full RTM workflow. Multicore Wavefront Diamond permits to further increase data reuse by leveraging spatial with Temporal Blocking (TB) during the stencil computations. This integration engenders new challenges with a snowball effect on the legacy synchronous RTM workflow as it requires rethinking of how the absorbing boundary conditions, the I/O operations, and the imaging condition operate. These disruptive changes are necessary to maintain the performance superiority of asynchronous stencil execution throughout the time integration, while ensuring the quality of the subsurface image does not deteriorate. We assess the overall performance of the new MWD-based RTM and compare against traditional Spatial Blocking (SB)-based RTM on various shared-memory systems using the SEG Salt3D model. The MWD-based RTM achieves up to 70% performance speedup compared to SB-based RTM. To our knowledge, this paper highlights for the first time the applicability of asynchronous executions with temporal blocking throughout the whole RTM. This may eventually create new research opportunities in improving hydrocarbon extraction for the petroleum industry.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"132 - 150"},"PeriodicalIF":3.1,"publicationDate":"2022-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43710430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
PeleC: An adaptive mesh refinement solver for compressible reacting flows PeleC:可压缩反应流的自适应网格细化求解器
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2022-09-06 DOI: 10.1177/10943420221121151
M. T. Henry de Frahan, Jonathan S. Rood, M. Day, H. Sitaraman, S. Yellapantula, Bruce A. Perry, R. Grout, A. Almgren, Weiqun Zhang, J. Bell, Jacqueline H. Chen
Reacting flow simulations for combustion applications require extensive computing capabilities. Leveraging the AMReX library, the Pele suite of combustion simulation tools targets the largest supercomputers available and future exascale machines. We introduce PeleC, the compressible solver in the Pele suite, and detail its capabilities, including complex geometry representation, chemistry integration, and discretization. We present a comparison of development efforts using both OpenACC and AMReX’s C++ performance portability framework for execution on multiple GPU architectures. We discuss relevant details that have allowed PeleC to achieve high performance and scalability. PeleC’s performance characteristics are measured through relevant simulations on multiple supercomputers. The success of PeleC’s design for exascale is exhibited through demonstration of a 160 billion cell simulation and weak scaling onto 100% of Summit, an NVIDIA-based GPU supercomputer at Oak Ridge National Laboratory. Our results provide confidence that PeleC will enable future combustion science simulations with unprecedented fidelity.
燃烧应用的反应流模拟需要广泛的计算能力。利用AMReX库,Pele燃烧模拟工具套件针对现有的最大超级计算机和未来的百亿亿次机器。我们将介绍Pele套件中的可压缩解算器PeleC,并详细介绍其功能,包括复杂几何表示、化学集成和离散化。我们比较了使用OpenACC和AMReX的c++性能可移植性框架在多种GPU架构上执行的开发工作。我们讨论了允许PeleC实现高性能和可伸缩性的相关细节。通过在多台超级计算机上的相关模拟,测量了PeleC的性能特性。通过在橡树岭国家实验室的基于nvidia的GPU超级计算机Summit上进行1600亿个单元模拟和弱缩放,PeleC的百亿亿次设计的成功得到了展示。我们的结果提供了信心,PeleC将使未来的燃烧科学模拟具有前所未有的保真度。
{"title":"PeleC: An adaptive mesh refinement solver for compressible reacting flows","authors":"M. T. Henry de Frahan, Jonathan S. Rood, M. Day, H. Sitaraman, S. Yellapantula, Bruce A. Perry, R. Grout, A. Almgren, Weiqun Zhang, J. Bell, Jacqueline H. Chen","doi":"10.1177/10943420221121151","DOIUrl":"https://doi.org/10.1177/10943420221121151","url":null,"abstract":"Reacting flow simulations for combustion applications require extensive computing capabilities. Leveraging the AMReX library, the Pele suite of combustion simulation tools targets the largest supercomputers available and future exascale machines. We introduce PeleC, the compressible solver in the Pele suite, and detail its capabilities, including complex geometry representation, chemistry integration, and discretization. We present a comparison of development efforts using both OpenACC and AMReX’s C++ performance portability framework for execution on multiple GPU architectures. We discuss relevant details that have allowed PeleC to achieve high performance and scalability. PeleC’s performance characteristics are measured through relevant simulations on multiple supercomputers. The success of PeleC’s design for exascale is exhibited through demonstration of a 160 billion cell simulation and weak scaling onto 100% of Summit, an NVIDIA-based GPU supercomputer at Oak Ridge National Laboratory. Our results provide confidence that PeleC will enable future combustion science simulations with unprecedented fidelity.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"115 - 131"},"PeriodicalIF":3.1,"publicationDate":"2022-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45540928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Enabling efficient execution of a variational data assimilation application 支持有效地执行变分数据同化应用程序
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2022-08-28 DOI: 10.1177/10943420221119801
J. Dennis, A. Baker, B. Dobbins, M. Bell, Jian Sun, Youngsung Kim, Ting-Yu Cha
Remote sensing observational instruments are critical for better understanding and predicting severe weather. Observational data from such instruments, such as Doppler radar data, for example, are often processed for assimilation into numerical weather prediction models. As such instruments become more sophisticated, the amount of data to be processed grows and requires efficient variational analysis tools. Here we examine the code that implements the popular SAMURAI (Spline Analysis at Mesoscale Utilizing Radar and Aircraft Instrumentation) technique for estimating the atmospheric state for a given set of observations. We employ a number of techniques to significantly improve the code’s performance, including porting it to run on standard HPC clusters, analyzing and optimizing its single-node performance, implementing a more efficient nonlinear optimization method, and enabling the use of GPUs via OpenACC. Our efforts thus far have yielded more than 100x improvement over the original code on large test problems of interest to the community.
遥感观测仪器对于更好地了解和预测恶劣天气至关重要。来自这类仪器的观测数据,例如多普勒雷达数据,经常经过处理,以便同化到数值天气预报模式中。随着这些工具变得越来越复杂,需要处理的数据量也在增长,这就需要高效的变分分析工具。在这里,我们研究了实现流行的SAMURAI(利用雷达和飞机仪器的中尺度样条分析)技术的代码,用于估计给定观测集的大气状态。我们采用了许多技术来显著提高代码的性能,包括将其移植到标准HPC集群上运行,分析和优化其单节点性能,实现更有效的非线性优化方法,以及通过OpenACC启用gpu的使用。到目前为止,我们的努力已经在社区感兴趣的大型测试问题上产生了比原始代码100倍以上的改进。
{"title":"Enabling efficient execution of a variational data assimilation application","authors":"J. Dennis, A. Baker, B. Dobbins, M. Bell, Jian Sun, Youngsung Kim, Ting-Yu Cha","doi":"10.1177/10943420221119801","DOIUrl":"https://doi.org/10.1177/10943420221119801","url":null,"abstract":"Remote sensing observational instruments are critical for better understanding and predicting severe weather. Observational data from such instruments, such as Doppler radar data, for example, are often processed for assimilation into numerical weather prediction models. As such instruments become more sophisticated, the amount of data to be processed grows and requires efficient variational analysis tools. Here we examine the code that implements the popular SAMURAI (Spline Analysis at Mesoscale Utilizing Radar and Aircraft Instrumentation) technique for estimating the atmospheric state for a given set of observations. We employ a number of techniques to significantly improve the code’s performance, including porting it to run on standard HPC clusters, analyzing and optimizing its single-node performance, implementing a more efficient nonlinear optimization method, and enabling the use of GPUs via OpenACC. Our efforts thus far have yielded more than 100x improvement over the original code on large test problems of interest to the community.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"101 - 114"},"PeriodicalIF":3.1,"publicationDate":"2022-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45282216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
International Journal of High Performance Computing Applications
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1