An efficient halo approach for Euler-Lagrange simulations based on MPI-3 shared memory

The International Conference on High Performance Computing in Asia-Pacific Region Companion Pub Date : 2021-01-20 DOI:10.1145/3440722.3440904

Patrick Kopper, M. Pfeiffer, S. Copplestone, A. Beck

{"title":"An efficient halo approach for Euler-Lagrange simulations based on MPI-3 shared memory","authors":"Patrick Kopper, M. Pfeiffer, S. Copplestone, A. Beck","doi":"10.1145/3440722.3440904","DOIUrl":null,"url":null,"abstract":"Euler-Lagrange methods are a common approach for simulation of dispersed particle-laden flow, e.g. in turbomachinery. In this approach, the fluid is treated as continuous phase with an Eulerian field solver whereas the Lagrangian movement of the dispersed phase is described through the equations of motion for each individual particle. In high-performance computing, the load of the fluid phase is only dependent on the degrees of freedom and load-balancing steps can be taken a priori, thereby ensuring optimal scaling. However, the discrete phase introduces local load imbalances that cannot easily predicted as generally neither the spatial particle distribution nor the computational cost for advancing particles in relation to the fluid integration are know a priori. Runtime load balancing alleviates this problem by adjusting the local load on each processor according to information gathered during the simulation [4]. Since the load balancing step becomes part of the simulation time, its performance and appropriate scaling on modern HPC systems becomes of crucial importance. In this talk, we will first present the FLEXI framework for the Euler-Lagrange system, and follow by introducing the previous approach and highlight its difficulties. FLEXI is a high-order accurate, massively parallel CFD framework based on the Discontinuous Galerkin Spectral Element Method (DGSEM). It has shown excellent scaling properties for the fluid phase and was recently extended by particle tracking capabilities [1], developed together with the PICLas framework [2]. In FLEXI, the mesh is saved in the HDF5 format, allowing for parallel access, with the elements presorted along a space-filling curve (SFC). This approach has shown its suitability for fluid simulations as each processor requires and accesses only the local mesh information, thereby reducing I/O on the underlying file system [3]. However, the particle phase needs additional information around the fluid domain to retain high computational efficiency since particles can cross the local domain boundary at any point during a time step. In previous implementations, this “halo region” information was communicated between each individual processor, causing significant CPU and network load for an extended period of time during initialization and each load balancing step. Therefore, we propose an method developed from scratch utilizing modern MPI calls and able to overcome most of the challenges in the previous approach. This reworked method utilizes MPI-3 shared memory to make mesh information available to all processors on a compute-node. We perform a two-step, communication-free identification of all relevant mesh elements for a compute-node. Furthermore, by making the mesh information accessible to all processors sharing local memory, we eliminate redundant calculations and reduce data duplication. We conclude by presenting examples of large scale computations of particle-laden flows in complex turbomachinery systems and give an outlook on the next research challenges.","PeriodicalId":183674,"journal":{"name":"The International Conference on High Performance Computing in Asia-Pacific Region Companion","volume":"102 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The International Conference on High Performance Computing in Asia-Pacific Region Companion","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3440722.3440904","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Euler-Lagrange methods are a common approach for simulation of dispersed particle-laden flow, e.g. in turbomachinery. In this approach, the fluid is treated as continuous phase with an Eulerian field solver whereas the Lagrangian movement of the dispersed phase is described through the equations of motion for each individual particle. In high-performance computing, the load of the fluid phase is only dependent on the degrees of freedom and load-balancing steps can be taken a priori, thereby ensuring optimal scaling. However, the discrete phase introduces local load imbalances that cannot easily predicted as generally neither the spatial particle distribution nor the computational cost for advancing particles in relation to the fluid integration are know a priori. Runtime load balancing alleviates this problem by adjusting the local load on each processor according to information gathered during the simulation [4]. Since the load balancing step becomes part of the simulation time, its performance and appropriate scaling on modern HPC systems becomes of crucial importance. In this talk, we will first present the FLEXI framework for the Euler-Lagrange system, and follow by introducing the previous approach and highlight its difficulties. FLEXI is a high-order accurate, massively parallel CFD framework based on the Discontinuous Galerkin Spectral Element Method (DGSEM). It has shown excellent scaling properties for the fluid phase and was recently extended by particle tracking capabilities [1], developed together with the PICLas framework [2]. In FLEXI, the mesh is saved in the HDF5 format, allowing for parallel access, with the elements presorted along a space-filling curve (SFC). This approach has shown its suitability for fluid simulations as each processor requires and accesses only the local mesh information, thereby reducing I/O on the underlying file system [3]. However, the particle phase needs additional information around the fluid domain to retain high computational efficiency since particles can cross the local domain boundary at any point during a time step. In previous implementations, this “halo region” information was communicated between each individual processor, causing significant CPU and network load for an extended period of time during initialization and each load balancing step. Therefore, we propose an method developed from scratch utilizing modern MPI calls and able to overcome most of the challenges in the previous approach. This reworked method utilizes MPI-3 shared memory to make mesh information available to all processors on a compute-node. We perform a two-step, communication-free identification of all relevant mesh elements for a compute-node. Furthermore, by making the mesh information accessible to all processors sharing local memory, we eliminate redundant calculations and reduce data duplication. We conclude by presenting examples of large scale computations of particle-laden flows in complex turbomachinery systems and give an outlook on the next research challenges.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

一种基于MPI-3共享内存的欧拉-拉格朗日模拟晕轮方法

欧拉-拉格朗日方法是一种常用的方法来模拟分散颗粒负载的流动，例如在涡轮机械中。在这种方法中，用欧拉场求解器将流体视为连续相，而分散相的拉格朗日运动通过每个单个粒子的运动方程来描述。在高性能计算中，流体阶段的负载仅依赖于自由度，并且可以先验地采取负载平衡步骤，从而确保最佳缩放。然而，离散阶段引入了局部负载不平衡，这很难预测，因为通常无论是空间颗粒分布还是与流体积分相关的推进颗粒的计算成本都是先验的。运行时负载平衡通过根据仿真过程中收集的信息调整每个处理器的本地负载来缓解这一问题[4]。由于负载平衡步骤已成为仿真时间的一部分，因此在现代高性能计算系统上，负载平衡步骤的性能和适当的伸缩变得至关重要。在这次演讲中，我们将首先介绍欧拉-拉格朗日系统的FLEXI框架，然后介绍之前的方法并强调其难点。FLEXI是一个基于不连续伽辽金谱元法(DGSEM)的高阶精度、大规模并行CFD框架。它在流体相中表现出优异的缩放性能，最近又通过粒子跟踪功能得到了扩展[1]，并与PICLas框架一起开发[2]。在FLEXI中，网格以HDF5格式保存，允许并行访问，元素沿空间填充曲线(SFC)呈现。这种方法已经显示出它适合流体模拟，因为每个处理器只需要并访问局部网格信息，从而减少底层文件系统上的I/O[3]。然而，由于粒子可以在时间步长内的任何点跨越局部区域边界，因此为了保持较高的计算效率，粒子相位需要在流体域周围提供额外的信息。在以前的实现中，这个“光环区域”信息是在每个单独的处理器之间进行通信的，在初始化和每个负载平衡步骤期间，会导致大量的CPU和网络负载延长一段时间。因此，我们提出了一种利用现代MPI调用从头开发的方法，能够克服以前方法中的大多数挑战。这种改进的方法利用MPI-3共享内存使网格信息对计算节点上的所有处理器可用。我们对计算节点的所有相关网格元素执行两步，无需通信的识别。此外，通过使共享本地内存的所有处理器都可以访问网格信息，我们消除了冗余计算并减少了数据重复。最后，我们给出了复杂涡轮机械系统中颗粒流的大规模计算实例，并对下一步的研究挑战进行了展望。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

The International Conference on High Performance Computing in Asia-Pacific Region Companion

自引率

0.00%

发文量