首页 > 最新文献

The International Conference on High Performance Computing in Asia-Pacific Region Companion最新文献

英文 中文
Advantages of Space-Time Finite Elements for Domains with Time Varying Topology 时变拓扑域的空时有限元优势
N. Hosters, Maximilian von Danwitz, Patrick Antony, M. Behr
ACM Reference Format: Norbert Hosters, Maximilian von Danwitz, Patrick Antony, and Marek Behr. 2021. Advantages of Space-Time Finite Elements for Domains with Time Varying Topology. In The International Conference on High Performance Computing in Asia-Pacific Region Companion (HPC Asia 2021 Companion), January 20–22, 2021, Virtual Event, Republic of Korea. ACM, New York, NY, USA, 2 pages. https://doi.org/10.1145/3440722.3440907
ACM参考格式:Norbert Hosters, Maximilian von Danwitz, Patrick Antony和Marek Behr。2021。时变拓扑域的空时有限元优势。亚太地区高性能计算国际会议(HPC Asia 2021 Companion), 2021年1月20-22日,虚拟事件,韩国。ACM,纽约,美国,2页。https://doi.org/10.1145/3440722.3440907
{"title":"Advantages of Space-Time Finite Elements for Domains with Time Varying Topology","authors":"N. Hosters, Maximilian von Danwitz, Patrick Antony, M. Behr","doi":"10.1145/3440722.3440907","DOIUrl":"https://doi.org/10.1145/3440722.3440907","url":null,"abstract":"ACM Reference Format: Norbert Hosters, Maximilian von Danwitz, Patrick Antony, and Marek Behr. 2021. Advantages of Space-Time Finite Elements for Domains with Time Varying Topology. In The International Conference on High Performance Computing in Asia-Pacific Region Companion (HPC Asia 2021 Companion), January 20–22, 2021, Virtual Event, Republic of Korea. ACM, New York, NY, USA, 2 pages. https://doi.org/10.1145/3440722.3440907","PeriodicalId":183674,"journal":{"name":"The International Conference on High Performance Computing in Asia-Pacific Region Companion","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126322838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Single-Precision Calculation of Iterative Refinement of Eigenpairs of a Real Symmetric-Definite Generalized Eigenproblem by Using a Filter Composed of a Single Resolvent 用单解组成的滤波器迭代求实对称定广义特征问题特征对的单精度计算
H. Murakami
By using a filter, we calculate approximate eigenpairs of a real symmetric-definite generalized eigenproblem Av = λBv whose eigenvalues are in a specified interval. In our experiments in this paper, the IEEE-754 single-precision floating-point (binary 32bit) number system is used for calculations. In general, a filter is constructed by using some resolvents with different shifts ρ. For a given vector x, an action of a resolvent is given by solving a system of linear equations C(ρ)y = Bx for y, here the coefficient C(ρ) = A − ρB is symmetric. We assume to solve this system of linear equations by matrix factorization of C(ρ), for example by the modified Cholesky method (LDLT decomposition method). When both matrices A and B are banded, C(ρ) is also banded and the modified Cholesky method for banded system can be used to solve the system of linear equations. The filter we used is either a polynomial of a resolvent with a real shift, or a polynomial of an imaginary part of a resolvent with an imaginary shift. We use only a single resolvent to construct the filter in order to reduce both amounts of calculation to factor matrices and especially storage to hold factors of matrices. The most disadvantage when we use only a single resolvent rather than many is, such a filter have poor properties especially when compuation is made in single-precision. Therefore, approximate eigenpairs required are not obtained in good accuracy if they are extracted from the set of vectors made by an application of a combination of B-orthonormalization and filtering to a set of initial random vectors. However, experiments show approximate eigenpairs required are refined well if they are extracted from the set of vectors obtained by a few applications of a combination of B-orthonormalization and filtering to a set of initial random vectors.
利用滤波器,计算了特征值在一定区间内的实对称定广义特征问题Av = λBv的近似特征对。在本文的实验中,我们使用IEEE-754单精度浮点(二进制32位)数字系统进行计算。一般来说,滤波器是通过使用一些具有不同平移ρ的解来构造的。对于给定的向量x,通过求解线性方程组C(ρ)y = Bx来给出解的作用,这里系数C(ρ) = a - ρ b是对称的。我们假设通过C(ρ)的矩阵分解来求解这个线性方程组,例如通过改进的Cholesky方法(LDLT分解方法)。当矩阵A和矩阵B都是带状时,C(ρ)也是带状的,改进的带状系统Cholesky方法可用于求解线性方程组。我们使用的滤波器要么是具有实移位的分解式的多项式,要么是具有虚移位的分解式的虚部的多项式。为了减少因子矩阵的计算量,特别是减少保存矩阵因子的存储空间,我们只使用单一的解析器来构建滤波器。当我们只使用一个而不是多个解决方案时,最大的缺点是,这种过滤器的性能很差,特别是在单精度计算时。因此,如果从一组初始随机向量的b -正交化和滤波的组合应用所产生的向量集合中提取所需的近似特征对,则不能获得很好的精度。然而,实验表明,如果从一组初始随机向量的b -正交化和滤波的组合应用得到的向量集中提取所需的近似特征对,则可以得到很好的改进。
{"title":"Single-Precision Calculation of Iterative Refinement of Eigenpairs of a Real Symmetric-Definite Generalized Eigenproblem by Using a Filter Composed of a Single Resolvent","authors":"H. Murakami","doi":"10.1145/3440722.3440784","DOIUrl":"https://doi.org/10.1145/3440722.3440784","url":null,"abstract":"By using a filter, we calculate approximate eigenpairs of a real symmetric-definite generalized eigenproblem Av = λBv whose eigenvalues are in a specified interval. In our experiments in this paper, the IEEE-754 single-precision floating-point (binary 32bit) number system is used for calculations. In general, a filter is constructed by using some resolvents with different shifts ρ. For a given vector x, an action of a resolvent is given by solving a system of linear equations C(ρ)y = Bx for y, here the coefficient C(ρ) = A − ρB is symmetric. We assume to solve this system of linear equations by matrix factorization of C(ρ), for example by the modified Cholesky method (LDLT decomposition method). When both matrices A and B are banded, C(ρ) is also banded and the modified Cholesky method for banded system can be used to solve the system of linear equations. The filter we used is either a polynomial of a resolvent with a real shift, or a polynomial of an imaginary part of a resolvent with an imaginary shift. We use only a single resolvent to construct the filter in order to reduce both amounts of calculation to factor matrices and especially storage to hold factors of matrices. The most disadvantage when we use only a single resolvent rather than many is, such a filter have poor properties especially when compuation is made in single-precision. Therefore, approximate eigenpairs required are not obtained in good accuracy if they are extracted from the set of vectors made by an application of a combination of B-orthonormalization and filtering to a set of initial random vectors. However, experiments show approximate eigenpairs required are refined well if they are extracted from the set of vectors obtained by a few applications of a combination of B-orthonormalization and filtering to a set of initial random vectors.","PeriodicalId":183674,"journal":{"name":"The International Conference on High Performance Computing in Asia-Pacific Region Companion","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131374910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Molecular-Continuum Flow Simulation in the Exascale and Big Data Era 百亿亿次和大数据时代的分子连续流模拟
Philipp Neumann, Vahid Jafari, P. Jarmatz, F. Maurer, Helene Wittenberg, Niklas Wittmer
(b) Figure 1: (a) Slice through a coupled 3D vortex street simulation using a Lattice Boltzmann solver (CFD), also illustrating the location of the embedded MD domain (red box). (b) Y-component of flow velocity in the center of the MD domain over time: noisy result of MD (red dots), CFD result (blue line), filter results for a Median Filter (blue dots) and a Gaussian Filter (green dots) from scipy.ndimage
(b)图1:(a)使用晶格玻尔兹曼求解器(CFD)对耦合三维涡街模拟进行切片,也说明了嵌入式MD域(红框)的位置。(b) MD域中心流速随时间的y分量:MD的噪声结果(红点)、CFD结果(蓝线)、scipy. nimage中值滤波器(蓝点)和高斯滤波器(绿点)的滤波结果
{"title":"Molecular-Continuum Flow Simulation in the Exascale and Big Data Era","authors":"Philipp Neumann, Vahid Jafari, P. Jarmatz, F. Maurer, Helene Wittenberg, Niklas Wittmer","doi":"10.1145/3440722.3440903","DOIUrl":"https://doi.org/10.1145/3440722.3440903","url":null,"abstract":"(b) Figure 1: (a) Slice through a coupled 3D vortex street simulation using a Lattice Boltzmann solver (CFD), also illustrating the location of the embedded MD domain (red box). (b) Y-component of flow velocity in the center of the MD domain over time: noisy result of MD (red dots), CFD result (blue line), filter results for a Median Filter (blue dots) and a Gaussian Filter (green dots) from scipy.ndimage","PeriodicalId":183674,"journal":{"name":"The International Conference on High Performance Computing in Asia-Pacific Region Companion","volume":"245 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134086497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-scale Modelling of Urban Air Pollution with Coupled Weather Forecast and Traffic Simulation on HPC Architecture 基于HPC架构的城市大气污染天气预报与交通模拟耦合多尺度模型
L. Kornyei, Z. Horváth, A. Ruopp, Á. Kovács, Bence Liszkai
Urban air pollution is one of the global challenges to which over 3 million deaths are attributable yearly. Traffic is emitting over 40% of several contaminants, like NO2 [10]. The directive 2008/50/EC of the European Commission prescribes the assessment air quality by accumulating exceedance of contamination concentration limits over a one-year period using measurement stations, which may be supplemented by modeling techniques to provide adequate information on spatial distribution. Computational models do predict that small scale spatial fluctuation is expected on the street level: local air flow phenomena can cluster up pollutants or carry them away far from the location of emission [2]. The spread of the SARS-CoV-2 virus also interacts with urban air quality. Regions in lock down have highly reduced air pollution strain due to the drop of traffic [4]. Also, correlation between the fatality rate of a previous respiratory disease, SARS 2002, and Air Pollution Index suggests that bad air quality may double fatality rate [6]. At street level pollution dispersion highly depends on the daily weather, a one-year simulation low time scale model is needed. Additionally, to resolve street-level phenomena a cell size of 1 to 4 meters are utilized in these regions that requires CFD methods to use a simulation domain of 1 to 100 million cells. Memory and computational requirements for these tasks are enormous, so HPC architecture is needed to have reasonable results within a manageable time frame. To tackle this challenge, the Urban Air Pollution (UAP) workflow is developed as a pilot of the HiDALGO project [7], which is funded by the H2020 framework of the European Union. The pilot is designed in a modular way with the mindset to be developed into a digital twin model later. Its standardized interfaces enable multiple software to be used in a specific module. At its core, a traffic simulation implemented in SUMO is coupled with a CFD simulation. Currently OpenFOAM (v1906, v1912 and v2006) and Ansys Fluent (v19.2) are supported. This presentation focuses on the OpenFOAM implementation, as it proved more feasible and scalable on most HPC architectures. The incompressible unsteady Reynolds-averaged Navier– Stokes equations are solved with the PIMPLE method, Courant-number based adaptive time stepping and transient atmospheric boundary conditions. The single component NOx-type pollution is calculated independently as a scalar with transport equations along the flow field. Pollution emission is treated as a per cell volumetric source that changes in time. The initial condition is obtained from a steady state solution at the initial time with the SIMPLE method, using the identical, but stationary boundary conditions and source fields. Custom modules are developed for proper boundary condition and source term handling. The UAP workflow supports automatic 3D air flow geometry and traffic network generation from OpenStreetMap data. Ground and building information
主要基准测试是在本地集群PLEXI(18节点2x6核心英特尔X5650, 48GB RAM, 40Gb InfiniBand)上发布的,并对EAGLE (PSNC, Poznan, 1119节点2x14核心英特尔E5-2697v3, 64GB RAM, 56Gb InfiniBand)和HAWK测试系统(HLRS,斯图加特,5632节点2x128核心AMD EPYC 7742, 256GB RAM, 200gb InfiniBand HDR200)进行了额外的调查。使用优化的IO、多层分解和单元索引重编号对OpenFOAM设置进行调优,使PLEXI在1M单元数下的加速从18提高到102,在216核的9M单元数模型上的加速从49提高到77。在HAWK测试系统上,1M单元的速度最高为133,9M单元的速度最高为401,都是2048核。在EAGLE上,1M单元数模型在448核时最高加速到104。一个节点核心计数的饱和效应表明内存带宽有限的计算。全天的模拟运行也在5个城市(Győr,马德里,斯图加特,赫伦堡和格拉茨)的区域进行,随机交通和不同的网格大小约为1.8 m和约3M单元。在48核的PLEXI上,对于较小和较大的单元数,完整的CFD模块的运行时间分别为2.7小时和20小时。这使得为期一年的模拟可以在PLEXI上实现粗网格,在更强大的HPC架构上实现细网格。由于AMD处理器的核数与内存通道比率较高,在内存带宽有限的应用中,单节点并行效率较差,这支持了我们目前的研究结果,并且与相同硬件上其他CFD软件的加速结果相当[9]。然而,某些OpenFOAM模拟的基于节点的加速可能会表现出超线性行为[1]。总之,UAP工作流和CFD模块的OpenFOAM实现在可管理的时间框架内实现了模拟一年的目标。对一天的污染进行几个小时的模拟时间也使当前版本的预测变得可行。未来的工作包括使用适当的正交分解,POD[8],这是一种模型降阶方法,最终在牺牲有限精度的情况下大幅提高计算时间。我们还计划对基于GPGPU的求解器进行测试和基准测试,实施污染物反应,并使用新的空气质量测量站扩展验证。
{"title":"Multi-scale Modelling of Urban Air Pollution with Coupled Weather Forecast and Traffic Simulation on HPC Architecture","authors":"L. Kornyei, Z. Horváth, A. Ruopp, Á. Kovács, Bence Liszkai","doi":"10.1145/3440722.3440917","DOIUrl":"https://doi.org/10.1145/3440722.3440917","url":null,"abstract":"Urban air pollution is one of the global challenges to which over 3 million deaths are attributable yearly. Traffic is emitting over 40% of several contaminants, like NO2 [10]. The directive 2008/50/EC of the European Commission prescribes the assessment air quality by accumulating exceedance of contamination concentration limits over a one-year period using measurement stations, which may be supplemented by modeling techniques to provide adequate information on spatial distribution. Computational models do predict that small scale spatial fluctuation is expected on the street level: local air flow phenomena can cluster up pollutants or carry them away far from the location of emission [2]. The spread of the SARS-CoV-2 virus also interacts with urban air quality. Regions in lock down have highly reduced air pollution strain due to the drop of traffic [4]. Also, correlation between the fatality rate of a previous respiratory disease, SARS 2002, and Air Pollution Index suggests that bad air quality may double fatality rate [6]. At street level pollution dispersion highly depends on the daily weather, a one-year simulation low time scale model is needed. Additionally, to resolve street-level phenomena a cell size of 1 to 4 meters are utilized in these regions that requires CFD methods to use a simulation domain of 1 to 100 million cells. Memory and computational requirements for these tasks are enormous, so HPC architecture is needed to have reasonable results within a manageable time frame. To tackle this challenge, the Urban Air Pollution (UAP) workflow is developed as a pilot of the HiDALGO project [7], which is funded by the H2020 framework of the European Union. The pilot is designed in a modular way with the mindset to be developed into a digital twin model later. Its standardized interfaces enable multiple software to be used in a specific module. At its core, a traffic simulation implemented in SUMO is coupled with a CFD simulation. Currently OpenFOAM (v1906, v1912 and v2006) and Ansys Fluent (v19.2) are supported. This presentation focuses on the OpenFOAM implementation, as it proved more feasible and scalable on most HPC architectures. The incompressible unsteady Reynolds-averaged Navier– Stokes equations are solved with the PIMPLE method, Courant-number based adaptive time stepping and transient atmospheric boundary conditions. The single component NOx-type pollution is calculated independently as a scalar with transport equations along the flow field. Pollution emission is treated as a per cell volumetric source that changes in time. The initial condition is obtained from a steady state solution at the initial time with the SIMPLE method, using the identical, but stationary boundary conditions and source fields. Custom modules are developed for proper boundary condition and source term handling. The UAP workflow supports automatic 3D air flow geometry and traffic network generation from OpenStreetMap data. Ground and building information","PeriodicalId":183674,"journal":{"name":"The International Conference on High Performance Computing in Asia-Pacific Region Companion","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114143442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Distributed MLPerf ResNet50 Training on Intel Xeon Architectures with TensorFlow 基于TensorFlow的Intel至强架构分布式MLPerf ResNet50训练
Wei Wang, N. Hasabnis
MLPerf benchmarks, which measure training and inference performance of ML hardware and software, have published three sets of ML training results so far. In all sets of results, ResNet50v1.5 was used as a standard benchmark to showcase the latest developments on image recognition tasks. The latest MLPerf training round (v0.7) featured Intel’s submission with TensorFlow. In this paper, we describe the recent optimization work that enabled this submission. In particular, we enabled BFloat16 data type in ResNet50v1.5 model as well as in Intel-optimized TensorFlow to exploit full potential of 3rd generation Intel Xeon scalable processors that have built-in BFloat16 support. We also describe the performance optimizations as well as the state-of-the-art accuracy/convergence results of ResNet50v1.5 model, achieved with large-scale distributed training (with upto 256 MPI workers) with Horovod. These results lay great foundation to support future MLPerf training submissions with large scale Intel Xeon clusters.
MLPerf基准测试衡量机器学习硬件和软件的训练和推理性能,迄今为止已经发布了三组机器学习训练结果。在所有结果集中,ResNet50v1.5被用作标准基准,以展示图像识别任务的最新发展。最新的MLPerf培训轮(v0.7)以英特尔提交的TensorFlow为特色。在本文中,我们描述了支持此提交的最新优化工作。特别是,我们在ResNet50v1.5模型以及英特尔优化的TensorFlow中启用了BFloat16数据类型,以充分利用内置BFloat16支持的第三代英特尔至强可扩展处理器的全部潜力。我们还描述了ResNet50v1.5模型的性能优化以及最先进的精度/收敛结果,这些结果是通过Horovod的大规模分布式训练(多达256个MPI工人)实现的。这些结果为支持未来大规模Intel至强集群的MLPerf培训提交奠定了良好的基础。
{"title":"Distributed MLPerf ResNet50 Training on Intel Xeon Architectures with TensorFlow","authors":"Wei Wang, N. Hasabnis","doi":"10.1145/3440722.3440880","DOIUrl":"https://doi.org/10.1145/3440722.3440880","url":null,"abstract":"MLPerf benchmarks, which measure training and inference performance of ML hardware and software, have published three sets of ML training results so far. In all sets of results, ResNet50v1.5 was used as a standard benchmark to showcase the latest developments on image recognition tasks. The latest MLPerf training round (v0.7) featured Intel’s submission with TensorFlow. In this paper, we describe the recent optimization work that enabled this submission. In particular, we enabled BFloat16 data type in ResNet50v1.5 model as well as in Intel-optimized TensorFlow to exploit full potential of 3rd generation Intel Xeon scalable processors that have built-in BFloat16 support. We also describe the performance optimizations as well as the state-of-the-art accuracy/convergence results of ResNet50v1.5 model, achieved with large-scale distributed training (with upto 256 MPI workers) with Horovod. These results lay great foundation to support future MLPerf training submissions with large scale Intel Xeon clusters.","PeriodicalId":183674,"journal":{"name":"The International Conference on High Performance Computing in Asia-Pacific Region Companion","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132579719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Node-level Performance Optimizations in CFD Codes CFD代码中的节点级性能优化
Peter Wauligmann, Jakob Dürrwächter, Philipp Offenhäuser, A. Schlottke, M. Bernreuther, B. Dick
We present examples of beneficial node-level performance optimizations in three computational fluid-dynamics applications. In particular, we not only quantify the speedup achieved but also try to assess flexibility, readability, (performance) portability and labor effort.
我们提出了在三个计算流体力学应用中有益的节点级性能优化的例子。特别是,我们不仅量化了所实现的加速,而且还尝试评估灵活性、可读性、(性能)可移植性和工作量。
{"title":"Node-level Performance Optimizations in CFD Codes","authors":"Peter Wauligmann, Jakob Dürrwächter, Philipp Offenhäuser, A. Schlottke, M. Bernreuther, B. Dick","doi":"10.1145/3440722.3440914","DOIUrl":"https://doi.org/10.1145/3440722.3440914","url":null,"abstract":"We present examples of beneficial node-level performance optimizations in three computational fluid-dynamics applications. In particular, we not only quantify the speedup achieved but also try to assess flexibility, readability, (performance) portability and labor effort.","PeriodicalId":183674,"journal":{"name":"The International Conference on High Performance Computing in Asia-Pacific Region Companion","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115037306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
High Performance Simulations of Quantum Transport using Manycore Computing 基于多核计算的量子传输的高性能模拟
Yosang Jeong, H. Ryu
The Non-Equilibrium Green’s Function (NEGF) has been widely utilized in the field of nanoscience and nanotechnology to predict carrier transport behaviors in electronic device channels of sizes in a quantum regime. This work explores how much performance improvement can be driven for NEGF computations with unique features of manycore computing, where the core numerical step of NEGF computations involves a recursive process of matrix-matrix multiplication. The major techniques adopted for the performance enhancement are data-restructuring, matrix-tiling, thread-scheduling, and offload computing and we present in-depth discussion on why they are critical to fully exploit the power of manycore computing hardware including Intel Xeon Phi Knights Landing systems and NVIDIA general-purpose graphic processing unit (GPU) devices. Performance of the optimized algorithm has been tested in a single computing node, where the host is Xeon Phi 7210 that is equipped with two NVIDIA Quadro GV100 GPU devices. The target structure of NEGF simulations is a [100] silicon nanowire that consists of 100K atoms involving a 1000K × 1000K complex Hamiltonian matrix. Through rigorous benchmark tests, we show, with optimization techniques whose details are elaborately explained, the workload can be accelerated almost by a factor of up to ∼ 20 compared to the unoptimized case.
非平衡格林函数(NEGF)被广泛应用于纳米科学和纳米技术领域,用于预测量子状态下电子器件通道中载流子的输运行为。这项工作探讨了使用多核计算的独特功能可以为NEGF计算带来多少性能改进,其中NEGF计算的核心数值步骤涉及矩阵-矩阵乘法的递归过程。性能增强采用的主要技术是数据重构、矩阵平铺、线程调度和卸载计算,我们深入讨论了为什么它们对于充分利用多核计算硬件(包括Intel Xeon Phi Knights Landing系统和NVIDIA通用图形处理单元(GPU)设备)的能力至关重要。优化算法的性能已经在单个计算节点上进行了测试,其中主机是Xeon Phi 7210,配备了两个NVIDIA Quadro GV100 GPU设备。NEGF模拟的目标结构是由100K个原子组成的[100]硅纳米线,涉及1000K × 1000K复哈密顿矩阵。通过严格的基准测试,我们发现,与未优化的情况相比,使用详细解释的优化技术,工作负载几乎可以加速到20倍。
{"title":"High Performance Simulations of Quantum Transport using Manycore Computing","authors":"Yosang Jeong, H. Ryu","doi":"10.1145/3440722.3440879","DOIUrl":"https://doi.org/10.1145/3440722.3440879","url":null,"abstract":"The Non-Equilibrium Green’s Function (NEGF) has been widely utilized in the field of nanoscience and nanotechnology to predict carrier transport behaviors in electronic device channels of sizes in a quantum regime. This work explores how much performance improvement can be driven for NEGF computations with unique features of manycore computing, where the core numerical step of NEGF computations involves a recursive process of matrix-matrix multiplication. The major techniques adopted for the performance enhancement are data-restructuring, matrix-tiling, thread-scheduling, and offload computing and we present in-depth discussion on why they are critical to fully exploit the power of manycore computing hardware including Intel Xeon Phi Knights Landing systems and NVIDIA general-purpose graphic processing unit (GPU) devices. Performance of the optimized algorithm has been tested in a single computing node, where the host is Xeon Phi 7210 that is equipped with two NVIDIA Quadro GV100 GPU devices. The target structure of NEGF simulations is a [100] silicon nanowire that consists of 100K atoms involving a 1000K × 1000K complex Hamiltonian matrix. Through rigorous benchmark tests, we show, with optimization techniques whose details are elaborately explained, the workload can be accelerated almost by a factor of up to ∼ 20 compared to the unoptimized case.","PeriodicalId":183674,"journal":{"name":"The International Conference on High Performance Computing in Asia-Pacific Region Companion","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132977226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Comparison of Parallel Profiling Tools for Programs utilizing the FFT 利用FFT的程序并行分析工具的比较
B. Leu, S. Aseeri, B. Muite
Performance monitoring is an important component of code optimization. Performance monitoring is also important for the beginning user, but can be difficult to configure appropriately. The overhead of the performance monitoring tools Craypat, FPMP, mpiP, Scalasca and TAU, are measured using default configurations likely to be choosen by a novice user and shown to be small when profiling Fast Fourier Transform based solvers for the Klein Gordon equation based on 2decomp&FFT and on FFTE. Performance measurements help explain that despite FFTE having a more efficient parallel algorithm, it is not always faster than 2decomp&FFT because the complied single core FFT is not as fast as that in FFTW which is used in 2decomp&FFT.
性能监视是代码优化的重要组成部分。性能监视对于初级用户也很重要,但是很难进行适当的配置。性能监控工具Craypat、FPMP、mpiP、Scalasca和TAU的开销是使用新手可能选择的默认配置进行测量的,并且在分析基于2decomp&FFT和FFTE的基于快速傅立叶变换的Klein Gordon方程求解器时显示很小。性能测量有助于解释,尽管FFTE具有更有效的并行算法,但它并不总是比2decomp&FFT快,因为编译的单核FFT不如2decomp&FFT中使用的FFTW快。
{"title":"A Comparison of Parallel Profiling Tools for Programs utilizing the FFT","authors":"B. Leu, S. Aseeri, B. Muite","doi":"10.1145/3440722.3440881","DOIUrl":"https://doi.org/10.1145/3440722.3440881","url":null,"abstract":"Performance monitoring is an important component of code optimization. Performance monitoring is also important for the beginning user, but can be difficult to configure appropriately. The overhead of the performance monitoring tools Craypat, FPMP, mpiP, Scalasca and TAU, are measured using default configurations likely to be choosen by a novice user and shown to be small when profiling Fast Fourier Transform based solvers for the Klein Gordon equation based on 2decomp&FFT and on FFTE. Performance measurements help explain that despite FFTE having a more efficient parallel algorithm, it is not always faster than 2decomp&FFT because the complied single core FFT is not as fast as that in FFTW which is used in 2decomp&FFT.","PeriodicalId":183674,"journal":{"name":"The International Conference on High Performance Computing in Asia-Pacific Region Companion","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115591535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Parallel Multigrid Method on Intel Xeon Phi Clusters Intel Xeon Phi集群的高效并行多网格方法
K. Nakajima, Balazs Gerofi, Y. Ishikawa, Masashi Horikoshi
The parallel multigrid method is expected to play an important role in scientific computing on exa-scale supercomputer systems for solving large-scale linear equations with sparse matrices. Because solving sparse linear systems is a very memory-bound process, efficient method for storage of coefficient matrices is a crucial issue. In the previous works, authors implemented sliced ELL method to parallel conjugate gradient solvers with multigrid preconditioning (MGCG) for the application on 3D groundwater flow through heterogeneous porous media (pGW3D-FVM), and excellent performance has been obtained on large-scale multicore/manycore clusters. In the present work, authors introduced SELL-C-σ to the MGCG solver, and evaluated the performance of the solver with various types of OpenMP/MPI hybrid parallel programing models on the Oakforest-PACS (OFP) system at JCAHPC using up to 1,024 nodes of Intel Xeon Phi. Because SELL-C-σ is suitable for wide-SIMD architecture, such as Xeon Phi, improvement of the performance over the sliced ELL was more than 20%. This is one of the first examples of SELL-C-σ applied to forward/backward substitutions in ILU-type smoother of multigrid solver. Furthermore, effects of IHK/McKernel has been investigated, and it achieved 11% improvement on 1,024 nodes.
并行多重网格法有望在超大规模超级计算机系统的科学计算中发挥重要作用,用于求解具有稀疏矩阵的大规模线性方程组。由于求解稀疏线性系统是一个非常受内存限制的过程,因此系数矩阵的有效存储方法是一个关键问题。在之前的工作中,作者将切片ELL方法与多网格预处理(MGCG)并行共轭梯度求解器应用于非均质多孔介质三维地下水流动(pGW3D-FVM),并在大规模多核/多核集群上取得了优异的性能。在本文中,作者将SELL-C-σ引入到MGCG求解器中,并在JCAHPC的Oakforest-PACS (OFP)系统上使用多达1,024个Intel Xeon Phi节点,使用不同类型的OpenMP/MPI混合并行编程模型对求解器的性能进行了评估。由于SELL-C-σ适用于Xeon Phi等宽simd架构,因此性能比切片ELL提高了20%以上。这是将SELL-C-σ应用于多网格求解器ilu型平滑中的前向/后向替换的第一个例子。此外,对IHK/McKernel的效果进行了研究,在1,024个节点上实现了11%的改进。
{"title":"Efficient Parallel Multigrid Method on Intel Xeon Phi Clusters","authors":"K. Nakajima, Balazs Gerofi, Y. Ishikawa, Masashi Horikoshi","doi":"10.1145/3440722.3440882","DOIUrl":"https://doi.org/10.1145/3440722.3440882","url":null,"abstract":"The parallel multigrid method is expected to play an important role in scientific computing on exa-scale supercomputer systems for solving large-scale linear equations with sparse matrices. Because solving sparse linear systems is a very memory-bound process, efficient method for storage of coefficient matrices is a crucial issue. In the previous works, authors implemented sliced ELL method to parallel conjugate gradient solvers with multigrid preconditioning (MGCG) for the application on 3D groundwater flow through heterogeneous porous media (pGW3D-FVM), and excellent performance has been obtained on large-scale multicore/manycore clusters. In the present work, authors introduced SELL-C-σ to the MGCG solver, and evaluated the performance of the solver with various types of OpenMP/MPI hybrid parallel programing models on the Oakforest-PACS (OFP) system at JCAHPC using up to 1,024 nodes of Intel Xeon Phi. Because SELL-C-σ is suitable for wide-SIMD architecture, such as Xeon Phi, improvement of the performance over the sliced ELL was more than 20%. This is one of the first examples of SELL-C-σ applied to forward/backward substitutions in ILU-type smoother of multigrid solver. Furthermore, effects of IHK/McKernel has been investigated, and it achieved 11% improvement on 1,024 nodes.","PeriodicalId":183674,"journal":{"name":"The International Conference on High Performance Computing in Asia-Pacific Region Companion","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122273865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
An efficient halo approach for Euler-Lagrange simulations based on MPI-3 shared memory 一种基于MPI-3共享内存的欧拉-拉格朗日模拟晕轮方法
Patrick Kopper, M. Pfeiffer, S. Copplestone, A. Beck
Euler-Lagrange methods are a common approach for simulation of dispersed particle-laden flow, e.g. in turbomachinery. In this approach, the fluid is treated as continuous phase with an Eulerian field solver whereas the Lagrangian movement of the dispersed phase is described through the equations of motion for each individual particle. In high-performance computing, the load of the fluid phase is only dependent on the degrees of freedom and load-balancing steps can be taken a priori, thereby ensuring optimal scaling. However, the discrete phase introduces local load imbalances that cannot easily predicted as generally neither the spatial particle distribution nor the computational cost for advancing particles in relation to the fluid integration are know a priori. Runtime load balancing alleviates this problem by adjusting the local load on each processor according to information gathered during the simulation [4]. Since the load balancing step becomes part of the simulation time, its performance and appropriate scaling on modern HPC systems becomes of crucial importance. In this talk, we will first present the FLEXI framework for the Euler-Lagrange system, and follow by introducing the previous approach and highlight its difficulties. FLEXI is a high-order accurate, massively parallel CFD framework based on the Discontinuous Galerkin Spectral Element Method (DGSEM). It has shown excellent scaling properties for the fluid phase and was recently extended by particle tracking capabilities [1], developed together with the PICLas framework [2]. In FLEXI, the mesh is saved in the HDF5 format, allowing for parallel access, with the elements presorted along a space-filling curve (SFC). This approach has shown its suitability for fluid simulations as each processor requires and accesses only the local mesh information, thereby reducing I/O on the underlying file system [3]. However, the particle phase needs additional information around the fluid domain to retain high computational efficiency since particles can cross the local domain boundary at any point during a time step. In previous implementations, this “halo region” information was communicated between each individual processor, causing significant CPU and network load for an extended period of time during initialization and each load balancing step. Therefore, we propose an method developed from scratch utilizing modern MPI calls and able to overcome most of the challenges in the previous approach. This reworked method utilizes MPI-3 shared memory to make mesh information available to all processors on a compute-node. We perform a two-step, communication-free identification of all relevant mesh elements for a compute-node. Furthermore, by making the mesh information accessible to all processors sharing local memory, we eliminate redundant calculations and reduce data duplication. We conclude by presenting examples of large scale computations of particle-laden flows in complex turbomachinery system
欧拉-拉格朗日方法是一种常用的方法来模拟分散颗粒负载的流动,例如在涡轮机械中。在这种方法中,用欧拉场求解器将流体视为连续相,而分散相的拉格朗日运动通过每个单个粒子的运动方程来描述。在高性能计算中,流体阶段的负载仅依赖于自由度,并且可以先验地采取负载平衡步骤,从而确保最佳缩放。然而,离散阶段引入了局部负载不平衡,这很难预测,因为通常无论是空间颗粒分布还是与流体积分相关的推进颗粒的计算成本都是先验的。运行时负载平衡通过根据仿真过程中收集的信息调整每个处理器的本地负载来缓解这一问题[4]。由于负载平衡步骤已成为仿真时间的一部分,因此在现代高性能计算系统上,负载平衡步骤的性能和适当的伸缩变得至关重要。在这次演讲中,我们将首先介绍欧拉-拉格朗日系统的FLEXI框架,然后介绍之前的方法并强调其难点。FLEXI是一个基于不连续伽辽金谱元法(DGSEM)的高阶精度、大规模并行CFD框架。它在流体相中表现出优异的缩放性能,最近又通过粒子跟踪功能得到了扩展[1],并与PICLas框架一起开发[2]。在FLEXI中,网格以HDF5格式保存,允许并行访问,元素沿空间填充曲线(SFC)呈现。这种方法已经显示出它适合流体模拟,因为每个处理器只需要并访问局部网格信息,从而减少底层文件系统上的I/O[3]。然而,由于粒子可以在时间步长内的任何点跨越局部区域边界,因此为了保持较高的计算效率,粒子相位需要在流体域周围提供额外的信息。在以前的实现中,这个“光环区域”信息是在每个单独的处理器之间进行通信的,在初始化和每个负载平衡步骤期间,会导致大量的CPU和网络负载延长一段时间。因此,我们提出了一种利用现代MPI调用从头开发的方法,能够克服以前方法中的大多数挑战。这种改进的方法利用MPI-3共享内存使网格信息对计算节点上的所有处理器可用。我们对计算节点的所有相关网格元素执行两步,无需通信的识别。此外,通过使共享本地内存的所有处理器都可以访问网格信息,我们消除了冗余计算并减少了数据重复。最后,我们给出了复杂涡轮机械系统中颗粒流的大规模计算实例,并对下一步的研究挑战进行了展望。
{"title":"An efficient halo approach for Euler-Lagrange simulations based on MPI-3 shared memory","authors":"Patrick Kopper, M. Pfeiffer, S. Copplestone, A. Beck","doi":"10.1145/3440722.3440904","DOIUrl":"https://doi.org/10.1145/3440722.3440904","url":null,"abstract":"Euler-Lagrange methods are a common approach for simulation of dispersed particle-laden flow, e.g. in turbomachinery. In this approach, the fluid is treated as continuous phase with an Eulerian field solver whereas the Lagrangian movement of the dispersed phase is described through the equations of motion for each individual particle. In high-performance computing, the load of the fluid phase is only dependent on the degrees of freedom and load-balancing steps can be taken a priori, thereby ensuring optimal scaling. However, the discrete phase introduces local load imbalances that cannot easily predicted as generally neither the spatial particle distribution nor the computational cost for advancing particles in relation to the fluid integration are know a priori. Runtime load balancing alleviates this problem by adjusting the local load on each processor according to information gathered during the simulation [4]. Since the load balancing step becomes part of the simulation time, its performance and appropriate scaling on modern HPC systems becomes of crucial importance. In this talk, we will first present the FLEXI framework for the Euler-Lagrange system, and follow by introducing the previous approach and highlight its difficulties. FLEXI is a high-order accurate, massively parallel CFD framework based on the Discontinuous Galerkin Spectral Element Method (DGSEM). It has shown excellent scaling properties for the fluid phase and was recently extended by particle tracking capabilities [1], developed together with the PICLas framework [2]. In FLEXI, the mesh is saved in the HDF5 format, allowing for parallel access, with the elements presorted along a space-filling curve (SFC). This approach has shown its suitability for fluid simulations as each processor requires and accesses only the local mesh information, thereby reducing I/O on the underlying file system [3]. However, the particle phase needs additional information around the fluid domain to retain high computational efficiency since particles can cross the local domain boundary at any point during a time step. In previous implementations, this “halo region” information was communicated between each individual processor, causing significant CPU and network load for an extended period of time during initialization and each load balancing step. Therefore, we propose an method developed from scratch utilizing modern MPI calls and able to overcome most of the challenges in the previous approach. This reworked method utilizes MPI-3 shared memory to make mesh information available to all processors on a compute-node. We perform a two-step, communication-free identification of all relevant mesh elements for a compute-node. Furthermore, by making the mesh information accessible to all processors sharing local memory, we eliminate redundant calculations and reduce data duplication. We conclude by presenting examples of large scale computations of particle-laden flows in complex turbomachinery system","PeriodicalId":183674,"journal":{"name":"The International Conference on High Performance Computing in Asia-Pacific Region Companion","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116826805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
The International Conference on High Performance Computing in Asia-Pacific Region Companion
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1