首页 > 最新文献

2019 IEEE High Performance Extreme Computing Conference (HPEC)最新文献

英文 中文
Fast Large-Scale Algorithm for Electromagnetic Wave Propagation in 3D Media 三维介质中电磁波传播的快速大规模算法
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916219
M. Harris, M. H. Langston, Pierre-David Létourneau, G. Papanicolaou, J. Ezick, R. Lethin
We present a fast, large-scale algorithm for the simulation of electromagnetic waves (Maxwell’s equations) in three-dimensional inhomogeneous media. The algorithm has a complexity of $O(Nlog (N))$ and runs in parallel. Numerical simulations show the rapid treatment of problems with tens of millions of unknowns on a small shared-memory cluster (≤ 16 cores).
我们提出了一种快速、大规模的算法来模拟三维非均匀介质中的电磁波(麦克斯韦方程组)。该算法的复杂度为$O(Nlog (N))$,并并行运行。数值模拟显示了在一个小型共享内存集群(≤16核)上快速处理具有数千万个未知数的问题。
{"title":"Fast Large-Scale Algorithm for Electromagnetic Wave Propagation in 3D Media","authors":"M. Harris, M. H. Langston, Pierre-David Létourneau, G. Papanicolaou, J. Ezick, R. Lethin","doi":"10.1109/HPEC.2019.8916219","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916219","url":null,"abstract":"We present a fast, large-scale algorithm for the simulation of electromagnetic waves (Maxwell’s equations) in three-dimensional inhomogeneous media. The algorithm has a complexity of $O(Nlog (N))$ and runs in parallel. Numerical simulations show the rapid treatment of problems with tens of millions of unknowns on a small shared-memory cluster (≤ 16 cores).","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124772697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Using Container Migration for HPC Workloads Resilience 使用容器迁移实现HPC工作负载弹性
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916436
Mohamad Sindi, John R. Williams
We share experiences in implementing a containerbased HPC environment that could help sustain running HPC workloads on clusters. By running workloads inside containers, we are able to migrate them from cluster nodes anticipating hardware problems, to healthy nodes while the workloads are running. Migration is done using the CRIU tool with no application modification. No major interruption or overhead is introduced to the workload. Various real HPC applications are tested. Tests are done with different hardware node specs, network interconnects, and MPI implementations. We also benchmark the applications on containers and compare performance to native. Results demonstrate successful migration of HPC workloads inside containers with minimal interruption, while maintaining the integrity of the results produced. We provide several YouTube videos demonstrating the migration tests. Benchmarks also show that application performance on containers is close to native. We discuss some of the challenges faced during implementation and solutions adopted. To the best of our knowledge, we believe this work is the first to demonstrate successful migration of real MPI-based HPC workloads using CRIU and containers.
我们将分享实现基于容器的HPC环境的经验,该环境有助于在集群上持续运行HPC工作负载。通过在容器内运行工作负载,我们能够在工作负载运行时将它们从预测硬件问题的集群节点迁移到健康节点。迁移是使用CRIU工具完成的,不需要修改应用程序。没有给工作负载引入主要的中断或开销。测试了各种实际HPC应用程序。测试使用了不同的硬件节点规格、网络互连和MPI实现。我们还在容器上对应用程序进行基准测试,并将性能与本机进行比较。结果表明,HPC工作负载在容器内的成功迁移具有最小的中断,同时保持所产生结果的完整性。我们提供了几个演示迁移测试的YouTube视频。基准测试还显示,容器上的应用程序性能接近本机。我们将讨论在实施过程中面临的一些挑战和采用的解决方案。据我们所知,我们认为这项工作是第一次展示使用CRIU和容器成功迁移真正基于mpi的HPC工作负载。
{"title":"Using Container Migration for HPC Workloads Resilience","authors":"Mohamad Sindi, John R. Williams","doi":"10.1109/HPEC.2019.8916436","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916436","url":null,"abstract":"We share experiences in implementing a containerbased HPC environment that could help sustain running HPC workloads on clusters. By running workloads inside containers, we are able to migrate them from cluster nodes anticipating hardware problems, to healthy nodes while the workloads are running. Migration is done using the CRIU tool with no application modification. No major interruption or overhead is introduced to the workload. Various real HPC applications are tested. Tests are done with different hardware node specs, network interconnects, and MPI implementations. We also benchmark the applications on containers and compare performance to native. Results demonstrate successful migration of HPC workloads inside containers with minimal interruption, while maintaining the integrity of the results produced. We provide several YouTube videos demonstrating the migration tests. Benchmarks also show that application performance on containers is close to native. We discuss some of the challenges faced during implementation and solutions adopted. To the best of our knowledge, we believe this work is the first to demonstrate successful migration of real MPI-based HPC workloads using CRIU and containers.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128319792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Combinatorial Multigrid: Advanced Preconditioners For Ill-Conditioned Linear Systems 组合多重网格:病态线性系统的高级预调节器
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916446
M. H. Langston, M. Harris, Pierre-David Létourneau, R. Lethin, J. Ezick
The Combinatorial Multigrid (CMG) technique is a practical and adaptable solver and combinatorial preconditioner for solving certain classes of large, sparse systems of linear equations. CMG is similar to Algebraic Multigrid (AMG) but replaces large groupings of fine-level variables with a single coarse-level one, resulting in simple and fast interpolation schemes. These schemes further provide control over the refinement strategies at different levels of the solver hierarchy depending on the condition number of the system being solved [1]. While many pre-existing solvers may be able to solve large, sparse systems with relatively low complexity, inversion may require O(n2) space; whereas, if we know that a linear operator has $tilde{n}=O(n)$ nonzero elements, we desire to use O(n) space in order to reduce communication as much as possible. Being able to invert sparse linear systems of equations, asymptotically as fast as the values can be read from memory, has been identified by the Defense Advanced Research Projects Agency (DARPA) and the Department of Energy (DOE) as increasingly necessary for scalable solvers and energy-efficient algorithms [2], [3] in scientific computing. Further, as industry and government agencies move towards exascale, fast solvers and communication-avoidance will be more necessary [4], [5]. In this paper, we present an optimized implementation of the Combinatorial Multigrid in C using Petsc and analyze the solution of various systems using the CMG approach as a preconditioner on much larger problems than have been presented thus far. We compare the number of iterations, setup times and solution times against other popular preconditioners for such systems, including Incomplete Cholesky and a Multigrid approach in Petsc against common problems, further exhibiting superior performance by the CMG.1 2
组合多重网格(CMG)技术是求解某类大型、稀疏线性方程组的一种实用且适应性强的求解器和组合预条件。CMG类似于代数多网格(algeaic Multigrid, AMG),但用一个粗级变量代替了大组细级变量,从而实现了简单快速的插值方案。这些方案进一步提供了对求解器层次结构中不同层次的细化策略的控制,这取决于被求解系统的条件数[1]。虽然许多已有的求解器可以求解复杂度相对较低的大型稀疏系统,但反演可能需要O(n2)空间;然而,如果我们知道一个线性算子有$tilde{n}=O(n)$非零元素,我们希望使用O(n)空间,以便尽可能地减少通信。美国国防部高级研究计划局(DARPA)和美国能源部(DOE)认为,在科学计算中,对于可扩展求解器和节能算法[2],[3]来说,能够反演稀疏线性方程组,且速度与从存储器中读取值的速度一样快,这一点越来越有必要。此外,随着工业和政府机构向百亿亿级发展,快速求解器和通信避免将更加必要[4],[5]。在本文中,我们使用Petsc在C中提出了组合多网格的优化实现,并使用CMG方法作为迄今为止提出的更大问题的前置条件,分析了各种系统的解决方案。我们将迭代次数、设置时间和解决时间与此类系统的其他流行前置条件(包括针对常见问题的不完全Cholesky和Petsc中的Multigrid方法)进行了比较,进一步展示了CMG.1的优越性能
{"title":"Combinatorial Multigrid: Advanced Preconditioners For Ill-Conditioned Linear Systems","authors":"M. H. Langston, M. Harris, Pierre-David Létourneau, R. Lethin, J. Ezick","doi":"10.1109/HPEC.2019.8916446","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916446","url":null,"abstract":"The Combinatorial Multigrid (CMG) technique is a practical and adaptable solver and combinatorial preconditioner for solving certain classes of large, sparse systems of linear equations. CMG is similar to Algebraic Multigrid (AMG) but replaces large groupings of fine-level variables with a single coarse-level one, resulting in simple and fast interpolation schemes. These schemes further provide control over the refinement strategies at different levels of the solver hierarchy depending on the condition number of the system being solved [1]. While many pre-existing solvers may be able to solve large, sparse systems with relatively low complexity, inversion may require O(n2) space; whereas, if we know that a linear operator has $tilde{n}=O(n)$ nonzero elements, we desire to use O(n) space in order to reduce communication as much as possible. Being able to invert sparse linear systems of equations, asymptotically as fast as the values can be read from memory, has been identified by the Defense Advanced Research Projects Agency (DARPA) and the Department of Energy (DOE) as increasingly necessary for scalable solvers and energy-efficient algorithms [2], [3] in scientific computing. Further, as industry and government agencies move towards exascale, fast solvers and communication-avoidance will be more necessary [4], [5]. In this paper, we present an optimized implementation of the Combinatorial Multigrid in C using Petsc and analyze the solution of various systems using the CMG approach as a preconditioner on much larger problems than have been presented thus far. We compare the number of iterations, setup times and solution times against other popular preconditioners for such systems, including Incomplete Cholesky and a Multigrid approach in Petsc against common problems, further exhibiting superior performance by the CMG.1 2","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125475653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Scalable Lazy-update Multigrid Preconditioners 可伸缩的延迟更新多网格预处理
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916504
Majid Rasouli, Vidhi Zala, R. Kirby, H. Sundar
Multigrid is one of the most effective methods for solving elliptic PDEs. It is algorithmically optimal and is robust when combined with Krylov methods. Algebraic multigrid is especially attractive due to its blackbox nature. This however comes at the cost of increased setup costs that can be significant in case of systems where the system matrix changes frequently making it difficult to amortize the setup cost. In this work, we investigate several strategies for performing lazy updates to the multigrid hierarchy corresponding to changes in the system matrix. These include delayed updates, value updates without changing structure, process local changes, and full updates. We demonstrate that in many cases, the overhead of building the AMG hierarchy can be mitigated for rapidly changing system matrices.
多重网格是求解椭圆偏微分方程最有效的方法之一。它是算法最优的,并且与Krylov方法结合使用时具有鲁棒性。代数多重网格由于其黑箱特性而特别具有吸引力。然而,这是以增加的设置成本为代价的,在系统矩阵频繁变化的情况下,这可能是显著的,这使得难以摊销设置成本。在这项工作中,我们研究了几种针对系统矩阵变化对多网格层次结构执行延迟更新的策略。这包括延迟更新、不更改结构的值更新、流程局部更改和完整更新。我们证明,在许多情况下,构建AMG层次结构的开销可以减轻快速变化的系统矩阵。
{"title":"Scalable Lazy-update Multigrid Preconditioners","authors":"Majid Rasouli, Vidhi Zala, R. Kirby, H. Sundar","doi":"10.1109/HPEC.2019.8916504","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916504","url":null,"abstract":"Multigrid is one of the most effective methods for solving elliptic PDEs. It is algorithmically optimal and is robust when combined with Krylov methods. Algebraic multigrid is especially attractive due to its blackbox nature. This however comes at the cost of increased setup costs that can be significant in case of systems where the system matrix changes frequently making it difficult to amortize the setup cost. In this work, we investigate several strategies for performing lazy updates to the multigrid hierarchy corresponding to changes in the system matrix. These include delayed updates, value updates without changing structure, process local changes, and full updates. We demonstrate that in many cases, the overhead of building the AMG hierarchy can be mitigated for rapidly changing system matrices.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120894534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluation of the Imbalance Evolution in Parallel Reservoir Simulation 平行油藏模拟中不平衡演化的评价
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916495
M. Rogowski, Suha N. Kayum
Load balancing is a crucial factor affecting the performance of parallel applications. Improper work distribution leads to underutilization of computing resources and an unnecessary increase in runtime. This paper identifies the imbalance sources in reservoir simulation and characterizes them as static or dynamic. Simulation model properties that change over time, such as well management actions, are registered and correlated with performance characteristics hence identifying sources of imbalance. The results are exploratory and used to validate the current approach of static grid-to-process, and well-to-process assignment widely used in commercial parallel reservoir simulators. Areas in which implementing dynamic load balancing would be worthwhile are identified.
负载平衡是影响并行应用程序性能的一个关键因素。不合理的工作分配导致计算资源利用率不足,运行时间增加。本文对油藏模拟中的不平衡源进行了识别,并将其分为静态不平衡源和动态不平衡源。随着时间的推移,模拟模型的属性(如井管理行为)会发生变化,并与性能特征相关联,从而确定不平衡的来源。结果是探索性的,并用于验证目前在商业并行油藏模拟器中广泛使用的静态网格到过程和井到过程分配方法。确定了值得实现动态负载平衡的领域。
{"title":"Evaluation of the Imbalance Evolution in Parallel Reservoir Simulation","authors":"M. Rogowski, Suha N. Kayum","doi":"10.1109/HPEC.2019.8916495","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916495","url":null,"abstract":"Load balancing is a crucial factor affecting the performance of parallel applications. Improper work distribution leads to underutilization of computing resources and an unnecessary increase in runtime. This paper identifies the imbalance sources in reservoir simulation and characterizes them as static or dynamic. Simulation model properties that change over time, such as well management actions, are registered and correlated with performance characteristics hence identifying sources of imbalance. The results are exploratory and used to validate the current approach of static grid-to-process, and well-to-process assignment widely used in commercial parallel reservoir simulators. Areas in which implementing dynamic load balancing would be worthwhile are identified.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121806033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Design and Implementation of Knowledge Base for Runtime Management of Software Defined Hardware 软件定义硬件运行时管理知识库的设计与实现
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916328
Hongkuan Zhou, Ajitesh Srivastava, R. Kannan, V. Prasanna
PageRank is a fundamental graph algorithm to evaluate the importance of vertices in a graph. In this paper, we present an efficient parallel PageRank design based on an edge-centric scatter-gather model. To overcome the poor locality of PageRank and optimize the memory performance, we develop a fast and efficient partitioning technique. We first partition all the vertices into non-overlapping vertex sets such that the data of each vertex set can fit in the cache; then we sort the outgoing edges of each vertex set based on the destination vertices to minimize random memory writes. The partitioning technique significantly reduces random accesses to main memory and improves the sustained memory bandwidth by 3×. It also enables efficient parallel execution on multicore platforms; we use distinct cores to execute the computations of distinct vertex sets in parallel to achieve speedup. We implement our design on a 16-core Intel Xeon processor and use various large-scale real-life and synthetic datasets for evaluation. Compared with the PageRank Pipeline Benchmark, our design achieves 12× to 19× speedup for all the datasets.
运行时可重新配置的软件与可重新配置的硬件相结合是非常可取的,因为这是在不损害可编程性的情况下最大化运行时效率的一种手段。这类软件系统的编译器设计起来极其困难,因为它们必须在运行时利用不同类型的硬件。为了解决与动态可重构硬件相匹配的工作流的静态和动态编译器优化的需要,我们提出了一种针对软件定义硬件的动态软件编译器的中心组件的新设计。我们的综合设计不仅关注静态知识,还关注从程序执行中提取知识的半监督式提取,并开发其性能模型。具体来说,我们的新动态和可扩展知识库1)在工作流执行期间持续收集知识2)在最佳(可用)硬件配置上确定工作流的最佳实现。它在存储来自编译器的其他组件以及人工分析人员的信息并向其提供信息方面起着中心作用。通过丰富的三部分图表示,知识库捕获并学习了有关分解和将代码步骤映射到内核以及将内核映射到可用硬件配置的广泛信息。该知识库使用$ c++ $ Boost库实现,能够快速处理离线和在线查询和更新。我们展示了我们的知识库可以在$1 ms$内回答查询,而不管它存储的工作流的数量。据我们所知,这是支持高级语言编译以利用任意可重构平台的第一个动态和可扩展知识库的设计。
{"title":"Design and Implementation of Knowledge Base for Runtime Management of Software Defined Hardware","authors":"Hongkuan Zhou, Ajitesh Srivastava, R. Kannan, V. Prasanna","doi":"10.1109/HPEC.2019.8916328","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916328","url":null,"abstract":"PageRank is a fundamental graph algorithm to evaluate the importance of vertices in a graph. In this paper, we present an efficient parallel PageRank design based on an edge-centric scatter-gather model. To overcome the poor locality of PageRank and optimize the memory performance, we develop a fast and efficient partitioning technique. We first partition all the vertices into non-overlapping vertex sets such that the data of each vertex set can fit in the cache; then we sort the outgoing edges of each vertex set based on the destination vertices to minimize random memory writes. The partitioning technique significantly reduces random accesses to main memory and improves the sustained memory bandwidth by 3×. It also enables efficient parallel execution on multicore platforms; we use distinct cores to execute the computations of distinct vertex sets in parallel to achieve speedup. We implement our design on a 16-core Intel Xeon processor and use various large-scale real-life and synthetic datasets for evaluation. Compared with the PageRank Pipeline Benchmark, our design achieves 12× to 19× speedup for all the datasets.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131738217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
A Parallel Simulation Approach to ACAS X Development ACAS X开发的并行仿真方法
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916301
A. Gjersvik, Robert J. Moss
With a rapidly growing and evolving National Airspace System (NAS), ACAS X is intended to be the nextgeneration airborne collision avoidance system that can meet the demands its predecessor could not. The ACAS X algorithms are developed in the Julia programming language and are exercised in simulation environments tailored to test different characteristics of the system. Massive parallelization of these simulation environments has been implemented on the Lincoln Laboratory Supercomputing Center cluster in order to expedite the design and performance optimization of the system. This work outlines the approach to parallelization of one of our simulation tools and presents the resulting simulation speedups as well as a discussion on how it will enhance system characterization and design. Parallelization has made our simulation environment 33 times faster, which has greatly sped up the development process of ACAS X.
随着国家空域系统(NAS)的快速发展和演进,ACAS X旨在成为下一代机载防撞系统,能够满足其前身无法满足的需求。ACAS X算法是用Julia编程语言开发的,并在定制的仿真环境中进行练习,以测试系统的不同特性。这些模拟环境的大规模并行化已经在林肯实验室超级计算中心集群上实现,以加快系统的设计和性能优化。这项工作概述了我们的仿真工具之一的并行化方法,并介绍了由此产生的仿真加速以及关于它将如何增强系统表征和设计的讨论。并行化使我们的仿真环境速度提高了33倍,大大加快了ACAS X的开发进程。
{"title":"A Parallel Simulation Approach to ACAS X Development","authors":"A. Gjersvik, Robert J. Moss","doi":"10.1109/HPEC.2019.8916301","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916301","url":null,"abstract":"With a rapidly growing and evolving National Airspace System (NAS), ACAS X is intended to be the nextgeneration airborne collision avoidance system that can meet the demands its predecessor could not. The ACAS X algorithms are developed in the Julia programming language and are exercised in simulation environments tailored to test different characteristics of the system. Massive parallelization of these simulation environments has been implemented on the Lincoln Laboratory Supercomputing Center cluster in order to expedite the design and performance optimization of the system. This work outlines the approach to parallelization of one of our simulation tools and presents the resulting simulation speedups as well as a discussion on how it will enhance system characterization and design. Parallelization has made our simulation environment 33 times faster, which has greatly sped up the development process of ACAS X.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133462714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Distributed Direction-Optimizing Label Propagation for Community Detection 面向社区检测的分布式方向优化标签传播
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916215
Xu T. Liu, J. Firoz, Marcin Zalewski, M. Halappanavar, K. Barker, A. Lumsdaine, A. Gebremedhin
Designing a scalable algorithm for community detection is challenging due to the simultaneous need for both high performance and quality of solution. We propose a new distributed algorithm for community detection based on a novel Label Propagation algorithm. The algorithm is inspired by the direction optimization technique in graph traversal algorithms, relies on the use of frontiers, and alternates between abstractions called label push and label pull. This organization creates flexibility and affords us with opportunities for balancing performance and quality of solution. We implement our algorithm in distributed memory with the active-message based asynchronous many-task runtime AM++. We experiment with two seeding strategies for the initial seeding stage, namely, random seeding and degree seeding. With the Graph Challenge dataset, our distributed implementation, in conjunction with the runtime support, detects the communities in graphs having 20 million vertices in less than one second while achieving reasonably high quality of solution.
设计一种可扩展的社区检测算法具有挑战性,因为同时需要高性能和高质量的解决方案。在标签传播算法的基础上,提出了一种新的分布式社区检测算法。该算法受图遍历算法中的方向优化技术的启发,依赖于边界的使用,并在称为标签推和标签拉的抽象之间交替。这种组织创造了灵活性,并为我们提供了平衡解决方案性能和质量的机会。我们使用基于活动消息的异步多任务运行时am++在分布式内存中实现了我们的算法。在初始播种阶段,采用随机播种和程度播种两种播种策略进行了试验。使用Graph Challenge数据集,我们的分布式实现与运行时支持一起,在不到一秒的时间内检测到具有2000万个顶点的图中的社区,同时获得相当高质量的解决方案。
{"title":"Distributed Direction-Optimizing Label Propagation for Community Detection","authors":"Xu T. Liu, J. Firoz, Marcin Zalewski, M. Halappanavar, K. Barker, A. Lumsdaine, A. Gebremedhin","doi":"10.1109/HPEC.2019.8916215","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916215","url":null,"abstract":"Designing a scalable algorithm for community detection is challenging due to the simultaneous need for both high performance and quality of solution. We propose a new distributed algorithm for community detection based on a novel Label Propagation algorithm. The algorithm is inspired by the direction optimization technique in graph traversal algorithms, relies on the use of frontiers, and alternates between abstractions called label push and label pull. This organization creates flexibility and affords us with opportunities for balancing performance and quality of solution. We implement our algorithm in distributed memory with the active-message based asynchronous many-task runtime AM++. We experiment with two seeding strategies for the initial seeding stage, namely, random seeding and degree seeding. With the Graph Challenge dataset, our distributed implementation, in conjunction with the runtime support, detects the communities in graphs having 20 million vertices in less than one second while achieving reasonably high quality of solution.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132792541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Breadth-First Search on Dynamic Graphs using Dynamic Parallelism on the GPU 在GPU上使用动态并行的动态图的广度优先搜索
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916476
Dominik Tödling, Martin Winter, M. Steinberger
Breadth-First Search is an important basis for many different graph-based algorithms with applications ranging from peer-to-peer networking to garbage collection. However, the performance of different approaches depends strongly on the type of graph. In this paper, we present an efficient algorithm that performs well on a variety of different graphs. As part of this, we look into utilizing dynamic parallelism in order to both reduce overhead from latency between the CPU and GPU, as well as speed up the algorithm itself. Lastly, integrate the algorithm with the faimGraph framework for dynamic graphs and examine the relative performance to a Compressed-Sparse-Row data structure. We show that our algorithm can be well adapted to the dynamic setting and outperforms another competing dynamic graph framework on our test set.
广度优先搜索是许多不同的基于图的算法的重要基础,其应用范围从点对点网络到垃圾收集。然而,不同方法的性能在很大程度上取决于图的类型。在本文中,我们提出了一个有效的算法,在各种不同的图上表现良好。作为其中的一部分,我们着眼于利用动态并行,以减少CPU和GPU之间延迟的开销,以及加快算法本身。最后,将该算法与用于动态图的famgraph框架集成,并对压缩稀疏行数据结构的相对性能进行了测试。我们证明了我们的算法可以很好地适应动态设置,并且在我们的测试集上优于另一个竞争的动态图框架。
{"title":"Breadth-First Search on Dynamic Graphs using Dynamic Parallelism on the GPU","authors":"Dominik Tödling, Martin Winter, M. Steinberger","doi":"10.1109/HPEC.2019.8916476","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916476","url":null,"abstract":"Breadth-First Search is an important basis for many different graph-based algorithms with applications ranging from peer-to-peer networking to garbage collection. However, the performance of different approaches depends strongly on the type of graph. In this paper, we present an efficient algorithm that performs well on a variety of different graphs. As part of this, we look into utilizing dynamic parallelism in order to both reduce overhead from latency between the CPU and GPU, as well as speed up the algorithm itself. Lastly, integrate the algorithm with the faimGraph framework for dynamic graphs and examine the relative performance to a Compressed-Sparse-Row data structure. We show that our algorithm can be well adapted to the dynamic setting and outperforms another competing dynamic graph framework on our test set.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"34 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117278166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Write Quick, Run Fast: Sparse Deep Neural Network in 20 Minutes of Development Time via SuiteSparse:GraphBLAS 编写快速,运行快速:稀疏深度神经网络在20分钟的开发时间通过SuiteSparse:GraphBLAS
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916550
T. Davis, M. Aznaveh, Scott P. Kolodziej
SuiteSparse:GraphBLAS is a full implementation of the GraphBLAS standard, which provides a powerful and expressive framework for creating graph algorithms based on the elegant mathematics of sparse matrix operations on a semiring. Algorithms written in GraphBLAS achieve high performance with minimal development time. Using GraphBLAS, it took a mere 20 minutes to write a first-cut computational kernel that solves the Sparse Deep Neural Network Graph Challenge. Understanding the problem description and file format, writing code to read in the files that define the problem, and comparing our results with the reference solution took a full day. The kernel consists of a single for-loop around 4 lines of code, all of which are calls to GraphBLAS, and it worked perfectly the first time it was compiled. The sequential performance of the GraphBLAS solution is 3x to 5x faster than the MATLAB reference implementation. OpenMP parallelism gives an additional 10x to 15x speedup on a 20-core Intel processor, 17x on an IBM Power8 system, and 20x on a Power9 system, for the largest problems. Since SuiteSparse:GraphBLAS does not yet employ MPI, this was added at the application level, a development effort that took one week, primarily because of difficulties in resolving a load-balancing issue in the MPI-based parallel algorithm.
GraphBLAS是GraphBLAS标准的完整实现,它提供了一个强大而富有表现力的框架,用于创建基于半环上稀疏矩阵运算的优雅数学的图算法。用GraphBLAS编写的算法以最少的开发时间实现高性能。使用GraphBLAS,只花了20分钟就写出了解决稀疏深度神经网络图挑战的第一个计算内核。理解问题描述和文件格式,编写代码以读取定义问题的文件,并将我们的结果与参考解决方案进行比较,这需要一整天的时间。内核由一个大约4行代码的for循环组成,所有这些代码都是对GraphBLAS的调用,它在第一次编译时工作得很好。GraphBLAS解决方案的顺序性能比MATLAB参考实现快3到5倍。对于最大的问题,OpenMP并行性在20核英特尔处理器上提供了10到15倍的额外加速,在IBM Power8系统上提供了17倍的加速,在Power9系统上提供了20倍的加速。由于SuiteSparse:GraphBLAS还没有使用MPI,所以这是在应用程序级别添加的,这一开发工作花费了一周的时间,主要是因为在解决基于MPI的并行算法中的负载平衡问题时遇到了困难。
{"title":"Write Quick, Run Fast: Sparse Deep Neural Network in 20 Minutes of Development Time via SuiteSparse:GraphBLAS","authors":"T. Davis, M. Aznaveh, Scott P. Kolodziej","doi":"10.1109/HPEC.2019.8916550","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916550","url":null,"abstract":"SuiteSparse:GraphBLAS is a full implementation of the GraphBLAS standard, which provides a powerful and expressive framework for creating graph algorithms based on the elegant mathematics of sparse matrix operations on a semiring. Algorithms written in GraphBLAS achieve high performance with minimal development time. Using GraphBLAS, it took a mere 20 minutes to write a first-cut computational kernel that solves the Sparse Deep Neural Network Graph Challenge. Understanding the problem description and file format, writing code to read in the files that define the problem, and comparing our results with the reference solution took a full day. The kernel consists of a single for-loop around 4 lines of code, all of which are calls to GraphBLAS, and it worked perfectly the first time it was compiled. The sequential performance of the GraphBLAS solution is 3x to 5x faster than the MATLAB reference implementation. OpenMP parallelism gives an additional 10x to 15x speedup on a 20-core Intel processor, 17x on an IBM Power8 system, and 20x on a Power9 system, for the largest problems. Since SuiteSparse:GraphBLAS does not yet employ MPI, this was added at the application level, a development effort that took one week, primarily because of difficulties in resolving a load-balancing issue in the MPI-based parallel algorithm.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115479197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
期刊
2019 IEEE High Performance Extreme Computing Conference (HPEC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1