首页 > 最新文献

2019 IEEE High Performance Extreme Computing Conference (HPEC)最新文献

英文 中文
Fast Large-Scale Algorithm for Electromagnetic Wave Propagation in 3D Media 三维介质中电磁波传播的快速大规模算法
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916219
M. Harris, M. H. Langston, Pierre-David Létourneau, G. Papanicolaou, J. Ezick, R. Lethin
We present a fast, large-scale algorithm for the simulation of electromagnetic waves (Maxwell’s equations) in three-dimensional inhomogeneous media. The algorithm has a complexity of $O(Nlog (N))$ and runs in parallel. Numerical simulations show the rapid treatment of problems with tens of millions of unknowns on a small shared-memory cluster (≤ 16 cores).
我们提出了一种快速、大规模的算法来模拟三维非均匀介质中的电磁波(麦克斯韦方程组)。该算法的复杂度为$O(Nlog (N))$,并并行运行。数值模拟显示了在一个小型共享内存集群(≤16核)上快速处理具有数千万个未知数的问题。
{"title":"Fast Large-Scale Algorithm for Electromagnetic Wave Propagation in 3D Media","authors":"M. Harris, M. H. Langston, Pierre-David Létourneau, G. Papanicolaou, J. Ezick, R. Lethin","doi":"10.1109/HPEC.2019.8916219","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916219","url":null,"abstract":"We present a fast, large-scale algorithm for the simulation of electromagnetic waves (Maxwell’s equations) in three-dimensional inhomogeneous media. The algorithm has a complexity of $O(Nlog (N))$ and runs in parallel. Numerical simulations show the rapid treatment of problems with tens of millions of unknowns on a small shared-memory cluster (≤ 16 cores).","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124772697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Using Container Migration for HPC Workloads Resilience 使用容器迁移实现HPC工作负载弹性
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916436
Mohamad Sindi, John R. Williams
We share experiences in implementing a containerbased HPC environment that could help sustain running HPC workloads on clusters. By running workloads inside containers, we are able to migrate them from cluster nodes anticipating hardware problems, to healthy nodes while the workloads are running. Migration is done using the CRIU tool with no application modification. No major interruption or overhead is introduced to the workload. Various real HPC applications are tested. Tests are done with different hardware node specs, network interconnects, and MPI implementations. We also benchmark the applications on containers and compare performance to native. Results demonstrate successful migration of HPC workloads inside containers with minimal interruption, while maintaining the integrity of the results produced. We provide several YouTube videos demonstrating the migration tests. Benchmarks also show that application performance on containers is close to native. We discuss some of the challenges faced during implementation and solutions adopted. To the best of our knowledge, we believe this work is the first to demonstrate successful migration of real MPI-based HPC workloads using CRIU and containers.
我们将分享实现基于容器的HPC环境的经验,该环境有助于在集群上持续运行HPC工作负载。通过在容器内运行工作负载,我们能够在工作负载运行时将它们从预测硬件问题的集群节点迁移到健康节点。迁移是使用CRIU工具完成的,不需要修改应用程序。没有给工作负载引入主要的中断或开销。测试了各种实际HPC应用程序。测试使用了不同的硬件节点规格、网络互连和MPI实现。我们还在容器上对应用程序进行基准测试,并将性能与本机进行比较。结果表明,HPC工作负载在容器内的成功迁移具有最小的中断,同时保持所产生结果的完整性。我们提供了几个演示迁移测试的YouTube视频。基准测试还显示,容器上的应用程序性能接近本机。我们将讨论在实施过程中面临的一些挑战和采用的解决方案。据我们所知,我们认为这项工作是第一次展示使用CRIU和容器成功迁移真正基于mpi的HPC工作负载。
{"title":"Using Container Migration for HPC Workloads Resilience","authors":"Mohamad Sindi, John R. Williams","doi":"10.1109/HPEC.2019.8916436","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916436","url":null,"abstract":"We share experiences in implementing a containerbased HPC environment that could help sustain running HPC workloads on clusters. By running workloads inside containers, we are able to migrate them from cluster nodes anticipating hardware problems, to healthy nodes while the workloads are running. Migration is done using the CRIU tool with no application modification. No major interruption or overhead is introduced to the workload. Various real HPC applications are tested. Tests are done with different hardware node specs, network interconnects, and MPI implementations. We also benchmark the applications on containers and compare performance to native. Results demonstrate successful migration of HPC workloads inside containers with minimal interruption, while maintaining the integrity of the results produced. We provide several YouTube videos demonstrating the migration tests. Benchmarks also show that application performance on containers is close to native. We discuss some of the challenges faced during implementation and solutions adopted. To the best of our knowledge, we believe this work is the first to demonstrate successful migration of real MPI-based HPC workloads using CRIU and containers.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128319792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Combinatorial Multigrid: Advanced Preconditioners For Ill-Conditioned Linear Systems 组合多重网格:病态线性系统的高级预调节器
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916446
M. H. Langston, M. Harris, Pierre-David Létourneau, R. Lethin, J. Ezick
The Combinatorial Multigrid (CMG) technique is a practical and adaptable solver and combinatorial preconditioner for solving certain classes of large, sparse systems of linear equations. CMG is similar to Algebraic Multigrid (AMG) but replaces large groupings of fine-level variables with a single coarse-level one, resulting in simple and fast interpolation schemes. These schemes further provide control over the refinement strategies at different levels of the solver hierarchy depending on the condition number of the system being solved [1]. While many pre-existing solvers may be able to solve large, sparse systems with relatively low complexity, inversion may require O(n2) space; whereas, if we know that a linear operator has $tilde{n}=O(n)$ nonzero elements, we desire to use O(n) space in order to reduce communication as much as possible. Being able to invert sparse linear systems of equations, asymptotically as fast as the values can be read from memory, has been identified by the Defense Advanced Research Projects Agency (DARPA) and the Department of Energy (DOE) as increasingly necessary for scalable solvers and energy-efficient algorithms [2], [3] in scientific computing. Further, as industry and government agencies move towards exascale, fast solvers and communication-avoidance will be more necessary [4], [5]. In this paper, we present an optimized implementation of the Combinatorial Multigrid in C using Petsc and analyze the solution of various systems using the CMG approach as a preconditioner on much larger problems than have been presented thus far. We compare the number of iterations, setup times and solution times against other popular preconditioners for such systems, including Incomplete Cholesky and a Multigrid approach in Petsc against common problems, further exhibiting superior performance by the CMG.1 2
组合多重网格(CMG)技术是求解某类大型、稀疏线性方程组的一种实用且适应性强的求解器和组合预条件。CMG类似于代数多网格(algeaic Multigrid, AMG),但用一个粗级变量代替了大组细级变量,从而实现了简单快速的插值方案。这些方案进一步提供了对求解器层次结构中不同层次的细化策略的控制,这取决于被求解系统的条件数[1]。虽然许多已有的求解器可以求解复杂度相对较低的大型稀疏系统,但反演可能需要O(n2)空间;然而,如果我们知道一个线性算子有$tilde{n}=O(n)$非零元素,我们希望使用O(n)空间,以便尽可能地减少通信。美国国防部高级研究计划局(DARPA)和美国能源部(DOE)认为,在科学计算中,对于可扩展求解器和节能算法[2],[3]来说,能够反演稀疏线性方程组,且速度与从存储器中读取值的速度一样快,这一点越来越有必要。此外,随着工业和政府机构向百亿亿级发展,快速求解器和通信避免将更加必要[4],[5]。在本文中,我们使用Petsc在C中提出了组合多网格的优化实现,并使用CMG方法作为迄今为止提出的更大问题的前置条件,分析了各种系统的解决方案。我们将迭代次数、设置时间和解决时间与此类系统的其他流行前置条件(包括针对常见问题的不完全Cholesky和Petsc中的Multigrid方法)进行了比较,进一步展示了CMG.1的优越性能
{"title":"Combinatorial Multigrid: Advanced Preconditioners For Ill-Conditioned Linear Systems","authors":"M. H. Langston, M. Harris, Pierre-David Létourneau, R. Lethin, J. Ezick","doi":"10.1109/HPEC.2019.8916446","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916446","url":null,"abstract":"The Combinatorial Multigrid (CMG) technique is a practical and adaptable solver and combinatorial preconditioner for solving certain classes of large, sparse systems of linear equations. CMG is similar to Algebraic Multigrid (AMG) but replaces large groupings of fine-level variables with a single coarse-level one, resulting in simple and fast interpolation schemes. These schemes further provide control over the refinement strategies at different levels of the solver hierarchy depending on the condition number of the system being solved [1]. While many pre-existing solvers may be able to solve large, sparse systems with relatively low complexity, inversion may require O(n2) space; whereas, if we know that a linear operator has $tilde{n}=O(n)$ nonzero elements, we desire to use O(n) space in order to reduce communication as much as possible. Being able to invert sparse linear systems of equations, asymptotically as fast as the values can be read from memory, has been identified by the Defense Advanced Research Projects Agency (DARPA) and the Department of Energy (DOE) as increasingly necessary for scalable solvers and energy-efficient algorithms [2], [3] in scientific computing. Further, as industry and government agencies move towards exascale, fast solvers and communication-avoidance will be more necessary [4], [5]. In this paper, we present an optimized implementation of the Combinatorial Multigrid in C using Petsc and analyze the solution of various systems using the CMG approach as a preconditioner on much larger problems than have been presented thus far. We compare the number of iterations, setup times and solution times against other popular preconditioners for such systems, including Incomplete Cholesky and a Multigrid approach in Petsc against common problems, further exhibiting superior performance by the CMG.1 2","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125475653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Scalable Lazy-update Multigrid Preconditioners 可伸缩的延迟更新多网格预处理
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916504
Majid Rasouli, Vidhi Zala, R. Kirby, H. Sundar
Multigrid is one of the most effective methods for solving elliptic PDEs. It is algorithmically optimal and is robust when combined with Krylov methods. Algebraic multigrid is especially attractive due to its blackbox nature. This however comes at the cost of increased setup costs that can be significant in case of systems where the system matrix changes frequently making it difficult to amortize the setup cost. In this work, we investigate several strategies for performing lazy updates to the multigrid hierarchy corresponding to changes in the system matrix. These include delayed updates, value updates without changing structure, process local changes, and full updates. We demonstrate that in many cases, the overhead of building the AMG hierarchy can be mitigated for rapidly changing system matrices.
多重网格是求解椭圆偏微分方程最有效的方法之一。它是算法最优的,并且与Krylov方法结合使用时具有鲁棒性。代数多重网格由于其黑箱特性而特别具有吸引力。然而,这是以增加的设置成本为代价的,在系统矩阵频繁变化的情况下,这可能是显著的,这使得难以摊销设置成本。在这项工作中,我们研究了几种针对系统矩阵变化对多网格层次结构执行延迟更新的策略。这包括延迟更新、不更改结构的值更新、流程局部更改和完整更新。我们证明,在许多情况下,构建AMG层次结构的开销可以减轻快速变化的系统矩阵。
{"title":"Scalable Lazy-update Multigrid Preconditioners","authors":"Majid Rasouli, Vidhi Zala, R. Kirby, H. Sundar","doi":"10.1109/HPEC.2019.8916504","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916504","url":null,"abstract":"Multigrid is one of the most effective methods for solving elliptic PDEs. It is algorithmically optimal and is robust when combined with Krylov methods. Algebraic multigrid is especially attractive due to its blackbox nature. This however comes at the cost of increased setup costs that can be significant in case of systems where the system matrix changes frequently making it difficult to amortize the setup cost. In this work, we investigate several strategies for performing lazy updates to the multigrid hierarchy corresponding to changes in the system matrix. These include delayed updates, value updates without changing structure, process local changes, and full updates. We demonstrate that in many cases, the overhead of building the AMG hierarchy can be mitigated for rapidly changing system matrices.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120894534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluation of the Imbalance Evolution in Parallel Reservoir Simulation 平行油藏模拟中不平衡演化的评价
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916495
M. Rogowski, Suha N. Kayum
Load balancing is a crucial factor affecting the performance of parallel applications. Improper work distribution leads to underutilization of computing resources and an unnecessary increase in runtime. This paper identifies the imbalance sources in reservoir simulation and characterizes them as static or dynamic. Simulation model properties that change over time, such as well management actions, are registered and correlated with performance characteristics hence identifying sources of imbalance. The results are exploratory and used to validate the current approach of static grid-to-process, and well-to-process assignment widely used in commercial parallel reservoir simulators. Areas in which implementing dynamic load balancing would be worthwhile are identified.
负载平衡是影响并行应用程序性能的一个关键因素。不合理的工作分配导致计算资源利用率不足,运行时间增加。本文对油藏模拟中的不平衡源进行了识别,并将其分为静态不平衡源和动态不平衡源。随着时间的推移,模拟模型的属性(如井管理行为)会发生变化,并与性能特征相关联,从而确定不平衡的来源。结果是探索性的,并用于验证目前在商业并行油藏模拟器中广泛使用的静态网格到过程和井到过程分配方法。确定了值得实现动态负载平衡的领域。
{"title":"Evaluation of the Imbalance Evolution in Parallel Reservoir Simulation","authors":"M. Rogowski, Suha N. Kayum","doi":"10.1109/HPEC.2019.8916495","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916495","url":null,"abstract":"Load balancing is a crucial factor affecting the performance of parallel applications. Improper work distribution leads to underutilization of computing resources and an unnecessary increase in runtime. This paper identifies the imbalance sources in reservoir simulation and characterizes them as static or dynamic. Simulation model properties that change over time, such as well management actions, are registered and correlated with performance characteristics hence identifying sources of imbalance. The results are exploratory and used to validate the current approach of static grid-to-process, and well-to-process assignment widely used in commercial parallel reservoir simulators. Areas in which implementing dynamic load balancing would be worthwhile are identified.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121806033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
[HPEC 2019 Copyright notice] [HPEC 2019版权声明]
Pub Date : 2019-09-01 DOI: 10.1109/hpec.2019.8916557
{"title":"[HPEC 2019 Copyright notice]","authors":"","doi":"10.1109/hpec.2019.8916557","DOIUrl":"https://doi.org/10.1109/hpec.2019.8916557","url":null,"abstract":"","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114012282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Breadth-First Search on Dynamic Graphs using Dynamic Parallelism on the GPU 在GPU上使用动态并行的动态图的广度优先搜索
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916476
Dominik Tödling, Martin Winter, M. Steinberger
Breadth-First Search is an important basis for many different graph-based algorithms with applications ranging from peer-to-peer networking to garbage collection. However, the performance of different approaches depends strongly on the type of graph. In this paper, we present an efficient algorithm that performs well on a variety of different graphs. As part of this, we look into utilizing dynamic parallelism in order to both reduce overhead from latency between the CPU and GPU, as well as speed up the algorithm itself. Lastly, integrate the algorithm with the faimGraph framework for dynamic graphs and examine the relative performance to a Compressed-Sparse-Row data structure. We show that our algorithm can be well adapted to the dynamic setting and outperforms another competing dynamic graph framework on our test set.
广度优先搜索是许多不同的基于图的算法的重要基础,其应用范围从点对点网络到垃圾收集。然而,不同方法的性能在很大程度上取决于图的类型。在本文中,我们提出了一个有效的算法,在各种不同的图上表现良好。作为其中的一部分,我们着眼于利用动态并行,以减少CPU和GPU之间延迟的开销,以及加快算法本身。最后,将该算法与用于动态图的famgraph框架集成,并对压缩稀疏行数据结构的相对性能进行了测试。我们证明了我们的算法可以很好地适应动态设置,并且在我们的测试集上优于另一个竞争的动态图框架。
{"title":"Breadth-First Search on Dynamic Graphs using Dynamic Parallelism on the GPU","authors":"Dominik Tödling, Martin Winter, M. Steinberger","doi":"10.1109/HPEC.2019.8916476","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916476","url":null,"abstract":"Breadth-First Search is an important basis for many different graph-based algorithms with applications ranging from peer-to-peer networking to garbage collection. However, the performance of different approaches depends strongly on the type of graph. In this paper, we present an efficient algorithm that performs well on a variety of different graphs. As part of this, we look into utilizing dynamic parallelism in order to both reduce overhead from latency between the CPU and GPU, as well as speed up the algorithm itself. Lastly, integrate the algorithm with the faimGraph framework for dynamic graphs and examine the relative performance to a Compressed-Sparse-Row data structure. We show that our algorithm can be well adapted to the dynamic setting and outperforms another competing dynamic graph framework on our test set.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"34 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117278166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Write Quick, Run Fast: Sparse Deep Neural Network in 20 Minutes of Development Time via SuiteSparse:GraphBLAS 编写快速,运行快速:稀疏深度神经网络在20分钟的开发时间通过SuiteSparse:GraphBLAS
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916550
T. Davis, M. Aznaveh, Scott P. Kolodziej
SuiteSparse:GraphBLAS is a full implementation of the GraphBLAS standard, which provides a powerful and expressive framework for creating graph algorithms based on the elegant mathematics of sparse matrix operations on a semiring. Algorithms written in GraphBLAS achieve high performance with minimal development time. Using GraphBLAS, it took a mere 20 minutes to write a first-cut computational kernel that solves the Sparse Deep Neural Network Graph Challenge. Understanding the problem description and file format, writing code to read in the files that define the problem, and comparing our results with the reference solution took a full day. The kernel consists of a single for-loop around 4 lines of code, all of which are calls to GraphBLAS, and it worked perfectly the first time it was compiled. The sequential performance of the GraphBLAS solution is 3x to 5x faster than the MATLAB reference implementation. OpenMP parallelism gives an additional 10x to 15x speedup on a 20-core Intel processor, 17x on an IBM Power8 system, and 20x on a Power9 system, for the largest problems. Since SuiteSparse:GraphBLAS does not yet employ MPI, this was added at the application level, a development effort that took one week, primarily because of difficulties in resolving a load-balancing issue in the MPI-based parallel algorithm.
GraphBLAS是GraphBLAS标准的完整实现,它提供了一个强大而富有表现力的框架,用于创建基于半环上稀疏矩阵运算的优雅数学的图算法。用GraphBLAS编写的算法以最少的开发时间实现高性能。使用GraphBLAS,只花了20分钟就写出了解决稀疏深度神经网络图挑战的第一个计算内核。理解问题描述和文件格式,编写代码以读取定义问题的文件,并将我们的结果与参考解决方案进行比较,这需要一整天的时间。内核由一个大约4行代码的for循环组成,所有这些代码都是对GraphBLAS的调用,它在第一次编译时工作得很好。GraphBLAS解决方案的顺序性能比MATLAB参考实现快3到5倍。对于最大的问题,OpenMP并行性在20核英特尔处理器上提供了10到15倍的额外加速,在IBM Power8系统上提供了17倍的加速,在Power9系统上提供了20倍的加速。由于SuiteSparse:GraphBLAS还没有使用MPI,所以这是在应用程序级别添加的,这一开发工作花费了一周的时间,主要是因为在解决基于MPI的并行算法中的负载平衡问题时遇到了困难。
{"title":"Write Quick, Run Fast: Sparse Deep Neural Network in 20 Minutes of Development Time via SuiteSparse:GraphBLAS","authors":"T. Davis, M. Aznaveh, Scott P. Kolodziej","doi":"10.1109/HPEC.2019.8916550","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916550","url":null,"abstract":"SuiteSparse:GraphBLAS is a full implementation of the GraphBLAS standard, which provides a powerful and expressive framework for creating graph algorithms based on the elegant mathematics of sparse matrix operations on a semiring. Algorithms written in GraphBLAS achieve high performance with minimal development time. Using GraphBLAS, it took a mere 20 minutes to write a first-cut computational kernel that solves the Sparse Deep Neural Network Graph Challenge. Understanding the problem description and file format, writing code to read in the files that define the problem, and comparing our results with the reference solution took a full day. The kernel consists of a single for-loop around 4 lines of code, all of which are calls to GraphBLAS, and it worked perfectly the first time it was compiled. The sequential performance of the GraphBLAS solution is 3x to 5x faster than the MATLAB reference implementation. OpenMP parallelism gives an additional 10x to 15x speedup on a 20-core Intel processor, 17x on an IBM Power8 system, and 20x on a Power9 system, for the largest problems. Since SuiteSparse:GraphBLAS does not yet employ MPI, this was added at the application level, a development effort that took one week, primarily because of difficulties in resolving a load-balancing issue in the MPI-based parallel algorithm.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115479197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
A Novel Design of Adaptive and Hierarchical Convolutional Neural Networks using Partial Reconfiguration on FPGA 基于FPGA的部分重构自适应分层卷积神经网络设计
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916237
Mohammad Farhadi, Mehdi Ghasemi, Yezhou Yang
Nowadays most research in visual recognition using Convolutional Neural Networks (CNNs) follows the “deeper model with deeper confidence” belief to gain a higher recognition accuracy. At the same time, deeper model brings heavier computation. On the other hand, for a large chunk of recognition challenges, a system can classify images correctly using simple models or so-called shallow networks. Moreover, the implementation of CNNs faces with the size, weight, and energy constraints on the embedded devices. In this paper, we implement the adaptive switching between shallow and deep networks to reach the highest throughput on a resource-constrained MPSoC with CPU and FPGA. To this end, we develop and present a novel architecture for the CNNs where a gate makes the decision whether using the deeper model is beneficial or not. Due to resource limitation on FPGA, the idea of partial reconfiguration has been used to accommodate deep CNNs on the FPGA resources. We report experimental results on CIFAR-10, CIFAR-100, and SVHN datasets to validate our approach. Using confidence metric as the decision making factor, only 69.8%, 71.8%, and 43.8% of the computation in the deepest network is done for CIFAR10, CIFAR-100, and SVHN while it can maintain the desired accuracy with the throughput of around 400 images per second for SVHN dataset. https://github.com/mfarhadi/AHCNN.
目前使用卷积神经网络(cnn)进行视觉识别的研究大多遵循“更深的模型和更深的置信度”的信念来获得更高的识别精度。同时,模型越深,计算量越大。另一方面,对于大量的识别挑战,系统可以使用简单的模型或所谓的浅层网络正确地对图像进行分类。此外,cnn的实现还面临着嵌入式设备的尺寸、重量和能量限制。在本文中,我们实现了浅层和深层网络之间的自适应切换,以在具有CPU和FPGA的资源受限的MPSoC上达到最高吞吐量。为此,我们开发并提出了一种新颖的cnn架构,其中一个门决定是否使用更深的模型是有益的。由于FPGA的资源限制,在FPGA资源上采用局部重构的思想来容纳深度cnn。我们报告了在CIFAR-10、CIFAR-100和SVHN数据集上的实验结果来验证我们的方法。使用置信度作为决策因素,CIFAR10、CIFAR-100和SVHN在深度网络中只完成了69.8%、71.8%和43.8%的计算,而SVHN数据集可以保持所需的精度,吞吐量约为每秒400张图像。https://github.com/mfarhadi/AHCNN。
{"title":"A Novel Design of Adaptive and Hierarchical Convolutional Neural Networks using Partial Reconfiguration on FPGA","authors":"Mohammad Farhadi, Mehdi Ghasemi, Yezhou Yang","doi":"10.1109/HPEC.2019.8916237","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916237","url":null,"abstract":"Nowadays most research in visual recognition using Convolutional Neural Networks (CNNs) follows the “deeper model with deeper confidence” belief to gain a higher recognition accuracy. At the same time, deeper model brings heavier computation. On the other hand, for a large chunk of recognition challenges, a system can classify images correctly using simple models or so-called shallow networks. Moreover, the implementation of CNNs faces with the size, weight, and energy constraints on the embedded devices. In this paper, we implement the adaptive switching between shallow and deep networks to reach the highest throughput on a resource-constrained MPSoC with CPU and FPGA. To this end, we develop and present a novel architecture for the CNNs where a gate makes the decision whether using the deeper model is beneficial or not. Due to resource limitation on FPGA, the idea of partial reconfiguration has been used to accommodate deep CNNs on the FPGA resources. We report experimental results on CIFAR-10, CIFAR-100, and SVHN datasets to validate our approach. Using confidence metric as the decision making factor, only 69.8%, 71.8%, and 43.8% of the computation in the deepest network is done for CIFAR10, CIFAR-100, and SVHN while it can maintain the desired accuracy with the throughput of around 400 images per second for SVHN dataset. https://github.com/mfarhadi/AHCNN.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121533225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
IdPrism: Rapid Analysis of Forensic DNA Samples Using MPS SNP Profiles IdPrism:使用MPS SNP档案快速分析法医DNA样本
Pub Date : 2019-09-01 DOI: 10.1109/HPEC.2019.8916521
D. Ricke, James Watkins, Philip Fremont-Smith, Adam Michaleas
Massively parallel sequencing (MPS) of large single nucleotide polymorphism (SNP) panels enables identification, analysis of complex DNA mixture samples, and extended kinship predictions. Computational challenges related to SNP allele calling, probability of random man not excluded calculations, and both reference and complex mixture sample comparisons to tens of millions of reference profiles were encountered and resolved when scaling up from thousands to tens of thousands of SNP loci. A MPS SNP analysis pipeline is described for rapid analysis of forensic deoxyribonucleic acid (DNA) samples for thousands to tens of thousands of SNP loci against tens of millions of reference profiles. This pipeline is part of the MIT Lincoln Laboratory (MITLL) IdPrism advanced DNA forensic system.
大型单核苷酸多态性(SNP)面板的大规模平行测序(MPS)可以进行鉴定,分析复杂的DNA混合物样本,并扩展亲属关系预测。当从数千个SNP位点扩展到数万个SNP位点时,遇到了与SNP等位基因调用、随机人不排除计算概率以及参考和复杂混合样本与数千万个参考图谱的比较相关的计算挑战,并解决了这些挑战。MPS SNP分析管道用于法医脱氧核糖核酸(DNA)样品的数千至数万个SNP位点对数千万个参考谱的快速分析。该管道是麻省理工学院林肯实验室(MITLL) IdPrism先进DNA法医系统的一部分。
{"title":"IdPrism: Rapid Analysis of Forensic DNA Samples Using MPS SNP Profiles","authors":"D. Ricke, James Watkins, Philip Fremont-Smith, Adam Michaleas","doi":"10.1109/HPEC.2019.8916521","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916521","url":null,"abstract":"Massively parallel sequencing (MPS) of large single nucleotide polymorphism (SNP) panels enables identification, analysis of complex DNA mixture samples, and extended kinship predictions. Computational challenges related to SNP allele calling, probability of random man not excluded calculations, and both reference and complex mixture sample comparisons to tens of millions of reference profiles were encountered and resolved when scaling up from thousands to tens of thousands of SNP loci. A MPS SNP analysis pipeline is described for rapid analysis of forensic deoxyribonucleic acid (DNA) samples for thousands to tens of thousands of SNP loci against tens of millions of reference profiles. This pipeline is part of the MIT Lincoln Laboratory (MITLL) IdPrism advanced DNA forensic system.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127927347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
2019 IEEE High Performance Extreme Computing Conference (HPEC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1