首页 > 最新文献

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

英文 中文
Node variability in large-scale power measurements: perspectives from the Green500, Top500 and EEHPCWG 大规模功率测量中的节点变异性:来自Green500、Top500和EEHPCWG的视角
T. Scogland, Jonathan J. Azose, D. Rohr, Suzanne Rivoire, Natalie J. Bates, D. Hackenberg
The last decade has seen power consumption move from an afterthought to the foremost design constraint of new supercomputers. Measuring the power of a supercomputer can be a daunting proposition, and as a result, many published measurements are extrapolated. This paper explores the validity of these extrapolations in the context of inter-node power variability and power variations over time within a run. We characterize power variability across nodes in systems at eight supercomputer centers across the globe. This characterization shows that the current requirement for measurements submitted to the Green500 and others is insufficient, allowing variations of up to 20% due to measurement timing and a further 10--15% due to insufficient sample sizes. This paper proposes new power and energy measurement requirements for supercomputers, some of which have been accepted for use by the Green500 and Top500, to ensure consistent accuracy.
在过去的十年里,功耗已经从一个事后的想法变成了新型超级计算机的首要设计约束。测量超级计算机的能力可能是一项艰巨的任务,因此,许多已发表的测量结果都是外推的。本文探讨了在节点间功率变异性和功率随时间变化的情况下,这些外推的有效性。我们描述了全球八个超级计算机中心系统中节点之间的功率变化。这一特征表明,当前提交给Green500和其他机构的测量需求是不够的,由于测量时间的原因,允许高达20%的变化,并且由于样本量不足,允许进一步的10- 15%的变化。本文对超级计算机提出了新的功率和能量测量要求,其中一些已经被Green500和Top500接受使用,以确保一致的精度。
{"title":"Node variability in large-scale power measurements: perspectives from the Green500, Top500 and EEHPCWG","authors":"T. Scogland, Jonathan J. Azose, D. Rohr, Suzanne Rivoire, Natalie J. Bates, D. Hackenberg","doi":"10.1145/2807591.2807653","DOIUrl":"https://doi.org/10.1145/2807591.2807653","url":null,"abstract":"The last decade has seen power consumption move from an afterthought to the foremost design constraint of new supercomputers. Measuring the power of a supercomputer can be a daunting proposition, and as a result, many published measurements are extrapolated. This paper explores the validity of these extrapolations in the context of inter-node power variability and power variations over time within a run. We characterize power variability across nodes in systems at eight supercomputer centers across the globe. This characterization shows that the current requirement for measurements submitted to the Green500 and others is insufficient, allowing variations of up to 20% due to measurement timing and a further 10--15% due to insufficient sample sizes. This paper proposes new power and energy measurement requirements for supercomputers, some of which have been accepted for use by the Green500 and Top500, to ensure consistent accuracy.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125412529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Local recovery and failure masking for stencil-based applications at extreme scales 本地恢复和故障屏蔽在极端规模的基于模板的应用程序
Marc Gamell, K. Teranishi, M. Heroux, J. Mayo, H. Kolla, Jacqueline H. Chen, M. Parashar
Application resilience is a key challenge that has to be addressed to realize the exascale vision. Online recovery, even when it involves all processes, can dramatically reduce the overhead of failures as compared to the more traditional approach where the job is terminated and restarted from the last checkpoint. In this paper we explore how local recovery can be used for certain classes of applications to further reduce overheads due to resilience. Specifically we develop programming support and scalable runtime mechanisms to enable online and transparent local recovery for stencil-based parallel applications on current leadership class systems. We also show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution. We integrate these mechanisms with the S3D combustion simulation, and experimentally demonstrate (using the Titan Cray-XK7 system at ORNL) the ability to tolerate high failure rates (i.e., node failures every 5 seconds) with low overhead while sustaining performance, at scales up to 262144 cores.
应用程序弹性是实现百亿亿级远景必须解决的关键挑战。在线恢复,即使涉及到所有进程,与更传统的终止作业并从最后一个检查点重新启动的方法相比,也可以显著减少故障的开销。在本文中,我们探讨了如何将本地恢复用于某些类型的应用程序,以进一步减少由于弹性而产生的开销。具体来说,我们开发了编程支持和可扩展的运行时机制,以便在当前的领导类系统上为基于模板的并行应用程序实现在线和透明的本地恢复。我们还展示了如何掩盖多个独立的故障,以有效地减少对解决方案总时间的影响。我们将这些机制与S3D燃烧模拟集成在一起,并通过实验证明(使用ORNL的Titan Cray-XK7系统)能够容忍高故障率(即每5秒发生一次节点故障),同时保持性能,规模高达262144核。
{"title":"Local recovery and failure masking for stencil-based applications at extreme scales","authors":"Marc Gamell, K. Teranishi, M. Heroux, J. Mayo, H. Kolla, Jacqueline H. Chen, M. Parashar","doi":"10.1145/2807591.2807672","DOIUrl":"https://doi.org/10.1145/2807591.2807672","url":null,"abstract":"Application resilience is a key challenge that has to be addressed to realize the exascale vision. Online recovery, even when it involves all processes, can dramatically reduce the overhead of failures as compared to the more traditional approach where the job is terminated and restarted from the last checkpoint. In this paper we explore how local recovery can be used for certain classes of applications to further reduce overheads due to resilience. Specifically we develop programming support and scalable runtime mechanisms to enable online and transparent local recovery for stencil-based parallel applications on current leadership class systems. We also show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution. We integrate these mechanisms with the S3D combustion simulation, and experimentally demonstrate (using the Titan Cray-XK7 system at ORNL) the ability to tolerate high failure rates (i.e., node failures every 5 seconds) with low overhead while sustaining performance, at scales up to 262144 cores.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128235949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
Cost-effective diameter-two topologies: analysis and evaluation 具有成本效益的直径二拓扑:分析和评估
G. Kathareios, C. Minkenberg, B. Prisacari, G. Rodríguez, T. Hoefler
HPC network topology design is currently shifting from high-performance, higher-cost Fat-Trees to more cost-effective architectures. Three diameter-two designs, the Slim Fly, Multi-Layer Full-Mesh, and Two-Level Orthogonal Fat-Tree excel in this, exhibiting a cost per endpoint of only 2 links and 3 router ports with lower end-to-end latency and higher scalability than traditional networks of the same total cost. However, other than for the Slim Fly, there is currently no clear understanding of the performance and routing of these emerging topologies. For each network, we discuss minimal, indirect random, and adaptive routing algorithms along with deadlock-avoidance mechanisms. Using these, we evaluate the performance of a series of representative workloads, from global uniform and worst-case traffic to the all-to-all and near-neighbor exchange patterns prevalent in HPC applications. We show that while all three topologies have similar performance, OFTs scale to twice as many endpoints at the same cost as the others.
目前,HPC网络拓扑设计正从高性能、高成本的fat - tree结构转向更具成本效益的架构。Slim Fly、Multi-Layer Full-Mesh和Two-Level Orthogonal Fat-Tree三种直径两种设计在这方面表现出色,每个端点的成本只有2条链路和3个路由器端口,与相同总成本的传统网络相比,端到端延迟更低,可扩展性更高。然而,除了Slim Fly之外,目前对这些新兴拓扑的性能和路由还没有明确的了解。对于每个网络,我们讨论了最小、间接随机和自适应路由算法以及死锁避免机制。使用这些,我们评估了一系列代表性工作负载的性能,从全局统一和最坏情况流量到HPC应用程序中流行的所有对所有和近邻交换模式。我们表明,虽然所有三种拓扑都具有相似的性能,但oft以相同的成本扩展到两倍于其他拓扑的端点。
{"title":"Cost-effective diameter-two topologies: analysis and evaluation","authors":"G. Kathareios, C. Minkenberg, B. Prisacari, G. Rodríguez, T. Hoefler","doi":"10.1145/2807591.2807652","DOIUrl":"https://doi.org/10.1145/2807591.2807652","url":null,"abstract":"HPC network topology design is currently shifting from high-performance, higher-cost Fat-Trees to more cost-effective architectures. Three diameter-two designs, the Slim Fly, Multi-Layer Full-Mesh, and Two-Level Orthogonal Fat-Tree excel in this, exhibiting a cost per endpoint of only 2 links and 3 router ports with lower end-to-end latency and higher scalability than traditional networks of the same total cost. However, other than for the Slim Fly, there is currently no clear understanding of the performance and routing of these emerging topologies. For each network, we discuss minimal, indirect random, and adaptive routing algorithms along with deadlock-avoidance mechanisms. Using these, we evaluate the performance of a series of representative workloads, from global uniform and worst-case traffic to the all-to-all and near-neighbor exchange patterns prevalent in HPC applications. We show that while all three topologies have similar performance, OFTs scale to twice as many endpoints at the same cost as the others.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131366510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 51
A parallel connectivity algorithm for de Bruijn graphs in metagenomic applications 宏基因组应用中de Bruijn图的并行连接算法
P. Flick, Chirag Jain, Tony Pan, S. Aluru
Dramatic advances in DNA sequencing technology have made it possible to study microbial environments by direct sequencing of environmental DNA samples. Yet, due to the huge volume and high data complexity, current de novo assemblers cannot handle large metagenomic datasets or fail to perform assembly with acceptable quality. This paper presents the first parallel solution for decomposing the metagenomic assembly problem without compromising the post-assembly quality. We transform this problem into that of finding weakly connected components in the de Bruijn graph. We propose a novel distributed memory algorithm to identify the connected subgraphs, and present strategies to minimize the communication volume. We demonstrate the scalability of our algorithm on a soil metagenome dataset with 1.8 billion reads. Our approach achieves a runtime of 22 minutes using 1280 Intel Xeon cores for a 421 GB uncompressed FASTQ dataset. Moreover, our solution is generalizable to finding connected components in arbitrary undirected graphs.
DNA测序技术的巨大进步使得通过对环境DNA样本进行直接测序来研究微生物环境成为可能。然而,由于庞大的体积和高数据复杂性,目前的新组装程序无法处理大型宏基因组数据集,或者无法以可接受的质量进行组装。本文提出了一种不影响装配后质量的宏基因组装配问题并行分解方法。我们把这个问题转化为在德布鲁因图中寻找弱连通分量的问题。我们提出了一种新的分布式内存算法来识别连接子图,并提出了最小化通信量的策略。我们在一个具有18亿次读取的土壤宏基因组数据集上展示了算法的可扩展性。我们的方法在421 GB未压缩的FASTQ数据集上使用1280个Intel Xeon内核实现了22分钟的运行时间。此外,我们的解可推广到求任意无向图中的连通分量。
{"title":"A parallel connectivity algorithm for de Bruijn graphs in metagenomic applications","authors":"P. Flick, Chirag Jain, Tony Pan, S. Aluru","doi":"10.1145/2807591.2807619","DOIUrl":"https://doi.org/10.1145/2807591.2807619","url":null,"abstract":"Dramatic advances in DNA sequencing technology have made it possible to study microbial environments by direct sequencing of environmental DNA samples. Yet, due to the huge volume and high data complexity, current de novo assemblers cannot handle large metagenomic datasets or fail to perform assembly with acceptable quality. This paper presents the first parallel solution for decomposing the metagenomic assembly problem without compromising the post-assembly quality. We transform this problem into that of finding weakly connected components in the de Bruijn graph. We propose a novel distributed memory algorithm to identify the connected subgraphs, and present strategies to minimize the communication volume. We demonstrate the scalability of our algorithm on a soil metagenome dataset with 1.8 billion reads. Our approach achieves a runtime of 22 minutes using 1280 Intel Xeon cores for a 421 GB uncompressed FASTQ dataset. Moreover, our solution is generalizable to finding connected components in arbitrary undirected graphs.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133071999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
ELF: maximizing memory-level parallelism for GPUs with coordinated warp and fetch scheduling ELF:通过协调的warp和fetch调度最大化gpu的内存级并行性
Jason Jong Kyu Park, Yongjun Park, S. Mahlke
Graphics processing units (GPUs) are increasingly utilized as throughput engines in the modern computer systems. GPUs rely on fast context switching between thousands of threads to hide long latency operations, however, they still stall due to the memory operations. To minimize the stalls, memory operations should be overlapped with other operations as much as possible to maximize memory-level parallelism (MLP). In this paper, we propose Earliest Load First (ELF) warp scheduling, which maximizes the MLP by giving higher priority to the warps that have the fewest instructions to the next memory load. ELF utilizes the same warp priority for the fetch scheduling so that both are coordinated. We also show that ELF reveals its full benefits when there are fewer memory conflicts and fetch stalls. Evaluations show that ELF can improve the performance by 4.1% and achieve total improvement of 11.9% when used with other techniques over commonly-used greedy-then-oldest scheduling.
图形处理单元(gpu)在现代计算机系统中越来越多地用作吞吐量引擎。gpu依赖于数千个线程之间的快速上下文切换来隐藏长延迟操作,然而,由于内存操作,它们仍然会停机。为了最小化延迟,内存操作应该尽可能地与其他操作重叠,以最大化内存级并行性(MLP)。在本文中,我们提出了最早加载优先(ELF)的warp调度,它通过给指令最少的warp更高的优先级来最大化MLP。ELF对读取调度使用相同的翘曲优先级,因此两者是协调的。我们还展示了ELF在内存冲突和获取延迟减少的情况下的全部优势。评估表明,与常用的“先贪后老”调度相比,ELF可以将性能提高4.1%,与其他技术一起使用时,可以实现11.9%的总改进。
{"title":"ELF: maximizing memory-level parallelism for GPUs with coordinated warp and fetch scheduling","authors":"Jason Jong Kyu Park, Yongjun Park, S. Mahlke","doi":"10.1145/2807591.2807598","DOIUrl":"https://doi.org/10.1145/2807591.2807598","url":null,"abstract":"Graphics processing units (GPUs) are increasingly utilized as throughput engines in the modern computer systems. GPUs rely on fast context switching between thousands of threads to hide long latency operations, however, they still stall due to the memory operations. To minimize the stalls, memory operations should be overlapped with other operations as much as possible to maximize memory-level parallelism (MLP). In this paper, we propose Earliest Load First (ELF) warp scheduling, which maximizes the MLP by giving higher priority to the warps that have the fewest instructions to the next memory load. ELF utilizes the same warp priority for the fetch scheduling so that both are coordinated. We also show that ELF reveals its full benefits when there are fewer memory conflicts and fetch stalls. Evaluations show that ELF can improve the performance by 4.1% and achieve total improvement of 11.9% when used with other techniques over commonly-used greedy-then-oldest scheduling.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"180 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133489577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
An extreme-scale implicit solver for complex PDEs: highly heterogeneous flow in earth's mantle 复杂偏微分方程的极端尺度隐式求解器:地幔中的高度非均质流动
J. Rudi, A. Malossi, T. Isaac, G. Stadler, M. Gurnis, P. Staar, Y. Ineichen, C. Bekas, A. Curioni, O. Ghattas
Mantle convection is the fundamental physical process within earth's interior responsible for the thermal and geological evolution of the planet, including plate tectonics. The mantle is modeled as a viscous, incompressible, non-Newtonian fluid. The wide range of spatial scales, extreme variability and anisotropy in material properties, and severely nonlinear rheology have made global mantle convection modeling with realistic parameters prohibitive. Here we present a new implicit solver that exhibits optimal algorithmic performance and is capable of extreme scaling for hard PDE problems, such as mantle convection. To maximize accuracy and minimize runtime, the solver incorporates a number of advances, including aggressive multi-octree adaptivity, mixed continuous-discontinuous discretization, arbitrarily-high-order accuracy, hybrid spectral/geometric/algebraic multigrid, and novel Schur-complement preconditioning. These features present enormous challenges for extreme scalability. We demonstrate that---contrary to conventional wisdom---algorithmically optimal implicit solvers can be designed that scale out to 1.5 million cores for severely nonlinear, ill-conditioned, heterogeneous, and anisotropic PDEs.
地幔对流是地球内部的基本物理过程,负责地球的热和地质演化,包括板块构造。地幔被建模为一种粘性的、不可压缩的、非牛顿流体。大范围的空间尺度、材料性质的极端变异性和各向异性以及严重的非线性流变,使得用现实参数模拟全球地幔对流变得难以实现。在这里,我们提出了一种新的隐式求解器,它具有最佳的算法性能,并且能够极端缩放硬PDE问题,例如地幔对流。为了最大限度地提高精度和缩短运行时间,该求解器采用了许多先进技术,包括积极的多八叉树自适应、混合连续-不连续离散化、任意高阶精度、混合光谱/几何/代数多重网格以及新颖的Schur-complement预处理。这些特性对极端的可伸缩性提出了巨大的挑战。我们证明,与传统观点相反,算法最优隐式求解器可以设计为150万核,用于严重非线性、病态、异构和各向异性的偏微分方程。
{"title":"An extreme-scale implicit solver for complex PDEs: highly heterogeneous flow in earth's mantle","authors":"J. Rudi, A. Malossi, T. Isaac, G. Stadler, M. Gurnis, P. Staar, Y. Ineichen, C. Bekas, A. Curioni, O. Ghattas","doi":"10.1145/2807591.2807675","DOIUrl":"https://doi.org/10.1145/2807591.2807675","url":null,"abstract":"Mantle convection is the fundamental physical process within earth's interior responsible for the thermal and geological evolution of the planet, including plate tectonics. The mantle is modeled as a viscous, incompressible, non-Newtonian fluid. The wide range of spatial scales, extreme variability and anisotropy in material properties, and severely nonlinear rheology have made global mantle convection modeling with realistic parameters prohibitive. Here we present a new implicit solver that exhibits optimal algorithmic performance and is capable of extreme scaling for hard PDE problems, such as mantle convection. To maximize accuracy and minimize runtime, the solver incorporates a number of advances, including aggressive multi-octree adaptivity, mixed continuous-discontinuous discretization, arbitrarily-high-order accuracy, hybrid spectral/geometric/algebraic multigrid, and novel Schur-complement preconditioning. These features present enormous challenges for extreme scalability. We demonstrate that---contrary to conventional wisdom---algorithmically optimal implicit solvers can be designed that scale out to 1.5 million cores for severely nonlinear, ill-conditioned, heterogeneous, and anisotropic PDEs.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127644031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 147
STS-k: a multilevel sparse triangular solution scheme for NUMA multicores STS-k: NUMA多核的多层稀疏三角解方案
H. Kabir, J. Booth, G. Aupy, A. Benoit, Y. Robert, P. Raghavan
We consider techniques to improve the performance of parallel sparse triangular solution on non-uniform memory architecture multicores by extending earlier coloring and level set schemes for single-core multiprocessors. We develop STS-k, where k represents a small number of transformations for latency reduction from increased spatial and temporal locality of data accesses. We propose a graph model of data reuse to inform the development of STS-k and to prove that computing an optimal cost schedule is NP-complete. We observe significant speed-ups with STS-3 on 32-core Intel Westmere-Ex and 24-core AMD `MagnyCours' processors. Incremental gains solely from the 3-level transformations in STS-3 for a fixed ordering, correspond to reductions in execution times by factors of 1.4(Intel) and 1.5(AMD) for level sets and 2(Intel) and 2.2(AMD) for coloring. On average, execution times are reduced by a factor of 6(Intel) and 4(AMD) for STS-3 with coloring compared to a reference implementation using level sets.
通过扩展早期的单核多处理器的着色和水平集方案,研究了在非均匀存储架构多核上提高并行稀疏三角形解性能的技术。我们开发了STS-k,其中k表示从数据访问的空间和时间局域性增加中减少延迟的少量转换。我们提出了一个数据重用的图模型,为STS-k的开发提供了信息,并证明了计算最优成本计划是np完全的。我们观察到STS-3在32核英特尔Westmere-Ex和24核AMD“MagnyCours”处理器上的显著加速。对于固定的顺序,仅从STS-3中的3级转换中获得的增量收益对应于级别集的执行时间减少1.4(Intel)和1.5(AMD),以及着色的2(Intel)和2.2(AMD)。平均而言,与使用级别集的参考实现相比,使用着色的STS-3的执行时间减少了6倍(Intel)和4倍(AMD)。
{"title":"STS-k: a multilevel sparse triangular solution scheme for NUMA multicores","authors":"H. Kabir, J. Booth, G. Aupy, A. Benoit, Y. Robert, P. Raghavan","doi":"10.1145/2807591.2807667","DOIUrl":"https://doi.org/10.1145/2807591.2807667","url":null,"abstract":"We consider techniques to improve the performance of parallel sparse triangular solution on non-uniform memory architecture multicores by extending earlier coloring and level set schemes for single-core multiprocessors. We develop STS-k, where k represents a small number of transformations for latency reduction from increased spatial and temporal locality of data accesses. We propose a graph model of data reuse to inform the development of STS-k and to prove that computing an optimal cost schedule is NP-complete. We observe significant speed-ups with STS-3 on 32-core Intel Westmere-Ex and 24-core AMD `MagnyCours' processors. Incremental gains solely from the 3-level transformations in STS-3 for a fixed ordering, correspond to reductions in execution times by factors of 1.4(Intel) and 1.5(AMD) for level sets and 2(Intel) and 2.2(AMD) for coloring. On average, execution times are reduced by a factor of 6(Intel) and 4(AMD) for STS-3 with coloring compared to a reference implementation using level sets.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115388985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Massively parallel models of the human circulatory system 大规模并行人体循环系统模型
A. Randles, E. Draeger, T. Oppelstrup, L. Krauss, John A. Gunnels
The potential impact of blood flow simulations on the diagnosis and treatment of patients suffering from vascular disease is tremendous. Empowering models of the full arterial tree can provide insight into diseases such as arterial hypertension and enables the study of the influence of local factors on global hemodynamics. We present a new, highly scalable implementation of the lattice Boltzmann method which addresses key challenges such as multiscale coupling, limited memory capacity and bandwidth, and robust load balancing in complex geometries. We demonstrate the strong scaling of a three-dimensional, high-resolution simulation of hemodynamics in the systemic arterial tree on 1,572,864 cores of Blue Gene/Q. Faster calculation of flow in full arterial networks enables unprecedented risk stratification on a perpatient basis. In pursuit of this goal, we have introduced computational advances that significantly reduce time-to-solution for biofluidic simulations.
血流模拟对血管疾病患者的诊断和治疗的潜在影响是巨大的。全动脉树的增强模型可以提供对动脉高血压等疾病的深入了解,并使研究局部因素对整体血流动力学的影响成为可能。我们提出了一种新的、高度可扩展的晶格玻尔兹曼方法实现,该方法解决了多尺度耦合、有限内存容量和带宽以及复杂几何结构中鲁棒负载平衡等关键挑战。我们展示了在1,572,864个Blue Gene/Q核的全身动脉树中三维、高分辨率血流动力学模拟的强缩放。更快的计算流量在全动脉网络使前所未有的风险分层在逐次病人的基础上。为了实现这一目标,我们引入了计算上的进步,大大缩短了生物流体模拟的求解时间。
{"title":"Massively parallel models of the human circulatory system","authors":"A. Randles, E. Draeger, T. Oppelstrup, L. Krauss, John A. Gunnels","doi":"10.1145/2807591.2807676","DOIUrl":"https://doi.org/10.1145/2807591.2807676","url":null,"abstract":"The potential impact of blood flow simulations on the diagnosis and treatment of patients suffering from vascular disease is tremendous. Empowering models of the full arterial tree can provide insight into diseases such as arterial hypertension and enables the study of the influence of local factors on global hemodynamics. We present a new, highly scalable implementation of the lattice Boltzmann method which addresses key challenges such as multiscale coupling, limited memory capacity and bandwidth, and robust load balancing in complex geometries. We demonstrate the strong scaling of a three-dimensional, high-resolution simulation of hemodynamics in the systemic arterial tree on 1,572,864 cores of Blue Gene/Q. Faster calculation of flow in full arterial networks enables unprecedented risk stratification on a perpatient basis. In pursuit of this goal, we have introduced computational advances that significantly reduce time-to-solution for biofluidic simulations.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125992550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 56
High-performance algebraic multigrid solver optimized for multi-core based distributed parallel systems 针对多核分布式并行系统优化的高性能代数多网格求解器
Jongsoo Park, M. Smelyanskiy, U. Yang, Dheevatsa Mudigere, P. Dubey
Algebraic Multigrid (AMG) is a linear solver, well known for its linear computational complexity and excellent parallelization scalability. As a result, AMG is expected to be a solver of choice for emerging extreme scale systems capable of delivering hundred Pflops and beyond. While node level performance of AMG is generally limited by memory bandwidth, achieving high bandwidth efficiency is challenging due to highly sparse irregular computation, such as triple sparse matrix products, sparse-matrix dense-vector multiplications, independent set coarsening algorithms, and smoothers such as Gauss-Seidel. We develop and analyze a highly optimized AMG implementation, based on the well-known HYPRE library. Compared to the HYPRE baseline implementation, our optimized implementation achieves 2.0x speedup on a recent Intel® Xeon® Haswell processor. Combined with our other multi-node optimizations, this translates into similarly high speedups when weak-scaled multiple nodes. In addition, our implementation achieves 1.3x speedup compared to AmgX, NVIDIA's high-performance implementation of AMG, running on K40c.
代数多重网格(algeaic Multigrid, AMG)是一种线性求解器,以其线性计算复杂度和出色的并行化可扩展性而闻名。因此,AMG有望成为新兴极端规模系统的首选解决方案,能够提供100 Pflops甚至更高的速度。虽然AMG的节点级性能通常受到内存带宽的限制,但由于高度稀疏的不规则计算,如三重稀疏矩阵积、稀疏矩阵密集向量乘法、独立集粗化算法和高斯-塞德尔等平滑算法,实现高带宽效率是具有挑战性的。基于著名的HYPRE库,我们开发并分析了一个高度优化的AMG实现。与HYPRE基线实现相比,我们的优化实现在最新的Intel®Xeon®Haswell处理器上实现了2.0倍的加速。与我们的其他多节点优化相结合,当弱规模多节点时,这可以转化为类似的高速度。此外,与运行在K40c上的NVIDIA高性能AMG实现AmgX相比,我们的实现实现了1.3倍的加速。
{"title":"High-performance algebraic multigrid solver optimized for multi-core based distributed parallel systems","authors":"Jongsoo Park, M. Smelyanskiy, U. Yang, Dheevatsa Mudigere, P. Dubey","doi":"10.1145/2807591.2807603","DOIUrl":"https://doi.org/10.1145/2807591.2807603","url":null,"abstract":"Algebraic Multigrid (AMG) is a linear solver, well known for its linear computational complexity and excellent parallelization scalability. As a result, AMG is expected to be a solver of choice for emerging extreme scale systems capable of delivering hundred Pflops and beyond. While node level performance of AMG is generally limited by memory bandwidth, achieving high bandwidth efficiency is challenging due to highly sparse irregular computation, such as triple sparse matrix products, sparse-matrix dense-vector multiplications, independent set coarsening algorithms, and smoothers such as Gauss-Seidel. We develop and analyze a highly optimized AMG implementation, based on the well-known HYPRE library. Compared to the HYPRE baseline implementation, our optimized implementation achieves 2.0x speedup on a recent Intel® Xeon® Haswell processor. Combined with our other multi-node optimizations, this translates into similarly high speedups when weak-scaled multiple nodes. In addition, our implementation achieves 1.3x speedup compared to AmgX, NVIDIA's high-performance implementation of AMG, running on K40c.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"34-35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131844446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Engineering inhibitory proteins with InSiPS: the in-silico protein synthesizer 用硅蛋白合成器insps工程抑制蛋白
Andrew Schoenrock, Daniel J. Burnside, H. Moteshareie, A. Wong, A. Golshani, F. Dehne
Engineered proteins are synthetic novel proteins (not found in nature) that are designed to fulfill a predetermined biological function. Such proteins can be used as molecular markers, inhibitory agents, or drugs. For example, a synthetic protein could bind to a critical protein of a pathogen, thereby inhibiting the function of the target protein and potentially reducing the impact of the pathogen. In this paper we present the In-Silico Protein Synthesizer (InSiPS), a massively parallel computational tool for the IBM Blue Gene/Q that is aimed at designing inhibitory proteins. More precisely, InSiPS designs proteins that are predicted to interact with a given target protein (and may inhibit the target's cellular functions) while leaving non-target proteins unaffected (to minimize side-effects). As proof-of-concepts, two InSiPS designed proteins have been synthesized in the lab and their inhibitory properties have been experimentally verified through wet-lab experimentation.
工程蛋白是合成的新型蛋白质(不存在于自然界中),旨在实现预定的生物功能。这些蛋白质可用作分子标记、抑制剂或药物。例如,合成蛋白可以与病原体的关键蛋白结合,从而抑制目标蛋白的功能并潜在地减少病原体的影响。在本文中,我们介绍了In- silicon Protein Synthesizer (insps),这是IBM Blue Gene/Q的大规模并行计算工具,旨在设计抑制蛋白。更准确地说,insps设计的蛋白质可以与给定的靶蛋白相互作用(并可能抑制靶蛋白的细胞功能),而不影响非靶蛋白(以减少副作用)。作为概念验证,两种insps设计的蛋白质已经在实验室合成,并通过湿实验室实验验证了它们的抑制特性。
{"title":"Engineering inhibitory proteins with InSiPS: the in-silico protein synthesizer","authors":"Andrew Schoenrock, Daniel J. Burnside, H. Moteshareie, A. Wong, A. Golshani, F. Dehne","doi":"10.1145/2807591.2807630","DOIUrl":"https://doi.org/10.1145/2807591.2807630","url":null,"abstract":"Engineered proteins are synthetic novel proteins (not found in nature) that are designed to fulfill a predetermined biological function. Such proteins can be used as molecular markers, inhibitory agents, or drugs. For example, a synthetic protein could bind to a critical protein of a pathogen, thereby inhibiting the function of the target protein and potentially reducing the impact of the pathogen. In this paper we present the In-Silico Protein Synthesizer (InSiPS), a massively parallel computational tool for the IBM Blue Gene/Q that is aimed at designing inhibitory proteins. More precisely, InSiPS designs proteins that are predicted to interact with a given target protein (and may inhibit the target's cellular functions) while leaving non-target proteins unaffected (to minimize side-effects). As proof-of-concepts, two InSiPS designed proteins have been synthesized in the lab and their inhibitory properties have been experimentally verified through wet-lab experimentation.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125344488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
SC15: International Conference for High Performance Computing, Networking, Storage and Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1