首页 > 最新文献

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

英文 中文
Optimal scheduling of in-situ analysis for large-scale scientific simulations 大型科学模拟现场分析的优化调度
Preeti Malakar, V. Vishwanath, T. Munson, Christopher Knight, M. Hereld, S. Leyffer, M. Papka
Today's leadership computing facilities have enabled the execution of transformative simulations at unprecedented scales. However, analyzing the huge amount of output from these simulations remains a challenge. Most analyses of this output is performed in post-processing mode at the end of the simulation. The time to read the output for the analysis can be significantly high due to poor I/O bandwidth, which increases the end-to-end simulation-analysis time. Simulation-time analysis can reduce this end-to-end time. In this work, we present the scheduling of in-situ analysis as a numerical optimization problem to maximize the number of online analyses subject to resource constraints such as I/O bandwidth, network bandwidth, rate of computation and available memory. We demonstrate the effectiveness of our approach through two application case studies on the IBM Blue Gene/Q system.
当今领先的计算设备已经能够以前所未有的规模执行变革性模拟。然而,分析这些模拟的大量输出仍然是一个挑战。该输出的大多数分析是在模拟结束时以后处理模式执行的。由于较差的I/O带宽,读取分析输出的时间可能非常长,这会增加端到端模拟分析时间。仿真时间分析可以减少端到端时间。在这项工作中,我们提出了原位分析的调度作为一个数值优化问题,以最大限度地增加在线分析的数量,受制于资源约束,如I/O带宽,网络带宽,计算速率和可用内存。我们通过IBM Blue Gene/Q系统上的两个应用案例研究证明了我们方法的有效性。
{"title":"Optimal scheduling of in-situ analysis for large-scale scientific simulations","authors":"Preeti Malakar, V. Vishwanath, T. Munson, Christopher Knight, M. Hereld, S. Leyffer, M. Papka","doi":"10.1145/2807591.2807656","DOIUrl":"https://doi.org/10.1145/2807591.2807656","url":null,"abstract":"Today's leadership computing facilities have enabled the execution of transformative simulations at unprecedented scales. However, analyzing the huge amount of output from these simulations remains a challenge. Most analyses of this output is performed in post-processing mode at the end of the simulation. The time to read the output for the analysis can be significantly high due to poor I/O bandwidth, which increases the end-to-end simulation-analysis time. Simulation-time analysis can reduce this end-to-end time. In this work, we present the scheduling of in-situ analysis as a numerical optimization problem to maximize the number of online analyses subject to resource constraints such as I/O bandwidth, network bandwidth, rate of computation and available memory. We demonstrate the effectiveness of our approach through two application case studies on the IBM Blue Gene/Q system.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"324 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114010915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
Parallel distributed memory construction of suffix and longest common prefix arrays 后缀和最长公共前缀数组的并行分布式存储结构
P. Flick, S. Aluru
Suffix arrays and trees are fundamental string data structures of importance to many applications in computational biology. Consequently, their parallel construction is an actively studied problem. To date, algorithms with best practical performance lack efficient worst-case run-time guarantees, and vice versa. In addition, much of the recent work targeted low core count, shared memory parallelization. In this paper, we present parallel algorithms for distributed memory construction of suffix arrays and longest common prefix (LCP) arrays that simultaneously achieve good worst-case run-time bounds and superior practical performance. Our algorithms run in O(Tsort(n, p) · log n) worst-case time where Tsort(n, p) is the run-time of parallel sorting. We present several algorithm engineering techniques that improve performance in practice. We demonstrate the construction of suffix and LCP arrays of the human genome in less than 8 seconds on 1,024 Intel Xeon cores, reaching speedups of over 110X compared to the best sequential suffix array construction implementation divsufsort.
后缀数组和树是基本的字符串数据结构,在计算生物学的许多应用中都很重要。因此,它们的平行结构一直是研究的热点问题。迄今为止,具有最佳实际性能的算法缺乏有效的最坏情况运行时保证,反之亦然。此外,最近的许多工作都针对低核数、共享内存并行化。本文提出了一种用于后缀数组和最长公共前缀(LCP)数组的分布式存储器构建的并行算法,该算法可以同时获得良好的最坏情况运行时边界和优越的实际性能。我们的算法在O(Tsort(n, p)·log n)最坏情况下运行,其中Tsort(n, p)是并行排序的运行时间。我们提出了几种在实践中提高性能的算法工程技术。我们演示了在1024个Intel Xeon内核上不到8秒构建人类基因组的后缀和LCP阵列,与最佳顺序后缀阵列构建实现divsusort相比,达到超过110倍的速度。
{"title":"Parallel distributed memory construction of suffix and longest common prefix arrays","authors":"P. Flick, S. Aluru","doi":"10.1145/2807591.2807609","DOIUrl":"https://doi.org/10.1145/2807591.2807609","url":null,"abstract":"Suffix arrays and trees are fundamental string data structures of importance to many applications in computational biology. Consequently, their parallel construction is an actively studied problem. To date, algorithms with best practical performance lack efficient worst-case run-time guarantees, and vice versa. In addition, much of the recent work targeted low core count, shared memory parallelization. In this paper, we present parallel algorithms for distributed memory construction of suffix arrays and longest common prefix (LCP) arrays that simultaneously achieve good worst-case run-time bounds and superior practical performance. Our algorithms run in O(Tsort(n, p) · log n) worst-case time where Tsort(n, p) is the run-time of parallel sorting. We present several algorithm engineering techniques that improve performance in practice. We demonstrate the construction of suffix and LCP arrays of the human genome in less than 8 seconds on 1,024 Intel Xeon cores, reaching speedups of over 110X compared to the best sequential suffix array construction implementation divsufsort.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122391513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Scaling iterative graph computations with GraphMap 缩放迭代图计算与GraphMap
Kisung Lee, Ling Liu, K. Schwan, C. Pu, Qi Zhang, Yang Zhou, Emre Yigitoglu, Pingpeng Yuan
In recent years, systems researchers have devoted considerable effort to the study of large-scale graph processing. Existing distributed graph processing systems such as Pregel, based solely on distributed memory for their computations, fail to provide seamless scalability when the graph data and their intermediate computational results no longer fit into the memory; and most distributed approaches for iterative graph computations do not consider utilizing secondary storage a viable solution. This paper presents GraphMap, a distributed iterative graph computation framework that maximizes access locality and speeds up distributed iterative graph computations by effectively utilizing secondary storage. GraphMap has three salient features: (1) It distinguishes data states that are mutable during iterative computations from those that are read-only in all iterations to maximize sequential access and minimize random access. (2) It entails a two-level graph partitioning algorithm that enables balanced workloads and locality-optimized data placement. (3) It contains a proposed suite of locality-based optimizations that improve computational efficiency. Extensive experiments on several real-world graphs show that GraphMap outperforms existing distributed memory-based systems for various iterative graph algorithms.
近年来,系统研究人员对大规模图处理进行了大量的研究。现有的分布式图处理系统,如Pregel,完全基于分布式内存进行计算,当图数据及其中间计算结果不再适合内存时,无法提供无缝的可扩展性;而且大多数用于迭代图计算的分布式方法都不考虑利用辅助存储作为可行的解决方案。本文提出了一种分布式迭代图计算框架GraphMap,该框架通过有效地利用二次存储,最大化了访问局部性,加快了分布式迭代图计算的速度。GraphMap有三个突出的特点:(1)它区分迭代计算过程中可变的数据状态和所有迭代中只读的数据状态,以最大化顺序访问和最小化随机访问。(2)它需要一个两级图分区算法,实现平衡的工作负载和位置优化的数据放置。(3)它包含了一套基于位置的优化方案,提高了计算效率。在几个真实世界的图上进行的大量实验表明,GraphMap在各种迭代图算法上优于现有的基于分布式内存的系统。
{"title":"Scaling iterative graph computations with GraphMap","authors":"Kisung Lee, Ling Liu, K. Schwan, C. Pu, Qi Zhang, Yang Zhou, Emre Yigitoglu, Pingpeng Yuan","doi":"10.1145/2807591.2807604","DOIUrl":"https://doi.org/10.1145/2807591.2807604","url":null,"abstract":"In recent years, systems researchers have devoted considerable effort to the study of large-scale graph processing. Existing distributed graph processing systems such as Pregel, based solely on distributed memory for their computations, fail to provide seamless scalability when the graph data and their intermediate computational results no longer fit into the memory; and most distributed approaches for iterative graph computations do not consider utilizing secondary storage a viable solution. This paper presents GraphMap, a distributed iterative graph computation framework that maximizes access locality and speeds up distributed iterative graph computations by effectively utilizing secondary storage. GraphMap has three salient features: (1) It distinguishes data states that are mutable during iterative computations from those that are read-only in all iterations to maximize sequential access and minimize random access. (2) It entails a two-level graph partitioning algorithm that enables balanced workloads and locality-optimized data placement. (3) It contains a proposed suite of locality-based optimizations that improve computational efficiency. Extensive experiments on several real-world graphs show that GraphMap outperforms existing distributed memory-based systems for various iterative graph algorithms.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131863946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Regent: a high-productivity programming language for HPC with logical regions Regent:用于具有逻辑区域的高性能计算的高生产率编程语言
Elliott Slaughter, Wonchan Lee, Sean Treichler, Michael A. Bauer, A. Aiken
We present Regent, a high-productivity programming language for high performance computing with logical regions. Regent users compose programs with tasks (functions eligible for parallel execution) and logical regions (hierarchical collections of structured objects). Regent programs appear to execute sequentially, require no explicit synchronization, and are trivially deadlock-free. Regent's type system catches many common classes of mistakes and guarantees that a program with correct serial execution produces identical results on parallel and distributed machines. We present an optimizing compiler for Regent that translates Regent programs into efficient implementations for Legion, an asynchronous task-based model. Regent employs several novel compiler optimizations to minimize the dynamic overhead of the runtime system and enable efficient operation. We evaluate Regent on three benchmark applications and demonstrate that Regent achieves performance comparable to hand-tuned Legion.
我们提出Regent,一种用于逻辑区域的高性能计算的高生产率编程语言。Regent用户用任务(可并行执行的函数)和逻辑区域(结构化对象的分层集合)组合程序。摄制程序看起来是顺序执行的,不需要显式的同步,并且通常没有死锁。Regent的类型系统捕获了许多常见的错误类型,并保证具有正确串行执行的程序在并行和分布式机器上产生相同的结果。我们为Regent提供了一个优化编译器,可以将Regent程序转换为Legion的高效实现,Legion是一个基于异步任务的模型。Regent采用了几种新的编译器优化来最小化运行时系统的动态开销,并实现高效的操作。我们在三个基准应用程序上评估了Regent,并证明Regent实现了与手动调整的Legion相当的性能。
{"title":"Regent: a high-productivity programming language for HPC with logical regions","authors":"Elliott Slaughter, Wonchan Lee, Sean Treichler, Michael A. Bauer, A. Aiken","doi":"10.1145/2807591.2807629","DOIUrl":"https://doi.org/10.1145/2807591.2807629","url":null,"abstract":"We present Regent, a high-productivity programming language for high performance computing with logical regions. Regent users compose programs with tasks (functions eligible for parallel execution) and logical regions (hierarchical collections of structured objects). Regent programs appear to execute sequentially, require no explicit synchronization, and are trivially deadlock-free. Regent's type system catches many common classes of mistakes and guarantees that a program with correct serial execution produces identical results on parallel and distributed machines. We present an optimizing compiler for Regent that translates Regent programs into efficient implementations for Legion, an asynchronous task-based model. Regent employs several novel compiler optimizations to minimize the dynamic overhead of the runtime system and enable efficient operation. We evaluate Regent on three benchmark applications and demonstrate that Regent achieves performance comparable to hand-tuned Legion.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134069777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 107
Dynamic power sharing for higher job throughput 动态功率共享,实现更高的作业吞吐量
D. Ellsworth, A. Malony, B. Rountree, M. Schulz
Current trends for high-performance systems are leading towards hardware overprovisioning where it is no longer possible to run all components at peak power without exceeding a system- or facility-wide power bound. The standard practice of static power scheduling is likely to lead to inefficiencies with over- and under-provisioning of power to components at runtime. In this paper we investigate the performance and scalability of an application agnostic runtime power scheduler (POWsched) that is capable of enforcing a system-wide power limit. Our experimental results show POWsched is robust, has negligible overhead, and can take advantage of opportunities to shift wasted power to more power-intensive applications, improving overall workload runtime by as much as 14% without job scheduler integration or application specific profiling. In addition, we conduct scalability studies to determine POWsched's overhead for large node counts. Lastly, we contribute a model and simulator (POWsim) for investigating dynamic power scheduling behavior and enforcement at scale.
高性能系统的当前趋势是导致硬件过度配置,在不超过系统或设施范围的功率限制的情况下,不再可能以峰值功率运行所有组件。静态电源调度的标准实践可能会导致运行时组件的电源供应过剩或不足而导致效率低下。在本文中,我们研究了一个应用程序不可知的运行时功率调度程序(POWsched)的性能和可伸缩性,它能够强制执行系统范围的功率限制。我们的实验结果表明,POWsched是健壮的,开销可以忽略不计,并且可以利用机会将浪费的功率转移到更耗电的应用程序,在不集成作业调度器或特定于应用程序的分析的情况下,将总体工作负载运行时提高14%。此外,我们还进行了可伸缩性研究,以确定POWsched在大节点计数时的开销。最后,我们提供了一个模型和模拟器(POWsim)来研究大规模的动态电力调度行为和执行。
{"title":"Dynamic power sharing for higher job throughput","authors":"D. Ellsworth, A. Malony, B. Rountree, M. Schulz","doi":"10.1145/2807591.2807643","DOIUrl":"https://doi.org/10.1145/2807591.2807643","url":null,"abstract":"Current trends for high-performance systems are leading towards hardware overprovisioning where it is no longer possible to run all components at peak power without exceeding a system- or facility-wide power bound. The standard practice of static power scheduling is likely to lead to inefficiencies with over- and under-provisioning of power to components at runtime. In this paper we investigate the performance and scalability of an application agnostic runtime power scheduler (POWsched) that is capable of enforcing a system-wide power limit. Our experimental results show POWsched is robust, has negligible overhead, and can take advantage of opportunities to shift wasted power to more power-intensive applications, improving overall workload runtime by as much as 14% without job scheduler integration or application specific profiling. In addition, we conduct scalability studies to determine POWsched's overhead for large node counts. Lastly, we contribute a model and simulator (POWsim) for investigating dynamic power scheduling behavior and enforcement at scale.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121619947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 57
Implicit nonlinear wave simulation with 1.08T DOF and 0.270T unstructured finite elements to enhance comprehensive earthquake simulation 采用1.08T DOF和0.270T非结构有限元进行隐式非线性波动模拟,增强地震综合模拟
T. Ichimura, K. Fujita, P. E. Quinay, Lalith Maddegedara, M. Hori, Seizo Tanaka, Y. Shizawa, Hiroshi Kobayashi, K. Minami
This paper presents a new heroic computing method for unstructured, low-order, finite-element, implicit nonlinear wave simulation: 1.97 PFLOPS (18.6% of peak) was attained on the full K computer when solving a 1.08T degrees-of-freedom (DOF) and 0.270T-element problem. This is 40.1 times more DOF and elements, a 2.68-fold improvement in peak performance, and 3.67 times faster in time-to-solution compared to the SC14 Gordon Bell finalist's state-of-the-art simulation. The method scales up to the full K computer with 663,552 CPU cores with 96.6% sizeup efficiency, enabling solving of a 1.08T DOF problem in 29.7 s per time step. Using such heroic computing, we solved a practical problem involving an area 23.7 times larger than the state-of-the-art, and conducted a comprehensive earthquake simulation by combining earthquake wave propagation analysis and evacuation analysis. Application at such scale is a groundbreaking accomplishment and is expected to change the quality of earthquake disaster estimation and contribute to society.
本文提出了一种新的非结构、低阶、有限元、隐式非线性波动模拟的计算方法:在全K计算机上求解1.08T自由度和0.270 t单元问题时,获得了1.97 PFLOPS(峰值的18.6%)。与SC14 Gordon Bell决赛选手的最先进模拟相比,这是40.1倍的自由度和元素,峰值性能提高了2.68倍,解决时间加快了3.67倍。该方法可扩展到具有663,552个CPU内核的全K计算机,计算效率为96.6%,每个时间步长可在29.7 s内解决1.08T DOF问题。通过这种英勇的计算,我们解决了一个比现有技术大23.7倍的实际问题,并结合地震波传播分析和疏散分析进行了全面的地震模拟。如此大规模的应用是一项突破性的成就,有望改变地震灾害评估的质量,并为社会做出贡献。
{"title":"Implicit nonlinear wave simulation with 1.08T DOF and 0.270T unstructured finite elements to enhance comprehensive earthquake simulation","authors":"T. Ichimura, K. Fujita, P. E. Quinay, Lalith Maddegedara, M. Hori, Seizo Tanaka, Y. Shizawa, Hiroshi Kobayashi, K. Minami","doi":"10.1145/2807591.2807674","DOIUrl":"https://doi.org/10.1145/2807591.2807674","url":null,"abstract":"This paper presents a new heroic computing method for unstructured, low-order, finite-element, implicit nonlinear wave simulation: 1.97 PFLOPS (18.6% of peak) was attained on the full K computer when solving a 1.08T degrees-of-freedom (DOF) and 0.270T-element problem. This is 40.1 times more DOF and elements, a 2.68-fold improvement in peak performance, and 3.67 times faster in time-to-solution compared to the SC14 Gordon Bell finalist's state-of-the-art simulation. The method scales up to the full K computer with 663,552 CPU cores with 96.6% sizeup efficiency, enabling solving of a 1.08T DOF problem in 29.7 s per time step. Using such heroic computing, we solved a practical problem involving an area 23.7 times larger than the state-of-the-art, and conducted a comprehensive earthquake simulation by combining earthquake wave propagation analysis and evacuation analysis. Application at such scale is a groundbreaking accomplishment and is expected to change the quality of earthquake disaster estimation and contribute to society.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121966518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 61
The Spack package manager: bringing order to HPC software chaos Spack包管理器:给混乱的HPC软件带来秩序
T. Gamblin, M. LeGendre, M. Collette, Gregory L. Lee, A. Moody, B. Supinski, S. Futral
Large HPC centers spend considerable time supporting software for thousands of users, but the complexity of HPC software is quickly outpacing the capabilities of existing software management tools. Scientific applications require specific versions of compilers, MPI, and other dependency libraries, so using a single, standard software stack is infeasible. However, managing many configurations is difficult because the configuration space is combinatorial in size. We introduce Spack, a tool used at Lawrence Livermore National Laboratory to manage this complexity. Spack provides a novel, recursive specification syntax to invoke parametric builds of packages and dependencies. It allows any number of builds to coexist on the same system, and it ensures that installed packages can find their dependencies, regardless of the environment. We show through real-world use cases that Spack supports diverse and demanding applications, bringing order to HPC software chaos.
大型HPC中心花费大量时间为成千上万的用户支持软件,但是HPC软件的复杂性正在迅速超过现有软件管理工具的能力。科学应用程序需要特定版本的编译器、MPI和其他依赖库,因此使用单一的标准软件堆栈是不可行的。然而,管理许多配置是困难的,因为配置空间在大小上是组合的。我们介绍Spack,这是劳伦斯利弗莫尔国家实验室用来管理这种复杂性的工具。Spack提供了一种新颖的递归规范语法来调用包和依赖项的参数化构建。它允许在同一系统上共存任意数量的构建,并确保安装的包可以找到它们的依赖项,而不管环境如何。我们通过真实的用例展示了Spack支持各种苛刻的应用程序,为HPC软件混乱带来了秩序。
{"title":"The Spack package manager: bringing order to HPC software chaos","authors":"T. Gamblin, M. LeGendre, M. Collette, Gregory L. Lee, A. Moody, B. Supinski, S. Futral","doi":"10.1145/2807591.2807623","DOIUrl":"https://doi.org/10.1145/2807591.2807623","url":null,"abstract":"Large HPC centers spend considerable time supporting software for thousands of users, but the complexity of HPC software is quickly outpacing the capabilities of existing software management tools. Scientific applications require specific versions of compilers, MPI, and other dependency libraries, so using a single, standard software stack is infeasible. However, managing many configurations is difficult because the configuration space is combinatorial in size. We introduce Spack, a tool used at Lawrence Livermore National Laboratory to manage this complexity. Spack provides a novel, recursive specification syntax to invoke parametric builds of packages and dependencies. It allows any number of builds to coexist on the same system, and it ensures that installed packages can find their dependencies, regardless of the environment. We show through real-world use cases that Spack supports diverse and demanding applications, bringing order to HPC software chaos.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128499290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 222
A case for application-oblivious energy-efficient MPI runtime 一个应用程序无关的节能MPI运行时的案例
Akshay Venkatesh, Abhinav Vishnu, Khaled Hamidouche, Nathan R. Tallent, D. Panda, D. Kerbyson, A. Hoisie
Power has become a major impediment in designing large scale high-end systems. Message Passing Interface (MPI) is the de facto communication interface used as the back-end for designing applications, programming models and runtime for these systems. Slack --- the time spent by an MPI process in a single MPI call---provides a potential for energy and power savings, if an appropriate power reduction technique such as core-idling/Dynamic Voltage and Frequency Scaling (DVFS) can be applied without affecting the application's performance. Existing techniques that exploit slack for power savings assume that application behavior repeats across iterations/executions. However, an increasing use of adaptive and data-dependent workloads combined with system factors (OS noise, congestion) negates this assumption. This paper proposes and implements Energy Aware MPI (EAM) --- an application-oblivious energy-efficient MPI runtime. EAM uses a combination of communication models for common MPI primitives (point-to-point, collective, progress, blocking/non-blocking) and an online observation of slack to maximize energy efficiency and to honor performance degradation limits. Each power lever incurs time overhead, which must be amortized over slack to minimize degradation. When predicted communication time exceeds a lever overhead, the lever is used as soon as possible --- to maximize energy efficiency. When a misprediction occurs, the lever(s) are used automatically at specific intervals for amortization. We implement EAM using MVAPICH2 and evaluate it on ten applications using up to 4,096 processes. Our performance evaluation on an InfiniBand cluster indicates that EAM can reduce energy consumption by 5-41% in comparison to the default approach, which prioritizes performance alone, with negligible (less than 4% in all cases) performance loss.
功率已经成为设计大规模高端系统的主要障碍。消息传递接口(MPI)是实际的通信接口,用作为这些系统设计应用程序、编程模型和运行时的后端。如果能够在不影响应用程序性能的情况下采用适当的降功耗技术(如核心空转/动态电压和频率缩放(DVFS)),则MPI进程在单个MPI调用中花费的时间(Slack)可能会节省能源和电力。利用空闲来节省电力的现有技术假设应用程序行为在迭代/执行中重复。然而,越来越多地使用自适应和数据依赖的工作负载,再加上系统因素(操作系统噪声、拥塞),否定了这一假设。本文提出并实现了能源感知MPI (Energy Aware MPI, EAM)——一个与应用无关的节能MPI运行时。EAM使用了通用MPI原语(点对点、集体、进度、阻塞/非阻塞)的通信模型组合,以及对闲置的在线观察,以最大限度地提高能源效率,并遵守性能退化限制。每个动力杠杆都会产生时间开销,必须将其平摊在松弛上以最小化退化。当预测的通信时间超过杠杆开销时,就会尽快使用杠杆,以最大限度地提高能源效率。当出现错误预测时,杠杆会按特定的间隔自动使用以进行摊销。我们使用MVAPICH2实现EAM,并使用多达4,096个流程在10个应用程序上对其进行评估。我们对InfiniBand集群的性能评估表明,与默认方法相比,EAM可以减少5-41%的能耗,默认方法只优先考虑性能,性能损失可以忽略不计(在所有情况下都小于4%)。
{"title":"A case for application-oblivious energy-efficient MPI runtime","authors":"Akshay Venkatesh, Abhinav Vishnu, Khaled Hamidouche, Nathan R. Tallent, D. Panda, D. Kerbyson, A. Hoisie","doi":"10.1145/2807591.2807658","DOIUrl":"https://doi.org/10.1145/2807591.2807658","url":null,"abstract":"Power has become a major impediment in designing large scale high-end systems. Message Passing Interface (MPI) is the de facto communication interface used as the back-end for designing applications, programming models and runtime for these systems. Slack --- the time spent by an MPI process in a single MPI call---provides a potential for energy and power savings, if an appropriate power reduction technique such as core-idling/Dynamic Voltage and Frequency Scaling (DVFS) can be applied without affecting the application's performance. Existing techniques that exploit slack for power savings assume that application behavior repeats across iterations/executions. However, an increasing use of adaptive and data-dependent workloads combined with system factors (OS noise, congestion) negates this assumption. This paper proposes and implements Energy Aware MPI (EAM) --- an application-oblivious energy-efficient MPI runtime. EAM uses a combination of communication models for common MPI primitives (point-to-point, collective, progress, blocking/non-blocking) and an online observation of slack to maximize energy efficiency and to honor performance degradation limits. Each power lever incurs time overhead, which must be amortized over slack to minimize degradation. When predicted communication time exceeds a lever overhead, the lever is used as soon as possible --- to maximize energy efficiency. When a misprediction occurs, the lever(s) are used automatically at specific intervals for amortization. We implement EAM using MVAPICH2 and evaluate it on ten applications using up to 4,096 processes. Our performance evaluation on an InfiniBand cluster indicates that EAM can reduce energy consumption by 5-41% in comparison to the default approach, which prioritizes performance alone, with negligible (less than 4% in all cases) performance loss.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125069935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility 从橡树岭领导计算设施的泰坦超级计算机的GPU经验中获得的可靠性教训
Devesh Tiwari, Saurabh Gupta, George Gallarno, James H. Rogers, Don E. Maxwell
The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The world's second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simulations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercomputer as well as lessons learned in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.
图形处理单元(gpu)的高计算能力正在大规模地实现和推动科学发现过程。世界上第二快的开放科学超级计算机“泰坦”拥有超过18000个gpu,计算科学家用它来进行科学模拟和数据分析。然而,对GPU可靠性特性的理解仍处于初级阶段,因为GPU最近才大规模部署。本文详细研究了GPU错误及其对系统运行和应用程序的影响,描述了在Titan超级计算机上使用18,688个GPU的经验,以及在大规模GPU高效运行过程中的经验教训。这些经验对已经拥有大规模GPU集群或计划在未来部署GPU的HPC站点很有帮助。
{"title":"Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility","authors":"Devesh Tiwari, Saurabh Gupta, George Gallarno, James H. Rogers, Don E. Maxwell","doi":"10.1145/2807591.2807666","DOIUrl":"https://doi.org/10.1145/2807591.2807666","url":null,"abstract":"The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The world's second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simulations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercomputer as well as lessons learned in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125631660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 67
Mantle: a programmable metadata load balancer for the ceph file system Mantle: ceph文件系统的可编程元数据负载平衡器
Michael Sevilla, Noah Watkins, C. Maltzahn, I. Nassi, S. Brandt, S. Weil, Greg Farnum, S. Fineberg
Migrating resources is a useful tool for balancing load in a distributed system, but it is difficult to determine when to move resources, where to move resources, and how much of them to move. We look at resource migration for file system metadata and show how CephFS's dynamic subtree partitioning approach can exploit varying degrees of locality and balance because it can partition the namespace into variable sized units. Unfortunately, the current metadata balancer is complicated and difficult to control because it struggles to address many of the general resource migration challenges inherent to the metadata management problem. To help decouple policy from mechanism, we introduce a programmable storage system that lets the designer inject custom balancing logic. We show the flexibility and transparency of this approach by replicating the strategy of a state-of-the-art metadata balancer and conclude by comparing this strategy to other custom balancers on the same system.
迁移资源是在分布式系统中平衡负载的有用工具,但是很难确定何时移动资源、向何处移动资源以及要移动多少资源。我们将研究文件系统元数据的资源迁移,并展示cepphfs的动态子树分区方法如何利用不同程度的局部性和平衡性,因为它可以将名称空间划分为大小可变的单元。不幸的是,当前的元数据平衡器非常复杂且难以控制,因为它难以解决元数据管理问题所固有的许多一般资源迁移挑战。为了帮助将策略与机制解耦,我们引入了一个可编程存储系统,该系统允许设计者注入自定义平衡逻辑。我们通过复制最先进的元数据平衡器的策略来展示此方法的灵活性和透明度,并将此策略与同一系统上的其他自定义平衡器进行比较。
{"title":"Mantle: a programmable metadata load balancer for the ceph file system","authors":"Michael Sevilla, Noah Watkins, C. Maltzahn, I. Nassi, S. Brandt, S. Weil, Greg Farnum, S. Fineberg","doi":"10.1145/2807591.2807607","DOIUrl":"https://doi.org/10.1145/2807591.2807607","url":null,"abstract":"Migrating resources is a useful tool for balancing load in a distributed system, but it is difficult to determine when to move resources, where to move resources, and how much of them to move. We look at resource migration for file system metadata and show how CephFS's dynamic subtree partitioning approach can exploit varying degrees of locality and balance because it can partition the namespace into variable sized units. Unfortunately, the current metadata balancer is complicated and difficult to control because it struggles to address many of the general resource migration challenges inherent to the metadata management problem. To help decouple policy from mechanism, we introduce a programmable storage system that lets the designer inject custom balancing logic. We show the flexibility and transparency of this approach by replicating the strategy of a state-of-the-art metadata balancer and conclude by comparing this strategy to other custom balancers on the same system.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127047971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
期刊
SC15: International Conference for High Performance Computing, Networking, Storage and Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1