首页 > 最新文献

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

英文 中文
Optimal scheduling of in-situ analysis for large-scale scientific simulations 大型科学模拟现场分析的优化调度
Preeti Malakar, V. Vishwanath, T. Munson, Christopher Knight, M. Hereld, S. Leyffer, M. Papka
Today's leadership computing facilities have enabled the execution of transformative simulations at unprecedented scales. However, analyzing the huge amount of output from these simulations remains a challenge. Most analyses of this output is performed in post-processing mode at the end of the simulation. The time to read the output for the analysis can be significantly high due to poor I/O bandwidth, which increases the end-to-end simulation-analysis time. Simulation-time analysis can reduce this end-to-end time. In this work, we present the scheduling of in-situ analysis as a numerical optimization problem to maximize the number of online analyses subject to resource constraints such as I/O bandwidth, network bandwidth, rate of computation and available memory. We demonstrate the effectiveness of our approach through two application case studies on the IBM Blue Gene/Q system.
当今领先的计算设备已经能够以前所未有的规模执行变革性模拟。然而,分析这些模拟的大量输出仍然是一个挑战。该输出的大多数分析是在模拟结束时以后处理模式执行的。由于较差的I/O带宽,读取分析输出的时间可能非常长,这会增加端到端模拟分析时间。仿真时间分析可以减少端到端时间。在这项工作中,我们提出了原位分析的调度作为一个数值优化问题,以最大限度地增加在线分析的数量,受制于资源约束,如I/O带宽,网络带宽,计算速率和可用内存。我们通过IBM Blue Gene/Q系统上的两个应用案例研究证明了我们方法的有效性。
{"title":"Optimal scheduling of in-situ analysis for large-scale scientific simulations","authors":"Preeti Malakar, V. Vishwanath, T. Munson, Christopher Knight, M. Hereld, S. Leyffer, M. Papka","doi":"10.1145/2807591.2807656","DOIUrl":"https://doi.org/10.1145/2807591.2807656","url":null,"abstract":"Today's leadership computing facilities have enabled the execution of transformative simulations at unprecedented scales. However, analyzing the huge amount of output from these simulations remains a challenge. Most analyses of this output is performed in post-processing mode at the end of the simulation. The time to read the output for the analysis can be significantly high due to poor I/O bandwidth, which increases the end-to-end simulation-analysis time. Simulation-time analysis can reduce this end-to-end time. In this work, we present the scheduling of in-situ analysis as a numerical optimization problem to maximize the number of online analyses subject to resource constraints such as I/O bandwidth, network bandwidth, rate of computation and available memory. We demonstrate the effectiveness of our approach through two application case studies on the IBM Blue Gene/Q system.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"324 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114010915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
Parallel distributed memory construction of suffix and longest common prefix arrays 后缀和最长公共前缀数组的并行分布式存储结构
P. Flick, S. Aluru
Suffix arrays and trees are fundamental string data structures of importance to many applications in computational biology. Consequently, their parallel construction is an actively studied problem. To date, algorithms with best practical performance lack efficient worst-case run-time guarantees, and vice versa. In addition, much of the recent work targeted low core count, shared memory parallelization. In this paper, we present parallel algorithms for distributed memory construction of suffix arrays and longest common prefix (LCP) arrays that simultaneously achieve good worst-case run-time bounds and superior practical performance. Our algorithms run in O(Tsort(n, p) · log n) worst-case time where Tsort(n, p) is the run-time of parallel sorting. We present several algorithm engineering techniques that improve performance in practice. We demonstrate the construction of suffix and LCP arrays of the human genome in less than 8 seconds on 1,024 Intel Xeon cores, reaching speedups of over 110X compared to the best sequential suffix array construction implementation divsufsort.
后缀数组和树是基本的字符串数据结构,在计算生物学的许多应用中都很重要。因此,它们的平行结构一直是研究的热点问题。迄今为止,具有最佳实际性能的算法缺乏有效的最坏情况运行时保证,反之亦然。此外,最近的许多工作都针对低核数、共享内存并行化。本文提出了一种用于后缀数组和最长公共前缀(LCP)数组的分布式存储器构建的并行算法,该算法可以同时获得良好的最坏情况运行时边界和优越的实际性能。我们的算法在O(Tsort(n, p)·log n)最坏情况下运行,其中Tsort(n, p)是并行排序的运行时间。我们提出了几种在实践中提高性能的算法工程技术。我们演示了在1024个Intel Xeon内核上不到8秒构建人类基因组的后缀和LCP阵列,与最佳顺序后缀阵列构建实现divsusort相比,达到超过110倍的速度。
{"title":"Parallel distributed memory construction of suffix and longest common prefix arrays","authors":"P. Flick, S. Aluru","doi":"10.1145/2807591.2807609","DOIUrl":"https://doi.org/10.1145/2807591.2807609","url":null,"abstract":"Suffix arrays and trees are fundamental string data structures of importance to many applications in computational biology. Consequently, their parallel construction is an actively studied problem. To date, algorithms with best practical performance lack efficient worst-case run-time guarantees, and vice versa. In addition, much of the recent work targeted low core count, shared memory parallelization. In this paper, we present parallel algorithms for distributed memory construction of suffix arrays and longest common prefix (LCP) arrays that simultaneously achieve good worst-case run-time bounds and superior practical performance. Our algorithms run in O(Tsort(n, p) · log n) worst-case time where Tsort(n, p) is the run-time of parallel sorting. We present several algorithm engineering techniques that improve performance in practice. We demonstrate the construction of suffix and LCP arrays of the human genome in less than 8 seconds on 1,024 Intel Xeon cores, reaching speedups of over 110X compared to the best sequential suffix array construction implementation divsufsort.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122391513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Scaling iterative graph computations with GraphMap 缩放迭代图计算与GraphMap
Kisung Lee, Ling Liu, K. Schwan, C. Pu, Qi Zhang, Yang Zhou, Emre Yigitoglu, Pingpeng Yuan
In recent years, systems researchers have devoted considerable effort to the study of large-scale graph processing. Existing distributed graph processing systems such as Pregel, based solely on distributed memory for their computations, fail to provide seamless scalability when the graph data and their intermediate computational results no longer fit into the memory; and most distributed approaches for iterative graph computations do not consider utilizing secondary storage a viable solution. This paper presents GraphMap, a distributed iterative graph computation framework that maximizes access locality and speeds up distributed iterative graph computations by effectively utilizing secondary storage. GraphMap has three salient features: (1) It distinguishes data states that are mutable during iterative computations from those that are read-only in all iterations to maximize sequential access and minimize random access. (2) It entails a two-level graph partitioning algorithm that enables balanced workloads and locality-optimized data placement. (3) It contains a proposed suite of locality-based optimizations that improve computational efficiency. Extensive experiments on several real-world graphs show that GraphMap outperforms existing distributed memory-based systems for various iterative graph algorithms.
近年来,系统研究人员对大规模图处理进行了大量的研究。现有的分布式图处理系统,如Pregel,完全基于分布式内存进行计算,当图数据及其中间计算结果不再适合内存时,无法提供无缝的可扩展性;而且大多数用于迭代图计算的分布式方法都不考虑利用辅助存储作为可行的解决方案。本文提出了一种分布式迭代图计算框架GraphMap,该框架通过有效地利用二次存储,最大化了访问局部性,加快了分布式迭代图计算的速度。GraphMap有三个突出的特点:(1)它区分迭代计算过程中可变的数据状态和所有迭代中只读的数据状态,以最大化顺序访问和最小化随机访问。(2)它需要一个两级图分区算法,实现平衡的工作负载和位置优化的数据放置。(3)它包含了一套基于位置的优化方案,提高了计算效率。在几个真实世界的图上进行的大量实验表明,GraphMap在各种迭代图算法上优于现有的基于分布式内存的系统。
{"title":"Scaling iterative graph computations with GraphMap","authors":"Kisung Lee, Ling Liu, K. Schwan, C. Pu, Qi Zhang, Yang Zhou, Emre Yigitoglu, Pingpeng Yuan","doi":"10.1145/2807591.2807604","DOIUrl":"https://doi.org/10.1145/2807591.2807604","url":null,"abstract":"In recent years, systems researchers have devoted considerable effort to the study of large-scale graph processing. Existing distributed graph processing systems such as Pregel, based solely on distributed memory for their computations, fail to provide seamless scalability when the graph data and their intermediate computational results no longer fit into the memory; and most distributed approaches for iterative graph computations do not consider utilizing secondary storage a viable solution. This paper presents GraphMap, a distributed iterative graph computation framework that maximizes access locality and speeds up distributed iterative graph computations by effectively utilizing secondary storage. GraphMap has three salient features: (1) It distinguishes data states that are mutable during iterative computations from those that are read-only in all iterations to maximize sequential access and minimize random access. (2) It entails a two-level graph partitioning algorithm that enables balanced workloads and locality-optimized data placement. (3) It contains a proposed suite of locality-based optimizations that improve computational efficiency. Extensive experiments on several real-world graphs show that GraphMap outperforms existing distributed memory-based systems for various iterative graph algorithms.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131863946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
A case for application-oblivious energy-efficient MPI runtime 一个应用程序无关的节能MPI运行时的案例
Akshay Venkatesh, Abhinav Vishnu, Khaled Hamidouche, Nathan R. Tallent, D. Panda, D. Kerbyson, A. Hoisie
Power has become a major impediment in designing large scale high-end systems. Message Passing Interface (MPI) is the de facto communication interface used as the back-end for designing applications, programming models and runtime for these systems. Slack --- the time spent by an MPI process in a single MPI call---provides a potential for energy and power savings, if an appropriate power reduction technique such as core-idling/Dynamic Voltage and Frequency Scaling (DVFS) can be applied without affecting the application's performance. Existing techniques that exploit slack for power savings assume that application behavior repeats across iterations/executions. However, an increasing use of adaptive and data-dependent workloads combined with system factors (OS noise, congestion) negates this assumption. This paper proposes and implements Energy Aware MPI (EAM) --- an application-oblivious energy-efficient MPI runtime. EAM uses a combination of communication models for common MPI primitives (point-to-point, collective, progress, blocking/non-blocking) and an online observation of slack to maximize energy efficiency and to honor performance degradation limits. Each power lever incurs time overhead, which must be amortized over slack to minimize degradation. When predicted communication time exceeds a lever overhead, the lever is used as soon as possible --- to maximize energy efficiency. When a misprediction occurs, the lever(s) are used automatically at specific intervals for amortization. We implement EAM using MVAPICH2 and evaluate it on ten applications using up to 4,096 processes. Our performance evaluation on an InfiniBand cluster indicates that EAM can reduce energy consumption by 5-41% in comparison to the default approach, which prioritizes performance alone, with negligible (less than 4% in all cases) performance loss.
功率已经成为设计大规模高端系统的主要障碍。消息传递接口(MPI)是实际的通信接口,用作为这些系统设计应用程序、编程模型和运行时的后端。如果能够在不影响应用程序性能的情况下采用适当的降功耗技术(如核心空转/动态电压和频率缩放(DVFS)),则MPI进程在单个MPI调用中花费的时间(Slack)可能会节省能源和电力。利用空闲来节省电力的现有技术假设应用程序行为在迭代/执行中重复。然而,越来越多地使用自适应和数据依赖的工作负载,再加上系统因素(操作系统噪声、拥塞),否定了这一假设。本文提出并实现了能源感知MPI (Energy Aware MPI, EAM)——一个与应用无关的节能MPI运行时。EAM使用了通用MPI原语(点对点、集体、进度、阻塞/非阻塞)的通信模型组合,以及对闲置的在线观察,以最大限度地提高能源效率,并遵守性能退化限制。每个动力杠杆都会产生时间开销,必须将其平摊在松弛上以最小化退化。当预测的通信时间超过杠杆开销时,就会尽快使用杠杆,以最大限度地提高能源效率。当出现错误预测时,杠杆会按特定的间隔自动使用以进行摊销。我们使用MVAPICH2实现EAM,并使用多达4,096个流程在10个应用程序上对其进行评估。我们对InfiniBand集群的性能评估表明,与默认方法相比,EAM可以减少5-41%的能耗,默认方法只优先考虑性能,性能损失可以忽略不计(在所有情况下都小于4%)。
{"title":"A case for application-oblivious energy-efficient MPI runtime","authors":"Akshay Venkatesh, Abhinav Vishnu, Khaled Hamidouche, Nathan R. Tallent, D. Panda, D. Kerbyson, A. Hoisie","doi":"10.1145/2807591.2807658","DOIUrl":"https://doi.org/10.1145/2807591.2807658","url":null,"abstract":"Power has become a major impediment in designing large scale high-end systems. Message Passing Interface (MPI) is the de facto communication interface used as the back-end for designing applications, programming models and runtime for these systems. Slack --- the time spent by an MPI process in a single MPI call---provides a potential for energy and power savings, if an appropriate power reduction technique such as core-idling/Dynamic Voltage and Frequency Scaling (DVFS) can be applied without affecting the application's performance. Existing techniques that exploit slack for power savings assume that application behavior repeats across iterations/executions. However, an increasing use of adaptive and data-dependent workloads combined with system factors (OS noise, congestion) negates this assumption. This paper proposes and implements Energy Aware MPI (EAM) --- an application-oblivious energy-efficient MPI runtime. EAM uses a combination of communication models for common MPI primitives (point-to-point, collective, progress, blocking/non-blocking) and an online observation of slack to maximize energy efficiency and to honor performance degradation limits. Each power lever incurs time overhead, which must be amortized over slack to minimize degradation. When predicted communication time exceeds a lever overhead, the lever is used as soon as possible --- to maximize energy efficiency. When a misprediction occurs, the lever(s) are used automatically at specific intervals for amortization. We implement EAM using MVAPICH2 and evaluate it on ten applications using up to 4,096 processes. Our performance evaluation on an InfiniBand cluster indicates that EAM can reduce energy consumption by 5-41% in comparison to the default approach, which prioritizes performance alone, with negligible (less than 4% in all cases) performance loss.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125069935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
The Spack package manager: bringing order to HPC software chaos Spack包管理器:给混乱的HPC软件带来秩序
T. Gamblin, M. LeGendre, M. Collette, Gregory L. Lee, A. Moody, B. Supinski, S. Futral
Large HPC centers spend considerable time supporting software for thousands of users, but the complexity of HPC software is quickly outpacing the capabilities of existing software management tools. Scientific applications require specific versions of compilers, MPI, and other dependency libraries, so using a single, standard software stack is infeasible. However, managing many configurations is difficult because the configuration space is combinatorial in size. We introduce Spack, a tool used at Lawrence Livermore National Laboratory to manage this complexity. Spack provides a novel, recursive specification syntax to invoke parametric builds of packages and dependencies. It allows any number of builds to coexist on the same system, and it ensures that installed packages can find their dependencies, regardless of the environment. We show through real-world use cases that Spack supports diverse and demanding applications, bringing order to HPC software chaos.
大型HPC中心花费大量时间为成千上万的用户支持软件,但是HPC软件的复杂性正在迅速超过现有软件管理工具的能力。科学应用程序需要特定版本的编译器、MPI和其他依赖库,因此使用单一的标准软件堆栈是不可行的。然而,管理许多配置是困难的,因为配置空间在大小上是组合的。我们介绍Spack,这是劳伦斯利弗莫尔国家实验室用来管理这种复杂性的工具。Spack提供了一种新颖的递归规范语法来调用包和依赖项的参数化构建。它允许在同一系统上共存任意数量的构建,并确保安装的包可以找到它们的依赖项,而不管环境如何。我们通过真实的用例展示了Spack支持各种苛刻的应用程序,为HPC软件混乱带来了秩序。
{"title":"The Spack package manager: bringing order to HPC software chaos","authors":"T. Gamblin, M. LeGendre, M. Collette, Gregory L. Lee, A. Moody, B. Supinski, S. Futral","doi":"10.1145/2807591.2807623","DOIUrl":"https://doi.org/10.1145/2807591.2807623","url":null,"abstract":"Large HPC centers spend considerable time supporting software for thousands of users, but the complexity of HPC software is quickly outpacing the capabilities of existing software management tools. Scientific applications require specific versions of compilers, MPI, and other dependency libraries, so using a single, standard software stack is infeasible. However, managing many configurations is difficult because the configuration space is combinatorial in size. We introduce Spack, a tool used at Lawrence Livermore National Laboratory to manage this complexity. Spack provides a novel, recursive specification syntax to invoke parametric builds of packages and dependencies. It allows any number of builds to coexist on the same system, and it ensures that installed packages can find their dependencies, regardless of the environment. We show through real-world use cases that Spack supports diverse and demanding applications, bringing order to HPC software chaos.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128499290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 222
Large-scale compute-intensive analysis via a combined in-situ and co-scheduling workflow approach 结合原位和协同调度工作流方法的大规模计算密集型分析
Christopher M. Sewell, K. Heitmann, H. Finkel, G. Zagaris, S. Parete-Koon, P. Fasel, A. Pope, N. Frontiere, Li-Ta Lo, O. E. Messer, S. Habib, J. Ahrens
Large-scale simulations can produce hundreds of terabytes to petabytes of data, complicating and limiting the efficiency of workflows. Traditionally, outputs are stored on the file system and analyzed in post-processing. With the rapidly increasing size and complexity of simulations, this approach faces an uncertain future. Trending techniques consist of performing the analysis in-situ, utilizing the same resources as the simulation, and/or off-loading subsets of the data to a compute-intensive analysis system. We introduce an analysis framework developed for HACC, a cosmological N-body code, that uses both in-situ and co-scheduling approaches for handling petabyte-scale outputs. We compare different analysis set-ups ranging from purely off-line, to purely in-situ to in-situ/co-scheduling. The analysis routines are implemented using the PISTON/VTK-m framework, allowing a single implementation of an algorithm that simultaneously targets a variety of GPU, multi-core, and many-core architectures.
大规模模拟可以产生数百tb到pb的数据,使工作流程变得复杂并限制了其效率。传统上,输出存储在文件系统中,并在后处理中进行分析。随着模拟的规模和复杂性的迅速增加,这种方法面临着不确定的未来。趋势技术包括在现场执行分析,利用与模拟相同的资源,和/或将数据子集卸载到计算密集型分析系统。我们介绍了为HACC开发的分析框架,HACC是一种宇宙学n体代码,它使用原位和协同调度方法来处理pb级输出。我们比较了不同的分析设置,从纯离线,到纯原位,再到原位/协同调度。分析例程使用活塞/VTK-m框架实现,允许一个算法的单一实现,同时针对各种GPU,多核和多核架构。
{"title":"Large-scale compute-intensive analysis via a combined in-situ and co-scheduling workflow approach","authors":"Christopher M. Sewell, K. Heitmann, H. Finkel, G. Zagaris, S. Parete-Koon, P. Fasel, A. Pope, N. Frontiere, Li-Ta Lo, O. E. Messer, S. Habib, J. Ahrens","doi":"10.1145/2807591.2807663","DOIUrl":"https://doi.org/10.1145/2807591.2807663","url":null,"abstract":"Large-scale simulations can produce hundreds of terabytes to petabytes of data, complicating and limiting the efficiency of workflows. Traditionally, outputs are stored on the file system and analyzed in post-processing. With the rapidly increasing size and complexity of simulations, this approach faces an uncertain future. Trending techniques consist of performing the analysis in-situ, utilizing the same resources as the simulation, and/or off-loading subsets of the data to a compute-intensive analysis system. We introduce an analysis framework developed for HACC, a cosmological N-body code, that uses both in-situ and co-scheduling approaches for handling petabyte-scale outputs. We compare different analysis set-ups ranging from purely off-line, to purely in-situ to in-situ/co-scheduling. The analysis routines are implemented using the PISTON/VTK-m framework, allowing a single implementation of an algorithm that simultaneously targets a variety of GPU, multi-core, and many-core architectures.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130224667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Dynamic power sharing for higher job throughput 动态功率共享,实现更高的作业吞吐量
D. Ellsworth, A. Malony, B. Rountree, M. Schulz
Current trends for high-performance systems are leading towards hardware overprovisioning where it is no longer possible to run all components at peak power without exceeding a system- or facility-wide power bound. The standard practice of static power scheduling is likely to lead to inefficiencies with over- and under-provisioning of power to components at runtime. In this paper we investigate the performance and scalability of an application agnostic runtime power scheduler (POWsched) that is capable of enforcing a system-wide power limit. Our experimental results show POWsched is robust, has negligible overhead, and can take advantage of opportunities to shift wasted power to more power-intensive applications, improving overall workload runtime by as much as 14% without job scheduler integration or application specific profiling. In addition, we conduct scalability studies to determine POWsched's overhead for large node counts. Lastly, we contribute a model and simulator (POWsim) for investigating dynamic power scheduling behavior and enforcement at scale.
高性能系统的当前趋势是导致硬件过度配置,在不超过系统或设施范围的功率限制的情况下,不再可能以峰值功率运行所有组件。静态电源调度的标准实践可能会导致运行时组件的电源供应过剩或不足而导致效率低下。在本文中,我们研究了一个应用程序不可知的运行时功率调度程序(POWsched)的性能和可伸缩性,它能够强制执行系统范围的功率限制。我们的实验结果表明,POWsched是健壮的,开销可以忽略不计,并且可以利用机会将浪费的功率转移到更耗电的应用程序,在不集成作业调度器或特定于应用程序的分析的情况下,将总体工作负载运行时提高14%。此外,我们还进行了可伸缩性研究,以确定POWsched在大节点计数时的开销。最后,我们提供了一个模型和模拟器(POWsim)来研究大规模的动态电力调度行为和执行。
{"title":"Dynamic power sharing for higher job throughput","authors":"D. Ellsworth, A. Malony, B. Rountree, M. Schulz","doi":"10.1145/2807591.2807643","DOIUrl":"https://doi.org/10.1145/2807591.2807643","url":null,"abstract":"Current trends for high-performance systems are leading towards hardware overprovisioning where it is no longer possible to run all components at peak power without exceeding a system- or facility-wide power bound. The standard practice of static power scheduling is likely to lead to inefficiencies with over- and under-provisioning of power to components at runtime. In this paper we investigate the performance and scalability of an application agnostic runtime power scheduler (POWsched) that is capable of enforcing a system-wide power limit. Our experimental results show POWsched is robust, has negligible overhead, and can take advantage of opportunities to shift wasted power to more power-intensive applications, improving overall workload runtime by as much as 14% without job scheduler integration or application specific profiling. In addition, we conduct scalability studies to determine POWsched's overhead for large node counts. Lastly, we contribute a model and simulator (POWsim) for investigating dynamic power scheduling behavior and enforcement at scale.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121619947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 57
Implicit nonlinear wave simulation with 1.08T DOF and 0.270T unstructured finite elements to enhance comprehensive earthquake simulation 采用1.08T DOF和0.270T非结构有限元进行隐式非线性波动模拟,增强地震综合模拟
T. Ichimura, K. Fujita, P. E. Quinay, Lalith Maddegedara, M. Hori, Seizo Tanaka, Y. Shizawa, Hiroshi Kobayashi, K. Minami
This paper presents a new heroic computing method for unstructured, low-order, finite-element, implicit nonlinear wave simulation: 1.97 PFLOPS (18.6% of peak) was attained on the full K computer when solving a 1.08T degrees-of-freedom (DOF) and 0.270T-element problem. This is 40.1 times more DOF and elements, a 2.68-fold improvement in peak performance, and 3.67 times faster in time-to-solution compared to the SC14 Gordon Bell finalist's state-of-the-art simulation. The method scales up to the full K computer with 663,552 CPU cores with 96.6% sizeup efficiency, enabling solving of a 1.08T DOF problem in 29.7 s per time step. Using such heroic computing, we solved a practical problem involving an area 23.7 times larger than the state-of-the-art, and conducted a comprehensive earthquake simulation by combining earthquake wave propagation analysis and evacuation analysis. Application at such scale is a groundbreaking accomplishment and is expected to change the quality of earthquake disaster estimation and contribute to society.
本文提出了一种新的非结构、低阶、有限元、隐式非线性波动模拟的计算方法:在全K计算机上求解1.08T自由度和0.270 t单元问题时,获得了1.97 PFLOPS(峰值的18.6%)。与SC14 Gordon Bell决赛选手的最先进模拟相比,这是40.1倍的自由度和元素,峰值性能提高了2.68倍,解决时间加快了3.67倍。该方法可扩展到具有663,552个CPU内核的全K计算机,计算效率为96.6%,每个时间步长可在29.7 s内解决1.08T DOF问题。通过这种英勇的计算,我们解决了一个比现有技术大23.7倍的实际问题,并结合地震波传播分析和疏散分析进行了全面的地震模拟。如此大规模的应用是一项突破性的成就,有望改变地震灾害评估的质量,并为社会做出贡献。
{"title":"Implicit nonlinear wave simulation with 1.08T DOF and 0.270T unstructured finite elements to enhance comprehensive earthquake simulation","authors":"T. Ichimura, K. Fujita, P. E. Quinay, Lalith Maddegedara, M. Hori, Seizo Tanaka, Y. Shizawa, Hiroshi Kobayashi, K. Minami","doi":"10.1145/2807591.2807674","DOIUrl":"https://doi.org/10.1145/2807591.2807674","url":null,"abstract":"This paper presents a new heroic computing method for unstructured, low-order, finite-element, implicit nonlinear wave simulation: 1.97 PFLOPS (18.6% of peak) was attained on the full K computer when solving a 1.08T degrees-of-freedom (DOF) and 0.270T-element problem. This is 40.1 times more DOF and elements, a 2.68-fold improvement in peak performance, and 3.67 times faster in time-to-solution compared to the SC14 Gordon Bell finalist's state-of-the-art simulation. The method scales up to the full K computer with 663,552 CPU cores with 96.6% sizeup efficiency, enabling solving of a 1.08T DOF problem in 29.7 s per time step. Using such heroic computing, we solved a practical problem involving an area 23.7 times larger than the state-of-the-art, and conducted a comprehensive earthquake simulation by combining earthquake wave propagation analysis and evacuation analysis. Application at such scale is a groundbreaking accomplishment and is expected to change the quality of earthquake disaster estimation and contribute to society.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121966518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 61
Mantle: a programmable metadata load balancer for the ceph file system Mantle: ceph文件系统的可编程元数据负载平衡器
Michael Sevilla, Noah Watkins, C. Maltzahn, I. Nassi, S. Brandt, S. Weil, Greg Farnum, S. Fineberg
Migrating resources is a useful tool for balancing load in a distributed system, but it is difficult to determine when to move resources, where to move resources, and how much of them to move. We look at resource migration for file system metadata and show how CephFS's dynamic subtree partitioning approach can exploit varying degrees of locality and balance because it can partition the namespace into variable sized units. Unfortunately, the current metadata balancer is complicated and difficult to control because it struggles to address many of the general resource migration challenges inherent to the metadata management problem. To help decouple policy from mechanism, we introduce a programmable storage system that lets the designer inject custom balancing logic. We show the flexibility and transparency of this approach by replicating the strategy of a state-of-the-art metadata balancer and conclude by comparing this strategy to other custom balancers on the same system.
迁移资源是在分布式系统中平衡负载的有用工具,但是很难确定何时移动资源、向何处移动资源以及要移动多少资源。我们将研究文件系统元数据的资源迁移,并展示cepphfs的动态子树分区方法如何利用不同程度的局部性和平衡性,因为它可以将名称空间划分为大小可变的单元。不幸的是,当前的元数据平衡器非常复杂且难以控制,因为它难以解决元数据管理问题所固有的许多一般资源迁移挑战。为了帮助将策略与机制解耦,我们引入了一个可编程存储系统,该系统允许设计者注入自定义平衡逻辑。我们通过复制最先进的元数据平衡器的策略来展示此方法的灵活性和透明度,并将此策略与同一系统上的其他自定义平衡器进行比较。
{"title":"Mantle: a programmable metadata load balancer for the ceph file system","authors":"Michael Sevilla, Noah Watkins, C. Maltzahn, I. Nassi, S. Brandt, S. Weil, Greg Farnum, S. Fineberg","doi":"10.1145/2807591.2807607","DOIUrl":"https://doi.org/10.1145/2807591.2807607","url":null,"abstract":"Migrating resources is a useful tool for balancing load in a distributed system, but it is difficult to determine when to move resources, where to move resources, and how much of them to move. We look at resource migration for file system metadata and show how CephFS's dynamic subtree partitioning approach can exploit varying degrees of locality and balance because it can partition the namespace into variable sized units. Unfortunately, the current metadata balancer is complicated and difficult to control because it struggles to address many of the general resource migration challenges inherent to the metadata management problem. To help decouple policy from mechanism, we introduce a programmable storage system that lets the designer inject custom balancing logic. We show the flexibility and transparency of this approach by replicating the strategy of a state-of-the-art metadata balancer and conclude by comparing this strategy to other custom balancers on the same system.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127047971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
Elastic job bundling: an adaptive resource request strategy for large-scale parallel applications 弹性作业捆绑:大规模并行应用程序的自适应资源请求策略
Feng Liu, J. Weissman
In today's batch queue HPC cluster systems, the user submits a job requesting a fixed number of processors. The system will not start the job until all of the requested resources become available simultaneously. When cluster workload is high, large sized jobs will experience long waiting time due to this policy. In this paper, we propose a new approach that dynamically decomposes a large job into smaller ones to reduce waiting time, and lets the application expand across multiple subjobs while continuously achieving progress. This approach has three benefits: (i) application turnaround time is reduced, (ii) system fragmentation is diminished, and (iii) fairness is promoted. Our approach does not depend on job queue time prediction but exploits available backfill opportunities. Simulation results have shown that our approach can reduce application mean turnaround time by up to 48%.
在今天的批处理队列HPC集群系统中,用户提交一个请求固定数量处理器的作业。在所有请求的资源同时可用之前,系统不会启动作业。当集群工作负载较高时,由于此策略,大型作业将经历较长的等待时间。在本文中,我们提出了一种新的方法,该方法将一个大的作业动态分解为较小的作业,以减少等待时间,并使应用程序在多个子作业之间扩展,同时不断取得进展。这种方法有三个好处:(i)减少了应用程序周转时间,(ii)减少了系统碎片,(iii)提高了公平性。我们的方法不依赖于作业队列时间预测,而是利用可用的回填机会。仿真结果表明,我们的方法可以将应用程序的平均周转时间减少48%。
{"title":"Elastic job bundling: an adaptive resource request strategy for large-scale parallel applications","authors":"Feng Liu, J. Weissman","doi":"10.1145/2807591.2807610","DOIUrl":"https://doi.org/10.1145/2807591.2807610","url":null,"abstract":"In today's batch queue HPC cluster systems, the user submits a job requesting a fixed number of processors. The system will not start the job until all of the requested resources become available simultaneously. When cluster workload is high, large sized jobs will experience long waiting time due to this policy. In this paper, we propose a new approach that dynamically decomposes a large job into smaller ones to reduce waiting time, and lets the application expand across multiple subjobs while continuously achieving progress. This approach has three benefits: (i) application turnaround time is reduced, (ii) system fragmentation is diminished, and (iii) fairness is promoted. Our approach does not depend on job queue time prediction but exploits available backfill opportunities. Simulation results have shown that our approach can reduce application mean turnaround time by up to 48%.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126805096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
期刊
SC15: International Conference for High Performance Computing, Networking, Storage and Analysis
全部 Aust. J. Earth Sci. J. Geog. Sci. Vadose Zone J. Russ. Meteorol. Hydrol. J. Mar. Syst. ACTA CARSOLOGICA Geosci. Can. Palaeontol. Electronica Macroeconomic Dynamics Asia-Pacific Journal of Accounting & Economics International Tax and Public Finance 中国经济评论 SIAM J. Financ. Math. Geostand. Geoanal. Res. AGRIBUSINESS Nat. Geosci. Fossil Rec. Quat. Res. J. Archaeol. Sci. FACIES Permafrost Periglacial Processes GEOCHEM-EXPLOR ENV A JOKULL Rev Financ Stud Journal of World Trade J. Hous. Built Environ. Environmental & Resource Economics Journal of Real Estate Research 天津理工大学学报 Ciência Animal Brasileira J Appl Psychol Columbia Law School Day 2 Thu, June 21, 2018 Sociological Methods & Research Queen Mary Journal of Intellectual Property Chem. Soc. Rev. Light-Science & Applications Anal. Bioanal. Chem. Cineforum Curr. Org. Chem. Geobiology Journal of community psychology Lett. Org. Chem. Geosci. J. J. Atmos. Chem. Geophys. Prospect. PETROLOGY+ WEATHER CLIM SOC Pet. Geosci. EST J EARTH SCI GEOMAT NAT HAZ RISK J. Volcanol. Geotherm. Res. Bull. Volcanol. Empirical Economics Panoeconomicus International Finance Ocean and Coastal Research Journal of Agrarian Change B SOC GEOL MEX Habitat International QUATERNAIRE Journal of Economic Policy Reform IDOJARAS INDIAN J GEO-MAR SCI Paleontol. Res. Big Earth Data Atmos. Meas. Tech. Adv. Heterocycl. Chem. AAPG Bull. GEOBIOS-LYON Miner. Deposita Nat. Hazards Earth Syst. Sci. Environ. Res. Contrib. Mineral. Petrol. Hydrol. Processes CRYOSPHERE WIRES WATER J. Meteorolog. Res. Austral Ecol. RADIOCARBON SOCIO-ECON PLAN SCI Gondwana Res. Russ. Geol. Geophys. ATL GEOL J. Biogeogr. Paleontol. J. J. Mar. Res. Near Surf. Geophys. FRONT EARTH SCI-PRC Lake Reservoir Manage. Antarct. Sci. Adv. Meteorol. C.R. Geosci. POL POLAR RES Quat. Geochronol. J. Hydrol. Hydromech. Int. Geol. Rev. J. Geogr. Explor. Geophys. World Trade Review
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1