Preeti Malakar, V. Vishwanath, T. Munson, Christopher Knight, M. Hereld, S. Leyffer, M. Papka
Today's leadership computing facilities have enabled the execution of transformative simulations at unprecedented scales. However, analyzing the huge amount of output from these simulations remains a challenge. Most analyses of this output is performed in post-processing mode at the end of the simulation. The time to read the output for the analysis can be significantly high due to poor I/O bandwidth, which increases the end-to-end simulation-analysis time. Simulation-time analysis can reduce this end-to-end time. In this work, we present the scheduling of in-situ analysis as a numerical optimization problem to maximize the number of online analyses subject to resource constraints such as I/O bandwidth, network bandwidth, rate of computation and available memory. We demonstrate the effectiveness of our approach through two application case studies on the IBM Blue Gene/Q system.
当今领先的计算设备已经能够以前所未有的规模执行变革性模拟。然而,分析这些模拟的大量输出仍然是一个挑战。该输出的大多数分析是在模拟结束时以后处理模式执行的。由于较差的I/O带宽,读取分析输出的时间可能非常长,这会增加端到端模拟分析时间。仿真时间分析可以减少端到端时间。在这项工作中,我们提出了原位分析的调度作为一个数值优化问题,以最大限度地增加在线分析的数量,受制于资源约束,如I/O带宽,网络带宽,计算速率和可用内存。我们通过IBM Blue Gene/Q系统上的两个应用案例研究证明了我们方法的有效性。
{"title":"Optimal scheduling of in-situ analysis for large-scale scientific simulations","authors":"Preeti Malakar, V. Vishwanath, T. Munson, Christopher Knight, M. Hereld, S. Leyffer, M. Papka","doi":"10.1145/2807591.2807656","DOIUrl":"https://doi.org/10.1145/2807591.2807656","url":null,"abstract":"Today's leadership computing facilities have enabled the execution of transformative simulations at unprecedented scales. However, analyzing the huge amount of output from these simulations remains a challenge. Most analyses of this output is performed in post-processing mode at the end of the simulation. The time to read the output for the analysis can be significantly high due to poor I/O bandwidth, which increases the end-to-end simulation-analysis time. Simulation-time analysis can reduce this end-to-end time. In this work, we present the scheduling of in-situ analysis as a numerical optimization problem to maximize the number of online analyses subject to resource constraints such as I/O bandwidth, network bandwidth, rate of computation and available memory. We demonstrate the effectiveness of our approach through two application case studies on the IBM Blue Gene/Q system.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"324 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114010915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Suffix arrays and trees are fundamental string data structures of importance to many applications in computational biology. Consequently, their parallel construction is an actively studied problem. To date, algorithms with best practical performance lack efficient worst-case run-time guarantees, and vice versa. In addition, much of the recent work targeted low core count, shared memory parallelization. In this paper, we present parallel algorithms for distributed memory construction of suffix arrays and longest common prefix (LCP) arrays that simultaneously achieve good worst-case run-time bounds and superior practical performance. Our algorithms run in O(Tsort(n, p) · log n) worst-case time where Tsort(n, p) is the run-time of parallel sorting. We present several algorithm engineering techniques that improve performance in practice. We demonstrate the construction of suffix and LCP arrays of the human genome in less than 8 seconds on 1,024 Intel Xeon cores, reaching speedups of over 110X compared to the best sequential suffix array construction implementation divsufsort.
{"title":"Parallel distributed memory construction of suffix and longest common prefix arrays","authors":"P. Flick, S. Aluru","doi":"10.1145/2807591.2807609","DOIUrl":"https://doi.org/10.1145/2807591.2807609","url":null,"abstract":"Suffix arrays and trees are fundamental string data structures of importance to many applications in computational biology. Consequently, their parallel construction is an actively studied problem. To date, algorithms with best practical performance lack efficient worst-case run-time guarantees, and vice versa. In addition, much of the recent work targeted low core count, shared memory parallelization. In this paper, we present parallel algorithms for distributed memory construction of suffix arrays and longest common prefix (LCP) arrays that simultaneously achieve good worst-case run-time bounds and superior practical performance. Our algorithms run in O(Tsort(n, p) · log n) worst-case time where Tsort(n, p) is the run-time of parallel sorting. We present several algorithm engineering techniques that improve performance in practice. We demonstrate the construction of suffix and LCP arrays of the human genome in less than 8 seconds on 1,024 Intel Xeon cores, reaching speedups of over 110X compared to the best sequential suffix array construction implementation divsufsort.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122391513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kisung Lee, Ling Liu, K. Schwan, C. Pu, Qi Zhang, Yang Zhou, Emre Yigitoglu, Pingpeng Yuan
In recent years, systems researchers have devoted considerable effort to the study of large-scale graph processing. Existing distributed graph processing systems such as Pregel, based solely on distributed memory for their computations, fail to provide seamless scalability when the graph data and their intermediate computational results no longer fit into the memory; and most distributed approaches for iterative graph computations do not consider utilizing secondary storage a viable solution. This paper presents GraphMap, a distributed iterative graph computation framework that maximizes access locality and speeds up distributed iterative graph computations by effectively utilizing secondary storage. GraphMap has three salient features: (1) It distinguishes data states that are mutable during iterative computations from those that are read-only in all iterations to maximize sequential access and minimize random access. (2) It entails a two-level graph partitioning algorithm that enables balanced workloads and locality-optimized data placement. (3) It contains a proposed suite of locality-based optimizations that improve computational efficiency. Extensive experiments on several real-world graphs show that GraphMap outperforms existing distributed memory-based systems for various iterative graph algorithms.
{"title":"Scaling iterative graph computations with GraphMap","authors":"Kisung Lee, Ling Liu, K. Schwan, C. Pu, Qi Zhang, Yang Zhou, Emre Yigitoglu, Pingpeng Yuan","doi":"10.1145/2807591.2807604","DOIUrl":"https://doi.org/10.1145/2807591.2807604","url":null,"abstract":"In recent years, systems researchers have devoted considerable effort to the study of large-scale graph processing. Existing distributed graph processing systems such as Pregel, based solely on distributed memory for their computations, fail to provide seamless scalability when the graph data and their intermediate computational results no longer fit into the memory; and most distributed approaches for iterative graph computations do not consider utilizing secondary storage a viable solution. This paper presents GraphMap, a distributed iterative graph computation framework that maximizes access locality and speeds up distributed iterative graph computations by effectively utilizing secondary storage. GraphMap has three salient features: (1) It distinguishes data states that are mutable during iterative computations from those that are read-only in all iterations to maximize sequential access and minimize random access. (2) It entails a two-level graph partitioning algorithm that enables balanced workloads and locality-optimized data placement. (3) It contains a proposed suite of locality-based optimizations that improve computational efficiency. Extensive experiments on several real-world graphs show that GraphMap outperforms existing distributed memory-based systems for various iterative graph algorithms.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131863946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Akshay Venkatesh, Abhinav Vishnu, Khaled Hamidouche, Nathan R. Tallent, D. Panda, D. Kerbyson, A. Hoisie
Power has become a major impediment in designing large scale high-end systems. Message Passing Interface (MPI) is the de facto communication interface used as the back-end for designing applications, programming models and runtime for these systems. Slack --- the time spent by an MPI process in a single MPI call---provides a potential for energy and power savings, if an appropriate power reduction technique such as core-idling/Dynamic Voltage and Frequency Scaling (DVFS) can be applied without affecting the application's performance. Existing techniques that exploit slack for power savings assume that application behavior repeats across iterations/executions. However, an increasing use of adaptive and data-dependent workloads combined with system factors (OS noise, congestion) negates this assumption. This paper proposes and implements Energy Aware MPI (EAM) --- an application-oblivious energy-efficient MPI runtime. EAM uses a combination of communication models for common MPI primitives (point-to-point, collective, progress, blocking/non-blocking) and an online observation of slack to maximize energy efficiency and to honor performance degradation limits. Each power lever incurs time overhead, which must be amortized over slack to minimize degradation. When predicted communication time exceeds a lever overhead, the lever is used as soon as possible --- to maximize energy efficiency. When a misprediction occurs, the lever(s) are used automatically at specific intervals for amortization. We implement EAM using MVAPICH2 and evaluate it on ten applications using up to 4,096 processes. Our performance evaluation on an InfiniBand cluster indicates that EAM can reduce energy consumption by 5-41% in comparison to the default approach, which prioritizes performance alone, with negligible (less than 4% in all cases) performance loss.
{"title":"A case for application-oblivious energy-efficient MPI runtime","authors":"Akshay Venkatesh, Abhinav Vishnu, Khaled Hamidouche, Nathan R. Tallent, D. Panda, D. Kerbyson, A. Hoisie","doi":"10.1145/2807591.2807658","DOIUrl":"https://doi.org/10.1145/2807591.2807658","url":null,"abstract":"Power has become a major impediment in designing large scale high-end systems. Message Passing Interface (MPI) is the de facto communication interface used as the back-end for designing applications, programming models and runtime for these systems. Slack --- the time spent by an MPI process in a single MPI call---provides a potential for energy and power savings, if an appropriate power reduction technique such as core-idling/Dynamic Voltage and Frequency Scaling (DVFS) can be applied without affecting the application's performance. Existing techniques that exploit slack for power savings assume that application behavior repeats across iterations/executions. However, an increasing use of adaptive and data-dependent workloads combined with system factors (OS noise, congestion) negates this assumption. This paper proposes and implements Energy Aware MPI (EAM) --- an application-oblivious energy-efficient MPI runtime. EAM uses a combination of communication models for common MPI primitives (point-to-point, collective, progress, blocking/non-blocking) and an online observation of slack to maximize energy efficiency and to honor performance degradation limits. Each power lever incurs time overhead, which must be amortized over slack to minimize degradation. When predicted communication time exceeds a lever overhead, the lever is used as soon as possible --- to maximize energy efficiency. When a misprediction occurs, the lever(s) are used automatically at specific intervals for amortization. We implement EAM using MVAPICH2 and evaluate it on ten applications using up to 4,096 processes. Our performance evaluation on an InfiniBand cluster indicates that EAM can reduce energy consumption by 5-41% in comparison to the default approach, which prioritizes performance alone, with negligible (less than 4% in all cases) performance loss.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125069935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Gamblin, M. LeGendre, M. Collette, Gregory L. Lee, A. Moody, B. Supinski, S. Futral
Large HPC centers spend considerable time supporting software for thousands of users, but the complexity of HPC software is quickly outpacing the capabilities of existing software management tools. Scientific applications require specific versions of compilers, MPI, and other dependency libraries, so using a single, standard software stack is infeasible. However, managing many configurations is difficult because the configuration space is combinatorial in size. We introduce Spack, a tool used at Lawrence Livermore National Laboratory to manage this complexity. Spack provides a novel, recursive specification syntax to invoke parametric builds of packages and dependencies. It allows any number of builds to coexist on the same system, and it ensures that installed packages can find their dependencies, regardless of the environment. We show through real-world use cases that Spack supports diverse and demanding applications, bringing order to HPC software chaos.
{"title":"The Spack package manager: bringing order to HPC software chaos","authors":"T. Gamblin, M. LeGendre, M. Collette, Gregory L. Lee, A. Moody, B. Supinski, S. Futral","doi":"10.1145/2807591.2807623","DOIUrl":"https://doi.org/10.1145/2807591.2807623","url":null,"abstract":"Large HPC centers spend considerable time supporting software for thousands of users, but the complexity of HPC software is quickly outpacing the capabilities of existing software management tools. Scientific applications require specific versions of compilers, MPI, and other dependency libraries, so using a single, standard software stack is infeasible. However, managing many configurations is difficult because the configuration space is combinatorial in size. We introduce Spack, a tool used at Lawrence Livermore National Laboratory to manage this complexity. Spack provides a novel, recursive specification syntax to invoke parametric builds of packages and dependencies. It allows any number of builds to coexist on the same system, and it ensures that installed packages can find their dependencies, regardless of the environment. We show through real-world use cases that Spack supports diverse and demanding applications, bringing order to HPC software chaos.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128499290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christopher M. Sewell, K. Heitmann, H. Finkel, G. Zagaris, S. Parete-Koon, P. Fasel, A. Pope, N. Frontiere, Li-Ta Lo, O. E. Messer, S. Habib, J. Ahrens
Large-scale simulations can produce hundreds of terabytes to petabytes of data, complicating and limiting the efficiency of workflows. Traditionally, outputs are stored on the file system and analyzed in post-processing. With the rapidly increasing size and complexity of simulations, this approach faces an uncertain future. Trending techniques consist of performing the analysis in-situ, utilizing the same resources as the simulation, and/or off-loading subsets of the data to a compute-intensive analysis system. We introduce an analysis framework developed for HACC, a cosmological N-body code, that uses both in-situ and co-scheduling approaches for handling petabyte-scale outputs. We compare different analysis set-ups ranging from purely off-line, to purely in-situ to in-situ/co-scheduling. The analysis routines are implemented using the PISTON/VTK-m framework, allowing a single implementation of an algorithm that simultaneously targets a variety of GPU, multi-core, and many-core architectures.
{"title":"Large-scale compute-intensive analysis via a combined in-situ and co-scheduling workflow approach","authors":"Christopher M. Sewell, K. Heitmann, H. Finkel, G. Zagaris, S. Parete-Koon, P. Fasel, A. Pope, N. Frontiere, Li-Ta Lo, O. E. Messer, S. Habib, J. Ahrens","doi":"10.1145/2807591.2807663","DOIUrl":"https://doi.org/10.1145/2807591.2807663","url":null,"abstract":"Large-scale simulations can produce hundreds of terabytes to petabytes of data, complicating and limiting the efficiency of workflows. Traditionally, outputs are stored on the file system and analyzed in post-processing. With the rapidly increasing size and complexity of simulations, this approach faces an uncertain future. Trending techniques consist of performing the analysis in-situ, utilizing the same resources as the simulation, and/or off-loading subsets of the data to a compute-intensive analysis system. We introduce an analysis framework developed for HACC, a cosmological N-body code, that uses both in-situ and co-scheduling approaches for handling petabyte-scale outputs. We compare different analysis set-ups ranging from purely off-line, to purely in-situ to in-situ/co-scheduling. The analysis routines are implemented using the PISTON/VTK-m framework, allowing a single implementation of an algorithm that simultaneously targets a variety of GPU, multi-core, and many-core architectures.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130224667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Current trends for high-performance systems are leading towards hardware overprovisioning where it is no longer possible to run all components at peak power without exceeding a system- or facility-wide power bound. The standard practice of static power scheduling is likely to lead to inefficiencies with over- and under-provisioning of power to components at runtime. In this paper we investigate the performance and scalability of an application agnostic runtime power scheduler (POWsched) that is capable of enforcing a system-wide power limit. Our experimental results show POWsched is robust, has negligible overhead, and can take advantage of opportunities to shift wasted power to more power-intensive applications, improving overall workload runtime by as much as 14% without job scheduler integration or application specific profiling. In addition, we conduct scalability studies to determine POWsched's overhead for large node counts. Lastly, we contribute a model and simulator (POWsim) for investigating dynamic power scheduling behavior and enforcement at scale.
{"title":"Dynamic power sharing for higher job throughput","authors":"D. Ellsworth, A. Malony, B. Rountree, M. Schulz","doi":"10.1145/2807591.2807643","DOIUrl":"https://doi.org/10.1145/2807591.2807643","url":null,"abstract":"Current trends for high-performance systems are leading towards hardware overprovisioning where it is no longer possible to run all components at peak power without exceeding a system- or facility-wide power bound. The standard practice of static power scheduling is likely to lead to inefficiencies with over- and under-provisioning of power to components at runtime. In this paper we investigate the performance and scalability of an application agnostic runtime power scheduler (POWsched) that is capable of enforcing a system-wide power limit. Our experimental results show POWsched is robust, has negligible overhead, and can take advantage of opportunities to shift wasted power to more power-intensive applications, improving overall workload runtime by as much as 14% without job scheduler integration or application specific profiling. In addition, we conduct scalability studies to determine POWsched's overhead for large node counts. Lastly, we contribute a model and simulator (POWsim) for investigating dynamic power scheduling behavior and enforcement at scale.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121619947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Ichimura, K. Fujita, P. E. Quinay, Lalith Maddegedara, M. Hori, Seizo Tanaka, Y. Shizawa, Hiroshi Kobayashi, K. Minami
This paper presents a new heroic computing method for unstructured, low-order, finite-element, implicit nonlinear wave simulation: 1.97 PFLOPS (18.6% of peak) was attained on the full K computer when solving a 1.08T degrees-of-freedom (DOF) and 0.270T-element problem. This is 40.1 times more DOF and elements, a 2.68-fold improvement in peak performance, and 3.67 times faster in time-to-solution compared to the SC14 Gordon Bell finalist's state-of-the-art simulation. The method scales up to the full K computer with 663,552 CPU cores with 96.6% sizeup efficiency, enabling solving of a 1.08T DOF problem in 29.7 s per time step. Using such heroic computing, we solved a practical problem involving an area 23.7 times larger than the state-of-the-art, and conducted a comprehensive earthquake simulation by combining earthquake wave propagation analysis and evacuation analysis. Application at such scale is a groundbreaking accomplishment and is expected to change the quality of earthquake disaster estimation and contribute to society.
本文提出了一种新的非结构、低阶、有限元、隐式非线性波动模拟的计算方法:在全K计算机上求解1.08T自由度和0.270 t单元问题时,获得了1.97 PFLOPS(峰值的18.6%)。与SC14 Gordon Bell决赛选手的最先进模拟相比,这是40.1倍的自由度和元素,峰值性能提高了2.68倍,解决时间加快了3.67倍。该方法可扩展到具有663,552个CPU内核的全K计算机,计算效率为96.6%,每个时间步长可在29.7 s内解决1.08T DOF问题。通过这种英勇的计算,我们解决了一个比现有技术大23.7倍的实际问题,并结合地震波传播分析和疏散分析进行了全面的地震模拟。如此大规模的应用是一项突破性的成就,有望改变地震灾害评估的质量,并为社会做出贡献。
{"title":"Implicit nonlinear wave simulation with 1.08T DOF and 0.270T unstructured finite elements to enhance comprehensive earthquake simulation","authors":"T. Ichimura, K. Fujita, P. E. Quinay, Lalith Maddegedara, M. Hori, Seizo Tanaka, Y. Shizawa, Hiroshi Kobayashi, K. Minami","doi":"10.1145/2807591.2807674","DOIUrl":"https://doi.org/10.1145/2807591.2807674","url":null,"abstract":"This paper presents a new heroic computing method for unstructured, low-order, finite-element, implicit nonlinear wave simulation: 1.97 PFLOPS (18.6% of peak) was attained on the full K computer when solving a 1.08T degrees-of-freedom (DOF) and 0.270T-element problem. This is 40.1 times more DOF and elements, a 2.68-fold improvement in peak performance, and 3.67 times faster in time-to-solution compared to the SC14 Gordon Bell finalist's state-of-the-art simulation. The method scales up to the full K computer with 663,552 CPU cores with 96.6% sizeup efficiency, enabling solving of a 1.08T DOF problem in 29.7 s per time step. Using such heroic computing, we solved a practical problem involving an area 23.7 times larger than the state-of-the-art, and conducted a comprehensive earthquake simulation by combining earthquake wave propagation analysis and evacuation analysis. Application at such scale is a groundbreaking accomplishment and is expected to change the quality of earthquake disaster estimation and contribute to society.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121966518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael Sevilla, Noah Watkins, C. Maltzahn, I. Nassi, S. Brandt, S. Weil, Greg Farnum, S. Fineberg
Migrating resources is a useful tool for balancing load in a distributed system, but it is difficult to determine when to move resources, where to move resources, and how much of them to move. We look at resource migration for file system metadata and show how CephFS's dynamic subtree partitioning approach can exploit varying degrees of locality and balance because it can partition the namespace into variable sized units. Unfortunately, the current metadata balancer is complicated and difficult to control because it struggles to address many of the general resource migration challenges inherent to the metadata management problem. To help decouple policy from mechanism, we introduce a programmable storage system that lets the designer inject custom balancing logic. We show the flexibility and transparency of this approach by replicating the strategy of a state-of-the-art metadata balancer and conclude by comparing this strategy to other custom balancers on the same system.
{"title":"Mantle: a programmable metadata load balancer for the ceph file system","authors":"Michael Sevilla, Noah Watkins, C. Maltzahn, I. Nassi, S. Brandt, S. Weil, Greg Farnum, S. Fineberg","doi":"10.1145/2807591.2807607","DOIUrl":"https://doi.org/10.1145/2807591.2807607","url":null,"abstract":"Migrating resources is a useful tool for balancing load in a distributed system, but it is difficult to determine when to move resources, where to move resources, and how much of them to move. We look at resource migration for file system metadata and show how CephFS's dynamic subtree partitioning approach can exploit varying degrees of locality and balance because it can partition the namespace into variable sized units. Unfortunately, the current metadata balancer is complicated and difficult to control because it struggles to address many of the general resource migration challenges inherent to the metadata management problem. To help decouple policy from mechanism, we introduce a programmable storage system that lets the designer inject custom balancing logic. We show the flexibility and transparency of this approach by replicating the strategy of a state-of-the-art metadata balancer and conclude by comparing this strategy to other custom balancers on the same system.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127047971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In today's batch queue HPC cluster systems, the user submits a job requesting a fixed number of processors. The system will not start the job until all of the requested resources become available simultaneously. When cluster workload is high, large sized jobs will experience long waiting time due to this policy. In this paper, we propose a new approach that dynamically decomposes a large job into smaller ones to reduce waiting time, and lets the application expand across multiple subjobs while continuously achieving progress. This approach has three benefits: (i) application turnaround time is reduced, (ii) system fragmentation is diminished, and (iii) fairness is promoted. Our approach does not depend on job queue time prediction but exploits available backfill opportunities. Simulation results have shown that our approach can reduce application mean turnaround time by up to 48%.
{"title":"Elastic job bundling: an adaptive resource request strategy for large-scale parallel applications","authors":"Feng Liu, J. Weissman","doi":"10.1145/2807591.2807610","DOIUrl":"https://doi.org/10.1145/2807591.2807610","url":null,"abstract":"In today's batch queue HPC cluster systems, the user submits a job requesting a fixed number of processors. The system will not start the job until all of the requested resources become available simultaneously. When cluster workload is high, large sized jobs will experience long waiting time due to this policy. In this paper, we propose a new approach that dynamically decomposes a large job into smaller ones to reduce waiting time, and lets the application expand across multiple subjobs while continuously achieving progress. This approach has three benefits: (i) application turnaround time is reduced, (ii) system fragmentation is diminished, and (iii) fairness is promoted. Our approach does not depend on job queue time prediction but exploits available backfill opportunities. Simulation results have shown that our approach can reduce application mean turnaround time by up to 48%.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126805096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}