K. Bowers, B. Albright, B. Bergen, L. Yin, K. Barker, D. Kerbyson
We demonstrate the outstanding performance and scalability of the VPIC kinetic plasma modeling code on the heterogeneous IBM Roadrunner supercomputer at Los Alamos National Laboratory. VPIC is a three-dimensional, relativistic, electromagnetic, particle-in-cell (PIC) code that self-consistently evolves a kinetic plasma. VPIC simulations of laser plasma interaction were conducted at unprecedented fidelity and scale-up to 1.0 times 1012 particles on as many as 136 times 106 voxels-to model accurately the particle trapping physics occurring within a laser-driven hohlraum in an inertial confinement fusion experiment. During a parameter study of laser reflectivity as a function of laser intensity under experimentally realizable hohlraum conditions, we measured sustained performance exceeding 0.374 Pflop/s (s.p.) with the inner loop itself achieving 0.488 Pflop/s (s.p.). Given the increasing importance of data motion limitations, it is notable that this was measured in a PIC calculation-a technique that typically requires more data motion per computation than other techniques (such as dense matrix calculations, molecular dynamics N-body calculations and Monte-Carlo calculations) often used to demonstrate supercomputer performance. This capability opens up the exciting possibility of using VPIC to model, from first-principles, an issue critical to the success of the multi-billion dollar DOE/NNSA National Ignition Facility.
{"title":"0.374 Pflop/s trillion-particle kinetic modeling of laser plasma interaction on roadrunner","authors":"K. Bowers, B. Albright, B. Bergen, L. Yin, K. Barker, D. Kerbyson","doi":"10.1109/SC.2008.5222734","DOIUrl":"https://doi.org/10.1109/SC.2008.5222734","url":null,"abstract":"We demonstrate the outstanding performance and scalability of the VPIC kinetic plasma modeling code on the heterogeneous IBM Roadrunner supercomputer at Los Alamos National Laboratory. VPIC is a three-dimensional, relativistic, electromagnetic, particle-in-cell (PIC) code that self-consistently evolves a kinetic plasma. VPIC simulations of laser plasma interaction were conducted at unprecedented fidelity and scale-up to 1.0 times 1012 particles on as many as 136 times 106 voxels-to model accurately the particle trapping physics occurring within a laser-driven hohlraum in an inertial confinement fusion experiment. During a parameter study of laser reflectivity as a function of laser intensity under experimentally realizable hohlraum conditions, we measured sustained performance exceeding 0.374 Pflop/s (s.p.) with the inner loop itself achieving 0.488 Pflop/s (s.p.). Given the increasing importance of data motion limitations, it is notable that this was measured in a PIC calculation-a technique that typically requires more data motion per computation than other techniques (such as dense matrix calculations, molecular dynamics N-body calculations and Monte-Carlo calculations) often used to demonstrate supercomputer performance. This capability opens up the exciting possibility of using VPIC to model, from first-principles, an issue critical to the success of the multi-billion dollar DOE/NNSA National Ignition Facility.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123369913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Grids promote new modes of scientific collaboration and discovery by connecting distributed instruments, data and computing facilities. Because many resources are shared, application performance can vary widely and unexpectedly. We describe a novel performance analysis framework that reasons temporally and qualitatively about performance data from multiple monitoring levels and sources. The framework periodically analyzes application performance states by generating and interpreting signatures containing structural and temporal features from time-series data. Signatures are compared to expected behaviors and in case of mismatches, the framework hints at causes of degraded performance, based on unexpected behavior characteristics previously learned by application exposure to known performance stress factors. Experiments with two scientific applications reveal signatures that have distinct characteristics during well-performing versus poor-performing executions. The ability to automatically and compactly generate signatures capturing fundamental differences between good and poor application performance states is essential to improving the quality of service for Grid applications.
{"title":"Analysis of application heartbeats: Learning structural and temporal features in time series data for identification of performance problems","authors":"Emma S. Buneci, D. Reed","doi":"10.1109/SC.2008.5219753","DOIUrl":"https://doi.org/10.1109/SC.2008.5219753","url":null,"abstract":"Grids promote new modes of scientific collaboration and discovery by connecting distributed instruments, data and computing facilities. Because many resources are shared, application performance can vary widely and unexpectedly. We describe a novel performance analysis framework that reasons temporally and qualitatively about performance data from multiple monitoring levels and sources. The framework periodically analyzes application performance states by generating and interpreting signatures containing structural and temporal features from time-series data. Signatures are compared to expected behaviors and in case of mismatches, the framework hints at causes of degraded performance, based on unexpected behavior characteristics previously learned by application exposure to known performance stress factors. Experiments with two scientific applications reveal signatures that have distinct characteristics during well-performing versus poor-performing executions. The ability to automatically and compactly generate signatures capturing fundamental differences between good and poor application performance states is essential to improving the quality of service for Grid applications.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125160752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most GPU performance ldquohypesrdquo have focused around tightly-coupled applications with small memory bandwidth requirements e.g., N-body, but GPUs are also commodity vector machines sporting substantial memory bandwidth; however, effective programming methodologies thereof have been poorly studied. Our new 3-D FFT kernel, written in NVIDIA CUDA, achieves nearly 80 GFLOPS on a top-end GPU, being more than three times faster than any existing FFT implementations on GPUs including CUFFT. Careful programming techniques are employed to fully exploit modern GPU hardware characteristics while overcoming their limitations, including on-chip shared memory utilization, optimizing the number of threads and registers through appropriate localization, and avoiding low-speed stride memory accesses. Our kernel applied to real applications achieves orders of magnitude boost in power&cost vs. performance metrics. The off-card bandwidth limitation is still an issue, which could be alleviated somewhat with application kernels confinement within the card, while ideal solution being facilitation of faster GPU interfaces.
{"title":"Bandwidth intensive 3-D FFT kernel for GPUs using CUDA","authors":"Akira Nukada, Y. Ogata, Toshio Endo, S. Matsuoka","doi":"10.1145/1413370.1413376","DOIUrl":"https://doi.org/10.1145/1413370.1413376","url":null,"abstract":"Most GPU performance ldquohypesrdquo have focused around tightly-coupled applications with small memory bandwidth requirements e.g., N-body, but GPUs are also commodity vector machines sporting substantial memory bandwidth; however, effective programming methodologies thereof have been poorly studied. Our new 3-D FFT kernel, written in NVIDIA CUDA, achieves nearly 80 GFLOPS on a top-end GPU, being more than three times faster than any existing FFT implementations on GPUs including CUFFT. Careful programming techniques are employed to fully exploit modern GPU hardware characteristics while overcoming their limitations, including on-chip shared memory utilization, optimizing the number of threads and registers through appropriate localization, and avoiding low-speed stride memory accesses. Our kernel applied to real applications achieves orders of magnitude boost in power&cost vs. performance metrics. The off-card bandwidth limitation is still an issue, which could be alleviated somewhat with application kernels confinement within the card, while ideal solution being facilitation of faster GPU interfaces.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114787768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gregory L. Lee, D. Ahn, D. Arnold, B. Supinski, M. LeGendre, B. Miller, M. Schulz, B. Liblit
Petascale systems will present several new challenges to performance and correctness tools. Such machines may contain millions of cores, requiring that tools use scalable data structures and analysis algorithms to collect and to process application data. In addition, at such scales, each tool itself will become a large parallel application - already, debugging the full Blue-Gene/L (BG/L) installation at the Lawrence Livermore National Laboratory requires employing 1664 tool daemons. To reach such sizes and beyond, tools must use a scalable communication infrastructure and manage their own tool processes efficiently. Some system resources, such as the file system, may also become tool bottlenecks. In this paper, we present challenges to petascale tool development, using the stack trace analysis tool (STAT) as a case study. STAT is a lightweight tool that gathers and merges stack traces from a parallel application to identify process equivalence classes. We use results gathered at thousands of tasks on an Infiniband cluster and results up to 208 K processes on BG/L to identify current scalability issues as well as challenges that will be faced at the petascale. We then present implemented solutions to these challenges and show the resulting performance improvements. We also discuss future plans to meet the debugging demands of petascale machines.
{"title":"Lessons learned at 208K: Towards debugging millions of cores","authors":"Gregory L. Lee, D. Ahn, D. Arnold, B. Supinski, M. LeGendre, B. Miller, M. Schulz, B. Liblit","doi":"10.1109/SC.2008.5218557","DOIUrl":"https://doi.org/10.1109/SC.2008.5218557","url":null,"abstract":"Petascale systems will present several new challenges to performance and correctness tools. Such machines may contain millions of cores, requiring that tools use scalable data structures and analysis algorithms to collect and to process application data. In addition, at such scales, each tool itself will become a large parallel application - already, debugging the full Blue-Gene/L (BG/L) installation at the Lawrence Livermore National Laboratory requires employing 1664 tool daemons. To reach such sizes and beyond, tools must use a scalable communication infrastructure and manage their own tool processes efficiently. Some system resources, such as the file system, may also become tool bottlenecks. In this paper, we present challenges to petascale tool development, using the stack trace analysis tool (STAT) as a case study. STAT is a lightweight tool that gathers and merges stack traces from a parallel application to identify process equivalence classes. We use results gathered at thousands of tasks on an Infiniband cluster and results up to 208 K processes on BG/L to identify current scalability issues as well as challenges that will be faced at the petascale. We then present implemented solutions to these challenges and show the resulting performance improvements. We also discuss future plans to meet the debugging demands of petascale machines.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129044204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Sampath, Santi S. Adavani, H. Sundar, I. Lashuk, G. Biros
In this article, we present Dendro, a suite of parallel algorithms for the discretization and solution of partial differential equations (PDEs) involving second-order elliptic operators. Dendro uses trilinear finite element discretizations constructed using octrees. Dendro, comprises four main modules: a bottom-up octree generation and 2:1 balancing module, a meshing module, a geometric multiplicative multigrid module, and a module for adaptive mesh refinement (AMR). Here, we focus on the multigrid and AMR modules. The key features of Dendro are coarsening/refinement, inter-octree transfers of scalar and vector fields, and parallel partition of multilevel octree forests. We describe a bottom-up algorithm for constructing the coarser multigrid levels. The input is an arbitrary 2:1 balanced octree-based mesh, representing the fine level mesh. The output is a set of octrees and meshes that are used in the multigrid sweeps. Also, we describe matrix-free implementations for the discretized PDE operators and the intergrid transfer operations. We present results on up to 4096 CPUs on the Cray XT3 (ldquoBigBenrdquo), the Intel 64 system (ldquoAberdquo), and the Sun Constellation Linux cluster (ldquoRangerrdquo).
{"title":"Dendro: Parallel algorithms for multigrid and AMR methods on 2:1 balanced octrees","authors":"R. Sampath, Santi S. Adavani, H. Sundar, I. Lashuk, G. Biros","doi":"10.1109/SC.2008.5218558","DOIUrl":"https://doi.org/10.1109/SC.2008.5218558","url":null,"abstract":"In this article, we present Dendro, a suite of parallel algorithms for the discretization and solution of partial differential equations (PDEs) involving second-order elliptic operators. Dendro uses trilinear finite element discretizations constructed using octrees. Dendro, comprises four main modules: a bottom-up octree generation and 2:1 balancing module, a meshing module, a geometric multiplicative multigrid module, and a module for adaptive mesh refinement (AMR). Here, we focus on the multigrid and AMR modules. The key features of Dendro are coarsening/refinement, inter-octree transfers of scalar and vector fields, and parallel partition of multilevel octree forests. We describe a bottom-up algorithm for constructing the coarser multigrid levels. The input is an arbitrary 2:1 balanced octree-based mesh, representing the fine level mesh. The output is a set of octrees and meshes that are used in the multigrid sweeps. Also, we describe matrix-free implementations for the discretized PDE operators and the intergrid transfer operations. We present results on up to 4096 CPUs on the Cray XT3 (ldquoBigBenrdquo), the Intel 64 system (ldquoAberdquo), and the Sun Constellation Linux cluster (ldquoRangerrdquo).","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131675062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Schlosser, Michael P. Ryan, Ricardo Taborda-Rios, J. C. López-Hernández, D. O'Hallaron, J. Bielak
Large-scale earthquake simulation requires source datasets which describe the highly heterogeneous physical characteristics of the earth in the region under simulation. Physical characteristic datasets are the first stage in a simulation pipeline which includes mesh generation, partitioning, solving, and visualization. In practice, the data is produced in an ad-hoc fashion for each set of experiments, which has several significant shortcomings including lower performance, decreased repeatability and comparability, and a longer time to science, an increasingly important metric. As a solution to these problems, we propose a new approach for providing scientific data to ground motion simulations, in which ground model datasets are fully materialized into octress stored on disk, which can be more efficiently queried (by up to two orders of magnitude) than the underlying community velocity model programs. While octrees have long been used to store spatial datasets, they have not yet been used at the scale we propose. We further propose that these datasets can be provided as a service, either over the Internet or, more likely, in a data center or supercomputing center in which the simulations take place. Since constructing these octrees is itself a challenge, we present three data-parallel techniques for efficiently building them, which can significantly decrease the build time from days or weeks to hours using commodity clusters. This approach typifies a broader shift toward science as a service techniques in which scientific computation and storage services become more tightly intertwined.
{"title":"Materialized community ground models for large-scale earthquake simulation","authors":"S. Schlosser, Michael P. Ryan, Ricardo Taborda-Rios, J. C. López-Hernández, D. O'Hallaron, J. Bielak","doi":"10.1109/SC.2008.5215657","DOIUrl":"https://doi.org/10.1109/SC.2008.5215657","url":null,"abstract":"Large-scale earthquake simulation requires source datasets which describe the highly heterogeneous physical characteristics of the earth in the region under simulation. Physical characteristic datasets are the first stage in a simulation pipeline which includes mesh generation, partitioning, solving, and visualization. In practice, the data is produced in an ad-hoc fashion for each set of experiments, which has several significant shortcomings including lower performance, decreased repeatability and comparability, and a longer time to science, an increasingly important metric. As a solution to these problems, we propose a new approach for providing scientific data to ground motion simulations, in which ground model datasets are fully materialized into octress stored on disk, which can be more efficiently queried (by up to two orders of magnitude) than the underlying community velocity model programs. While octrees have long been used to store spatial datasets, they have not yet been used at the scale we propose. We further propose that these datasets can be provided as a service, either over the Internet or, more likely, in a data center or supercomputing center in which the simulations take place. Since constructing these octrees is itself a challenge, we present three data-parallel techniques for efficiently building them, which can significantly decrease the build time from days or weeks to hours using commodity clusters. This approach typifies a broader shift toward science as a service techniques in which scientific computation and storage services become more tightly intertwined.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116722868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marek Wieczorek, Stefan Podlipnig, R. Prodan, T. Fahringer
Grid economy models have long been considered as a promising alternative for the classical Grid resource management, due to their dynamic and decentralized nature, and because the financial valuation of resources and services is inherent in any such model. In particular, auction models are widely used in the existing Grid research, as they are easy to implement and are shown to successfully manage resource allocation on the Grid market. The focus on the current work is on workflow scheduling in the Grid resource allocation model based on Continuous Double Auctions (CDA). We analyze different scheduling strategies that can be applied by the user to execute workflows in such an environment, and try to identify the general behavioral patterns that can lead to a fast and cheap workflow execution. In the experimental study, we show that under certain circumstances some benefit can be gained by applying an ldquoaggressiverdquo scheduling strategy.
{"title":"Applying double auctions for scheduling of workflows on the Grid","authors":"Marek Wieczorek, Stefan Podlipnig, R. Prodan, T. Fahringer","doi":"10.1109/SC.2008.5218071","DOIUrl":"https://doi.org/10.1109/SC.2008.5218071","url":null,"abstract":"Grid economy models have long been considered as a promising alternative for the classical Grid resource management, due to their dynamic and decentralized nature, and because the financial valuation of resources and services is inherent in any such model. In particular, auction models are widely used in the existing Grid research, as they are easy to implement and are shown to successfully manage resource allocation on the Grid market. The focus on the current work is on workflow scheduling in the Grid resource allocation model based on Continuous Double Auctions (CDA). We analyze different scheduling strategies that can be applied by the user to execute workflows in such an environment, and try to identify the general behavioral patterns that can lead to a fast and cheap workflow execution. In the experimental study, we show that under certain circumstances some benefit can be gained by applying an ldquoaggressiverdquo scheduling strategy.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128211641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
For wide-area high-performance applications, light-paths provide 10Gbps connectivity, and multi-core hosts with PCI-Express can drive such data rates. However, sustaining such end-to-end application throughputs across connections of thousands of miles remains challenging, and the current performance studies of such solutions are very limited. We present an experimental study of two solutions to achieve such throughputs based on: (a) 10Gbps Ethernet with TCP/IP transport protocols, and (b) InfiniBand and its wide-area extensions. For both, we generate performance profiles over 10Gbps connections of lengths up to 8600 miles, and discuss the components, complexity, and limitations of sustaining such throughputs, using different connections and host configurations. Our results indicate that IB solution is better suited for applications with a single large flow, and 10GigE solution is better for those with multiple competing flows.
{"title":"Wide-area performance profiling of 10GigE and InfiniBand technologies","authors":"N. Rao, Weikuan Yu, W. Wing, S. Poole, J. Vetter","doi":"10.1109/SC.2008.5214435","DOIUrl":"https://doi.org/10.1109/SC.2008.5214435","url":null,"abstract":"For wide-area high-performance applications, light-paths provide 10Gbps connectivity, and multi-core hosts with PCI-Express can drive such data rates. However, sustaining such end-to-end application throughputs across connections of thousands of miles remains challenging, and the current performance studies of such solutions are very limited. We present an experimental study of two solutions to achieve such throughputs based on: (a) 10Gbps Ethernet with TCP/IP transport protocols, and (b) InfiniBand and its wide-area extensions. For both, we generate performance profiles over 10Gbps connections of lengths up to 8600 miles, and discuss the components, complexity, and limitations of sustaining such throughputs, using different connections and host configurations. Our results indicate that IB solution is better suited for applications with a single large flow, and 10GigE solution is better for those with multiple competing flows.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121528468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The trend in parallel computing toward clusters running thousands of cooperating processes per application has led to an I/O bottleneck that has only gotten more severe as the CPU density of clusters has increased. Current parallel file systems provide large amounts of aggregate I/O bandwidth; however, they do not achieve the high degrees of metadata scalability required to manage files distributed across hundreds or thousands of storage nodes. In this paper we examine the use of collective communication between the storage servers to improve the scalability of file metadata operations. In particular, we apply server-to-server communication to simplify consistency checking and improve the performance of file creation, file removal, and file stat. Our results indicate that collective communication is an effective scheme for simplifying consistency checks and significantly improving the performance for several real metadata intensive workloads.
{"title":"Using server-to-server communication in parallel file systems to simplify consistency and improve performance","authors":"P. Carns, B. Settlemyer, W. Ligon","doi":"10.1145/1413370.1413377","DOIUrl":"https://doi.org/10.1145/1413370.1413377","url":null,"abstract":"The trend in parallel computing toward clusters running thousands of cooperating processes per application has led to an I/O bottleneck that has only gotten more severe as the CPU density of clusters has increased. Current parallel file systems provide large amounts of aggregate I/O bandwidth; however, they do not achieve the high degrees of metadata scalability required to manage files distributed across hundreds or thousands of storage nodes. In this paper we examine the use of collective communication between the storage servers to improve the scalability of file metadata operations. In particular, we apply server-to-server communication to simplify consistency checking and improve the performance of file creation, file removal, and file stat. Our results indicate that collective communication is an effective scheme for simplifying consistency checks and significantly improving the performance for several real metadata intensive workloads.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128179391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The emerging class of adaptive, real-time, data-driven applications is a significant problem for today's HPC systems. In general, it is extremely difficult for queuing-system-controlled HPC resources to make and guarantee a tightly-bounded prediction regarding the time at which a newly-submitted application will execute. While a reservation-based approach partially addresses the problem, it can create severe resource under-utilization (unused reservations, necessary scheduled idle slots, underutilized reservations, etc.) that resource providers are eager to avoid. In contrast, this paper presents a fundamentally different approach to guarantee predictable execution. By creating a virtualized application layer called the performance container, and opportunistically multiplexing concurrent performance containers through the application of formal feedback control theory, we regulate the job's progress such that the job meets its deadline without requiring exclusive access to resources even in the presence of a wide class of unexpected disturbances. Our evaluation using two widely-used applications, WRF and BLAST, on an 8-core server show our approach is predictable and meets deadlines with 3.4 % of errors on average while achieving high overall utilization.
{"title":"Feedback-controlled resource sharing for predictable eScience","authors":"Sang-Min Park, M. Humphrey","doi":"10.1145/1413370.1413384","DOIUrl":"https://doi.org/10.1145/1413370.1413384","url":null,"abstract":"The emerging class of adaptive, real-time, data-driven applications is a significant problem for today's HPC systems. In general, it is extremely difficult for queuing-system-controlled HPC resources to make and guarantee a tightly-bounded prediction regarding the time at which a newly-submitted application will execute. While a reservation-based approach partially addresses the problem, it can create severe resource under-utilization (unused reservations, necessary scheduled idle slots, underutilized reservations, etc.) that resource providers are eager to avoid. In contrast, this paper presents a fundamentally different approach to guarantee predictable execution. By creating a virtualized application layer called the performance container, and opportunistically multiplexing concurrent performance containers through the application of formal feedback control theory, we regulate the job's progress such that the job meets its deadline without requiring exclusive access to resources even in the presence of a wide class of unexpected disturbances. Our evaluation using two widely-used applications, WRF and BLAST, on an 8-core server show our approach is predictable and meets deadlines with 3.4 % of errors on average while achieving high overall utilization.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130516676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}