O. Rübel, Prabhat, Kesheng Wu, H. Childs, J. Meredith, C. Geddes, E. Cormier-Michel, Sean Ahern, G. Weber, P. Messmer, H. Hagen, B. Hamann, E. W. Bethel
One of the central challenges in modern science is the need to quickly derive knowledge and understanding from large, complex collections of data. We present a new approach that deals with this challenge by combining and extending techniques from high performance visual data analysis and scientific data management. This approach is demonstrated within the context of gaining insight from complex, time-varying datasets produced by a laser wakefield accelerator simulation. Our approach leverages histogram-based parallel coordinates for both visual information display as well as a vehicle for guiding a data mining operation. Data extraction and subsetting are implemented with state-of-the-art index/query technology. This approach, while applied here to accelerator science, is generally applicable to a broad set of science applications, and is implemented in a production-quality visual data analysis infrastructure. We conduct a detailed performance analysis and demonstrate good scalability on a distributed memory Cray XT4 system.
{"title":"High performance multivariate visual data exploration for extremely large data","authors":"O. Rübel, Prabhat, Kesheng Wu, H. Childs, J. Meredith, C. Geddes, E. Cormier-Michel, Sean Ahern, G. Weber, P. Messmer, H. Hagen, B. Hamann, E. W. Bethel","doi":"10.1145/1413370.1413422","DOIUrl":"https://doi.org/10.1145/1413370.1413422","url":null,"abstract":"One of the central challenges in modern science is the need to quickly derive knowledge and understanding from large, complex collections of data. We present a new approach that deals with this challenge by combining and extending techniques from high performance visual data analysis and scientific data management. This approach is demonstrated within the context of gaining insight from complex, time-varying datasets produced by a laser wakefield accelerator simulation. Our approach leverages histogram-based parallel coordinates for both visual information display as well as a vehicle for guiding a data mining operation. Data extraction and subsetting are implemented with state-of-the-art index/query technology. This approach, while applied here to accelerator science, is generally applicable to a broad set of science applications, and is implemented in a production-quality visual data analysis infrastructure. We conduct a detailed performance analysis and demonstrate good scalability on a distributed memory Cray XT4 system.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121795289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Eberle, P. García, J. Flich, J. Duato, R. Drost, N. Gura, D. Hopkins, W. Olesinski
We describe a novel way to implement high-radix crossbar switches. Our work is enabled by a new chip interconnect technology called proximity communication (PxC) that offers unparalleled chip IO density. First, we show how a crossbar architecture is topologically mapped onto a PxC-enabled multi-chip module (MCM). Then, we describe a first prototype implementation of a small-scale switch based on a PxC MCM. Finally, we present a performance analysis of two large-scale switch configurations with 288 ports and 1,728 ports, respectively, contrasting a 1-stage PxC-enabled switch and a multi-stage switch using conventional technology. Our simulation results show that (a) arbitration delays in a large 1-stage switch can be considerable, (b) multi-stage switches are extremely susceptible to saturation under non-uniform traffic, a problem that becomes worse for higher radices (1-stage switches, in contrast, are not affected by this problem).
{"title":"High-radix crossbar switches enabled by Proximity Communication","authors":"H. Eberle, P. García, J. Flich, J. Duato, R. Drost, N. Gura, D. Hopkins, W. Olesinski","doi":"10.1109/SC.2008.5219754","DOIUrl":"https://doi.org/10.1109/SC.2008.5219754","url":null,"abstract":"We describe a novel way to implement high-radix crossbar switches. Our work is enabled by a new chip interconnect technology called proximity communication (PxC) that offers unparalleled chip IO density. First, we show how a crossbar architecture is topologically mapped onto a PxC-enabled multi-chip module (MCM). Then, we describe a first prototype implementation of a small-scale switch based on a PxC MCM. Finally, we present a performance analysis of two large-scale switch configurations with 288 ports and 1,728 ports, respectively, contrasting a 1-stage PxC-enabled switch and a multi-stage switch using conventional technology. Our simulation results show that (a) arbitration delays in a large 1-stage switch can be considerable, (b) multi-stage switches are extremely susceptible to saturation under non-uniform traffic, a problem that becomes worse for higher radices (1-stage switches, in contrast, are not affected by this problem).","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121162384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In task parallel languages, an important factor for achieving a good performance is the use of a cut-off technique to reduce the number of tasks created. Using a cut-off to avoid an excessive number of tasks helps the runtime system to reduce the total overhead associated with task creation, particularlt if the tasks are fine grain. Unfortunately, the best cut-off technique its usually dependent on the application structure or even the input data of the application. We propose a new cut-off technique that, using information from the application collected at runtime, decides which tasks should be pruned to improve the performance of the application. This technique does not rely on the programmer to determine the cut-off technique that is best suited for the application. We have implemented this cut-off in the context of the new OpenMP tasking model. Our evaluation, with a variety of applications, shows that our adaptive cut-off is able to make good decisions and most of the time matches the optimal cut-off that could be set by hand by a programmer.
{"title":"An adaptive cut-off for task parallelism","authors":"A. Duran, J. Corbalán, E. Ayguadé","doi":"10.1145/1413370.1413407","DOIUrl":"https://doi.org/10.1145/1413370.1413407","url":null,"abstract":"In task parallel languages, an important factor for achieving a good performance is the use of a cut-off technique to reduce the number of tasks created. Using a cut-off to avoid an excessive number of tasks helps the runtime system to reduce the total overhead associated with task creation, particularlt if the tasks are fine grain. Unfortunately, the best cut-off technique its usually dependent on the application structure or even the input data of the application. We propose a new cut-off technique that, using information from the application collected at runtime, decides which tasks should be pruned to improve the performance of the application. This technique does not rely on the programmer to determine the cut-off technique that is best suited for the application. We have implemented this cut-off in the context of the new OpenMP tasking model. Our evaluation, with a variety of applications, shows that our adaptive cut-off is able to make good decisions and most of the time matches the optimal cut-off that could be set by hand by a programmer.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116807790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Existing knowledge based grid workflow languages and composition tools require sophisticated expertise of domain scientists in order to automate the process of managing workflows and its components (activities). So far semantic workflow specification and management has not been addressed from a general and integrated perspective. This paper presents a novel domain oriented approach which features separations of concerns between data meaning and data representation and between activity function (semantic description of workflow activities) and activity type (syntactic description of workflow activities). These separations are implemented as part of abstract grid workflow language (AGWL) which supports the development of grid workflows at a high level (semantic) of abstraction. The corresponding workflow composition tool simplifies grid workflow composition by (i) enabling users to compose grid workflows at the level of data meaning and activity function that shields the complexity of the grid, any specific implementation technology (e.g. Web or Grid service) and any specific data representation, (ii) semi-automatic data flow composition, and (iii) automatic data conversions. We have implemented our approach as part of the ASKALON grid application development and computing environment. We demonstrate the effectiveness of our approach by applying it to a real world meteorology workflow application and report some preliminary results. Our approach can also be adapted to other scientific domains by developing the corresponding ontologies for those domains.
{"title":"A novel domain oriented approach for scientific Grid workflow composition","authors":"Jun Qin, T. Fahringer","doi":"10.1109/SC.2008.5214432","DOIUrl":"https://doi.org/10.1109/SC.2008.5214432","url":null,"abstract":"Existing knowledge based grid workflow languages and composition tools require sophisticated expertise of domain scientists in order to automate the process of managing workflows and its components (activities). So far semantic workflow specification and management has not been addressed from a general and integrated perspective. This paper presents a novel domain oriented approach which features separations of concerns between data meaning and data representation and between activity function (semantic description of workflow activities) and activity type (syntactic description of workflow activities). These separations are implemented as part of abstract grid workflow language (AGWL) which supports the development of grid workflows at a high level (semantic) of abstraction. The corresponding workflow composition tool simplifies grid workflow composition by (i) enabling users to compose grid workflows at the level of data meaning and activity function that shields the complexity of the grid, any specific implementation technology (e.g. Web or Grid service) and any specific data representation, (ii) semi-automatic data flow composition, and (iii) automatic data conversions. We have implemented our approach as part of the ASKALON grid application development and computing environment. We demonstrate the effectiveness of our approach by applying it to a real world meteorology workflow application and report some preliminary results. Our approach can also be adapted to other scientific domains by developing the corresponding ontologies for those domains.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117192729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Gamblin, B. Supinski, M. Schulz, R. Fowler, D. Reed
Good load balance is crucial on very large parallel systems, but the most sophisticated algorithms introduce dynamic imbalances through adaptation in domain decomposition or use of adaptive solvers. To observe and diagnose imbalance, developers need system-wide, temporally-ordered measurements from full-scale runs. This potentially requires data collection from multiple code regions on all processors over the entire execution. Doing this instrumentation naively can, in combination with the application itself, exceed available I/O bandwidth and storage capacity, and can induce severe behavioral perturbations. We present and evaluate a novel technique for scalable, low-error load balance measurement. This uses a parallel wavelet transform and other parallel encoding methods. We show that our technique collects and reconstructs system-wide measurements with low error. Compression time scales sublinearly with system size and data volume is several orders of magnitude smaller than the raw data. The overhead is low enough for online use in a production environment.
{"title":"Scalable load-balance measurement for SPMD codes","authors":"T. Gamblin, B. Supinski, M. Schulz, R. Fowler, D. Reed","doi":"10.1145/1413370.1413417","DOIUrl":"https://doi.org/10.1145/1413370.1413417","url":null,"abstract":"Good load balance is crucial on very large parallel systems, but the most sophisticated algorithms introduce dynamic imbalances through adaptation in domain decomposition or use of adaptive solvers. To observe and diagnose imbalance, developers need system-wide, temporally-ordered measurements from full-scale runs. This potentially requires data collection from multiple code regions on all processors over the entire execution. Doing this instrumentation naively can, in combination with the application itself, exceed available I/O bandwidth and storage capacity, and can induce severe behavioral perturbations. We present and evaluate a novel technique for scalable, low-error load balance measurement. This uses a parallel wavelet transform and other parallel encoding methods. We show that our technique collects and reconstructs system-wide measurements with low error. Compression time scales sublinearly with system size and data volume is several orders of magnitude smaller than the raw data. The overhead is low enough for online use in a production environment.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127239826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 seconds of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 seconds. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively.
{"title":"Proactive process-level live migration in HPC environments","authors":"Chao Wang, F. Mueller, C. Engelmann, S. Scott","doi":"10.1109/SC.2008.5222634","DOIUrl":"https://doi.org/10.1109/SC.2008.5222634","url":null,"abstract":"As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 seconds of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 seconds. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127584752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Load imbalance cause significant performance degradation in High Performance Computing applications. In our previous work we showed that load imbalance can be alleviated by modern MT processors that provide mechanisms for controlling the allocation of processors internal resources. In that work, we applied static, hand-tuned resource allocations to balance HPC applications, providing improvements for benchmarks and real applications. In this paper we propose a dynamic process scheduler for the Linux kernel that automatically and transparently balances HPC applications according to their behavior. We tested our new scheduler on an IBM POWER5 machine, which provides a software-controlled prioritization mechanism that allows us to bias the processor resource allocation. Our experiments show that the scheduler reduces the imbalance of HPC applications, achieving results similar to the ones obtained by hand-tuning the applications (up to 16%). Moreover, our solution reduces the application's execution time combining effect of load balance and high responsive scheduling.
{"title":"A dynamic scheduler for balancing HPC applications","authors":"C. Boneti, R. Gioiosa, F. Cazorla, M. Valero","doi":"10.1145/1413370.1413412","DOIUrl":"https://doi.org/10.1145/1413370.1413412","url":null,"abstract":"Load imbalance cause significant performance degradation in High Performance Computing applications. In our previous work we showed that load imbalance can be alleviated by modern MT processors that provide mechanisms for controlling the allocation of processors internal resources. In that work, we applied static, hand-tuned resource allocations to balance HPC applications, providing improvements for benchmarks and real applications. In this paper we propose a dynamic process scheduler for the Linux kernel that automatically and transparently balances HPC applications according to their behavior. We tested our new scheduler on an IBM POWER5 machine, which provides a software-controlled prioritization mechanism that allows us to bias the processor resource allocation. Our experiments show that the scheduler reduces the imbalance of HPC applications, achieving results similar to the ones obtained by hand-tuning the applications (up to 16%). Moreover, our solution reduces the application's execution time combining effect of load balance and high responsive scheduling.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131459515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Auction mechanisms have been proposed as a means to efficiently and fairly schedule jobs in high-performance computing environments. The generalized vickrey auction has long been known to produce efficient allocations while exposing users to truth-revealing incentives, but the algorithms used to compute its payments can be computationally intractable. In this paper we present a novel implementation of the generalized vickrey auction that uses dynamic programming to schedule jobs and compute payments in pseudo-polynomial time. Additionally, we have built a version of the PBS scheduler that uses this algorithm to schedule jobs, and in this paper we present the results of our tests using this scheduler.
{"title":"Efficient auction-based grid reservations using dynamic programming","authors":"A. Mutz, R. Wolski","doi":"10.1145/1413370.1413387","DOIUrl":"https://doi.org/10.1145/1413370.1413387","url":null,"abstract":"Auction mechanisms have been proposed as a means to efficiently and fairly schedule jobs in high-performance computing environments. The generalized vickrey auction has long been known to produce efficient allocations while exposing users to truth-revealing incentives, but the algorithms used to compute its payments can be computationally intractable. In this paper we present a novel implementation of the generalized vickrey auction that uses dynamic programming to schedule jobs and compute payments in pseudo-polynomial time. Additionally, we have built a version of the PBS scheduler that uses this algorithm to schedule jobs, and in this paper we present the results of our tests using this scheduler.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113964371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Susukita, H. Ando, M. Aoyagi, H. Honda, Y. Inadomi, Koji Inoue, Shigeru Ishizuki, Yasunori Kimura, Hidemi Komatsu, M. Kurokawa, K. Murakami, H. Shibamura, Shuji Yamamura, Yunqing Yu
To predict application performance on an HPC system is an important technology for designing the computing system and developing applications. However, accurate prediction is a challenge, particularly, in the case of a future coming system with higher performance. In this paper, we present a new method for predicting application performance on HPC systems. This method combines modeling of sequential performance on a single processor and macro-level simulations of applications for parallel performance on the entire system. In the simulation, the execution flow is traced but kernel computations are omitted for reducing the execution time. Validation on a real terascale system showed that the predicted and measured performance agreed within 10% to 20 %. We employed the method in designing a hypothetical petascale system of 32768 SIMD-extended processor cores. For predicting application performance on the petascale system, the macro-level simulation required several hours.
{"title":"Performance prediction of large-scale parallell system and application using macro-level simulation","authors":"R. Susukita, H. Ando, M. Aoyagi, H. Honda, Y. Inadomi, Koji Inoue, Shigeru Ishizuki, Yasunori Kimura, Hidemi Komatsu, M. Kurokawa, K. Murakami, H. Shibamura, Shuji Yamamura, Yunqing Yu","doi":"10.1145/1413370.1413391","DOIUrl":"https://doi.org/10.1145/1413370.1413391","url":null,"abstract":"To predict application performance on an HPC system is an important technology for designing the computing system and developing applications. However, accurate prediction is a challenge, particularly, in the case of a future coming system with higher performance. In this paper, we present a new method for predicting application performance on HPC systems. This method combines modeling of sequential performance on a single processor and macro-level simulations of applications for parallel performance on the entire system. In the simulation, the execution flow is traced but kernel computations are omitted for reducing the execution time. Validation on a real terascale system showed that the predicted and measured performance agreed within 10% to 20 %. We employed the method in designing a hypothetical petascale system of 32768 SIMD-extended processor cores. For predicting application performance on the petascale system, the macro-level simulation required several hours.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128709576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The unprecedented parallelism of new supercomputing platforms poses tremendous challenges to achieving scalable performance for I/O intensive applications. Performance assessments using traditional I/O system and component benchmarks are difficult to relate back to application I/O requirements. However, the complexity of full applications motivates development of simpler synthetic I/O benchmarks as proxies to the full application. In this paper we examine the I/O requirements of a range of HPC applications and describe how the LLNL IOR synthetic benchmark was chosen as suitable proxy for the diverse workload. We show a procedure for selecting IOR parameters to match the I/O patterns of the selected applications and show it can accurately predict the I/O performance of the full applications. We conclude that IOR is an effective replacement for full-application I/O benchmarks and can bridge the gap of understanding that typically exists between stand-alone benchmarks and the full applications they intend to model.
{"title":"Characterizing and predicting the I/O performance of HPC applications using a parameterized synthetic benchmark","authors":"H. Shan, K. Antypas, J. Shalf","doi":"10.1145/1413370.1413413","DOIUrl":"https://doi.org/10.1145/1413370.1413413","url":null,"abstract":"The unprecedented parallelism of new supercomputing platforms poses tremendous challenges to achieving scalable performance for I/O intensive applications. Performance assessments using traditional I/O system and component benchmarks are difficult to relate back to application I/O requirements. However, the complexity of full applications motivates development of simpler synthetic I/O benchmarks as proxies to the full application. In this paper we examine the I/O requirements of a range of HPC applications and describe how the LLNL IOR synthetic benchmark was chosen as suitable proxy for the diverse workload. We show a procedure for selecting IOR parameters to match the I/O patterns of the selected applications and show it can accurately predict the I/O performance of the full applications. We conclude that IOR is an effective replacement for full-application I/O benchmarks and can bridge the gap of understanding that typically exists between stand-alone benchmarks and the full applications they intend to model.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134354357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}