Task-parallel programming models with input annotation-based concurrency extraction at runtime present a promising paradigm for programming multicore processors. Through management of dependencies, task assignments, and orchestration, these models markedly simplify the programming effort for parallelization while exposing higher levels of concurrency. In this paper we show that for multicores with a shared last-level cache (LLC), the concurrency extraction framework can be used to improve the shared LLC performance. Based on the input annotations for future tasks, the runtime instructs the hardware to prioritize data blocks with future reuse while evicting blocks with no future reuse. These instructions allow the hardware to preserve all the blocks for at least some of the future tasks and evict dead blocks. This leads to a considerable improvement in cache efficiency over what is achieved by hardware-only replacement policies, which can replace blocks for all future tasks resulting in poor hit-rates for all future tasks. The proposed hardware-software technique leads to a mean improvement of 18% in application performance and a mean reduction of 26% in misses over a shared LLC managed by the Least Recently Used replacement policy for a set of input-annotated task-parallel programs using the OmpSs programming model implemented on the NANOS++ runtime. In contrast, the state-of-the-art thread-based partitioning scheme suffers an average performance loss of 2% and an average increase of 15% in misses over the baseline.
{"title":"Runtime-driven shared last-level cache management for task-parallel programs","authors":"Abhisek Pan, Vijay S. Pai","doi":"10.1145/2807591.2807625","DOIUrl":"https://doi.org/10.1145/2807591.2807625","url":null,"abstract":"Task-parallel programming models with input annotation-based concurrency extraction at runtime present a promising paradigm for programming multicore processors. Through management of dependencies, task assignments, and orchestration, these models markedly simplify the programming effort for parallelization while exposing higher levels of concurrency. In this paper we show that for multicores with a shared last-level cache (LLC), the concurrency extraction framework can be used to improve the shared LLC performance. Based on the input annotations for future tasks, the runtime instructs the hardware to prioritize data blocks with future reuse while evicting blocks with no future reuse. These instructions allow the hardware to preserve all the blocks for at least some of the future tasks and evict dead blocks. This leads to a considerable improvement in cache efficiency over what is achieved by hardware-only replacement policies, which can replace blocks for all future tasks resulting in poor hit-rates for all future tasks. The proposed hardware-software technique leads to a mean improvement of 18% in application performance and a mean reduction of 26% in misses over a shared LLC managed by the Least Recently Used replacement policy for a set of input-annotated task-parallel programs using the OmpSs programming model implemented on the NANOS++ runtime. In contrast, the state-of-the-art thread-based partitioning scheme suffers an average performance loss of 2% and an average increase of 15% in misses over the baseline.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"176 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114939136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
For data analysis, a partial singular value decomposition (SVD) of the sparse matrix representing the data is a powerful tool. However, computing the SVD of a large matrix can take a significant amount of time even on a current high-performance supercomputer. Hence, there is a growing interest in a novel algorithm that can quickly compute the SVD for efficiently processing massive amounts of data that are being generated from many modern applications. To respond to this demand, in this paper, we study randomized algorithms that update the SVD as changes are made to the data, which is often more efficient than recomputing the SVD from scratch. Furthermore, in some applications, recomputing the SVD may not be possible because the original data, for which the SVD has been already computed, is no longer available. Our experimental results with the data sets for the Latent Semantic Indexing and population clustering demonstrate that these randomized algorithms can obtain the desired accuracy of the SVD with a small number of data accesses, and compared to the state-of-the-art updating algorithm, they often require much lower computational and communication costs. Our performance results on a hybrid CPU/GPU cluster show that these randomized algorithms can obtain significant speedups over the state-of-the-art updating algorithm.
{"title":"Randomized algorithms to update partial singular value decomposition on a hybrid CPU/GPU cluster","authors":"I. Yamazaki, J. Kurzak, P. Luszczek, J. Dongarra","doi":"10.1145/2807591.2807608","DOIUrl":"https://doi.org/10.1145/2807591.2807608","url":null,"abstract":"For data analysis, a partial singular value decomposition (SVD) of the sparse matrix representing the data is a powerful tool. However, computing the SVD of a large matrix can take a significant amount of time even on a current high-performance supercomputer. Hence, there is a growing interest in a novel algorithm that can quickly compute the SVD for efficiently processing massive amounts of data that are being generated from many modern applications. To respond to this demand, in this paper, we study randomized algorithms that update the SVD as changes are made to the data, which is often more efficient than recomputing the SVD from scratch. Furthermore, in some applications, recomputing the SVD may not be possible because the original data, for which the SVD has been already computed, is no longer available. Our experimental results with the data sets for the Latent Semantic Indexing and population clustering demonstrate that these randomized algorithms can obtain the desired accuracy of the SVD with a small number of data accesses, and compared to the state-of-the-art updating algorithm, they often require much lower computational and communication costs. Our performance results on a hybrid CPU/GPU cluster show that these randomized algorithms can obtain significant speedups over the state-of-the-art updating algorithm.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130310831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. D. Rose, Andrew Gontarek, A. Vose, Bob Moench, D. Abramson, M. N. Dinh, Chao Jin
Relative debugging traces software errors by comparing two executions of a program concurrently - one code being a reference version and the other faulty. Relative debugging is particularly effective when code is migrated from one platform to another, and this is of significant interest for hybrid computer architectures containing CPUs accelerators or coprocessors. In this paper we extend relative debugging to support porting stencil computation on a hybrid computer. We describe a generic data model that allows programmers to examine the global state across different types of applications, including MPI/OpenMP, MPI/OpenACC, and UPC programs. We present case studies using a hybrid version of the `stellarator' particle simulation DELTA5D, on Titan at ORNL, and the UPC version of Shallow Water Equations on Crystal, an internal supercomputer of Cray. These case studies used up to 5,120 GPUs and 32,768 CPU cores to illustrate that the debugger is effective and practical.
{"title":"Relative debugging for a highly parallel hybrid computer system","authors":"L. D. Rose, Andrew Gontarek, A. Vose, Bob Moench, D. Abramson, M. N. Dinh, Chao Jin","doi":"10.1145/2807591.2807605","DOIUrl":"https://doi.org/10.1145/2807591.2807605","url":null,"abstract":"Relative debugging traces software errors by comparing two executions of a program concurrently - one code being a reference version and the other faulty. Relative debugging is particularly effective when code is migrated from one platform to another, and this is of significant interest for hybrid computer architectures containing CPUs accelerators or coprocessors. In this paper we extend relative debugging to support porting stencil computation on a hybrid computer. We describe a generic data model that allows programmers to examine the global state across different types of applications, including MPI/OpenMP, MPI/OpenACC, and UPC programs. We present case studies using a hybrid version of the `stellarator' particle simulation DELTA5D, on Titan at ORNL, and the UPC version of Shallow Water Equations on Crystal, an internal supercomputer of Cray. These case studies used up to 5,120 GPUs and 32,768 CPU cores to illustrate that the debugger is effective and practical.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129457849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Calderara, S. Brück, A. Pedersen, M. H. Bani-Hashemian, J. VandeVondele, M. Luisier
The capabilities of CP2K, a density-functional theory package and OMEN, a nano-device simulator, are combined to study transport phenomena from first-principles in unprecedentedly large nanostructures. Based on the Hamiltonian and overlap matrices generated by CP2K for a given system, OMEN solves the Schrödinger equation with open boundary conditions (OBCs) for all possible electron momenta and energies. To accelerate this core operation a robust algorithm called SplitSolve has been developed. It allows to simultaneously treat the OBCs on CPUs and the Schrödinger equation on GPUs, taking advantage of hybrid nodes. Our key achievements on the Cray-XK7 Titan are (i) a reduction in time-to-solution by more than one order of magnitude as compared to standard methods, enabling the simulation of structures with more than 50000 atoms, (ii) a parallel efficiency of 97% when scaling from 756 up to 18564 nodes, and (iii) a sustained performance of 15 DP-PFlop/s.
{"title":"Pushing back the limit of ab-initio quantum transport simulations on hybrid supercomputers","authors":"M. Calderara, S. Brück, A. Pedersen, M. H. Bani-Hashemian, J. VandeVondele, M. Luisier","doi":"10.1145/2807591.2807673","DOIUrl":"https://doi.org/10.1145/2807591.2807673","url":null,"abstract":"The capabilities of CP2K, a density-functional theory package and OMEN, a nano-device simulator, are combined to study transport phenomena from first-principles in unprecedentedly large nanostructures. Based on the Hamiltonian and overlap matrices generated by CP2K for a given system, OMEN solves the Schrödinger equation with open boundary conditions (OBCs) for all possible electron momenta and energies. To accelerate this core operation a robust algorithm called SplitSolve has been developed. It allows to simultaneously treat the OBCs on CPUs and the Schrödinger equation on GPUs, taking advantage of hybrid nodes. Our key achievements on the Cray-XK7 Titan are (i) a reduction in time-to-solution by more than one order of magnitude as compared to standard methods, enabling the simulation of structures with more than 50000 atoms, (ii) a parallel efficiency of 97% when scaling from 756 up to 18564 nodes, and (iii) a sustained performance of 15 DP-PFlop/s.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134550205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Vaidyanathan, Dhiraj D. Kalamkar, K. Pamnany, J. Hammond, P. Balaji, Dipankar Das, Jongsoo Park, B. Joó
We present a new approach for multithreaded communication and asynchronous progress in MPI applications, wherein we offload communication processing to a dedicated thread. The central premise is that given the rapidly increasing core counts on modern systems, the improvements in MPI performance arising from dedicating a thread to drive communication outweigh the small loss of resources for application computation, particularly when overlap of communication and computation can be exploited. Our approach allows application threads to make MPI calls concurrently, enqueuing these as communication tasks to be processed by a dedicated communication thread. This not only guarantees progress for such communication operations, but also reduces load imbalance. Our implementation additionally significantly reduces the overhead of mutual exclusion seen in existing implementations for applications using MPI_THREAD_MULTIPLE. Our technique requires no modification to the application, and we demonstrate significant performance improvement (up to 2X) for QCD, 1-D FFT and deep learning CNN applications.
{"title":"Improving concurrency and asynchrony in multithreaded MPI applications using software offloading","authors":"K. Vaidyanathan, Dhiraj D. Kalamkar, K. Pamnany, J. Hammond, P. Balaji, Dipankar Das, Jongsoo Park, B. Joó","doi":"10.1145/2807591.2807602","DOIUrl":"https://doi.org/10.1145/2807591.2807602","url":null,"abstract":"We present a new approach for multithreaded communication and asynchronous progress in MPI applications, wherein we offload communication processing to a dedicated thread. The central premise is that given the rapidly increasing core counts on modern systems, the improvements in MPI performance arising from dedicating a thread to drive communication outweigh the small loss of resources for application computation, particularly when overlap of communication and computation can be exploited. Our approach allows application threads to make MPI calls concurrently, enqueuing these as communication tasks to be processed by a dedicated communication thread. This not only guarantees progress for such communication operations, but also reduces load imbalance. Our implementation additionally significantly reduces the overhead of mutual exclusion seen in existing implementations for applications using MPI_THREAD_MULTIPLE. Our technique requires no modification to the application, and we demonstrate significant performance improvement (up to 2X) for QCD, 1-D FFT and deep learning CNN applications.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128887741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Y. Inadomi, Tapasya Patki, Koji Inoue, M. Aoyagi, B. Rountree, M. Schulz, D. Lowenthal, Y. Wada, K. Fukazawa, M. Ueda, Masaaki Kondo, Ikuo Miyoshi
A key challenge in next-generation supercomputing is to effectively schedule limited power resources. Modern processors suffer from increasingly large power variations due to the chip manufacturing process. These variations lead to power inhomogeneity in current systems and manifest into performance inhomogeneity in power constrained environments, drastically limiting supercomputing performance. We present a first-of-its-kind study on manufacturing variability on four production HPC systems spanning four microarchitectures, analyze its impact on HPC applications, and propose a novel variation-aware power budgeting scheme to maximize effective application performance. Our low-cost and scalable budgeting algorithm strives to achieve performance homogeneity under a power constraint by deriving application-specific, module-level power allocations. Experimental results using a 1,920 socket system show up to 5.4X speedup, with an average speedup of 1.8X across all benchmarks when compared to a variation-unaware power allocation scheme.
{"title":"Analyzing and mitigating the impact of manufacturing variability in power-constrained supercomputing","authors":"Y. Inadomi, Tapasya Patki, Koji Inoue, M. Aoyagi, B. Rountree, M. Schulz, D. Lowenthal, Y. Wada, K. Fukazawa, M. Ueda, Masaaki Kondo, Ikuo Miyoshi","doi":"10.1145/2807591.2807638","DOIUrl":"https://doi.org/10.1145/2807591.2807638","url":null,"abstract":"A key challenge in next-generation supercomputing is to effectively schedule limited power resources. Modern processors suffer from increasingly large power variations due to the chip manufacturing process. These variations lead to power inhomogeneity in current systems and manifest into performance inhomogeneity in power constrained environments, drastically limiting supercomputing performance. We present a first-of-its-kind study on manufacturing variability on four production HPC systems spanning four microarchitectures, analyze its impact on HPC applications, and propose a novel variation-aware power budgeting scheme to maximize effective application performance. Our low-cost and scalable budgeting algorithm strives to achieve performance homogeneity under a power constraint by deriving application-specific, module-level power allocations. Experimental results using a 1,920 socket system show up to 5.4X speedup, with an average speedup of 1.8X across all benchmarks when compared to a variation-unaware power allocation scheme.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"375 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115911295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hyogi Sim, Youngjae Kim, Sudharshan S. Vazhkudai, Devesh Tiwari, Ali Anwar, A. Butt, L. Ramakrishnan
The need for novel data analysis is urgent in the face of a data deluge from modern applications. Traditional approaches to data analysis incur significant data movement costs, moving data back and forth between the storage system and the processor. Emerging Active Flash devices enable processing on the flash, where the data already resides. An array of such Active Flash devices allows us to revisit how analysis workflows interact with storage systems. By seamlessly blending together the flash storage and data analysis, we create an analysis workflow-aware storage system, AnalyzeThis. Our guiding principle is that analysis-awareness be deeply ingrained in each and every layer of the storage, elevating data analyses as first-class citizens, and transforming AnalyzeThis into a potent analytics-aware appliance. We implement the AnalyzeThis storage system atop an emulation platform of the Active Flash array. Our results indicate that AnalyzeThis is viable, expediting workflow execution and minimizing data movement.
{"title":"AnalyzeThis: an analysis workflow-aware storage system","authors":"Hyogi Sim, Youngjae Kim, Sudharshan S. Vazhkudai, Devesh Tiwari, Ali Anwar, A. Butt, L. Ramakrishnan","doi":"10.1145/2807591.2807622","DOIUrl":"https://doi.org/10.1145/2807591.2807622","url":null,"abstract":"The need for novel data analysis is urgent in the face of a data deluge from modern applications. Traditional approaches to data analysis incur significant data movement costs, moving data back and forth between the storage system and the processor. Emerging Active Flash devices enable processing on the flash, where the data already resides. An array of such Active Flash devices allows us to revisit how analysis workflows interact with storage systems. By seamlessly blending together the flash storage and data analysis, we create an analysis workflow-aware storage system, AnalyzeThis. Our guiding principle is that analysis-awareness be deeply ingrained in each and every layer of the storage, elevating data analyses as first-class citizens, and transforming AnalyzeThis into a potent analytics-aware appliance. We implement the AnalyzeThis storage system atop an emulation platform of the Active Flash array. Our results indicate that AnalyzeThis is viable, expediting workflow execution and minimizing data movement.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115290053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qian Sun, Tong Jin, Melissa Romanus, H. Bui, Fan Zhang, Hongfeng Yu, H. Kolla, S. Klasky, Jacqueline H. Chen, M. Parashar
Data staging and in-situ/in-transit data processing are emerging as attractive approaches for supporting extreme scale scientific workflows. These approaches improve end-to-end performance by enabling runtime data sharing between coupled simulations and data analytics components of the workflow. However, the complex and dynamic data exchange patterns exhibited by the workflows coupled with the varied data access behaviors make efficient data placement within the staging area challenging. In this paper, we present an adaptive data placement approach to address these challenges. Our approach adapts data placement based on application-specific dynamic data access patterns, and applies access pattern-driven and location-aware mechanisms to reduce data access costs and to support efficient data sharing between the multiple workflow components. We experimentally demonstrate the effectiveness of our approach on Titan Cray XK7 using a real combustion-analyses workflow. The evaluation results demonstrate that our approach can effectively improve data access performance and overall efficiency of coupled scientific workflows.
{"title":"Adaptive data placement for staging-based coupled scientific workflows","authors":"Qian Sun, Tong Jin, Melissa Romanus, H. Bui, Fan Zhang, Hongfeng Yu, H. Kolla, S. Klasky, Jacqueline H. Chen, M. Parashar","doi":"10.1145/2807591.2807669","DOIUrl":"https://doi.org/10.1145/2807591.2807669","url":null,"abstract":"Data staging and in-situ/in-transit data processing are emerging as attractive approaches for supporting extreme scale scientific workflows. These approaches improve end-to-end performance by enabling runtime data sharing between coupled simulations and data analytics components of the workflow. However, the complex and dynamic data exchange patterns exhibited by the workflows coupled with the varied data access behaviors make efficient data placement within the staging area challenging. In this paper, we present an adaptive data placement approach to address these challenges. Our approach adapts data placement based on application-specific dynamic data access patterns, and applies access pattern-driven and location-aware mechanisms to reduce data access costs and to support efficient data sharing between the multiple workflow components. We experimentally demonstrate the effectiveness of our approach on Titan Cray XK7 using a real combustion-analyses workflow. The evaluation results demonstrate that our approach can effectively improve data access performance and overall efficiency of coupled scientific workflows.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131999007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yong Hu, Xiaomeng Huang, A. Baker, Y. Tseng, F. Bryan, J. Dennis, Guangwen Yang
High-resolution climate simulations are increasingly in demand and require tremendous computing resources. In the Community Earth SystemModel (CESM), the Parallel Ocean Model (POP) is computationally expensive for high-resolution grids (e.g., 0.1°) and is frequently the least scalable component of CESM for certain production simulations. In particular, the modified Preconditioned Conjugate Gradient (PCG), used to solve the elliptic system of equations in the barotropic mode, scales poorly at the high core counts, which is problematic for high-resolution simulations. In this work, we demonstrate that the communication costs in the barotropic solver occupy an increasing portion of the total POP execution time as core counts are increased. To mitigate this problem, we implement a preconditioned Chebyshev-type iterative method in POP (called P-CSI), which requires far fewer global reductions than PCG. We also develop an effective block preconditioner based on the Error Vector Propagation Method to attain a competitive convergence rate for P-CSI. We demonstrate that the improved scalability of P-CSI results in a 5.2x speedup of the barotropic mode in high-resolution POP on 16,875 cores, which yields a 1.7x speedup of the overall POP simulation. Further, we ensure that the new solver produces an ocean climate consistent with the original one via an ensemble-based statistical method.
{"title":"Improving the scalability of the ocean barotropic solver in the community earth system model","authors":"Yong Hu, Xiaomeng Huang, A. Baker, Y. Tseng, F. Bryan, J. Dennis, Guangwen Yang","doi":"10.1145/2807591.2807596","DOIUrl":"https://doi.org/10.1145/2807591.2807596","url":null,"abstract":"High-resolution climate simulations are increasingly in demand and require tremendous computing resources. In the Community Earth SystemModel (CESM), the Parallel Ocean Model (POP) is computationally expensive for high-resolution grids (e.g., 0.1°) and is frequently the least scalable component of CESM for certain production simulations. In particular, the modified Preconditioned Conjugate Gradient (PCG), used to solve the elliptic system of equations in the barotropic mode, scales poorly at the high core counts, which is problematic for high-resolution simulations. In this work, we demonstrate that the communication costs in the barotropic solver occupy an increasing portion of the total POP execution time as core counts are increased. To mitigate this problem, we implement a preconditioned Chebyshev-type iterative method in POP (called P-CSI), which requires far fewer global reductions than PCG. We also develop an effective block preconditioner based on the Error Vector Propagation Method to attain a competitive convergence rate for P-CSI. We demonstrate that the improved scalability of P-CSI results in a 5.2x speedup of the barotropic mode in high-resolution POP on 16,875 cores, which yields a 1.7x speedup of the overall POP simulation. Further, we ensure that the new solver produces an ocean climate consistent with the original one via an ensemble-based statistical method.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125400117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. C. Chiang, H. H. Huang, Timothy Wood, Changbin Liu, O. Spatscheck
Multi-tier data-intensive applications are widely deployed in virtualized data centers for high scalability and reliability. As the response time is vital for user satisfaction, this requires achieving good performance at each tier of the applications in order to minimize the overall latency. However, in such virtualized environments, each tier (e.g., application, database, web) is likely to be hosted by different virtual machines (VMs) on multiple physical servers, where a guest VM is unaware of changes outside its domain, and the hypervisor also does not know the configuration and runtime status of a guest VM. As a result, isolated virtualization domains lend themselves to performance unpredictability and variance. In this paper, we propose IOrchestra, a holistic collaborative virtualization framework, which bridges the semantic gaps of I/O stacks and system information across multiple VMs, improves virtual I/O performance through collaboration from guest domains, and increases resource utilization in data centers. We present several case studies to demonstrate that IOrchestra is able to address numerous drawbacks of the current practice and improve the I/O latency of various distributed cloud applications by up to 31%.
{"title":"IOrchestra: supporting high-performance data-intensive applications in the cloud via collaborative virtualization","authors":"R. C. Chiang, H. H. Huang, Timothy Wood, Changbin Liu, O. Spatscheck","doi":"10.1145/2807591.2807633","DOIUrl":"https://doi.org/10.1145/2807591.2807633","url":null,"abstract":"Multi-tier data-intensive applications are widely deployed in virtualized data centers for high scalability and reliability. As the response time is vital for user satisfaction, this requires achieving good performance at each tier of the applications in order to minimize the overall latency. However, in such virtualized environments, each tier (e.g., application, database, web) is likely to be hosted by different virtual machines (VMs) on multiple physical servers, where a guest VM is unaware of changes outside its domain, and the hypervisor also does not know the configuration and runtime status of a guest VM. As a result, isolated virtualization domains lend themselves to performance unpredictability and variance. In this paper, we propose IOrchestra, a holistic collaborative virtualization framework, which bridges the semantic gaps of I/O stacks and system information across multiple VMs, improves virtual I/O performance through collaboration from guest domains, and increases resource utilization in data centers. We present several case studies to demonstrate that IOrchestra is able to address numerous drawbacks of the current practice and improve the I/O latency of various distributed cloud applications by up to 31%.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"22 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114119660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}