SC15: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

英文中文

Runtime-driven shared last-level cache management for task-parallel programs 用于任务并行程序的运行时驱动的共享最后一级缓存管理

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807625

Abhisek Pan, Vijay S. Pai

Task-parallel programming models with input annotation-based concurrency extraction at runtime present a promising paradigm for programming multicore processors. Through management of dependencies, task assignments, and orchestration, these models markedly simplify the programming effort for parallelization while exposing higher levels of concurrency. In this paper we show that for multicores with a shared last-level cache (LLC), the concurrency extraction framework can be used to improve the shared LLC performance. Based on the input annotations for future tasks, the runtime instructs the hardware to prioritize data blocks with future reuse while evicting blocks with no future reuse. These instructions allow the hardware to preserve all the blocks for at least some of the future tasks and evict dead blocks. This leads to a considerable improvement in cache efficiency over what is achieved by hardware-only replacement policies, which can replace blocks for all future tasks resulting in poor hit-rates for all future tasks. The proposed hardware-software technique leads to a mean improvement of 18% in application performance and a mean reduction of 26% in misses over a shared LLC managed by the Least Recently Used replacement policy for a set of input-annotated task-parallel programs using the OmpSs programming model implemented on the NANOS++ runtime. In contrast, the state-of-the-art thread-based partitioning scheme suffers an average performance loss of 2% and an average increase of 15% in misses over the baseline.

运行时基于输入注释的并发抽取任务并行编程模型为多核处理器编程提供了一种很有前途的范式。通过对依赖关系、任务分配和编排的管理，这些模型显著地简化了并行化的编程工作，同时暴露了更高级别的并发性。在本文中，我们证明了对于具有共享最后一级缓存(LLC)的多核，并发提取框架可以用来提高共享LLC的性能。基于未来任务的输入注释，运行时指示硬件优先考虑将来可重用的数据块，同时驱逐将来不可重用的数据块。这些指令允许硬件为至少一些未来的任务保留所有的块，并清除死块。与纯硬件替换策略相比，这大大提高了缓存效率，纯硬件替换策略可以替换所有未来任务的块，从而导致所有未来任务的命中率很低。对于使用在nanos++运行时上实现的OmpSs编程模型的一组输入注释任务并行程序，采用最近最少使用替换策略管理的共享LLC，所提出的硬件软件技术使应用程序性能平均提高18%，失误率平均降低26%。相比之下，最先进的基于线程的分区方案在基线上平均损失2%的性能，平均增加15%的失误。

{"title":"Runtime-driven shared last-level cache management for task-parallel programs","authors":"Abhisek Pan, Vijay S. Pai","doi":"10.1145/2807591.2807625","DOIUrl":"https://doi.org/10.1145/2807591.2807625","url":null,"abstract":"Task-parallel programming models with input annotation-based concurrency extraction at runtime present a promising paradigm for programming multicore processors. Through management of dependencies, task assignments, and orchestration, these models markedly simplify the programming effort for parallelization while exposing higher levels of concurrency. In this paper we show that for multicores with a shared last-level cache (LLC), the concurrency extraction framework can be used to improve the shared LLC performance. Based on the input annotations for future tasks, the runtime instructs the hardware to prioritize data blocks with future reuse while evicting blocks with no future reuse. These instructions allow the hardware to preserve all the blocks for at least some of the future tasks and evict dead blocks. This leads to a considerable improvement in cache efficiency over what is achieved by hardware-only replacement policies, which can replace blocks for all future tasks resulting in poor hit-rates for all future tasks. The proposed hardware-software technique leads to a mean improvement of 18% in application performance and a mean reduction of 26% in misses over a shared LLC managed by the Least Recently Used replacement policy for a set of input-annotated task-parallel programs using the OmpSs programming model implemented on the NANOS++ runtime. In contrast, the state-of-the-art thread-based partitioning scheme suffers an average performance loss of 2% and an average increase of 15% in misses over the baseline.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"176 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114939136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Randomized algorithms to update partial singular value decomposition on a hybrid CPU/GPU cluster 在CPU/GPU混合集群上更新部分奇异值分解的随机算法

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807608

I. Yamazaki, J. Kurzak, P. Luszczek, J. Dongarra

For data analysis, a partial singular value decomposition (SVD) of the sparse matrix representing the data is a powerful tool. However, computing the SVD of a large matrix can take a significant amount of time even on a current high-performance supercomputer. Hence, there is a growing interest in a novel algorithm that can quickly compute the SVD for efficiently processing massive amounts of data that are being generated from many modern applications. To respond to this demand, in this paper, we study randomized algorithms that update the SVD as changes are made to the data, which is often more efficient than recomputing the SVD from scratch. Furthermore, in some applications, recomputing the SVD may not be possible because the original data, for which the SVD has been already computed, is no longer available. Our experimental results with the data sets for the Latent Semantic Indexing and population clustering demonstrate that these randomized algorithms can obtain the desired accuracy of the SVD with a small number of data accesses, and compared to the state-of-the-art updating algorithm, they often require much lower computational and communication costs. Our performance results on a hybrid CPU/GPU cluster show that these randomized algorithms can obtain significant speedups over the state-of-the-art updating algorithm.

对于数据分析，表示数据的稀疏矩阵的部分奇异值分解(SVD)是一个强大的工具。然而，即使在当前的高性能超级计算机上，计算大型矩阵的奇异值分解也会花费大量时间。因此，人们对一种新的算法越来越感兴趣，这种算法可以快速计算SVD，以有效地处理从许多现代应用程序生成的大量数据。为了响应这一需求，在本文中，我们研究了随机算法，当数据发生变化时更新SVD，这通常比从头开始重新计算SVD更有效。此外，在某些应用程序中，重新计算SVD可能是不可能的，因为已经计算过SVD的原始数据不再可用。我们对潜在语义索引和总体聚类数据集的实验结果表明，这些随机化算法可以通过少量的数据访问获得期望的SVD精度，并且与最先进的更新算法相比，它们通常需要更低的计算和通信成本。我们在CPU/GPU混合集群上的性能结果表明，这些随机算法比最先进的更新算法可以获得显着的加速。

{"title":"Randomized algorithms to update partial singular value decomposition on a hybrid CPU/GPU cluster","authors":"I. Yamazaki, J. Kurzak, P. Luszczek, J. Dongarra","doi":"10.1145/2807591.2807608","DOIUrl":"https://doi.org/10.1145/2807591.2807608","url":null,"abstract":"For data analysis, a partial singular value decomposition (SVD) of the sparse matrix representing the data is a powerful tool. However, computing the SVD of a large matrix can take a significant amount of time even on a current high-performance supercomputer. Hence, there is a growing interest in a novel algorithm that can quickly compute the SVD for efficiently processing massive amounts of data that are being generated from many modern applications. To respond to this demand, in this paper, we study randomized algorithms that update the SVD as changes are made to the data, which is often more efficient than recomputing the SVD from scratch. Furthermore, in some applications, recomputing the SVD may not be possible because the original data, for which the SVD has been already computed, is no longer available. Our experimental results with the data sets for the Latent Semantic Indexing and population clustering demonstrate that these randomized algorithms can obtain the desired accuracy of the SVD with a small number of data accesses, and compared to the state-of-the-art updating algorithm, they often require much lower computational and communication costs. Our performance results on a hybrid CPU/GPU cluster show that these randomized algorithms can obtain significant speedups over the state-of-the-art updating algorithm.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130310831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Relative debugging for a highly parallel hybrid computer system 高度并行混合计算机系统的相对调试

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807605

L. D. Rose, Andrew Gontarek, A. Vose, Bob Moench, D. Abramson, M. N. Dinh, Chao Jin

Relative debugging traces software errors by comparing two executions of a program concurrently - one code being a reference version and the other faulty. Relative debugging is particularly effective when code is migrated from one platform to another, and this is of significant interest for hybrid computer architectures containing CPUs accelerators or coprocessors. In this paper we extend relative debugging to support porting stencil computation on a hybrid computer. We describe a generic data model that allows programmers to examine the global state across different types of applications, including MPI/OpenMP, MPI/OpenACC, and UPC programs. We present case studies using a hybrid version of the `stellarator' particle simulation DELTA5D, on Titan at ORNL, and the UPC version of Shallow Water Equations on Crystal, an internal supercomputer of Cray. These case studies used up to 5,120 GPUs and 32,768 CPU cores to illustrate that the debugger is effective and practical.

相对调试通过比较一个程序的两个并发执行来跟踪软件错误——一个代码是参考版本，另一个是错误的。当代码从一个平台迁移到另一个平台时，相对调试特别有效，这对于包含cpu加速器或协处理器的混合计算机体系结构非常重要。本文将相对调试扩展到支持在混合计算机上移植模板计算。我们描述了一个通用的数据模型，它允许程序员检查不同类型的应用程序的全局状态，包括MPI/OpenMP、MPI/OpenACC和UPC程序。我们在ORNL的Titan上使用混合版本的“仿星器”粒子模拟DELTA5D，在Cray的内部超级计算机Crystal上使用UPC版本的浅水方程进行案例研究。这些案例研究使用了多达5,120个gpu和32,768个CPU内核来说明调试器是有效和实用的。

引用次数: 5

Pushing back the limit of ab-initio quantum transport simulations on hybrid supercomputers 在混合超级计算机上突破从头算量子输运模拟的极限

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807673

M. Calderara, S. Brück, A. Pedersen, M. H. Bani-Hashemian, J. VandeVondele, M. Luisier

The capabilities of CP2K, a density-functional theory package and OMEN, a nano-device simulator, are combined to study transport phenomena from first-principles in unprecedentedly large nanostructures. Based on the Hamiltonian and overlap matrices generated by CP2K for a given system, OMEN solves the Schrödinger equation with open boundary conditions (OBCs) for all possible electron momenta and energies. To accelerate this core operation a robust algorithm called SplitSolve has been developed. It allows to simultaneously treat the OBCs on CPUs and the Schrödinger equation on GPUs, taking advantage of hybrid nodes. Our key achievements on the Cray-XK7 Titan are (i) a reduction in time-to-solution by more than one order of magnitude as compared to standard methods, enabling the simulation of structures with more than 50000 atoms, (ii) a parallel efficiency of 97% when scaling from 756 up to 18564 nodes, and (iii) a sustained performance of 15 DP-PFlop/s.

结合CP2K(密度泛函理论包)和OMEN(纳米器件模拟器)的能力，从第一性原理研究前所未有的大型纳米结构中的输运现象。基于CP2K生成的哈密顿矩阵和重叠矩阵，OMEN求解了具有开放边界条件(OBCs)的所有可能的电子动量和能量Schrödinger方程。为了加速这一核心操作，我们开发了一种名为SplitSolve的稳健算法。它允许同时处理cpu上的obc和gpu上的Schrödinger方程，利用混合节点的优势。我们在Cray-XK7 Titan上的主要成就是(i)与标准方法相比，解决时间减少了一个数量级以上，能够模拟超过50000个原子的结构，(ii)从756扩展到18564节点时并行效率为97%，以及(iii) 15 DP-PFlop/s的持续性能。

引用次数: 25

Improving concurrency and asynchrony in multithreaded MPI applications using software offloading 使用软件卸载改进多线程MPI应用程序中的并发性和异步性

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807602

K. Vaidyanathan, Dhiraj D. Kalamkar, K. Pamnany, J. Hammond, P. Balaji, Dipankar Das, Jongsoo Park, B. Joó

We present a new approach for multithreaded communication and asynchronous progress in MPI applications, wherein we offload communication processing to a dedicated thread. The central premise is that given the rapidly increasing core counts on modern systems, the improvements in MPI performance arising from dedicating a thread to drive communication outweigh the small loss of resources for application computation, particularly when overlap of communication and computation can be exploited. Our approach allows application threads to make MPI calls concurrently, enqueuing these as communication tasks to be processed by a dedicated communication thread. This not only guarantees progress for such communication operations, but also reduces load imbalance. Our implementation additionally significantly reduces the overhead of mutual exclusion seen in existing implementations for applications using MPI_THREAD_MULTIPLE. Our technique requires no modification to the application, and we demonstrate significant performance improvement (up to 2X) for QCD, 1-D FFT and deep learning CNN applications.

我们提出了一种用于MPI应用程序中多线程通信和异步进程的新方法，其中我们将通信处理卸载到专用线程。核心前提是，考虑到现代系统上快速增加的核心数量，专用线程驱动通信所带来的MPI性能改进超过了应用程序计算的资源损失，特别是在通信和计算可以重叠的情况下。我们的方法允许应用程序线程并发地进行MPI调用，将这些调用作为通信任务排队，由专用通信线程处理。这不仅保证了通信操作的进度，还减少了负载不平衡。我们的实现还显著降低了互斥的开销，这种开销在使用MPI_THREAD_MULTIPLE的应用程序的现有实现中可以看到。我们的技术不需要修改应用程序，并且我们证明了QCD, 1-D FFT和深度学习CNN应用程序的显着性能改进(高达2X)。

{"title":"Improving concurrency and asynchrony in multithreaded MPI applications using software offloading","authors":"K. Vaidyanathan, Dhiraj D. Kalamkar, K. Pamnany, J. Hammond, P. Balaji, Dipankar Das, Jongsoo Park, B. Joó","doi":"10.1145/2807591.2807602","DOIUrl":"https://doi.org/10.1145/2807591.2807602","url":null,"abstract":"We present a new approach for multithreaded communication and asynchronous progress in MPI applications, wherein we offload communication processing to a dedicated thread. The central premise is that given the rapidly increasing core counts on modern systems, the improvements in MPI performance arising from dedicating a thread to drive communication outweigh the small loss of resources for application computation, particularly when overlap of communication and computation can be exploited. Our approach allows application threads to make MPI calls concurrently, enqueuing these as communication tasks to be processed by a dedicated communication thread. This not only guarantees progress for such communication operations, but also reduces load imbalance. Our implementation additionally significantly reduces the overhead of mutual exclusion seen in existing implementations for applications using MPI_THREAD_MULTIPLE. Our technique requires no modification to the application, and we demonstrate significant performance improvement (up to 2X) for QCD, 1-D FFT and deep learning CNN applications.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128887741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34

Analyzing and mitigating the impact of manufacturing variability in power-constrained supercomputing 在功率受限的超级计算中分析和减轻制造可变性的影响

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807638

Y. Inadomi, Tapasya Patki, Koji Inoue, M. Aoyagi, B. Rountree, M. Schulz, D. Lowenthal, Y. Wada, K. Fukazawa, M. Ueda, Masaaki Kondo, Ikuo Miyoshi

A key challenge in next-generation supercomputing is to effectively schedule limited power resources. Modern processors suffer from increasingly large power variations due to the chip manufacturing process. These variations lead to power inhomogeneity in current systems and manifest into performance inhomogeneity in power constrained environments, drastically limiting supercomputing performance. We present a first-of-its-kind study on manufacturing variability on four production HPC systems spanning four microarchitectures, analyze its impact on HPC applications, and propose a novel variation-aware power budgeting scheme to maximize effective application performance. Our low-cost and scalable budgeting algorithm strives to achieve performance homogeneity under a power constraint by deriving application-specific, module-level power allocations. Experimental results using a 1,920 socket system show up to 5.4X speedup, with an average speedup of 1.8X across all benchmarks when compared to a variation-unaware power allocation scheme.

下一代超级计算的一个关键挑战是有效地调度有限的电力资源。由于芯片制造过程，现代处理器受到越来越大的功率变化的影响。这些变化导致当前系统中的功率不均匀，并在功率受限的环境中表现为性能不均匀，从而极大地限制了超级计算的性能。本文首次对四种生产HPC系统的制造可变性进行了研究，分析了其对HPC应用的影响，并提出了一种新的变化感知功率预算方案，以最大限度地提高有效的应用性能。我们的低成本和可扩展预算算法通过导出特定应用的模块级功率分配，努力在功率约束下实现性能同质性。使用1920插座系统的实验结果显示，与不受变化影响的功率分配方案相比，所有基准测试的平均加速速度为1.8倍，加速速度高达5.4倍。

{"title":"Analyzing and mitigating the impact of manufacturing variability in power-constrained supercomputing","authors":"Y. Inadomi, Tapasya Patki, Koji Inoue, M. Aoyagi, B. Rountree, M. Schulz, D. Lowenthal, Y. Wada, K. Fukazawa, M. Ueda, Masaaki Kondo, Ikuo Miyoshi","doi":"10.1145/2807591.2807638","DOIUrl":"https://doi.org/10.1145/2807591.2807638","url":null,"abstract":"A key challenge in next-generation supercomputing is to effectively schedule limited power resources. Modern processors suffer from increasingly large power variations due to the chip manufacturing process. These variations lead to power inhomogeneity in current systems and manifest into performance inhomogeneity in power constrained environments, drastically limiting supercomputing performance. We present a first-of-its-kind study on manufacturing variability on four production HPC systems spanning four microarchitectures, analyze its impact on HPC applications, and propose a novel variation-aware power budgeting scheme to maximize effective application performance. Our low-cost and scalable budgeting algorithm strives to achieve performance homogeneity under a power constraint by deriving application-specific, module-level power allocations. Experimental results using a 1,920 socket system show up to 5.4X speedup, with an average speedup of 1.8X across all benchmarks when compared to a variation-unaware power allocation scheme.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"375 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115911295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 112

AnalyzeThis: an analysis workflow-aware storage system AnalyzeThis:一个支持分析工作流的存储系统

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807622

Hyogi Sim, Youngjae Kim, Sudharshan S. Vazhkudai, Devesh Tiwari, Ali Anwar, A. Butt, L. Ramakrishnan

The need for novel data analysis is urgent in the face of a data deluge from modern applications. Traditional approaches to data analysis incur significant data movement costs, moving data back and forth between the storage system and the processor. Emerging Active Flash devices enable processing on the flash, where the data already resides. An array of such Active Flash devices allows us to revisit how analysis workflows interact with storage systems. By seamlessly blending together the flash storage and data analysis, we create an analysis workflow-aware storage system, AnalyzeThis. Our guiding principle is that analysis-awareness be deeply ingrained in each and every layer of the storage, elevating data analyses as first-class citizens, and transforming AnalyzeThis into a potent analytics-aware appliance. We implement the AnalyzeThis storage system atop an emulation platform of the Active Flash array. Our results indicate that AnalyzeThis is viable, expediting workflow execution and minimizing data movement.

面对来自现代应用程序的数据洪流，对新颖数据分析的需求是迫切的。传统的数据分析方法需要在存储系统和处理器之间来回移动数据，从而产生巨大的数据移动成本。新出现的Active Flash设备可以在闪存上进行处理，因为数据已经驻留在闪存上。一系列这样的活动闪存设备允许我们重新审视分析工作流如何与存储系统交互。通过将闪存和数据分析无缝地融合在一起，我们创建了一个分析工作流感知存储系统AnalyzeThis。我们的指导原则是将分析意识深深根植于存储的每一层，将数据分析提升为一流公民，并将AnalyzeThis转换为强大的分析意识设备。我们在Active Flash阵列的仿真平台上实现了AnalyzeThis存储系统。我们的结果表明，AnalyzeThis是可行的，加快了工作流的执行，并最大限度地减少了数据移动。

引用次数: 22

Adaptive data placement for staging-based coupled scientific workflows 基于阶段的耦合科学工作流的自适应数据放置

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807669

Qian Sun, Tong Jin, Melissa Romanus, H. Bui, Fan Zhang, Hongfeng Yu, H. Kolla, S. Klasky, Jacqueline H. Chen, M. Parashar

Data staging and in-situ/in-transit data processing are emerging as attractive approaches for supporting extreme scale scientific workflows. These approaches improve end-to-end performance by enabling runtime data sharing between coupled simulations and data analytics components of the workflow. However, the complex and dynamic data exchange patterns exhibited by the workflows coupled with the varied data access behaviors make efficient data placement within the staging area challenging. In this paper, we present an adaptive data placement approach to address these challenges. Our approach adapts data placement based on application-specific dynamic data access patterns, and applies access pattern-driven and location-aware mechanisms to reduce data access costs and to support efficient data sharing between the multiple workflow components. We experimentally demonstrate the effectiveness of our approach on Titan Cray XK7 using a real combustion-analyses workflow. The evaluation results demonstrate that our approach can effectively improve data access performance and overall efficiency of coupled scientific workflows.

数据分期和原位/在途数据处理正在成为支持极端规模科学工作流程的有吸引力的方法。这些方法通过支持工作流的耦合模拟和数据分析组件之间的运行时数据共享来提高端到端性能。然而，工作流所显示的复杂和动态的数据交换模式，加上各种数据访问行为，使得在staging区域内进行有效的数据放置具有挑战性。在本文中，我们提出了一种自适应数据放置方法来解决这些挑战。我们的方法基于特定于应用程序的动态数据访问模式来调整数据放置，并应用访问模式驱动和位置感知机制来降低数据访问成本，并支持多个工作流组件之间的有效数据共享。我们通过实验证明了我们的方法在泰坦克雷XK7上使用真实的燃烧分析工作流程的有效性。评估结果表明，该方法可以有效提高耦合科学工作流的数据访问性能和整体效率。

{"title":"Adaptive data placement for staging-based coupled scientific workflows","authors":"Qian Sun, Tong Jin, Melissa Romanus, H. Bui, Fan Zhang, Hongfeng Yu, H. Kolla, S. Klasky, Jacqueline H. Chen, M. Parashar","doi":"10.1145/2807591.2807669","DOIUrl":"https://doi.org/10.1145/2807591.2807669","url":null,"abstract":"Data staging and in-situ/in-transit data processing are emerging as attractive approaches for supporting extreme scale scientific workflows. These approaches improve end-to-end performance by enabling runtime data sharing between coupled simulations and data analytics components of the workflow. However, the complex and dynamic data exchange patterns exhibited by the workflows coupled with the varied data access behaviors make efficient data placement within the staging area challenging. In this paper, we present an adaptive data placement approach to address these challenges. Our approach adapts data placement based on application-specific dynamic data access patterns, and applies access pattern-driven and location-aware mechanisms to reduce data access costs and to support efficient data sharing between the multiple workflow components. We experimentally demonstrate the effectiveness of our approach on Titan Cray XK7 using a real combustion-analyses workflow. The evaluation results demonstrate that our approach can effectively improve data access performance and overall efficiency of coupled scientific workflows.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131999007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Improving the scalability of the ocean barotropic solver in the community earth system model 提高海洋正压解在群落地球系统模型中的可扩展性

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807596

Yong Hu, Xiaomeng Huang, A. Baker, Y. Tseng, F. Bryan, J. Dennis, Guangwen Yang

High-resolution climate simulations are increasingly in demand and require tremendous computing resources. In the Community Earth SystemModel (CESM), the Parallel Ocean Model (POP) is computationally expensive for high-resolution grids (e.g., 0.1°) and is frequently the least scalable component of CESM for certain production simulations. In particular, the modified Preconditioned Conjugate Gradient (PCG), used to solve the elliptic system of equations in the barotropic mode, scales poorly at the high core counts, which is problematic for high-resolution simulations. In this work, we demonstrate that the communication costs in the barotropic solver occupy an increasing portion of the total POP execution time as core counts are increased. To mitigate this problem, we implement a preconditioned Chebyshev-type iterative method in POP (called P-CSI), which requires far fewer global reductions than PCG. We also develop an effective block preconditioner based on the Error Vector Propagation Method to attain a competitive convergence rate for P-CSI. We demonstrate that the improved scalability of P-CSI results in a 5.2x speedup of the barotropic mode in high-resolution POP on 16,875 cores, which yields a 1.7x speedup of the overall POP simulation. Further, we ensure that the new solver produces an ocean climate consistent with the original one via an ensemble-based statistical method.

高分辨率气候模拟的需求越来越大，需要大量的计算资源。在社区地球系统模型(CESM)中，平行海洋模型(POP)对于高分辨率网格(例如0.1°)的计算成本很高，并且在某些生产模拟中通常是CESM中可扩展性最低的组件。特别是，用于求解正压模式下椭圆方程组的改进的预条件共轭梯度(PCG)在高核数下的尺度较差，这对高分辨率模拟来说是一个问题。在这项工作中，我们证明了随着核心数的增加，正压求解器中的通信成本占总POP执行时间的比例越来越大。为了缓解这个问题，我们在POP中实现了一种预置的chebyshev型迭代方法(称为P-CSI)，它比PCG需要更少的全局缩减。我们还开发了一种基于误差向量传播方法的有效块预调节器，以获得P-CSI的竞争收敛率。我们证明了P-CSI改进的可扩展性导致在16,875核的高分辨率POP中正压模式加速5.2倍，从而使整个POP模拟加速1.7倍。此外，我们通过基于集合的统计方法确保新求解器产生的海洋气候与原始气候一致。

{"title":"Improving the scalability of the ocean barotropic solver in the community earth system model","authors":"Yong Hu, Xiaomeng Huang, A. Baker, Y. Tseng, F. Bryan, J. Dennis, Guangwen Yang","doi":"10.1145/2807591.2807596","DOIUrl":"https://doi.org/10.1145/2807591.2807596","url":null,"abstract":"High-resolution climate simulations are increasingly in demand and require tremendous computing resources. In the Community Earth SystemModel (CESM), the Parallel Ocean Model (POP) is computationally expensive for high-resolution grids (e.g., 0.1°) and is frequently the least scalable component of CESM for certain production simulations. In particular, the modified Preconditioned Conjugate Gradient (PCG), used to solve the elliptic system of equations in the barotropic mode, scales poorly at the high core counts, which is problematic for high-resolution simulations. In this work, we demonstrate that the communication costs in the barotropic solver occupy an increasing portion of the total POP execution time as core counts are increased. To mitigate this problem, we implement a preconditioned Chebyshev-type iterative method in POP (called P-CSI), which requires far fewer global reductions than PCG. We also develop an effective block preconditioner based on the Error Vector Propagation Method to attain a competitive convergence rate for P-CSI. We demonstrate that the improved scalability of P-CSI results in a 5.2x speedup of the barotropic mode in high-resolution POP on 16,875 cores, which yields a 1.7x speedup of the overall POP simulation. Further, we ensure that the new solver produces an ocean climate consistent with the original one via an ensemble-based statistical method.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125400117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

IOrchestra: supporting high-performance data-intensive applications in the cloud via collaborative virtualization IOrchestra:通过协作虚拟化支持云中的高性能数据密集型应用

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807633

R. C. Chiang, H. H. Huang, Timothy Wood, Changbin Liu, O. Spatscheck

Multi-tier data-intensive applications are widely deployed in virtualized data centers for high scalability and reliability. As the response time is vital for user satisfaction, this requires achieving good performance at each tier of the applications in order to minimize the overall latency. However, in such virtualized environments, each tier (e.g., application, database, web) is likely to be hosted by different virtual machines (VMs) on multiple physical servers, where a guest VM is unaware of changes outside its domain, and the hypervisor also does not know the configuration and runtime status of a guest VM. As a result, isolated virtualization domains lend themselves to performance unpredictability and variance. In this paper, we propose IOrchestra, a holistic collaborative virtualization framework, which bridges the semantic gaps of I/O stacks and system information across multiple VMs, improves virtual I/O performance through collaboration from guest domains, and increases resource utilization in data centers. We present several case studies to demonstrate that IOrchestra is able to address numerous drawbacks of the current practice and improve the I/O latency of various distributed cloud applications by up to 31%.

虚拟化数据中心广泛部署多层数据密集型应用，具有较高的可扩展性和可靠性。由于响应时间对用户满意度至关重要，因此需要在应用程序的每一层实现良好的性能，以最大限度地减少总体延迟。然而，在这样的虚拟化环境中，每个层(例如，应用程序、数据库、web)可能由多个物理服务器上的不同虚拟机(VM)托管，其中客户VM不知道其域外的更改，管理程序也不知道客户VM的配置和运行时状态。因此，孤立的虚拟化域会导致性能的不可预测性和差异性。在本文中，我们提出了IOrchestra，这是一个整体的协作虚拟化框架，它弥合了多个虚拟机之间I/O堆栈和系统信息的语义差距，通过来自客域的协作提高了虚拟I/O性能，并提高了数据中心的资源利用率。我们提供了几个案例研究来证明IOrchestra能够解决当前实践的许多缺点，并将各种分布式云应用程序的I/O延迟提高了31%。

{"title":"IOrchestra: supporting high-performance data-intensive applications in the cloud via collaborative virtualization","authors":"R. C. Chiang, H. H. Huang, Timothy Wood, Changbin Liu, O. Spatscheck","doi":"10.1145/2807591.2807633","DOIUrl":"https://doi.org/10.1145/2807591.2807633","url":null,"abstract":"Multi-tier data-intensive applications are widely deployed in virtualized data centers for high scalability and reliability. As the response time is vital for user satisfaction, this requires achieving good performance at each tier of the applications in order to minimize the overall latency. However, in such virtualized environments, each tier (e.g., application, database, web) is likely to be hosted by different virtual machines (VMs) on multiple physical servers, where a guest VM is unaware of changes outside its domain, and the hypervisor also does not know the configuration and runtime status of a guest VM. As a result, isolated virtualization domains lend themselves to performance unpredictability and variance. In this paper, we propose IOrchestra, a holistic collaborative virtualization framework, which bridges the semantic gaps of I/O stacks and system information across multiple VMs, improves virtual I/O performance through collaboration from guest domains, and increases resource utilization in data centers. We present several case studies to demonstrate that IOrchestra is able to address numerous drawbacks of the current practice and improve the I/O latency of various distributed cloud applications by up to 31%.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"22 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114119660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀