SC15: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

英文中文

An elegant sufficiency: load-aware differentiated scheduling of data transfers 一个优雅的充分性:数据传输的负载感知差异化调度

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807660

R. Kettimuthu, Gayane Vardoyan, G. Agrawal, P. Sadayappan, Ian T Foster

We investigate the file transfer scheduling problem, where transfers among different endpoints must be scheduled to maximize pertinent metrics. We propose two new algorithms that exploit the fact that the aggregate bandwidth obtained over a network or at a storage system tends to increase with the number of concurrent transfers---but only up to a certain limit. The first algorithm, SEAL, uses runtime information and data-driven models to approximate system load and adapt transfer schedules and concurrency so as to maximize performance while avoiding saturation. We implement this algorithm using GridFTP as the transfer protocol and evaluate it using real transfer logs in a production WAN environment. Results show that SEAL can improve average slowdowns and turnaround times by up to 25% and worst-case slowdown and turnaround times by up to 50%, compared with the best-performing baseline scheme. Our second algorithm, STEAL, further leverages user-supplied categorization of transfers as either "interactive" (requiring immediate processing) or "batch" (less time-critical). Results show that STEAL reduces the average slowdown of interactive transfers by 63% compared to the best-performing baseline and by 21% compared to SEAL. For batch transfers, compared to the best-performing baseline, STEAL improves by 18% the utilization of the bandwidth unused by interactive transfers. By elegantly ensuring a sufficient, but not excessive, allocation of concurrency to the right transfers, we significantly improve overall performance despite constraints.

我们研究了文件传输调度问题，其中不同端点之间的传输必须被调度以最大化相关指标。我们提出了两种新的算法，它们利用了这样一个事实，即通过网络或存储系统获得的总带宽往往随着并发传输的数量而增加——但只能达到一定的限制。第一种算法SEAL使用运行时信息和数据驱动模型来近似系统负载，并调整传输计划和并发性，从而在避免饱和的同时最大化性能。我们使用GridFTP作为传输协议实现了该算法，并在生产WAN环境中使用真实的传输日志对其进行了评估。结果表明，与性能最好的基准方案相比，SEAL可以将平均减速和周转时间提高25%，将最坏情况下的减速和周转时间提高50%。我们的第二个算法，STEAL，进一步利用用户提供的传输分类，要么是“交互式的”(需要立即处理)，要么是“批处理的”(时间不那么紧迫)。结果表明，与性能最好的基线相比，STEAL将交互传输的平均速度降低了63%，与SEAL相比降低了21%。对于批处理传输，与性能最佳的基线相比，STEAL将交互传输未使用的带宽利用率提高了18%。通过优雅地确保为正确的传输分配足够(但不是过多)的并发性，我们可以显著提高总体性能，尽管存在约束。

{"title":"An elegant sufficiency: load-aware differentiated scheduling of data transfers","authors":"R. Kettimuthu, Gayane Vardoyan, G. Agrawal, P. Sadayappan, Ian T Foster","doi":"10.1145/2807591.2807660","DOIUrl":"https://doi.org/10.1145/2807591.2807660","url":null,"abstract":"We investigate the file transfer scheduling problem, where transfers among different endpoints must be scheduled to maximize pertinent metrics. We propose two new algorithms that exploit the fact that the aggregate bandwidth obtained over a network or at a storage system tends to increase with the number of concurrent transfers---but only up to a certain limit. The first algorithm, SEAL, uses runtime information and data-driven models to approximate system load and adapt transfer schedules and concurrency so as to maximize performance while avoiding saturation. We implement this algorithm using GridFTP as the transfer protocol and evaluate it using real transfer logs in a production WAN environment. Results show that SEAL can improve average slowdowns and turnaround times by up to 25% and worst-case slowdown and turnaround times by up to 50%, compared with the best-performing baseline scheme. Our second algorithm, STEAL, further leverages user-supplied categorization of transfers as either \"interactive\" (requiring immediate processing) or \"batch\" (less time-critical). Results show that STEAL reduces the average slowdown of interactive transfers by 63% compared to the best-performing baseline and by 21% compared to SEAL. For batch transfers, compared to the best-performing baseline, STEAL improves by 18% the utilization of the bandwidth unused by interactive transfers. By elegantly ensuring a sufficient, but not excessive, allocation of concurrency to the right transfers, we significantly improve overall performance despite constraints.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127817039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Regent: a high-productivity programming language for HPC with logical regions Regent:用于具有逻辑区域的高性能计算的高生产率编程语言

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807629

Elliott Slaughter, Wonchan Lee, Sean Treichler, Michael A. Bauer, A. Aiken

We present Regent, a high-productivity programming language for high performance computing with logical regions. Regent users compose programs with tasks (functions eligible for parallel execution) and logical regions (hierarchical collections of structured objects). Regent programs appear to execute sequentially, require no explicit synchronization, and are trivially deadlock-free. Regent's type system catches many common classes of mistakes and guarantees that a program with correct serial execution produces identical results on parallel and distributed machines. We present an optimizing compiler for Regent that translates Regent programs into efficient implementations for Legion, an asynchronous task-based model. Regent employs several novel compiler optimizations to minimize the dynamic overhead of the runtime system and enable efficient operation. We evaluate Regent on three benchmark applications and demonstrate that Regent achieves performance comparable to hand-tuned Legion.

我们提出Regent，一种用于逻辑区域的高性能计算的高生产率编程语言。Regent用户用任务(可并行执行的函数)和逻辑区域(结构化对象的分层集合)组合程序。摄制程序看起来是顺序执行的，不需要显式的同步，并且通常没有死锁。Regent的类型系统捕获了许多常见的错误类型，并保证具有正确串行执行的程序在并行和分布式机器上产生相同的结果。我们为Regent提供了一个优化编译器，可以将Regent程序转换为Legion的高效实现，Legion是一个基于异步任务的模型。Regent采用了几种新的编译器优化来最小化运行时系统的动态开销，并实现高效的操作。我们在三个基准应用程序上评估了Regent，并证明Regent实现了与手动调整的Legion相当的性能。

引用次数: 107

Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility 从橡树岭领导计算设施的泰坦超级计算机的GPU经验中获得的可靠性教训

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807666

Devesh Tiwari, Saurabh Gupta, George Gallarno, James H. Rogers, Don E. Maxwell

The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The world's second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simulations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercomputer as well as lessons learned in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.

图形处理单元(gpu)的高计算能力正在大规模地实现和推动科学发现过程。世界上第二快的开放科学超级计算机“泰坦”拥有超过18000个gpu，计算科学家用它来进行科学模拟和数据分析。然而，对GPU可靠性特性的理解仍处于初级阶段，因为GPU最近才大规模部署。本文详细研究了GPU错误及其对系统运行和应用程序的影响，描述了在Titan超级计算机上使用18,688个GPU的经验，以及在大规模GPU高效运行过程中的经验教训。这些经验对已经拥有大规模GPU集群或计划在未来部署GPU的HPC站点很有帮助。

引用次数: 67

GossipMap: a distributed community detection algorithm for billion-edge directed graphs GossipMap:用于十亿边有向图的分布式社区检测算法

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807668

S. Bae, Bill Howe

In this paper, we describe a new distributed community detection algorithm for billion-edge directed graphs that, unlike modularity-based methods, achieves cluster quality on par with the best-known algorithms in the literature. We show that a simple approximation to the best-known serial algorithm dramatically reduces computation and enables distributed evaluation yet incurs only a very small impact on cluster quality. We present three main results: First, we show that the clustering produced by our scalable approximate algorithm compares favorably with prior results on small synthetic benchmarks and small real-world datasets (70 million edges). Second, we evaluate our algorithm on billion-edge directed graphs (a 1.5B edge social network graph, and a 3.7B edge web crawl), and show that the results exhibit the structural properties predicted by analysis of much smaller graphs from similar sources. Third, we show that our algorithm exhibits over 90% parallel efficiency on massive graphs in weak scaling experiments.

在本文中，我们描述了一种新的针对十亿边有向图的分布式社区检测算法，与基于模块化的方法不同，该算法实现了与文献中最著名的算法相当的聚类质量。我们表明，对最著名的串行算法的简单近似可以显着减少计算并实现分布式评估，但对集群质量的影响很小。我们给出了三个主要结果:首先，我们证明了由我们的可扩展近似算法产生的聚类与之前在小型合成基准和小型真实数据集(7000万条边)上的结果相比是有利的。其次，我们在十亿边有向图(15 b的边缘社交网络图和3.7B的边缘网络爬行图)上评估了我们的算法，并表明结果显示了通过分析来自类似来源的更小的图预测的结构特性。第三，我们在弱尺度实验中证明了我们的算法在海量图上的并行效率超过90%。

引用次数: 30

Parallel implementation and performance optimization of the configuration-interaction method 配置交互方法的并行实现与性能优化

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807618

H. Shan, Samuel Williams, C. Johnson, K. McElvain, W. Ormand

The configuration-interaction (CI) method, long a popular approach to describe quantum many-body systems, is cast as a very large sparse matrix eigenpair problem with matrices whose dimension can exceed one billion. Such formulations place high demands on memory capacity and memory bandwidth --- two quantities at a premium today. In this paper, we describe an efficient, scalable implementation, BIGSTICK, which, by factorizing both the basis and the interaction into two levels, can reconstruct the nonzero matrix elements on the fly, reduce the memory requirements by one or two orders of magnitude, and enable researchers to trade reduced resources for increased computational time. We optimize BIGSTICK on two leading HPC platforms --- the Cray XC30 and the IBM Blue Gene/Q. Specifically, we not only develop an empirically-driven load balancing strategy that can evenly distribute the matrix-vector multiplication across 256K threads, we also developed techniques that improve the performance of the Lanczos reorthogonalization. Combined, these optimizations improved performance by 1.3-8× depending on platform and configuration.

长期以来，构型相互作用(CI)方法一直是描述量子多体系统的一种流行方法，它被描述为一个矩阵维数超过十亿的超大型稀疏矩阵特征对问题。这种配方对内存容量和内存带宽提出了很高的要求——这两个数量在今天是非常宝贵的。在本文中，我们描述了一个高效的，可扩展的实现，BIGSTICK，它通过将基和交互分解为两个层次，可以动态重构非零矩阵元素，将内存需求降低一到两个数量级，并使研究人员能够以减少的资源换取增加的计算时间。我们在两个领先的高性能计算平台上优化BIGSTICK——Cray XC30和IBM Blue Gene/Q。具体来说，我们不仅开发了一种经验驱动的负载平衡策略，可以在256K线程之间均匀地分配矩阵向量乘法，而且还开发了提高Lanczos重新正交化性能的技术。根据平台和配置的不同，这些优化将性能提高了1.3-8倍。

{"title":"Parallel implementation and performance optimization of the configuration-interaction method","authors":"H. Shan, Samuel Williams, C. Johnson, K. McElvain, W. Ormand","doi":"10.1145/2807591.2807618","DOIUrl":"https://doi.org/10.1145/2807591.2807618","url":null,"abstract":"The configuration-interaction (CI) method, long a popular approach to describe quantum many-body systems, is cast as a very large sparse matrix eigenpair problem with matrices whose dimension can exceed one billion. Such formulations place high demands on memory capacity and memory bandwidth --- two quantities at a premium today. In this paper, we describe an efficient, scalable implementation, BIGSTICK, which, by factorizing both the basis and the interaction into two levels, can reconstruct the nonzero matrix elements on the fly, reduce the memory requirements by one or two orders of magnitude, and enable researchers to trade reduced resources for increased computational time. We optimize BIGSTICK on two leading HPC platforms --- the Cray XC30 and the IBM Blue Gene/Q. Specifically, we not only develop an empirically-driven load balancing strategy that can evenly distribute the matrix-vector multiplication across 256K threads, we also developed techniques that improve the performance of the Lanczos reorthogonalization. Combined, these optimizations improved performance by 1.3-8× depending on platform and configuration.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124008653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

A practical approach to reconciling availability, performance, and capacity in provisioning extreme-scale storage systems 一种在供应超大规模存储系统时协调可用性、性能和容量的实用方法

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807615

Lipeng Wan, Feiyi Wang, S. Oral, Devesh Tiwari, Sudharshan S. Vazhkudai, Qing Cao

The increasing data demands from high-performance computing applications significantly accelerate the capacity, capability and reliability requirements of storage systems. As systems scale, component failures and repair times increase, significantly impacting data availability. A wide array of decision points must be balanced in designing such systems. We propose a systematic approach that balances and optimizes both initial and continuous spare provisioning based on a detailed investigation of the anatomy and field failure data analysis of extreme-scale storage systems. We consider the component failure characteristics and its cost and impact at the system level simultaneously. We build a tool to evaluate different provisioning schemes, and the results demonstrate that our optimized provisioning can reduce the duration of data unavailability by as much as 52% under a fixed budget. We also observe that non-disk components have much higher failure rates than disks, and warrant careful considerations in the overall provisioning process.

高性能计算应用对数据的需求日益增长，对存储系统的容量、能力和可靠性的要求也随之提高。随着系统规模的扩大，组件故障和修复时间会增加，从而显著影响数据的可用性。在设计这样的系统时，必须平衡大量的决策点。基于对极端规模存储系统的解剖结构和现场故障数据分析的详细调查，我们提出了一种系统的方法来平衡和优化初始和持续备用供应。在系统层面上，我们同时考虑了部件的故障特征及其成本和影响。我们构建了一个工具来评估不同的供应方案，结果表明，在固定预算下，我们优化的供应可以将数据不可用的持续时间减少52%。我们还注意到，非磁盘组件的故障率比磁盘高得多，因此在整个供应过程中需要仔细考虑。

{"title":"A practical approach to reconciling availability, performance, and capacity in provisioning extreme-scale storage systems","authors":"Lipeng Wan, Feiyi Wang, S. Oral, Devesh Tiwari, Sudharshan S. Vazhkudai, Qing Cao","doi":"10.1145/2807591.2807615","DOIUrl":"https://doi.org/10.1145/2807591.2807615","url":null,"abstract":"The increasing data demands from high-performance computing applications significantly accelerate the capacity, capability and reliability requirements of storage systems. As systems scale, component failures and repair times increase, significantly impacting data availability. A wide array of decision points must be balanced in designing such systems. We propose a systematic approach that balances and optimizes both initial and continuous spare provisioning based on a detailed investigation of the anatomy and field failure data analysis of extreme-scale storage systems. We consider the component failure characteristics and its cost and impact at the system level simultaneously. We build a tool to evaluate different provisioning schemes, and the results demonstrate that our optimized provisioning can reduce the duration of data unavailability by as much as 52% under a fixed budget. We also observe that non-disk components have much higher failure rates than disks, and warrant careful considerations in the overall provisioning process.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124108612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Smart: a MapReduce-like framework for in-situ scientific analytics Smart:一个类似mapreduce的框架，用于现场科学分析

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807650

Yi Wang, G. Agrawal, Tekin Bicer, Wei Jiang

In-situ analytics has lately been shown to be an effective approach to reduce both I/O and storage costs for scientific analytics. Developing an efficient in-situ implementation, however, involves many challenges, including parallelization, data movement or sharing, and resource allocation. Based on the premise that MapReduce can be an appropriate API for specifying scientific analytics applications, we present a novel MapReduce-like framework that supports efficient in-situ scientific analytics, and address several challenges that arise in applying the MapReduce idea for in-situ processing. Specifically, our implementation can load simulated data directly from distributed memory, and it uses a modified API that helps meet the strict memory constraints of in-situ analytics. The framework is designed so that analytics can be launched from the parallel code region of a simulation program. We have developed both time sharing and space sharing modes for maximizing the performance in different scenarios, with the former even avoiding any copying of data from simulation to the analytics program. We demonstrate the functionality, efficiency, and scalability of our system, by using different simulation and analytics programs, executed on clusters with multi-core and many-core nodes.

原位分析最近被证明是一种有效的方法，可以降低科学分析的I/O和存储成本。然而，开发高效的原位实现涉及许多挑战，包括并行化、数据移动或共享以及资源分配。基于MapReduce可以成为指定科学分析应用程序的合适API的前提，我们提出了一个新颖的类似MapReduce的框架，支持高效的原位科学分析，并解决了在将MapReduce思想应用于原位处理时出现的几个挑战。具体来说，我们的实现可以直接从分布式内存中加载模拟数据，并且它使用经过修改的API来帮助满足原位分析的严格内存限制。该框架的设计使分析可以从仿真程序的并行代码区域启动。我们已经开发了时间共享和空间共享模式，以在不同的场景中最大化性能，前者甚至避免了从模拟到分析程序的任何数据复制。通过使用不同的仿真和分析程序，在具有多核和多核节点的集群上执行，我们演示了系统的功能、效率和可扩展性。

{"title":"Smart: a MapReduce-like framework for in-situ scientific analytics","authors":"Yi Wang, G. Agrawal, Tekin Bicer, Wei Jiang","doi":"10.1145/2807591.2807650","DOIUrl":"https://doi.org/10.1145/2807591.2807650","url":null,"abstract":"In-situ analytics has lately been shown to be an effective approach to reduce both I/O and storage costs for scientific analytics. Developing an efficient in-situ implementation, however, involves many challenges, including parallelization, data movement or sharing, and resource allocation. Based on the premise that MapReduce can be an appropriate API for specifying scientific analytics applications, we present a novel MapReduce-like framework that supports efficient in-situ scientific analytics, and address several challenges that arise in applying the MapReduce idea for in-situ processing. Specifically, our implementation can load simulated data directly from distributed memory, and it uses a modified API that helps meet the strict memory constraints of in-situ analytics. The framework is designed so that analytics can be launched from the parallel code region of a simulation program. We have developed both time sharing and space sharing modes for maximizing the performance in different scenarios, with the former even avoiding any copying of data from simulation to the analytics program. We demonstrate the functionality, efficiency, and scalability of our system, by using different simulation and analytics programs, executed on clusters with multi-core and many-core nodes.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"330 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115968792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 50

Efficient implementation of quantum materials simulations on distributed CPU-GPU systems 量子材料模拟在分布式CPU-GPU系统上的高效实现

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807654

R. Solcà, A. Kozhevnikov, A. Haidar, S. Tomov, J. Dongarra, T. Schulthess

We present a scalable implementation of the Linearized Augmented Plane Wave method for distributed memory systems, which relies on an efficient distributed, block-cyclic setup of the Hamiltonian and overlap matrices and allows us to turn around highly accurate 1000+ atom all-electron quantum materials simulations on clusters with a few hundred nodes. The implementation runs efficiently on standard multi-core CPU nodes, as well as hybrid CPU-GPU nodes. The key for the latter is a novel algorithm to solve the generalized eigenvalue problem for dense, complex Hermitian matrices on distributed hybrid CPU-GPU systems. Performance tests for Li-intercalated CoO2 supercells containing 1501 atoms demonstrate that high-accuracy, transferable quantum simulations can now be used in throughput materials search problems. While our application can benefit and get scalable performance through CPU-only libraries like ScaLAPACK or ELPA2, our new hybrid solver enables the efficient use of GPUs and shows that a hybrid CPU-GPU architecture scales to a desired performance using substantially fewer cluster nodes, and notably, is considerably more energy efficient than the traditional multi-core CPU only systems for such complex applications.

我们提出了一种用于分布式存储系统的线性化增强平面波方法的可扩展实现，它依赖于哈密顿矩阵和重叠矩阵的高效分布式、块循环设置，并允许我们在具有几百个节点的集群上进行高精度的1000+原子全电子量子材料模拟。该实现在标准多核CPU节点以及CPU- gpu混合节点上高效运行。后者的关键是一种在分布式混合CPU-GPU系统上求解密集复杂厄米矩阵广义特征值问题的新算法。对含有1501个原子的锂嵌入CoO2超级电池的性能测试表明，高精度、可转移的量子模拟现在可以用于吞吐量材料搜索问题。虽然我们的应用程序可以通过仅CPU的库(如ScaLAPACK或ELPA2)受益并获得可扩展的性能，但我们的新混合求解器可以有效地使用gpu，并表明混合CPU- gpu架构可以使用更少的集群节点扩展到所需的性能，值得注意的是，对于如此复杂的应用程序，它比传统的仅多核CPU系统节能得多。

{"title":"Efficient implementation of quantum materials simulations on distributed CPU-GPU systems","authors":"R. Solcà, A. Kozhevnikov, A. Haidar, S. Tomov, J. Dongarra, T. Schulthess","doi":"10.1145/2807591.2807654","DOIUrl":"https://doi.org/10.1145/2807591.2807654","url":null,"abstract":"We present a scalable implementation of the Linearized Augmented Plane Wave method for distributed memory systems, which relies on an efficient distributed, block-cyclic setup of the Hamiltonian and overlap matrices and allows us to turn around highly accurate 1000+ atom all-electron quantum materials simulations on clusters with a few hundred nodes. The implementation runs efficiently on standard multi-core CPU nodes, as well as hybrid CPU-GPU nodes. The key for the latter is a novel algorithm to solve the generalized eigenvalue problem for dense, complex Hermitian matrices on distributed hybrid CPU-GPU systems. Performance tests for Li-intercalated CoO2 supercells containing 1501 atoms demonstrate that high-accuracy, transferable quantum simulations can now be used in throughput materials search problems. While our application can benefit and get scalable performance through CPU-only libraries like ScaLAPACK or ELPA2, our new hybrid solver enables the efficient use of GPUs and shows that a hybrid CPU-GPU architecture scales to a desired performance using substantially fewer cluster nodes, and notably, is considerably more energy efficient than the traditional multi-core CPU only systems for such complex applications.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117239984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results 并行计算系统的科学基准测试:报告性能结果时告诉大众的12种方法

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807644

T. Hoefler, Roberto Belli

Measuring and reporting performance of parallel computers constitutes the basis for scientific advancement of high-performance computing (HPC). Most scientific reports show performance improvements of new techniques and are thus obliged to ensure reproducibility or at least interpretability. Our investigation of a stratified sample of 120 papers across three top conferences in the field shows that the state of the practice is lacking. For example, it is often unclear if reported improvements are deterministic or observed by chance. In addition to distilling best practices from existing work, we propose statistically sound analysis and reporting techniques and simple guidelines for experimental design in parallel computing and codify them in a portable benchmarking library. We aim to improve the standards of reporting research results and initiate a discussion in the HPC field. A wide adoption of our minimal set of rules will lead to better interpretability of performance results and improve the scientific culture in HPC.

测量和报告并行计算机的性能是高性能计算(HPC)科学发展的基础。大多数科学报告显示了新技术的性能改进，因此有义务确保可重复性或至少可解释性。我们对该领域三个顶级会议上120篇论文的分层样本进行的调查表明，实践的状态是缺乏的。例如，通常不清楚报告的改进是确定的还是偶然观察到的。除了从现有工作中提取最佳实践外，我们还提出了统计上合理的分析和报告技术以及并行计算实验设计的简单指导方针，并将它们编纂在便携式基准库中。我们的目标是提高报告研究成果的标准，并在高性能计算领域发起讨论。我们的最小规则集的广泛采用将导致性能结果更好的可解释性，并改善高性能计算中的科学文化。

引用次数: 206

Profile-based power shifting in interconnection networks with on/off links 具有开/关链路的互连网络中基于配置文件的功率转移

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807639

Shinobu Miwa, Hiroshi Nakamura

Overprovisioning hardware devices and coordinating their power budgets are proposed to improve the application performance of future power-constrained HPC systems. This coordination process is called power shifting. Meanwhile, recent studies have revealed that on/off links can save network power in HPC systems. Future HPC systems will thus adopt on/off links in addition to power shifting. This paper explores power shifting in interconnection networks with on/off links. Given that on/off links keep network power low at application runtime, we can transfer appreciable quantities of power budgets on networks to other devices before an application runs. We thus propose a profile-based power shifting technique that allows HPC users to transfer the power budget remaining on networks to other devices at the time of job dispatch. Experimental results show that the proposed technique appreciably improves application performance under various power constraints.

为了提高未来高性能计算系统的应用性能，提出了对硬件设备进行过度配置并协调其功率预算的方法。这种协调过程被称为权力转移。同时，最近的研究表明，在高性能计算系统中，开/关链路可以节省网络功率。因此，未来的高性能计算系统除了功率转换外，还将采用开/关链路。本文探讨了具有开/关链路的互联网络中的功率转移问题。考虑到开/关链路在应用程序运行时保持较低的网络功耗，我们可以在应用程序运行之前将网络上可观数量的功耗预算转移到其他设备上。因此，我们提出了一种基于配置文件的功率转移技术，该技术允许HPC用户在作业调度时将网络上剩余的功率预算转移到其他设备。实验结果表明，在各种功率约束下，该技术显著提高了应用性能。

引用次数: 9

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀