首页 > 最新文献

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing最新文献

英文 中文
GPU-Aware Non-contiguous Data Movement In Open MPI 开放MPI中gpu感知的非连续数据移动
Wei Wu, G. Bosilca, Rolf Vandevaart, Sylvain Jeaugey, J. Dongarra
Due to better parallel density and power efficiency, GPUs have become more popular for use in scientific applica- tions. Many of these applications are based on the ubiquitous Message Passing Interface (MPI) programming paradigm, and take advantage of non-contiguous memory layouts to exchange data between processes. However, support for efficient non- contiguous data movements for GPU-resident data is still in its infancy, imposing a negative impact on the overall application performance. To address this shortcoming, we present a solution where we take advantage of the inherent parallelism in the datatype pack- ing and unpacking operations. We developed a close integration between Open MPI's stack-based datatype engine, NVIDIA's Unified Memory Architecture and GPUDirect capabilities. In this design the datatype packing and unpacking operations are offloaded onto the GPU and handled by specialized GPU kernels, while the CPU remains the driver for data movements between nodes. By incorporating our design into the Open MPI library we have shown significantly better performance for non-contiguous GPU-resident data transfers on both shared and distributed memory machines.
由于更好的并行密度和功率效率,gpu在科学应用中越来越受欢迎。这些应用程序中的许多都基于无处不在的消息传递接口(Message Passing Interface, MPI)编程范例,并利用非连续内存布局在进程之间交换数据。然而,对gpu驻留数据的高效非连续数据移动的支持仍处于起步阶段,这对整体应用程序性能造成了负面影响。为了解决这个缺点,我们提出了一个解决方案,我们利用了数据类型打包和解包操作中固有的并行性。我们开发了Open MPI基于堆栈的数据类型引擎、NVIDIA的统一内存架构和GPUDirect功能之间的紧密集成。在这种设计中,数据类型打包和解包操作被卸载到GPU上,并由专门的GPU内核处理,而CPU仍然是节点之间数据移动的驱动程序。通过将我们的设计整合到Open MPI库中,我们已经在共享和分布式内存机器上显示了更好的非连续gpu驻留数据传输性能。
{"title":"GPU-Aware Non-contiguous Data Movement In Open MPI","authors":"Wei Wu, G. Bosilca, Rolf Vandevaart, Sylvain Jeaugey, J. Dongarra","doi":"10.1145/2907294.2907317","DOIUrl":"https://doi.org/10.1145/2907294.2907317","url":null,"abstract":"Due to better parallel density and power efficiency, GPUs have become more popular for use in scientific applica- tions. Many of these applications are based on the ubiquitous Message Passing Interface (MPI) programming paradigm, and take advantage of non-contiguous memory layouts to exchange data between processes. However, support for efficient non- contiguous data movements for GPU-resident data is still in its infancy, imposing a negative impact on the overall application performance. To address this shortcoming, we present a solution where we take advantage of the inherent parallelism in the datatype pack- ing and unpacking operations. We developed a close integration between Open MPI's stack-based datatype engine, NVIDIA's Unified Memory Architecture and GPUDirect capabilities. In this design the datatype packing and unpacking operations are offloaded onto the GPU and handled by specialized GPU kernels, while the CPU remains the driver for data movements between nodes. By incorporating our design into the Open MPI library we have shown significantly better performance for non-contiguous GPU-resident data transfers on both shared and distributed memory machines.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"140 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91112992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Automatic Hybridization of Runtime Systems 运行时系统的自动杂交
Kyle C. Hale, Conor Hetland, P. Dinda
The hybrid runtime (HRT) model offers a plausible path towards high performance and efficiency. By integrating the OS kernel, parallel runtime, and application, an HRT allows the runtime developer to leverage the full privileged feature set of the hardware and specialize OS services to the runtime's needs. However, conforming to the HRT model currently requires a complete port of the runtime and application to the kernel level, for example to our Nautilus kernel framework, and this requires knowledge of kernel internals. In response, we developed Multiverse, a system that bridges the gap between a built-from-scratch HRT and a legacy runtime system. Multiverse allows existing, unmodified applications and runtimes to be brought into the HRT model without any porting effort whatsoever. Developers simply recompile their package with our compiler toolchain, and Multiverse automatically splits the execution of the application between the domains of a legacy OS and an HRT environment. To the user, the package appears to run as usual on Linux, but the bulk of it now runs as a kernel. The developer can then incrementally extend the runtime and application to take advantage of the HRT model. We describe the design and implementation of Multiverse, and illustrate its capabilities using the Racket runtime system.
混合运行时(HRT)模型为实现高性能和高效率提供了一条可行的途径。通过集成操作系统内核、并行运行时和应用程序,HRT允许运行时开发人员利用硬件的完整特权特性集,并将操作系统服务专一化以满足运行时的需求。然而,遵循HRT模型目前需要将运行时和应用程序完整地移植到内核级别,例如移植到Nautilus内核框架,这需要了解内核内部的知识。为此,我们开发了Multiverse,这是一个在从头开始构建的HRT和遗留运行时系统之间架起桥梁的系统。Multiverse允许将现有的、未经修改的应用程序和运行时引入HRT模型,而无需进行任何移植工作。开发人员只需使用我们的编译器工具链重新编译他们的包,Multiverse就会自动在遗留操作系统和HRT环境的域之间分割应用程序的执行。对于用户来说,这个包看起来像往常一样在Linux上运行,但是现在它的大部分是作为内核运行的。然后,开发人员可以增量地扩展运行时和应用程序,以利用HRT模型。我们描述了Multiverse的设计和实现,并使用Racket运行时系统说明了它的功能。
{"title":"Automatic Hybridization of Runtime Systems","authors":"Kyle C. Hale, Conor Hetland, P. Dinda","doi":"10.1145/2907294.2907309","DOIUrl":"https://doi.org/10.1145/2907294.2907309","url":null,"abstract":"The hybrid runtime (HRT) model offers a plausible path towards high performance and efficiency. By integrating the OS kernel, parallel runtime, and application, an HRT allows the runtime developer to leverage the full privileged feature set of the hardware and specialize OS services to the runtime's needs. However, conforming to the HRT model currently requires a complete port of the runtime and application to the kernel level, for example to our Nautilus kernel framework, and this requires knowledge of kernel internals. In response, we developed Multiverse, a system that bridges the gap between a built-from-scratch HRT and a legacy runtime system. Multiverse allows existing, unmodified applications and runtimes to be brought into the HRT model without any porting effort whatsoever. Developers simply recompile their package with our compiler toolchain, and Multiverse automatically splits the execution of the application between the domains of a legacy OS and an HRT environment. To the user, the package appears to run as usual on Linux, but the bulk of it now runs as a kernel. The developer can then incrementally extend the runtime and application to take advantage of the HRT model. We describe the design and implementation of Multiverse, and illustrate its capabilities using the Racket runtime system.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86761142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
NVL-C: Static Analysis Techniques for Efficient, Correct Programming of Non-Volatile Main Memory Systems 非易失性主存系统高效、正确编程的静态分析技术
J. Denny, Seyong Lee, J. Vetter
Computer architecture experts expect that non-volatile memory (NVM) hierarchies will play a more significant role in future systems including mobile, enterprise, and HPC architectures. With this expectation in mind, we present NVL-C: a novel programming system that facilitates the efficient and correct programming of NVM main memory systems. The NVL-C programming abstraction extends C with a small set of intuitive language features that target NVM main memory, and can be combined directly with traditional C memory model features for DRAM. We have designed these new features to enable compiler analyses and run-time checks that can improve performance and guard against a number of subtle programming errors, which, when left uncorrected, can corrupt NVM-stored data. Moreover, to enable recovery of data across application or system failures, these NVL-C features include a flexible directive for specifying NVM transactions. So that our implementation might be extended to other compiler front ends and languages, the majority of our compiler analyses are implemented in an extended version of LLVM's intermediate representation (LLVM IR). We evaluate NVL-C on a number of applications to show its flexibility, performance, and correctness.
计算机体系结构专家预计,非易失性存储器(NVM)层次结构将在未来的系统中发挥更重要的作用,包括移动、企业和HPC体系结构。考虑到这一期望,我们提出了NVL-C:一种新的编程系统,它促进了NVM主存系统的高效和正确的编程。NVM -C编程抽象通过一组针对NVM主存的直观语言特性扩展了C语言,并且可以直接与用于DRAM的传统C内存模型特性相结合。我们设计这些新特性是为了启用编译器分析和运行时检查,从而提高性能并防止一些细微的编程错误,这些错误如果不加以纠正,可能会损坏nvm存储的数据。此外,为了能够跨应用程序或系统故障恢复数据,这些nl - c特性包括一个用于指定NVM事务的灵活指令。因此,我们的实现可以扩展到其他编译器前端和语言,我们的大多数编译器分析是在LLVM的中间表示(LLVM IR)的扩展版本中实现的。我们在许多应用程序上对nhl - c进行了评估,以展示其灵活性、性能和正确性。
{"title":"NVL-C: Static Analysis Techniques for Efficient, Correct Programming of Non-Volatile Main Memory Systems","authors":"J. Denny, Seyong Lee, J. Vetter","doi":"10.1145/2907294.2907303","DOIUrl":"https://doi.org/10.1145/2907294.2907303","url":null,"abstract":"Computer architecture experts expect that non-volatile memory (NVM) hierarchies will play a more significant role in future systems including mobile, enterprise, and HPC architectures. With this expectation in mind, we present NVL-C: a novel programming system that facilitates the efficient and correct programming of NVM main memory systems. The NVL-C programming abstraction extends C with a small set of intuitive language features that target NVM main memory, and can be combined directly with traditional C memory model features for DRAM. We have designed these new features to enable compiler analyses and run-time checks that can improve performance and guard against a number of subtle programming errors, which, when left uncorrected, can corrupt NVM-stored data. Moreover, to enable recovery of data across application or system failures, these NVL-C features include a flexible directive for specifying NVM transactions. So that our implementation might be extended to other compiler front ends and languages, the majority of our compiler analyses are implemented in an extended version of LLVM's intermediate representation (LLVM IR). We evaluate NVL-C on a number of applications to show its flexibility, performance, and correctness.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79103973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Session details: Massively Multicore Systems 会话细节:大规模多核系统
M. Wahib
{"title":"Session details: Massively Multicore Systems","authors":"M. Wahib","doi":"10.1145/3257975","DOIUrl":"https://doi.org/10.1145/3257975","url":null,"abstract":"","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"70 5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79302867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Session details: Potpourri 会议细节:百花香
N. Maruyama
{"title":"Session details: Potpourri","authors":"N. Maruyama","doi":"10.1145/3257978","DOIUrl":"https://doi.org/10.1145/3257978","url":null,"abstract":"","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"238 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73781405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IBIS: Interposed Big-data I/O Scheduler IBIS:嵌入式大数据I/O调度器
Yiqi Xu, Ming Zhao
Big-data systems are increasingly shared by diverse, data-intensive applications from different domains. However, existing systems lack the support for I/O management, and the performance of big-data applications degrades in unpredictable ways when they contend for I/Os. To address this challenge, this paper proposes IBIS, an Interposed Big-data I/O Scheduler, to provide I/O performance differentiation for competing applications in a shared big-data system. IBIS transparently intercepts, isolates, and schedules an application's different phases of I/Os via an I/O interposition layer on every datanode of the big-data system. It provides a new proportional-share I/O scheduler, SFQ(D2), to allow applications to share the I/O service of each datanode with good fairness and resource utilization. It enables the distributed I/O schedulers to coordinate with one another and to achieve proportional sharing of the big-data system's total I/O service in a scalable manner. Finally, it supports the shared use of big-data resources by diverse frameworks and manages the I/Os from different types of big-data workloads (e.g., batch jobs vs. queries) across these frameworks. The prototype of IBIS is implemented in Hadoop/YARN, a widely used big-data system. Experiments based on a variety of representative applications (WordCount, TeraSort, Facebook, TPC-H) show that IBIS achieves good total-service proportional sharing with low overhead in both application performance and resource usages. IBIS is also shown to support various performance policies: it can deliver stronger performance isolation than native Hadoop/YARN (99% better for WordCount and 15% better for TPC-H queries) with good resource utilization; and it can also achieve perfect proportional slowdown with better application performance (30% better than native Hadoop).
来自不同领域的各种数据密集型应用越来越多地共享大数据系统。然而,现有系统缺乏对I/O管理的支持,当大数据应用程序争用I/O时,它们的性能会以不可预测的方式下降。为了应对这一挑战,本文提出了IBIS,一个大数据I/O调度器,为共享大数据系统中的竞争应用程序提供I/O性能差异。IBIS通过大数据系统的每个datanode上的I/O插入层,透明地拦截、隔离和调度应用程序的不同阶段的I/O。它提供了一个新的比例共享I/O调度器SFQ(D2),允许应用程序以良好的公平性和资源利用率共享每个datanode的I/O服务。它使分布式I/O调度器能够相互协调,并以可扩展的方式实现大数据系统总I/O服务的比例共享。最后,它支持不同框架共享大数据资源,并跨这些框架管理来自不同类型大数据工作负载(例如批处理作业与查询)的I/ o。IBIS的原型是在广泛使用的大数据系统Hadoop/YARN中实现的。基于多种代表性应用(WordCount、TeraSort、Facebook、TPC-H)的实验表明,IBIS实现了良好的总服务比例共享,在应用性能和资源使用方面的开销都很低。IBIS还支持各种性能策略:它可以提供比原生Hadoop/YARN更强的性能隔离(WordCount提高99%,TPC-H查询提高15%),具有良好的资源利用率;它还可以实现完美的比例减速和更好的应用程序性能(比原生Hadoop好30%)。
{"title":"IBIS: Interposed Big-data I/O Scheduler","authors":"Yiqi Xu, Ming Zhao","doi":"10.1145/2907294.2907319","DOIUrl":"https://doi.org/10.1145/2907294.2907319","url":null,"abstract":"Big-data systems are increasingly shared by diverse, data-intensive applications from different domains. However, existing systems lack the support for I/O management, and the performance of big-data applications degrades in unpredictable ways when they contend for I/Os. To address this challenge, this paper proposes IBIS, an Interposed Big-data I/O Scheduler, to provide I/O performance differentiation for competing applications in a shared big-data system. IBIS transparently intercepts, isolates, and schedules an application's different phases of I/Os via an I/O interposition layer on every datanode of the big-data system. It provides a new proportional-share I/O scheduler, SFQ(D2), to allow applications to share the I/O service of each datanode with good fairness and resource utilization. It enables the distributed I/O schedulers to coordinate with one another and to achieve proportional sharing of the big-data system's total I/O service in a scalable manner. Finally, it supports the shared use of big-data resources by diverse frameworks and manages the I/Os from different types of big-data workloads (e.g., batch jobs vs. queries) across these frameworks. The prototype of IBIS is implemented in Hadoop/YARN, a widely used big-data system. Experiments based on a variety of representative applications (WordCount, TeraSort, Facebook, TPC-H) show that IBIS achieves good total-service proportional sharing with low overhead in both application performance and resource usages. IBIS is also shown to support various performance policies: it can deliver stronger performance isolation than native Hadoop/YARN (99% better for WordCount and 15% better for TPC-H queries) with good resource utilization; and it can also achieve perfect proportional slowdown with better application performance (30% better than native Hadoop).","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81251032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Session details: Graph Algorithms 会话详细信息:图算法
S. Song
{"title":"Session details: Graph Algorithms","authors":"S. Song","doi":"10.1145/3257977","DOIUrl":"https://doi.org/10.1145/3257977","url":null,"abstract":"","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83350549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Consecutive Job Submission Behavior at Mira Supercomputer Mira超级计算机连续作业提交行为
Stephan Schlagkamp, Rafael Ferreira da Silva, W. Allcock, E. Deelman, U. Schwiegelshohn
Understanding user behavior is crucial for the evaluation of scheduling and allocation performances in HPC environments. This paper aims to further understand the dynamic user reaction to different levels of system performance by performing a comprehensive analysis of user behavior in recorded data in the form of delays in the subsequent job submission behavior. Therefore, we characterize a workload trace covering one year of job submissions from the Mira supercomputer at ALCF (Argonne Leadership Computing Facility). We perform an in-depth analysis of correlations between job characteristics, system performance metrics, and the subsequent user behavior. Analysis results show that the user behavior is significantly influenced by long waiting times, and that complex jobs (number of nodes and CPU hours) lead to longer delays in subsequent job submissions.
理解用户行为对于高性能计算环境中调度和分配性能的评估至关重要。本文旨在通过综合分析记录数据中的用户行为,以后续作业提交行为的延迟形式,进一步了解用户对不同级别系统性能的动态反应。因此,我们对ALCF (Argonne Leadership Computing Facility)的Mira超级计算机一年的工作提交进行了工作量跟踪。我们对工作特征、系统性能指标和随后的用户行为之间的相关性进行了深入分析。分析结果表明,较长的等待时间对用户行为有显著影响,复杂的作业(节点数和CPU小时数)导致后续作业提交的延迟更长。
{"title":"Consecutive Job Submission Behavior at Mira Supercomputer","authors":"Stephan Schlagkamp, Rafael Ferreira da Silva, W. Allcock, E. Deelman, U. Schwiegelshohn","doi":"10.1145/2907294.2907314","DOIUrl":"https://doi.org/10.1145/2907294.2907314","url":null,"abstract":"Understanding user behavior is crucial for the evaluation of scheduling and allocation performances in HPC environments. This paper aims to further understand the dynamic user reaction to different levels of system performance by performing a comprehensive analysis of user behavior in recorded data in the form of delays in the subsequent job submission behavior. Therefore, we characterize a workload trace covering one year of job submissions from the Mira supercomputer at ALCF (Argonne Leadership Computing Facility). We perform an in-depth analysis of correlations between job characteristics, system performance metrics, and the subsequent user behavior. Analysis results show that the user behavior is significantly influenced by long waiting times, and that complex jobs (number of nodes and CPU hours) lead to longer delays in subsequent job submissions.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81825369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Scalable I/O-Aware Job Scheduling for Burst Buffer Enabled HPC Clusters 支持突发缓冲的HPC集群的可扩展I/ o感知作业调度
Stephen Herbein, D. Ahn, D. Lipari, T. Scogland, M. Stearman, Mark Grondona, J. Garlick, B. Springmeyer, M. Taufer
The economics of flash vs. disk storage is driving HPC centers to incorporate faster solid-state burst buffers into the storage hierarchy in exchange for smaller parallel file system (PFS) bandwidth. In systems with an underprovisioned PFS, avoiding I/O contention at the PFS level will become crucial to achieving high computational efficiency. In this paper, we propose novel batch job scheduling techniques that reduce such contention by integrating I/O awareness into scheduling policies such as EASY backfilling. We model the available bandwidth of links between each level of the storage hierarchy (i.e., burst buffers, I/O network, and PFS), and our I/O-aware schedulers use this model to avoid contention at any level in the hierarchy. We integrate our approach into Flux, a next-generation resource and job management framework, and evaluate the effectiveness and computational costs of our I/O-aware scheduling. Our results show that by reducing I/O contention for underprovisioned PFSes, our solution reduces job performance variability by up to 33% and decreases I/O-related utilization losses by up to 21%, which ultimately increases the amount of science performed by scientific workloads.
闪存与磁盘存储的经济性促使高性能计算中心将更快的固态突发缓冲区合并到存储层次结构中,以换取更小的并行文件系统(PFS)带宽。在PFS配置不足的系统中,避免PFS级别的I/O争用对于实现高计算效率至关重要。在本文中,我们提出了新的批处理作业调度技术,通过将I/O感知集成到调度策略(如EASY回填)中来减少这种争用。我们对存储层次结构的每个级别(即突发缓冲区、I/O网络和PFS)之间的链路的可用带宽进行建模,并且我们的I/O感知调度器使用该模型来避免层次结构中任何级别的争用。我们将我们的方法集成到下一代资源和作业管理框架Flux中,并评估了我们的I/ o感知调度的有效性和计算成本。我们的结果表明,通过减少配置不足的pfse的I/O争用,我们的解决方案将作业性能可变性降低了33%,并将I/O相关的利用率损失降低了21%,最终增加了科学工作负载执行的科学量。
{"title":"Scalable I/O-Aware Job Scheduling for Burst Buffer Enabled HPC Clusters","authors":"Stephen Herbein, D. Ahn, D. Lipari, T. Scogland, M. Stearman, Mark Grondona, J. Garlick, B. Springmeyer, M. Taufer","doi":"10.1145/2907294.2907316","DOIUrl":"https://doi.org/10.1145/2907294.2907316","url":null,"abstract":"The economics of flash vs. disk storage is driving HPC centers to incorporate faster solid-state burst buffers into the storage hierarchy in exchange for smaller parallel file system (PFS) bandwidth. In systems with an underprovisioned PFS, avoiding I/O contention at the PFS level will become crucial to achieving high computational efficiency. In this paper, we propose novel batch job scheduling techniques that reduce such contention by integrating I/O awareness into scheduling policies such as EASY backfilling. We model the available bandwidth of links between each level of the storage hierarchy (i.e., burst buffers, I/O network, and PFS), and our I/O-aware schedulers use this model to avoid contention at any level in the hierarchy. We integrate our approach into Flux, a next-generation resource and job management framework, and evaluate the effectiveness and computational costs of our I/O-aware scheduling. Our results show that by reducing I/O contention for underprovisioned PFSes, our solution reduces job performance variability by up to 33% and decreases I/O-related utilization losses by up to 21%, which ultimately increases the amount of science performed by scientific workloads.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"87 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72892671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 67
SDS-Sort: Scalable Dynamic Skew-aware Parallel Sorting SDS-Sort:可扩展的动态倾斜感知并行排序
Bin Dong, S. Byna, Kesheng Wu
Parallel sorting is an essential algorithm in large-scale data analytics using distributed memory systems. As the number of processes increases, existing parallel sorting algorithms could become inefficient because of the unbalanced workload. A common cause of load imbalance is the skewness of data, which is common in application data sets from physics, biology, earth and planetary sciences. In this work, we introduce a new scalable dynamic skew-aware parallel sorting algorithm, named SDS-Sort. It uses a skew-aware partition method to guarantee a tighter upper bound on the workload of each process. To improve load balance among parallel processes, existing algorithms usually add extra variables to the sorting key, which increase the time needed to complete the sorting operation. SDS-Sort allows a user to select any sorting key without sacrificing performance. SDS-Sort also provides optimizations, including adaptive local merging, overlapping of data exchange and data processing, and dynamic selection of data processing algorithms for different hardware configurations and for partially ordered data. SDS-Sort uses local-sampling based partitioning to further reduce its overhead. We tested SDS-Sort extensively on Edison, a Cray XC30 supercomputer. Timing measurements show that SDS-Sort can scale to 130K CPU cores and deliver a sorting throughput of 117TB/min. In tests with real application data from large science projects, SDS-Sort outperforms HykSort, a state-of-art parallel sorting algorithm, by 3.4X.
并行排序是使用分布式存储系统进行大规模数据分析的基本算法。随着进程数量的增加,由于工作负载的不平衡,现有的并行排序算法可能会变得效率低下。负载不平衡的一个常见原因是数据的偏度,这在物理、生物、地球和行星科学的应用程序数据集中很常见。在这项工作中,我们引入了一种新的可扩展的动态倾斜感知并行排序算法,称为SDS-Sort。它使用倾斜感知分区方法来保证每个进程的工作负载有一个更严格的上限。为了改善并行进程之间的负载平衡,现有算法通常在排序键中添加额外的变量,这增加了完成排序操作所需的时间。SDS-Sort允许用户在不牺牲性能的情况下选择任何排序键。SDS-Sort还提供了优化,包括自适应本地合并、数据交换和数据处理的重叠,以及针对不同硬件配置和部分有序数据动态选择数据处理算法。SDS-Sort使用基于本地采样的分区来进一步减少开销。我们在一台名为Edison的克雷XC30超级计算机上广泛测试了SDS-Sort。计时测量表明,SDS-Sort可以扩展到130K CPU内核,并提供117TB/min的排序吞吐量。在对大型科学项目的实际应用程序数据进行测试时,SDS-Sort的性能比HykSort(一种最先进的并行排序算法)高出3.4倍。
{"title":"SDS-Sort: Scalable Dynamic Skew-aware Parallel Sorting","authors":"Bin Dong, S. Byna, Kesheng Wu","doi":"10.1145/2907294.2907300","DOIUrl":"https://doi.org/10.1145/2907294.2907300","url":null,"abstract":"Parallel sorting is an essential algorithm in large-scale data analytics using distributed memory systems. As the number of processes increases, existing parallel sorting algorithms could become inefficient because of the unbalanced workload. A common cause of load imbalance is the skewness of data, which is common in application data sets from physics, biology, earth and planetary sciences. In this work, we introduce a new scalable dynamic skew-aware parallel sorting algorithm, named SDS-Sort. It uses a skew-aware partition method to guarantee a tighter upper bound on the workload of each process. To improve load balance among parallel processes, existing algorithms usually add extra variables to the sorting key, which increase the time needed to complete the sorting operation. SDS-Sort allows a user to select any sorting key without sacrificing performance. SDS-Sort also provides optimizations, including adaptive local merging, overlapping of data exchange and data processing, and dynamic selection of data processing algorithms for different hardware configurations and for partially ordered data. SDS-Sort uses local-sampling based partitioning to further reduce its overhead. We tested SDS-Sort extensively on Edison, a Cray XC30 supercomputer. Timing measurements show that SDS-Sort can scale to 130K CPU cores and deliver a sorting throughput of 117TB/min. In tests with real application data from large science projects, SDS-Sort outperforms HykSort, a state-of-art parallel sorting algorithm, by 3.4X.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88920770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
期刊
Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1