首页 > 最新文献

Proceedings of the 48th International Conference on Parallel Processing最新文献

英文 中文
Cynthia 辛西娅
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337873
Haoyue Zheng, Fei Xu, Li Chen, Zhi Zhou, Fangming Liu
It becomes an increasingly popular trend for deep neural networks with large-scale datasets to be trained in a distributed manner in the cloud. However, widely known as resource-intensive and time-consuming, distributed deep neural network (DDNN) training suffers from unpredictable performance in the cloud, due to the intricate factors of resource bottleneck, heterogeneity and the imbalance of computation and communication which eventually cause severe resource under-utilization. In this paper, we propose Cynthia, a cost-efficient cloud resource provisioning framework to provide predictable DDNN training performance and reduce the training budget. To explicitly explore the resource bottleneck and heterogeneity, Cynthia predicts the DDNN training time by leveraging a lightweight analytical performance model based on the resource consumption of workers and parameter servers. With an accurate performance prediction, Cynthia is able to optimally provision the cost-efficient cloud instances to jointly guarantee the training performance and minimize the training budget. We implement Cynthia on top of Kubernetes by launching a 56-docker cluster to train four representative DNN models. Extensive prototype experiments on Amazon EC2 demonstrate that Cynthia can provide predictable training performance while reducing the monetary cost for DDNN workloads by up to 50.6%, in comparison to state-of-the-art resource provisioning strategies, yet with acceptable runtime overhead.
{"title":"Cynthia","authors":"Haoyue Zheng, Fei Xu, Li Chen, Zhi Zhou, Fangming Liu","doi":"10.1145/3337821.3337873","DOIUrl":"https://doi.org/10.1145/3337821.3337873","url":null,"abstract":"It becomes an increasingly popular trend for deep neural networks with large-scale datasets to be trained in a distributed manner in the cloud. However, widely known as resource-intensive and time-consuming, distributed deep neural network (DDNN) training suffers from unpredictable performance in the cloud, due to the intricate factors of resource bottleneck, heterogeneity and the imbalance of computation and communication which eventually cause severe resource under-utilization. In this paper, we propose Cynthia, a cost-efficient cloud resource provisioning framework to provide predictable DDNN training performance and reduce the training budget. To explicitly explore the resource bottleneck and heterogeneity, Cynthia predicts the DDNN training time by leveraging a lightweight analytical performance model based on the resource consumption of workers and parameter servers. With an accurate performance prediction, Cynthia is able to optimally provision the cost-efficient cloud instances to jointly guarantee the training performance and minimize the training budget. We implement Cynthia on top of Kubernetes by launching a 56-docker cluster to train four representative DNN models. Extensive prototype experiments on Amazon EC2 demonstrate that Cynthia can provide predictable training performance while reducing the monetary cost for DDNN workloads by up to 50.6%, in comparison to state-of-the-art resource provisioning strategies, yet with acceptable runtime overhead.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130797899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Cooperative Job Scheduling and Data Allocation for Busy Data-Intensive Parallel Computing Clusters 繁忙数据密集型并行计算集群的协同作业调度与数据分配
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337864
Guoxin Liu, Haiying Shen, Haoyu Wang
In data-intensive parallel computing clusters, it is important to provide deadline-guaranteed service to jobs while minimizing resource usage (e.g., network bandwidth and energy). Under the current computing framework (that first allocates data and then schedules jobs), in a busy cluster with many jobs, it is difficult to achieve these objectives simultaneously. We model the problem to simultaneously achieve the objectives using integer programming, and propose a heuristic Cooperative job Scheduling and data Allocation method (CSA). CSA novelly reverses the order of data allocation and job scheduling in the current computing framework, i.e., changing data-first-job-second to job-first-data-second. It enables CSA to proactively consolidate tasks with more common requested data to the same server when conducting deadline-aware scheduling, and also consolidate the tasks to as few servers as possible to maximize energy savings. This facilitates the subsequent data allocation step to allocate a data block to the server that hosts most of this data's requester tasks, thus maximally enhancing data locality and reduce bandwidth consumption. CSA also has a recursive schedule refinement process to adjust the job and data allocation schedules to improve system performance regarding the three objectives and achieve the tradeoff between data locality and energy savings with specified weights. We implemented CSA and a number of previous job schedulers on Apache Hadoop on a real supercomputing cluster. Trace-driven experiments in the simulation and the real cluster show that CSA outperforms other schedulers in supplying deadline-guarantee and resource-efficient services.
在数据密集型并行计算集群中,重要的是在最小化资源使用(例如,网络带宽和能源)的同时为作业提供有期限保证的服务。在当前的计算框架下(先分配数据后调度作业),在一个作业较多的繁忙集群中,很难同时实现这些目标。采用整数规划方法对问题进行建模,并提出了一种启发式协同作业调度和数据分配方法。CSA将当前计算框架中数据分配和作业调度的顺序进行了新颖的反转,将数据优先-作业-秒改为作业优先-数据-秒。它使CSA能够在执行截止日期感知调度时,主动地将具有更多常见请求数据的任务合并到同一台服务器上,并且还可以将任务合并到尽可能少的服务器上,以最大限度地节省能源。这有助于后续的数据分配步骤,将数据块分配给承载该数据的大多数请求者任务的服务器,从而最大限度地增强数据局部性并减少带宽消耗。CSA还有一个递归调度优化过程,用于调整作业和数据分配调度,以提高关于这三个目标的系统性能,并在指定权重下实现数据局部性和节能之间的权衡。我们在一个真正的超级计算集群上,在Apache Hadoop上实现了CSA和许多以前的作业调度器。仿真和实际集群的跟踪驱动实验表明,CSA在提供截止日期保证和资源高效服务方面优于其他调度程序。
{"title":"Cooperative Job Scheduling and Data Allocation for Busy Data-Intensive Parallel Computing Clusters","authors":"Guoxin Liu, Haiying Shen, Haoyu Wang","doi":"10.1145/3337821.3337864","DOIUrl":"https://doi.org/10.1145/3337821.3337864","url":null,"abstract":"In data-intensive parallel computing clusters, it is important to provide deadline-guaranteed service to jobs while minimizing resource usage (e.g., network bandwidth and energy). Under the current computing framework (that first allocates data and then schedules jobs), in a busy cluster with many jobs, it is difficult to achieve these objectives simultaneously. We model the problem to simultaneously achieve the objectives using integer programming, and propose a heuristic Cooperative job Scheduling and data Allocation method (CSA). CSA novelly reverses the order of data allocation and job scheduling in the current computing framework, i.e., changing data-first-job-second to job-first-data-second. It enables CSA to proactively consolidate tasks with more common requested data to the same server when conducting deadline-aware scheduling, and also consolidate the tasks to as few servers as possible to maximize energy savings. This facilitates the subsequent data allocation step to allocate a data block to the server that hosts most of this data's requester tasks, thus maximally enhancing data locality and reduce bandwidth consumption. CSA also has a recursive schedule refinement process to adjust the job and data allocation schedules to improve system performance regarding the three objectives and achieve the tradeoff between data locality and energy savings with specified weights. We implemented CSA and a number of previous job schedulers on Apache Hadoop on a real supercomputing cluster. Trace-driven experiments in the simulation and the real cluster show that CSA outperforms other schedulers in supplying deadline-guarantee and resource-efficient services.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"281 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123720947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
diBELLA: Distributed Long Read to Long Read Alignment diBELLA:分布式长读对长读对齐
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337919
Marquita Ellis, Giulia Guidi, A. Buluç, L. Oliker, K. Yelick
We present a parallel algorithm and scalable implementation for genome analysis, specifically the problem of finding overlaps and alignments for data from "third generation" long read sequencers [29]. While long sequences of DNA offer enormous advantages for biological analysis and insight, current long read sequencing instruments have high error rates and therefore require different approaches to analysis than their short read counterparts. Our work focuses on an efficient distributed-memory parallelization of an accurate single-node algorithm for overlapping and aligning long reads. We achieve scalability of this irregular algorithm by addressing the competing issues of increasing parallelism, minimizing communication, constraining the memory footprint, and ensuring good load balance. The resulting application, diBELLA, is the first distributed memory overlapper and aligner specifically designed for long reads and parallel scalability. We describe and present analyses for high level design trade-offs and conduct an extensive empirical analysis that compares performance characteristics across state-of-the-art HPC systems as well as a commercial cloud architectures, highlighting the advantages of state-of-the-art network technologies.
我们提出了一种用于基因组分析的并行算法和可扩展实现,特别是发现来自“第三代”长读测序仪[29]的数据重叠和对齐的问题。虽然长DNA序列为生物分析和洞察提供了巨大的优势,但目前的长读测序仪器有很高的错误率,因此需要与短读测序仪器不同的分析方法。我们的工作重点是一个精确的单节点算法的高效分布式内存并行化,用于重叠和对齐长读取。我们通过解决增加并行性、最小化通信、限制内存占用和确保良好负载平衡等竞争问题来实现这种不规则算法的可伸缩性。由此产生的应用程序diBELLA是第一个专门为长读取和并行可伸缩性设计的分布式内存重叠和对齐器。我们对高级设计权衡进行了描述和分析,并进行了广泛的实证分析,比较了最先进的高性能计算系统和商业云架构的性能特征,突出了最先进的网络技术的优势。
{"title":"diBELLA: Distributed Long Read to Long Read Alignment","authors":"Marquita Ellis, Giulia Guidi, A. Buluç, L. Oliker, K. Yelick","doi":"10.1145/3337821.3337919","DOIUrl":"https://doi.org/10.1145/3337821.3337919","url":null,"abstract":"We present a parallel algorithm and scalable implementation for genome analysis, specifically the problem of finding overlaps and alignments for data from \"third generation\" long read sequencers [29]. While long sequences of DNA offer enormous advantages for biological analysis and insight, current long read sequencing instruments have high error rates and therefore require different approaches to analysis than their short read counterparts. Our work focuses on an efficient distributed-memory parallelization of an accurate single-node algorithm for overlapping and aligning long reads. We achieve scalability of this irregular algorithm by addressing the competing issues of increasing parallelism, minimizing communication, constraining the memory footprint, and ensuring good load balance. The resulting application, diBELLA, is the first distributed memory overlapper and aligner specifically designed for long reads and parallel scalability. We describe and present analyses for high level design trade-offs and conduct an extensive empirical analysis that compares performance characteristics across state-of-the-art HPC systems as well as a commercial cloud architectures, highlighting the advantages of state-of-the-art network technologies.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114258890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
A Plugin Architecture for the TAU Performance System TAU性能系统的插件架构
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337916
A. Malony, Srinivasan Ramesh, K. Huck, Nicholas Chaimov, S. Shende
Several robust performance systems have been created for parallel machines with the ability to observe diverse aspects of application execution on different hardware platforms. All of these are designed with the objective to support measurement methods that are efficient, portable, and scalable. For these reasons, the performance measurement infrastructure is tightly embedded with the application code and runtime execution environment. As parallel software and systems evolve, especially towards more heterogeneous, asynchronous, and dynamic operation, it is expected that the requirements for performance observation and awareness will change. For instance, heterogeneous machines introduce new types of performance data to capture and performance behaviors to characterize. Furthermore, there is a growing interest in interacting with the performance infrastructure for in situ analytics and policy-based control. The problem is that an existing performance system architecture could be constrained in its ability to evolve to meet these new requirements. The paper reports our research efforts to address this concern in the context of the TAU Performance System. In particular, we consider the use of a powerful plugin model to both capture existing capabilities in TAU and to extend its functionality in ways it was not necessarily conceived originally. The TAU plugin architecture supports three types of plugin paradigms: EVENT, TRIGGER, and AGENT. We demonstrate how each operates under several different scenarios. Results from larger-scale experiments are shown to highlight the fact that efficiency and robustness can be maintained, while new flexibility and programmability can be offered that leverages the power of the core TAU system while allowing significant and compelling extensions to be realized.
已经为并行机器创建了几个健壮的性能系统,这些系统能够观察不同硬件平台上应用程序执行的不同方面。所有这些都是为了支持高效、便携和可扩展的测量方法而设计的。由于这些原因,性能度量基础结构与应用程序代码和运行时执行环境紧密嵌入。随着并行软件和系统的发展,特别是朝着异构、异步和动态操作的方向发展,预计对性能观察和感知的需求将发生变化。例如,异构机器引入了要捕获的新类型的性能数据和要描述的性能行为。此外,人们对与性能基础设施进行交互以实现原位分析和基于策略的控制越来越感兴趣。问题是,现有的性能系统架构在满足这些新需求的能力方面可能受到限制。本文报告了我们在TAU绩效系统背景下解决这一问题的研究努力。特别地,我们考虑使用一个强大的插件模型来捕获TAU中的现有功能,并以一种不一定是最初设想的方式扩展其功能。TAU插件架构支持三种类型的插件范例:EVENT、TRIGGER和AGENT。我们将演示它们在几种不同场景下的操作方式。大规模实验的结果表明,可以保持效率和健壮性,同时可以提供新的灵活性和可编程性,利用核心TAU系统的功能,同时允许实现重要和引人注目的扩展。
{"title":"A Plugin Architecture for the TAU Performance System","authors":"A. Malony, Srinivasan Ramesh, K. Huck, Nicholas Chaimov, S. Shende","doi":"10.1145/3337821.3337916","DOIUrl":"https://doi.org/10.1145/3337821.3337916","url":null,"abstract":"Several robust performance systems have been created for parallel machines with the ability to observe diverse aspects of application execution on different hardware platforms. All of these are designed with the objective to support measurement methods that are efficient, portable, and scalable. For these reasons, the performance measurement infrastructure is tightly embedded with the application code and runtime execution environment. As parallel software and systems evolve, especially towards more heterogeneous, asynchronous, and dynamic operation, it is expected that the requirements for performance observation and awareness will change. For instance, heterogeneous machines introduce new types of performance data to capture and performance behaviors to characterize. Furthermore, there is a growing interest in interacting with the performance infrastructure for in situ analytics and policy-based control. The problem is that an existing performance system architecture could be constrained in its ability to evolve to meet these new requirements. The paper reports our research efforts to address this concern in the context of the TAU Performance System. In particular, we consider the use of a powerful plugin model to both capture existing capabilities in TAU and to extend its functionality in ways it was not necessarily conceived originally. The TAU plugin architecture supports three types of plugin paradigms: EVENT, TRIGGER, and AGENT. We demonstrate how each operates under several different scenarios. Results from larger-scale experiments are shown to highlight the fact that efficiency and robustness can be maintained, while new flexibility and programmability can be offered that leverages the power of the core TAU system while allowing significant and compelling extensions to be realized.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117283700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Parallel Algorithms for Evaluating Matrix Polynomials 求矩阵多项式的并行算法
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337871
Sivan Toledo, Amit Waisel
We develop and evaluate parallel algorithms for a fundamental problem in numerical computing, namely the evaluation of a polynomial of a matrix. The algorithm consists of many building blocks that can be assembled in several ways. We investigate parallelism in individual building blocks, develop parallel implemenations, and assemble them into an overall parallel algorithm. We analyze the effects of both the dimension of the matrix and the degree of the polynomial on both arithmetic complexity and on parallelism, and we consequently propose which variants use in different cases. Our theoretical results indicate that one variant of the algorithm, based on applying the Paterson-Stockmeyer method to the entire matrix, parallelizes very effectively on virtually any matrix dimension and polynomial degree. However, it is not the most efficient from the arithmetic complexity viewpoint. Another algorithm, based on the Davies-Higham block recurrence is much more efficient from the arithmetic complexity viewpoint, but one of its building blocks is serial. Experimental results on a dual-socket 28-core server show that the first algorithm can effectively use all the cores, but that on high-degree polynomials the second algorithm is often faster, in spite of the sequential phase. This indicates that our parallel algorithms for the other phases are indeed effective.
我们开发和评估并行算法在数值计算的一个基本问题,即评估一个矩阵的多项式。该算法由许多构建块组成,这些构建块可以用几种方式组装。我们研究单个构建块中的并行性,开发并行实现,并将它们组装成一个整体并行算法。我们分析了矩阵的维数和多项式的度对算术复杂度和并行性的影响,并由此提出了在不同情况下使用的变体。我们的理论结果表明,该算法的一种变体,基于将patterson - stockmeyer方法应用于整个矩阵,在几乎任何矩阵维数和多项式次上都非常有效地并行化。然而,从算法复杂度的角度来看,它并不是最有效的。另一种基于davis - higham块递归的算法从算法复杂度的角度来看效率要高得多,但其构建模块之一是序列。在双插槽28核服务器上的实验结果表明,第一种算法可以有效地使用所有内核,但在高次多项式上,第二种算法通常更快,尽管有顺序阶段。这表明我们的并行算法在其他阶段确实是有效的。
{"title":"Parallel Algorithms for Evaluating Matrix Polynomials","authors":"Sivan Toledo, Amit Waisel","doi":"10.1145/3337821.3337871","DOIUrl":"https://doi.org/10.1145/3337821.3337871","url":null,"abstract":"We develop and evaluate parallel algorithms for a fundamental problem in numerical computing, namely the evaluation of a polynomial of a matrix. The algorithm consists of many building blocks that can be assembled in several ways. We investigate parallelism in individual building blocks, develop parallel implemenations, and assemble them into an overall parallel algorithm. We analyze the effects of both the dimension of the matrix and the degree of the polynomial on both arithmetic complexity and on parallelism, and we consequently propose which variants use in different cases. Our theoretical results indicate that one variant of the algorithm, based on applying the Paterson-Stockmeyer method to the entire matrix, parallelizes very effectively on virtually any matrix dimension and polynomial degree. However, it is not the most efficient from the arithmetic complexity viewpoint. Another algorithm, based on the Davies-Higham block recurrence is much more efficient from the arithmetic complexity viewpoint, but one of its building blocks is serial. Experimental results on a dual-socket 28-core server show that the first algorithm can effectively use all the cores, but that on high-degree polynomials the second algorithm is often faster, in spite of the sequential phase. This indicates that our parallel algorithms for the other phases are indeed effective.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117324691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Breaking Band: A Breakdown of High-performance Communication 打破频带:高性能通信的崩溃
Rohit Zambre, M. Grodowitz, Aparna Chandramowlishwaran, Pavel Shamis
The critical path of internode communication on large-scale systems is composed of multiple components. When a supercomputing application initiates the transfer of a message using a high-level communication routine such as an MPI_Send, the payload of the message traverses multiple software stacks, the I/O subsystem on both the host and target nodes, and network components such as the switch. In this paper, we analyze where, why, and how much time is spent on the critical path of communication by modeling the overall injection overhead and end-to-end latency of a system. We focus our analysis on the performance of small messages since fine-grained communication is becoming increasingly important with the growing trend of an increasing number of cores per node. The analytical models present an accurate and detailed breakdown of time spent in internode communication. We validate the models on Arm ThunderX2-based servers connected with Mellanox InfiniBand. This is the first work of this kind on Arm. Alongside our breakdown, we describe the methodology to measure the time spent in each component so that readers with access to precise CPU timers and a PCIe analyzer can measure breakdowns on systems of their interest. Such a breakdown is crucial for software developers, system architects, and researchers to guide their optimization efforts. As researchers ourselves, we use the breakdown to simulate the impacts and discuss the likelihoods of a set of optimizations that target the bottlenecks in today's high-performance communication.
大型系统节点间通信的关键路径由多个组件组成。当超级计算应用程序使用高级通信例程(如MPI_Send)启动消息传输时,消息的有效负载将遍历多个软件堆栈、主机和目标节点上的I/O子系统以及交换机等网络组件。在本文中,我们通过对系统的总体注入开销和端到端延迟进行建模,分析在通信的关键路径上花费的位置、原因和时间。我们将分析重点放在小消息的性能上,因为随着每个节点的核心数量不断增加,细粒度通信变得越来越重要。分析模型对节点间通信所花费的时间进行了准确而详细的分解。我们在连接Mellanox InfiniBand的基于Arm thunderx2的服务器上验证了模型。这是Arm上的第一个此类作品。除了我们的故障外,我们还描述了测量每个组件所花费时间的方法,以便读者可以访问精确的CPU计时器和PCIe分析仪来测量他们感兴趣的系统的故障。这样的分解对于软件开发人员、系统架构师和研究人员指导他们的优化工作是至关重要的。作为研究人员,我们使用故障来模拟影响,并讨论针对当今高性能通信瓶颈的一组优化的可能性。
{"title":"Breaking Band: A Breakdown of High-performance Communication","authors":"Rohit Zambre, M. Grodowitz, Aparna Chandramowlishwaran, Pavel Shamis","doi":"10.1145/3337821","DOIUrl":"https://doi.org/10.1145/3337821","url":null,"abstract":"The critical path of internode communication on large-scale systems is composed of multiple components. When a supercomputing application initiates the transfer of a message using a high-level communication routine such as an MPI_Send, the payload of the message traverses multiple software stacks, the I/O subsystem on both the host and target nodes, and network components such as the switch. In this paper, we analyze where, why, and how much time is spent on the critical path of communication by modeling the overall injection overhead and end-to-end latency of a system. We focus our analysis on the performance of small messages since fine-grained communication is becoming increasingly important with the growing trend of an increasing number of cores per node. The analytical models present an accurate and detailed breakdown of time spent in internode communication. We validate the models on Arm ThunderX2-based servers connected with Mellanox InfiniBand. This is the first work of this kind on Arm. Alongside our breakdown, we describe the methodology to measure the time spent in each component so that readers with access to precise CPU timers and a PCIe analyzer can measure breakdowns on systems of their interest. Such a breakdown is crucial for software developers, system architects, and researchers to guide their optimization efforts. As researchers ourselves, we use the breakdown to simulate the impacts and discuss the likelihoods of a set of optimizations that target the bottlenecks in today's high-performance communication.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127590231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Exploiting Vector Processing in Dynamic Binary Translation
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337844
Chih-Min Lin, Sheng-Yu Fu, Ding-Yong Hong, Yu-Ping Liu, Jan-Jan Wu, W. Hsu
Auto vectorization techniques have been adopted by compilers to exploit data-level parallelism in parallel processing for decades. However, since processor architectures have kept enhancing with new features to improve vector/SIMD performance, legacy application binaries failed to fully exploit new vector/SIMD capabilities in modern architectures. For example, legacy ARMv7 binaries cannot benefit from ARMv8 SIMD double precision capability, and legacy x86 binaries cannot enjoy the power of AVX-512 extensions. In this paper, we study the fundamental issues involved in cross-ISA Dynamic Binary Translation (DBT) to convert non-vectorized loops to vector/SIMD forms to achieve greater computation throughput available in newer processor architectures. The key idea is to recover critical loop information from those application binaries in order to carry out vectorization at runtime. Experiment results show that our approach achieves an average speedup of 1.42x compared to ARMv7 native run across various benchmarks in an ARMv7-to-ARMv8 dynamic binary translation system.
自动向量化技术已经被编译器采用了几十年,以利用并行处理中的数据级并行性。然而,由于处理器体系结构不断增强新特性以提高矢量/SIMD性能,遗留应用程序二进制文件无法在现代体系结构中充分利用新的矢量/SIMD功能。例如,遗留的ARMv7二进制文件不能受益于ARMv8 SIMD双精度功能,遗留的x86二进制文件不能享受AVX-512扩展的强大功能。在本文中,我们研究了跨isa动态二进制转换(DBT)所涉及的基本问题,将非矢量化循环转换为矢量/SIMD形式,以在较新的处理器架构中实现更高的计算吞吐量。其关键思想是从这些应用程序二进制文件中恢复关键循环信息,以便在运行时执行向量化。实验结果表明,在ARMv7到armv8动态二进制转换系统中,与ARMv7本机运行的各种基准测试相比,我们的方法实现了1.42倍的平均加速。
{"title":"Exploiting Vector Processing in Dynamic Binary Translation","authors":"Chih-Min Lin, Sheng-Yu Fu, Ding-Yong Hong, Yu-Ping Liu, Jan-Jan Wu, W. Hsu","doi":"10.1145/3337821.3337844","DOIUrl":"https://doi.org/10.1145/3337821.3337844","url":null,"abstract":"Auto vectorization techniques have been adopted by compilers to exploit data-level parallelism in parallel processing for decades. However, since processor architectures have kept enhancing with new features to improve vector/SIMD performance, legacy application binaries failed to fully exploit new vector/SIMD capabilities in modern architectures. For example, legacy ARMv7 binaries cannot benefit from ARMv8 SIMD double precision capability, and legacy x86 binaries cannot enjoy the power of AVX-512 extensions. In this paper, we study the fundamental issues involved in cross-ISA Dynamic Binary Translation (DBT) to convert non-vectorized loops to vector/SIMD forms to achieve greater computation throughput available in newer processor architectures. The key idea is to recover critical loop information from those application binaries in order to carry out vectorization at runtime. Experiment results show that our approach achieves an average speedup of 1.42x compared to ARMv7 native run across various benchmarks in an ARMv7-to-ARMv8 dynamic binary translation system.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123972308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
I/O Characterization and Performance Evaluation of BeeGFS for Deep Learning 深度学习BeeGFS的I/O表征和性能评估
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337902
Fahim Chowdhury, Yue Zhu, T. Heer, Saul Paredes, A. Moody, R. Goldstone, K. Mohror, Weikuan Yu
Parallel File Systems (PFSs) are frequently deployed on leadership High Performance Computing (HPC) systems to ensure efficient I/O, persistent storage and scalable performance. Emerging Deep Learning (DL) applications incur new I/O and storage requirements to HPC systems with batched input of small random files. This mandates PFSs to have commensurate features that can meet the needs of DL applications. BeeGFS is a recently emerging PFS that has grabbed the attention of the research and industry world because of its performance, scalability and ease of use. While emphasizing a systematic performance analysis of BeeGFS, in this paper, we present the architectural and system features of BeeGFS, and perform an experimental evaluation using cutting-edge I/O, Metadata and DL application benchmarks. Particularly, we have utilized AlexNet and ResNet-50 models for the classification of ImageNet dataset using the Livermore Big Artificial Neural Network Toolkit (LBANN), and ImageNet data reader pipeline atop TensorFlow and Horovod. Through extensive performance characterization of BeeGFS, our study provides a useful documentation on how to leverage BeeGFS for the emerging DL applications.
并行文件系统(pfs)经常部署在高性能计算(HPC)系统上,以确保高效的I/O、持久的存储和可扩展的性能。新兴的深度学习(DL)应用程序会对HPC系统产生新的I/O和存储需求,并批量输入小型随机文件。这就要求pfs具有能够满足DL应用程序需求的相应特性。BeeGFS是最近出现的一种PFS,由于其性能、可扩展性和易用性而引起了研究和工业界的注意。在强调BeeGFS的系统性能分析的同时,本文介绍了BeeGFS的架构和系统特征,并使用先进的I/O,元数据和DL应用基准进行了实验评估。特别是,我们使用了AlexNet和ResNet-50模型对ImageNet数据集进行分类,使用了Livermore大人工神经网络工具包(LBANN),以及基于TensorFlow和Horovod的ImageNet数据读取器管道。通过对BeeGFS进行广泛的性能表征,我们的研究为如何将BeeGFS用于新兴的深度学习应用程序提供了有用的文档。
{"title":"I/O Characterization and Performance Evaluation of BeeGFS for Deep Learning","authors":"Fahim Chowdhury, Yue Zhu, T. Heer, Saul Paredes, A. Moody, R. Goldstone, K. Mohror, Weikuan Yu","doi":"10.1145/3337821.3337902","DOIUrl":"https://doi.org/10.1145/3337821.3337902","url":null,"abstract":"Parallel File Systems (PFSs) are frequently deployed on leadership High Performance Computing (HPC) systems to ensure efficient I/O, persistent storage and scalable performance. Emerging Deep Learning (DL) applications incur new I/O and storage requirements to HPC systems with batched input of small random files. This mandates PFSs to have commensurate features that can meet the needs of DL applications. BeeGFS is a recently emerging PFS that has grabbed the attention of the research and industry world because of its performance, scalability and ease of use. While emphasizing a systematic performance analysis of BeeGFS, in this paper, we present the architectural and system features of BeeGFS, and perform an experimental evaluation using cutting-edge I/O, Metadata and DL application benchmarks. Particularly, we have utilized AlexNet and ResNet-50 models for the classification of ImageNet dataset using the Livermore Big Artificial Neural Network Toolkit (LBANN), and ImageNet data reader pipeline atop TensorFlow and Horovod. Through extensive performance characterization of BeeGFS, our study provides a useful documentation on how to leverage BeeGFS for the emerging DL applications.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125847361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Design Exploration of Multi-tier Interconnection Networks for Exascale Systems 百亿亿级系统多层互连网络的设计探索
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337903
J. Navaridas, Joshua Lant, J. A. Pascual, M. Luján, J. Goodacre
Interconnection networks are one of the main limiting factors when it comes to scale out computing systems. In this paper, we explore what role the hybridization of topologies has on the design of an state-of-the-art exascale-capable computing system. More precisely we compare several hybrid topologies and compare with common single-topology ones when dealing with large-scale applicationlike traffic. In addition we explore how different aspects of the hybrid topology can affect the overall performance of the system. In particular, we found that hybrid topologies can outperform state-of-the-art torus and fattree networks as long as the density of connections is high enough--one connection every two or four nodes seems to be the sweet spot--and the size of the subtori is limited to a few nodes per dimension. Moreover, we explored two different alternatives to use in the upper tiers of the interconnect, a fattree and a generalised hypercube, and found little difference between the topologies, mostly depending on the workload to be executed.
当涉及到扩展计算系统时,互连网络是主要的限制因素之一。在本文中,我们探讨了拓扑的杂交在最先进的百亿亿级计算系统的设计中的作用。更准确地说,我们比较了几种混合拓扑,并在处理大规模应用程序流量时与常见的单一拓扑进行了比较。此外,我们还探讨了混合拓扑的不同方面如何影响系统的整体性能。特别是,我们发现,只要连接密度足够高(每两个或四个节点一个连接似乎是最佳点),混合拓扑可以胜过最先进的环面和脂肪树网络,并且子图的大小被限制在每个维度的几个节点内。此外,我们还研究了在互连的上层使用的两种不同的替代方案,即胖树和广义超立方体,发现拓扑之间几乎没有区别,主要取决于要执行的工作负载。
{"title":"Design Exploration of Multi-tier Interconnection Networks for Exascale Systems","authors":"J. Navaridas, Joshua Lant, J. A. Pascual, M. Luján, J. Goodacre","doi":"10.1145/3337821.3337903","DOIUrl":"https://doi.org/10.1145/3337821.3337903","url":null,"abstract":"Interconnection networks are one of the main limiting factors when it comes to scale out computing systems. In this paper, we explore what role the hybridization of topologies has on the design of an state-of-the-art exascale-capable computing system. More precisely we compare several hybrid topologies and compare with common single-topology ones when dealing with large-scale applicationlike traffic. In addition we explore how different aspects of the hybrid topology can affect the overall performance of the system. In particular, we found that hybrid topologies can outperform state-of-the-art torus and fattree networks as long as the density of connections is high enough--one connection every two or four nodes seems to be the sweet spot--and the size of the subtori is limited to a few nodes per dimension. Moreover, we explored two different alternatives to use in the upper tiers of the interconnect, a fattree and a generalised hypercube, and found little difference between the topologies, mostly depending on the workload to be executed.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129740513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
HPAS 下丘脑-垂体-肾上腺轴的
Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337907
E. Ates, Yijia Zhang, Burak Aksar, Jim Brandt, V. Leung, Manuel Egele, A. Coskun
Modern high performance computing (HPC) systems, including supercomputers, routinely suffer from substantial performance variations. The same application with the same input can have more than 100% performance variation, and such variations cause reduced efficiency and wasted resources. There have been recent studies on performance variability and on designing automated methods for diagnosing "anomalies" that cause performance variability. These studies either observe data collected from HPC systems, or they rely on synthetic reproduction of performance variability scenarios. However, there is no standardized way of creating performance variability inducing synthetic anomalies; so, researchers rely on designing ad-hoc methods for reproducing performance variability. This paper addresses this lack of a common method for creating relevant performance anomalies by introducing HPAS, an HPC Performance Anomaly Suite, consisting of anomaly generators for the major subsystems in HPC systems. These easy-to-use synthetic anomaly generators facilitate low-effort evaluation and comparison of various analytics methods as well as performance or resilience of applications, middleware, or systems under realistic performance variability scenarios. The paper also provides an analysis of the behavior of the anomaly generators and demonstrates several use cases: (1) performance anomaly diagnosis using HPAS, (2) evaluation of resource management policies under performance variations, and (3) design of applications that are resilient to performance variability.
{"title":"HPAS","authors":"E. Ates, Yijia Zhang, Burak Aksar, Jim Brandt, V. Leung, Manuel Egele, A. Coskun","doi":"10.1145/3337821.3337907","DOIUrl":"https://doi.org/10.1145/3337821.3337907","url":null,"abstract":"Modern high performance computing (HPC) systems, including supercomputers, routinely suffer from substantial performance variations. The same application with the same input can have more than 100% performance variation, and such variations cause reduced efficiency and wasted resources. There have been recent studies on performance variability and on designing automated methods for diagnosing \"anomalies\" that cause performance variability. These studies either observe data collected from HPC systems, or they rely on synthetic reproduction of performance variability scenarios. However, there is no standardized way of creating performance variability inducing synthetic anomalies; so, researchers rely on designing ad-hoc methods for reproducing performance variability. This paper addresses this lack of a common method for creating relevant performance anomalies by introducing HPAS, an HPC Performance Anomaly Suite, consisting of anomaly generators for the major subsystems in HPC systems. These easy-to-use synthetic anomaly generators facilitate low-effort evaluation and comparison of various analytics methods as well as performance or resilience of applications, middleware, or systems under realistic performance variability scenarios. The paper also provides an analysis of the behavior of the anomaly generators and demonstrates several use cases: (1) performance anomaly diagnosis using HPAS, (2) evaluation of resource management policies under performance variations, and (3) design of applications that are resilient to performance variability.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114455563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
期刊
Proceedings of the 48th International Conference on Parallel Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1