2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)最新文献_第4页

Processor Pipelining Method for Efficient Deep Neural Network Inference on Embedded Devices 嵌入式设备上高效深度神经网络推理的处理器流水线方法

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/HiPC50609.2020.00022

Akshay Parashar, Arun Abraham, Deepak Chaudhary, V. N. Rajendiran

Myriad applications of Deep Neural Networks (DNN) and the race for better accuracy have paved the way for the development of more computationally intensive network architectures. Execution of these heavy networks on embedded devices needs highly efficient real-time DNN inference frameworks. But the sequential architecture of popular DNNs makes it difficult to parallelize its operations among different processors. We propose a novel pipelining method pluggable on top of conventional inference frameworks and capable of parallelizing DNN inference on heterogeneous processors without impacting the accuracy. We partition the network into subnets, by estimating the optimal split points, and pipeline these subnets across multiple processors. The results shows that the proposed method achieves up to 68% improvement in the frames per second (FPS) rate of popular network architectures like VGG19, DenseNet-121 and ResNet-152. Moreover, we show that our method can be used to extract even more performance out of high performance chipsets, by better utilizing the capabilities of its AI processor ecosystem. We also showcase that our method can be easily extended to other low performance chipsets, where this additional performance gain is crucial to deploy real-time AI applications. Our results show performance improvement of up to 47% in the FPS rate on these chipsets without the need of specialized AI hardware.

深度神经网络(DNN)的无数应用和对更高精度的追求为更多计算密集型网络架构的发展铺平了道路。在嵌入式设备上执行这些繁重的网络需要高效的实时深度神经网络推理框架。但是流行的深度神经网络的顺序架构使得它很难在不同的处理器之间并行化它的操作。我们提出了一种新的流水线方法，可插入到传统的推理框架之上，能够在不影响精度的情况下在异构处理器上并行进行DNN推理。我们通过估计最优分裂点将网络划分为子网，并将这些子网分布在多个处理器上。结果表明，该方法比VGG19、DenseNet-121和ResNet-152等常用网络架构的帧率提高了68%。此外，我们表明，通过更好地利用其人工智能处理器生态系统的功能，我们的方法可以用来从高性能芯片组中提取更多的性能。我们还展示了我们的方法可以很容易地扩展到其他低性能芯片组，其中这种额外的性能增益对于部署实时AI应用程序至关重要。我们的研究结果显示，在不需要专门的人工智能硬件的情况下，这些芯片组的FPS率提高了47%。

{"title":"Processor Pipelining Method for Efficient Deep Neural Network Inference on Embedded Devices","authors":"Akshay Parashar, Arun Abraham, Deepak Chaudhary, V. N. Rajendiran","doi":"10.1109/HiPC50609.2020.00022","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00022","url":null,"abstract":"Myriad applications of Deep Neural Networks (DNN) and the race for better accuracy have paved the way for the development of more computationally intensive network architectures. Execution of these heavy networks on embedded devices needs highly efficient real-time DNN inference frameworks. But the sequential architecture of popular DNNs makes it difficult to parallelize its operations among different processors. We propose a novel pipelining method pluggable on top of conventional inference frameworks and capable of parallelizing DNN inference on heterogeneous processors without impacting the accuracy. We partition the network into subnets, by estimating the optimal split points, and pipeline these subnets across multiple processors. The results shows that the proposed method achieves up to 68% improvement in the frames per second (FPS) rate of popular network architectures like VGG19, DenseNet-121 and ResNet-152. Moreover, we show that our method can be used to extract even more performance out of high performance chipsets, by better utilizing the capabilities of its AI processor ecosystem. We also showcase that our method can be easily extended to other low performance chipsets, where this additional performance gain is crucial to deploy real-time AI applications. Our results show performance improvement of up to 47% in the FPS rate on these chipsets without the need of specialized AI hardware.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115450802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

AMCilk: A Framework for Multiprogrammed Parallel Workloads AMCilk:多程序并行工作负载的框架

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/HiPC50609.2020.00035

Zhe Wang, Chen Xu, Kunal Agrawal, Jing Li

Modern parallel platforms, such as clouds or servers, are often shared among many different jobs. However, existing parallel programming runtime systems are designed and optimized for running a single parallel job, so it is generally hard to directly use them to schedule multiple parallel jobs without incurring high overhead and inefficiency. In this work, we develop AMCilk (Adaptive Multiprogrammed Cilk), a novel runtime system framework, designed to support multiprogrammed parallel workloads. AMCilk has client-server architecture where users can dynamically submit parallel jobs to the system. AMCilk has a single runtime system that runs these jobs while dynamically reallocating cores, last-level cache, and memory bandwidth among these jobs according to the scheduling policy. AMCilk exposes the interface to the system designer, which allows the designer to easily build different scheduling policies meeting the requirements of various application scenarios and performance metrics, while AMCilk transparently (to designers) enforces the scheduling policy. The primary feature of AMCilk is the low-overhead and responsive preemption mechanism that allows fast reallocation of cores between jobs. Our empirical evaluation indicates that AMCilk incurs small overheads and provides significant benefits on application-specific criteria for a set of 4 practical applications due to its fast and low-overhead core reallocation mechanism.

现代并行平台，如云或服务器，通常在许多不同的工作之间共享。然而，现有的并行编程运行时系统是为运行单个并行作业而设计和优化的，因此通常很难直接使用它们来调度多个并行作业，而不会产生高开销和低效率。在这项工作中，我们开发了AMCilk(自适应多程序Cilk)，这是一种新的运行时系统框架，旨在支持多程序并行工作负载。AMCilk具有客户机-服务器架构，用户可以在其中动态地向系统提交并行作业。AMCilk有一个运行时系统，它运行这些作业，同时根据调度策略在这些作业之间动态地重新分配内核、最后一级缓存和内存带宽。AMCilk向系统设计人员公开接口，允许设计人员轻松构建满足各种应用程序场景和性能指标需求的不同调度策略，而AMCilk透明地(对设计人员)执行调度策略。AMCilk的主要特性是低开销和响应性抢占机制，允许在作业之间快速重新分配内核。我们的经验评估表明，AMCilk由于其快速和低开销的核心重新分配机制，在4个实际应用中产生了较小的开销，并在特定于应用程序的标准上提供了显著的好处。

{"title":"AMCilk: A Framework for Multiprogrammed Parallel Workloads","authors":"Zhe Wang, Chen Xu, Kunal Agrawal, Jing Li","doi":"10.1109/HiPC50609.2020.00035","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00035","url":null,"abstract":"Modern parallel platforms, such as clouds or servers, are often shared among many different jobs. However, existing parallel programming runtime systems are designed and optimized for running a single parallel job, so it is generally hard to directly use them to schedule multiple parallel jobs without incurring high overhead and inefficiency. In this work, we develop AMCilk (Adaptive Multiprogrammed Cilk), a novel runtime system framework, designed to support multiprogrammed parallel workloads. AMCilk has client-server architecture where users can dynamically submit parallel jobs to the system. AMCilk has a single runtime system that runs these jobs while dynamically reallocating cores, last-level cache, and memory bandwidth among these jobs according to the scheduling policy. AMCilk exposes the interface to the system designer, which allows the designer to easily build different scheduling policies meeting the requirements of various application scenarios and performance metrics, while AMCilk transparently (to designers) enforces the scheduling policy. The primary feature of AMCilk is the low-overhead and responsive preemption mechanism that allows fast reallocation of cores between jobs. Our empirical evaluation indicates that AMCilk incurs small overheads and provides significant benefits on application-specific criteria for a set of 4 practical applications due to its fast and low-overhead core reallocation mechanism.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"311 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116488638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Design and Study of Elastic Recovery in HPC Applications 高性能计算应用中弹性恢复的设计与研究

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/HiPC50609.2020.00040

Kai Keller, K. Parasyris, L. Bautista-Gomez

The efficient utilization of current supercomputing systems with deep storage hierarchies demands scientific applications that are capable of leveraging such heterogeneous hardware. Fault tolerance, and checkpointing in particular, is one of the most time-consuming aspects if not handled correctly. High checkpoint performance can be achieved using optimized multilevel checkpoint and restart libraries. Unfortunately, those libraries do not allow for restarts with a modified number of processes or scientific post-processing of the checkpointed data. This is because they typically use an N-N checkpointing scheme and opaque file-formats. In this article, we present a novel mechanism to asynchronously store checkpoints into a self-descriptive file format and load the data upon recovery with a different number of processes. We provide an API that defines the process-local data as part of a globally shared dataset. Our measurements demonstrate a low overhead between 0.6% and 2.5% for a 2.25 TB checkpoint with 6K processes.

高效利用当前具有深度存储层次的超级计算系统需要能够利用这种异构硬件的科学应用程序。如果处理不当，容错，特别是检查点，是最耗时的方面之一。使用优化的多级检查点和重启库可以实现高检查点性能。不幸的是，这些库不允许在修改进程数量或对检查点数据进行科学后处理的情况下重新启动。这是因为它们通常使用N-N检查点方案和不透明的文件格式。在本文中，我们提出了一种新的机制，可以异步地将检查点存储为自描述文件格式，并在恢复时使用不同数量的进程加载数据。我们提供了一个API，该API将流程本地数据定义为全局共享数据集的一部分。我们的测量表明，对于具有6K进程的2.25 TB检查点，开销在0.6%到2.5%之间。

引用次数: 0

Avoiding Communication in Logistic Regression 逻辑回归中的避免沟通

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-11-16 DOI: 10.1109/HiPC50609.2020.00023

Aditya Devarakonda, J. Demmel

Stochastic gradient descent (SGD) is one of the most widely used optimization methods for solving various machine learning problems. SGD solves an optimization problem by iteratively sampling a few data points from the input data, computing gradients for the selected data points, and updating the solution. However, in a parallel setting, SGD requires interprocess communication at every iteration. We introduce a new communication-avoiding technique for solving the logistic regression problem using SGD. This technique re-organizes the SGD computations into a form that communicates every $s$ iterations instead of every iteration, where $s$ is a tuning parameter. We prove theoretical flops, bandwidth, and latency upper bounds for SGD and its new communication-avoiding variant. Furthermore, we show experimental results that illustrate that the new Communication-Avoiding SGD (CA-SGD) method can achieve speedups of up to 4.97× on a high-performance Infiniband cluster without altering the convergence behavior or accuracy.

随机梯度下降(SGD)是解决各种机器学习问题的最广泛使用的优化方法之一。SGD通过从输入数据中迭代地采样一些数据点，计算所选数据点的梯度，并更新解决方案来解决优化问题。然而，在并行设置中，SGD需要在每次迭代中进行进程间通信。我们引入了一种新的通信避免技术来解决使用SGD的逻辑回归问题。该技术将SGD计算重新组织为一种形式，该形式将每次$s$迭代而不是每次迭代进行通信，其中$s$是一个调优参数。我们证明了SGD及其新的通信避免变体的理论失败、带宽和延迟上限。此外，我们展示的实验结果表明，新的通信避免SGD (CA-SGD)方法可以在高性能Infiniband集群上实现高达4.97倍的加速，而不会改变收敛行为或精度。

引用次数: 0

Distributing Sparse Matrix/Graph Applications in Heterogeneous Clusters - an Experimental Study 分布稀疏矩阵/图在异构集群中的应用——一个实验研究

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-11-03 DOI: 10.1109/HiPC50609.2020.00021

Charilaos Tzovas, Maria Predari, Henning Meyerhenke

Many problems in scientific and engineering applications contain sparse matrices or graphs as main input objects, e.g., numerical simulations on meshes. Large inputs are abundant these days and require parallel processing for memory size and speed. To optimize the execution of such simulations on cluster systems, the input problem needs to be distributed suitably onto the processing units (PUs). More and more frequently, such clusters contain different CPUs or a combination of CPUs and GPUs. This heterogeneity makes the load distribution problem quite challenging. Our study is motivated by the observation that established partitioning tools do not handle such heterogeneous distribution problems as well as homogeneous ones. In this paper, we first formulate the problem of balanced load distribution for heterogeneous architectures as a multiobjective, single-constraint optimization problem. We then split the problem into two phases and propose a greedy approach to determine optimal block sizes for each PU. These block sizes are then fed into numerous existing graph partitioners, for us to examine how well they handle the above problem. One of the tools we consider is an extension of our own previous work (von Looz et al., ICPP'18) called Geographer. Our experiments on well-known benchmark meshes indicate that only two tools under consideration are able to yield good quality. These two are ParMetis (both the geometric and the combinatorial variant) and Geographer. While ParMetis is faster, Geographer yields better quality on average.

科学和工程应用中的许多问题都包含稀疏矩阵或图作为主要输入对象，例如网格上的数值模拟。如今大量输入需要并行处理，以满足内存大小和速度的要求。为了优化此类模拟在集群系统上的执行，输入问题需要适当地分布到处理单元(pu)上。这种集群越来越频繁地包含不同的cpu或cpu和gpu的组合。这种异构性使得负载分配问题非常具有挑战性。我们的研究的动机是观察到建立的划分工具不能处理这种异构分布问题以及同质分布问题。在本文中，我们首先将异构架构的均衡负载分配问题表述为一个多目标、单约束的优化问题。然后，我们将问题分为两个阶段，并提出一种贪婪方法来确定每个PU的最佳块大小。然后将这些块大小馈送到许多现有的图分区器中，以便我们检查它们如何处理上述问题。我们考虑的工具之一是我们自己以前的工作(von Looz et al.， ICPP'18)的扩展，称为地理学家。我们在众所周知的基准网格上的实验表明，只有两种工具能够产生良好的质量。这两个是ParMetis(几何和组合变体)和地理学家。ParMetis更快，而geography的平均质量更好。

{"title":"Distributing Sparse Matrix/Graph Applications in Heterogeneous Clusters - an Experimental Study","authors":"Charilaos Tzovas, Maria Predari, Henning Meyerhenke","doi":"10.1109/HiPC50609.2020.00021","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00021","url":null,"abstract":"Many problems in scientific and engineering applications contain sparse matrices or graphs as main input objects, e.g., numerical simulations on meshes. Large inputs are abundant these days and require parallel processing for memory size and speed. To optimize the execution of such simulations on cluster systems, the input problem needs to be distributed suitably onto the processing units (PUs). More and more frequently, such clusters contain different CPUs or a combination of CPUs and GPUs. This heterogeneity makes the load distribution problem quite challenging. Our study is motivated by the observation that established partitioning tools do not handle such heterogeneous distribution problems as well as homogeneous ones. In this paper, we first formulate the problem of balanced load distribution for heterogeneous architectures as a multiobjective, single-constraint optimization problem. We then split the problem into two phases and propose a greedy approach to determine optimal block sizes for each PU. These block sizes are then fed into numerous existing graph partitioners, for us to examine how well they handle the above problem. One of the tools we consider is an extension of our own previous work (von Looz et al., ICPP'18) called Geographer. Our experiments on well-known benchmark meshes indicate that only two tools under consideration are able to yield good quality. These two are ParMetis (both the geometric and the combinatorial variant) and Geographer. While ParMetis is faster, Geographer yields better quality on average.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133101546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

WarpCore: A Library for fast Hash Tables on GPUs WarpCore: gpu上的快速哈希表库

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-09-16 DOI: 10.1109/HiPC50609.2020.00015

Daniel Jünger, Robin Kobus, André Müller, Christian Hundt, Kai Xu, Weiguo Liu, B. Schmidt

Hash tables are ubiquitous. Properties such as an amortized constant time complexity for insertion and querying as well as a compact memory layout make them versatile associative data structures with manifold applications. The rapidly growing amount of data emerging in many fields motivated the need for accelerated hash tables designed for modern parallel architectures. In this work, we exploit the fast memory interface of modern GPUs together with a parallel hashing scheme tailored to improve global memory access patterns, to design WarpCore – a versatile library of hash table data structures. Unique device-sided operations allow for building high performance data processing pipelines entirely on the GPU. Our implementation achieves up to 1.6 billion inserts and up to 4.3 billion retrievals per second on a single GV100 GPU thereby outperforming the state-of-the-art solutions cuDPP, SlabHash, and NVIDIA RAPIDS cuDF. This performance advantage becomes even more pronounced for high load factors of over 90%. To overcome the memory limitation of a single GPU, we scale our approach over a dense NVLink topology which gives us close-to-optimal weak scaling on DGX servers. We further show how WarpCore can be used for accelerating a real world bioinformatics application (metagenomic classification) with speedups of over two orders-of-magnitude against state-of-the-art CPU-based solutions. WarpCore is open source software written in C++/CUDA-C and can be downloaded at https://github.com/sleeepyjack/warpcore.

哈希表无处不在。诸如插入和查询的平摊常数时间复杂度以及紧凑的内存布局等属性使它们具有多种应用程序的通用关联数据结构。在许多领域中出现的快速增长的数据量激发了对为现代并行架构设计的加速哈希表的需求。在这项工作中，我们利用现代gpu的快速内存接口以及定制的并行哈希方案来改进全局内存访问模式，设计WarpCore -一个通用的哈希表数据结构库。独特的设备端操作允许完全在GPU上构建高性能数据处理管道。我们的实现在单个GV100 GPU上实现每秒高达16亿次插入和高达43亿次检索，从而优于最先进的解决方案cuDPP, SlabHash和NVIDIA RAPIDS cuDF。当负载系数超过90%时，这种性能优势变得更加明显。为了克服单个GPU的内存限制，我们在密集的NVLink拓扑上扩展我们的方法，这使我们在DGX服务器上实现了接近最佳的弱扩展。我们进一步展示了如何使用WarpCore来加速现实世界的生物信息学应用(宏基因组分类)，与最先进的基于cpu的解决方案相比，其速度超过两个数量级。WarpCore是用c++ /CUDA-C编写的开源软件，可以从https://github.com/sleeepyjack/warpcore下载。

{"title":"WarpCore: A Library for fast Hash Tables on GPUs","authors":"Daniel Jünger, Robin Kobus, André Müller, Christian Hundt, Kai Xu, Weiguo Liu, B. Schmidt","doi":"10.1109/HiPC50609.2020.00015","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00015","url":null,"abstract":"Hash tables are ubiquitous. Properties such as an amortized constant time complexity for insertion and querying as well as a compact memory layout make them versatile associative data structures with manifold applications. The rapidly growing amount of data emerging in many fields motivated the need for accelerated hash tables designed for modern parallel architectures. In this work, we exploit the fast memory interface of modern GPUs together with a parallel hashing scheme tailored to improve global memory access patterns, to design WarpCore – a versatile library of hash table data structures. Unique device-sided operations allow for building high performance data processing pipelines entirely on the GPU. Our implementation achieves up to 1.6 billion inserts and up to 4.3 billion retrievals per second on a single GV100 GPU thereby outperforming the state-of-the-art solutions cuDPP, SlabHash, and NVIDIA RAPIDS cuDF. This performance advantage becomes even more pronounced for high load factors of over 90%. To overcome the memory limitation of a single GPU, we scale our approach over a dense NVLink topology which gives us close-to-optimal weak scaling on DGX servers. We further show how WarpCore can be used for accelerating a real world bioinformatics application (metagenomic classification) with speedups of over two orders-of-magnitude against state-of-the-art CPU-based solutions. WarpCore is open source software written in C++/CUDA-C and can be downloaded at https://github.com/sleeepyjack/warpcore.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"642 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116473920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Extending SLURM for Dynamic Resource-Aware Adaptive Batch Scheduling 扩展SLURM用于动态资源感知自适应批调度

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-09-16 DOI: 10.1109/HiPC50609.2020.00036

Mohak Chadha, Jophin John, M. Gerndt

With the growing constraints on power budget and increasing hardware failure rates, the operation of future exascale systems faces several challenges. Towards this, resource awareness and adaptivity by enabling malleable jobs has been actively researched in the HPC community. Malleable jobs can change their computing resources at runtime and can significantly improve HPC system performance. However, due to the rigid nature of popular parallel programming paradigms such as MPI and lack of support for dynamic resource management in batch systems, malleable jobs have been largely unrealized. In this paper, we extend the SLURM batch system to support the execution and batch scheduling of malleable jobs. The malleable applications are written using a new adaptive parallel paradigm called Invasive MPI which extends the MPI standard to support resource-adaptivity at runtime. We propose two malleable job scheduling strategies to support performance-aware and power-aware dynamic reconfiguration decisions at runtime. We implement the strategies in SLURM and evaluate them on a production HPC system. Results for our performance-aware scheduling strategy show improvements in makespan, average system utilization, average response, and waiting times as compared to other scheduling strategies. Moreover, we demonstrate dynamic power corridor management using our power-aware strategy.

随着功率预算的限制和硬件故障率的增加，未来的百亿亿级系统的运行面临着一些挑战。为此，高性能计算社区一直在积极研究通过可塑作业实现的资源意识和适应性。延展性作业可以在运行时改变它们的计算资源，并且可以显著提高HPC系统的性能。然而，由于流行的并行编程范例(如MPI)的刚性以及缺乏对批处理系统中动态资源管理的支持，可塑性作业在很大程度上没有实现。在本文中，我们扩展了SLURM批处理系统，以支持可伸缩作业的执行和批调度。可扩展的应用程序是使用一种新的自适应并行范式编写的，称为入侵式MPI，它扩展了MPI标准，以支持运行时的资源自适应。我们提出了两种可伸缩的作业调度策略来支持性能感知和功耗感知的运行时动态重配置决策。我们在SLURM中实现了这些策略，并在生产HPC系统上进行了评估。我们的性能感知调度策略的结果显示，与其他调度策略相比，在完工时间、平均系统利用率、平均响应和等待时间方面有所改进。此外，我们还演示了使用我们的功率感知策略的动态功率走廊管理。

{"title":"Extending SLURM for Dynamic Resource-Aware Adaptive Batch Scheduling","authors":"Mohak Chadha, Jophin John, M. Gerndt","doi":"10.1109/HiPC50609.2020.00036","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00036","url":null,"abstract":"With the growing constraints on power budget and increasing hardware failure rates, the operation of future exascale systems faces several challenges. Towards this, resource awareness and adaptivity by enabling malleable jobs has been actively researched in the HPC community. Malleable jobs can change their computing resources at runtime and can significantly improve HPC system performance. However, due to the rigid nature of popular parallel programming paradigms such as MPI and lack of support for dynamic resource management in batch systems, malleable jobs have been largely unrealized. In this paper, we extend the SLURM batch system to support the execution and batch scheduling of malleable jobs. The malleable applications are written using a new adaptive parallel paradigm called Invasive MPI which extends the MPI standard to support resource-adaptivity at runtime. We propose two malleable job scheduling strategies to support performance-aware and power-aware dynamic reconfiguration decisions at runtime. We implement the strategies in SLURM and evaluate them on a production HPC system. Results for our performance-aware scheduling strategy show improvements in makespan, average system utilization, average response, and waiting times as compared to other scheduling strategies. Moreover, we demonstrate dynamic power corridor management using our power-aware strategy.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114883647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

2D Static Resource Allocation for Compressed Linear Algebra and Communication Constraints 基于压缩线性代数和通信约束的二维静态资源分配

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-07-15 DOI: 10.1109/HiPC50609.2020.00032

Olivier Beaumont, Lionel Eyraud-Dubois, Mathieu Vérité

This paper adresses static resource allocation problems for irregular distributed parallel applications. More precisely, we focus on two classical tiled linear algebra kernels: the Matrix Multiplication and the LU decomposition algorithms on large dense linear systems. In the context of parallel distributed platforms, data exchanges can dramatically degrade the performance of linear algebra kernels and in this context, compression techniques such as Block Low Rank (BLR) are good candidates both for limiting data storage on each node and data exchanges between nodes. On the other hand, the use of BLR representation makes the static allocation problem of tiles to nodes more complex. Indeed, the load associated to each tile depends on its compression factor, which induces an heterogeneous load balancing problem. In turn, solving this load balancing problem optimally might lead to complex allocation schemes, where the tiles allocated to a given node are scattered on all the matrix. This causes communication complexity problems, since matrix multiplication and LU decompositions rely heavily on broadcasting operations along rows and columns of processors, so that the communication volume is minimized when the number of different nodes on each row and column is minimized. In the fully homogeneous case, 2D block cyclic allocation solves both load balancing and communication minimization issues simultaneously, but it might lead to bad load balancing in the heterogeneous case. Our goal in this paper is to propose data allocation schemes dedicated to BLR format and to prove that it is possible to obtain good overall performance when simultaneously balancing the load and minimizing the maximal number of different resources in any row or column.

本文研究了不规则分布并行应用的静态资源分配问题。更准确地说，我们关注两种经典的平铺线性代数核:矩阵乘法和大型密集线性系统上的LU分解算法。在并行分布式平台的背景下，数据交换可能会显著降低线性代数核的性能，在这种情况下，压缩技术(如块低秩(BLR))是限制每个节点上的数据存储和节点之间的数据交换的良好候选。另一方面，BLR表示的使用使得节点的静态瓦片分配问题变得更加复杂。实际上，与每个tile相关的负载取决于其压缩因子，这导致了异构负载平衡问题。反过来，以最佳方式解决这个负载平衡问题可能会导致复杂的分配方案，其中分配给给定节点的块分散在所有矩阵中。这将导致通信复杂性问题，因为矩阵乘法和LU分解严重依赖于沿处理器的行和列的广播操作，因此当每行和列上的不同节点数量最小化时，通信量将最小化。在完全同构的情况下，2D块循环分配同时解决了负载均衡和通信最小化问题，但在异构情况下可能导致负载均衡不良。我们在本文中的目标是提出专用于BLR格式的数据分配方案，并证明在平衡负载的同时最小化任意行或列中不同资源的最大数量是可能获得良好的整体性能的。

{"title":"2D Static Resource Allocation for Compressed Linear Algebra and Communication Constraints","authors":"Olivier Beaumont, Lionel Eyraud-Dubois, Mathieu Vérité","doi":"10.1109/HiPC50609.2020.00032","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00032","url":null,"abstract":"This paper adresses static resource allocation problems for irregular distributed parallel applications. More precisely, we focus on two classical tiled linear algebra kernels: the Matrix Multiplication and the LU decomposition algorithms on large dense linear systems. In the context of parallel distributed platforms, data exchanges can dramatically degrade the performance of linear algebra kernels and in this context, compression techniques such as Block Low Rank (BLR) are good candidates both for limiting data storage on each node and data exchanges between nodes. On the other hand, the use of BLR representation makes the static allocation problem of tiles to nodes more complex. Indeed, the load associated to each tile depends on its compression factor, which induces an heterogeneous load balancing problem. In turn, solving this load balancing problem optimally might lead to complex allocation schemes, where the tiles allocated to a given node are scattered on all the matrix. This causes communication complexity problems, since matrix multiplication and LU decompositions rely heavily on broadcasting operations along rows and columns of processors, so that the communication volume is minimized when the number of different nodes on each row and column is minimized. In the fully homogeneous case, 2D block cyclic allocation solves both load balancing and communication minimization issues simultaneously, but it might lead to bad load balancing in the heterogeneous case. Our goal in this paper is to propose data allocation schemes dedicated to BLR format and to prove that it is possible to obtain good overall performance when simultaneously balancing the load and minimizing the maximal number of different resources in any row or column.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128115724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards High Performance, Portability, and Productivity: Lightweight Augmented Neural Networks for Performance Prediction 迈向高性能、可移植性和生产力:用于性能预测的轻量级增强神经网络

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-03-17 DOI: 10.1109/HiPC50609.2020.00016

Ajitesh Srivastava, Naifeng Zhang, R. Kannan, V. Prasanna

Writing high-performance code requires significant expertise in the programming language, compiler optimizations, and hardware knowledge. This often leads to poor productivity and portability and is inconvenient for a non-programmer domain-specialist such as a Physicist. More desirable is a high-level language where the domain-specialist simply specifies the workload in terms of high-level operations (e.g., matrix-multiply(A, B)), and the compiler identifies the best implementation fully utilizing the heterogeneous platform. For creating a compiler that supports productivity, portability, and performance simultaneously, it is crucial to predict the performance of various available implementations (variants) of the dominant operations (kernels) contained in the workload on various hardware to decide (a) which variant should be chosen for each kernel in the workload, and (b) on which hardware resource the variant should run. To enable the performance prediction, we propose lightweight augmented neural networks for arbitrary combinations of kernel-variant-hardware. A key innovation is utilizing the mathematical complexity of the kernels as a feature to achieve higher accuracy. These models are compact to reduce training time and allow fast inference during compile-time and run-time. Using models with less than 75 parameters, and only 250 training data instances, we are able to obtain accurate performance predictions, significantly outperforming traditional feed-forward neural networks on 48 kernel-variant-hardware combinations. We further demonstrate that our variant-selection approach can be used in Halide implementations to obtain up to 1.7x speedup over Halide auto-scheduler.

编写高性能代码需要在编程语言、编译器优化和硬件知识方面具有重要的专业知识。这通常会导致较差的生产力和可移植性，并且对于非程序员领域专家(如物理学家)来说是不方便的。更可取的是一种高级语言，领域专家只需根据高级操作(例如，矩阵乘法(a, B))指定工作负载，编译器确定充分利用异构平台的最佳实现。为了创建同时支持生产力、可移植性和性能的编译器，预测各种硬件上的工作负载中包含的主要操作(内核)的各种可用实现(变体)的性能至关重要，以决定(a)应该为工作负载中的每个内核选择哪个变体，以及(b)应该在哪个硬件资源上运行该变体。为了实现性能预测，我们提出了用于任意组合内核变量硬件的轻量级增强神经网络。一个关键的创新是利用核的数学复杂性作为一个特征来实现更高的精度。这些模型非常紧凑，可以减少训练时间，并允许在编译时和运行时进行快速推理。使用少于75个参数的模型，只有250个训练数据实例，我们能够获得准确的性能预测，显著优于传统的前馈神经网络在48个核变量硬件组合上的性能。我们进一步证明，我们的变体选择方法可以在Halide实现中使用，以获得比Halide自动调度器高达1.7倍的加速。

{"title":"Towards High Performance, Portability, and Productivity: Lightweight Augmented Neural Networks for Performance Prediction","authors":"Ajitesh Srivastava, Naifeng Zhang, R. Kannan, V. Prasanna","doi":"10.1109/HiPC50609.2020.00016","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00016","url":null,"abstract":"Writing high-performance code requires significant expertise in the programming language, compiler optimizations, and hardware knowledge. This often leads to poor productivity and portability and is inconvenient for a non-programmer domain-specialist such as a Physicist. More desirable is a high-level language where the domain-specialist simply specifies the workload in terms of high-level operations (e.g., matrix-multiply(A, B)), and the compiler identifies the best implementation fully utilizing the heterogeneous platform. For creating a compiler that supports productivity, portability, and performance simultaneously, it is crucial to predict the performance of various available implementations (variants) of the dominant operations (kernels) contained in the workload on various hardware to decide (a) which variant should be chosen for each kernel in the workload, and (b) on which hardware resource the variant should run. To enable the performance prediction, we propose lightweight augmented neural networks for arbitrary combinations of kernel-variant-hardware. A key innovation is utilizing the mathematical complexity of the kernels as a feature to achieve higher accuracy. These models are compact to reduce training time and allow fast inference during compile-time and run-time. Using models with less than 75 parameters, and only 250 training data instances, we are able to obtain accurate performance predictions, significantly outperforming traditional feed-forward neural networks on 48 kernel-variant-hardware combinations. We further demonstrate that our variant-selection approach can be used in Halide implementations to obtain up to 1.7x speedup over Halide auto-scheduler.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128938038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Nonblocking Persistent Software Transactional Memory 非阻塞持久化软件事务性内存

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-02-19 DOI: 10.1109/HiPC50609.2020.00042

H. A. Beadle, Wentao Cai, Haosen Wen, M. Scott

Newly emerging nonvolatile alternatives to DRAM raise the possibility that applications might compute directly on long-lived data, rather than serializing them to and from a file system or database. To ensure crash consistency, such data must, like a file system or database, provide failure-atomic transactional semantics. Several persistent software transactional memory (STM) systems have been devised to provide these semantics, but only one—the OneFile system of Ramalhete et al.—is nonblocking. Nonblocking progress is desirable to avoid both performance anomalies due to process preemption or failures and deadlock due to priority inversion. Unfortunately, OneFile achieves nonblocking progress at the cost of 2 × space overhead, sacrificing much of the cost and density benefit of nonvolatile memory relative to DRAM. OneFile also requires extensive and intrusive changes to data declarations, and works only on a machine with double-width compare-and-swap (CAS) or load-linked/store-conditional (LL/SC) instructions. To address these limitations, we introduce QSTM, a nonblocking persistent STM that requires neither the modification of target data structures nor the availability of a wide CAS instruction. We describe our system, give arguments for safety and liveness, and compare performance to that of the Mnemosyne and OneFile persistent STM systems. We argue that modest performance costs (within a factor of 2 of OneFile in almost all cases) are easily justified by dramatically lower space overhead and higher programmer convenience.

新出现的非易失性替代DRAM提高了应用程序直接计算长期数据的可能性，而不是将它们序列化到文件系统或数据库。为了确保崩溃一致性，这些数据必须像文件系统或数据库一样，提供故障原子事务语义。一些持久化软件事务性内存(STM)系统已经被设计用来提供这些语义，但是只有一个(Ramalhete等人的OneFile系统)是非阻塞的。为了避免由于进程抢占或失败导致的性能异常以及由于优先级反转导致的死锁，非阻塞进程是可取的。不幸的是，OneFile以2倍的空间开销为代价实现非阻塞进程，牺牲了相对于DRAM的非易失性存储器的大部分成本和密度优势。OneFile还需要对数据声明进行广泛而侵入性的更改，并且仅在具有双宽度比较和交换(CAS)或加载链接/存储条件(LL/SC)指令的机器上工作。为了解决这些限制，我们引入了QSTM，这是一种非阻塞的持久性STM，既不需要修改目标数据结构，也不需要使用广泛的CAS指令。我们描述了我们的系统，给出了安全性和活动性的论据，并将性能与Mnemosyne和OneFile持久STM系统进行了比较。我们认为，适度的性能成本(几乎在所有情况下都在OneFile的2倍之内)很容易被显著降低的空间开销和更高的程序员便利性所证明。

{"title":"Nonblocking Persistent Software Transactional Memory","authors":"H. A. Beadle, Wentao Cai, Haosen Wen, M. Scott","doi":"10.1109/HiPC50609.2020.00042","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00042","url":null,"abstract":"Newly emerging nonvolatile alternatives to DRAM raise the possibility that applications might compute directly on long-lived data, rather than serializing them to and from a file system or database. To ensure crash consistency, such data must, like a file system or database, provide failure-atomic transactional semantics. Several persistent software transactional memory (STM) systems have been devised to provide these semantics, but only one—the OneFile system of Ramalhete et al.—is nonblocking. Nonblocking progress is desirable to avoid both performance anomalies due to process preemption or failures and deadlock due to priority inversion. Unfortunately, OneFile achieves nonblocking progress at the cost of 2 × space overhead, sacrificing much of the cost and density benefit of nonvolatile memory relative to DRAM. OneFile also requires extensive and intrusive changes to data declarations, and works only on a machine with double-width compare-and-swap (CAS) or load-linked/store-conditional (LL/SC) instructions. To address these limitations, we introduce QSTM, a nonblocking persistent STM that requires neither the modification of target data structures nor the availability of a wide CAS instruction. We describe our system, give arguments for safety and liveness, and compare performance to that of the Mnemosyne and OneFile persistent STM systems. We argue that modest performance costs (within a factor of 2 of OneFile in almost all cases) are easily justified by dramatically lower space overhead and higher programmer convenience.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126141958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15