Pub Date : 2020-12-01DOI: 10.1109/HiPC50609.2020.00022
Akshay Parashar, Arun Abraham, Deepak Chaudhary, V. N. Rajendiran
Myriad applications of Deep Neural Networks (DNN) and the race for better accuracy have paved the way for the development of more computationally intensive network architectures. Execution of these heavy networks on embedded devices needs highly efficient real-time DNN inference frameworks. But the sequential architecture of popular DNNs makes it difficult to parallelize its operations among different processors. We propose a novel pipelining method pluggable on top of conventional inference frameworks and capable of parallelizing DNN inference on heterogeneous processors without impacting the accuracy. We partition the network into subnets, by estimating the optimal split points, and pipeline these subnets across multiple processors. The results shows that the proposed method achieves up to 68% improvement in the frames per second (FPS) rate of popular network architectures like VGG19, DenseNet-121 and ResNet-152. Moreover, we show that our method can be used to extract even more performance out of high performance chipsets, by better utilizing the capabilities of its AI processor ecosystem. We also showcase that our method can be easily extended to other low performance chipsets, where this additional performance gain is crucial to deploy real-time AI applications. Our results show performance improvement of up to 47% in the FPS rate on these chipsets without the need of specialized AI hardware.
{"title":"Processor Pipelining Method for Efficient Deep Neural Network Inference on Embedded Devices","authors":"Akshay Parashar, Arun Abraham, Deepak Chaudhary, V. N. Rajendiran","doi":"10.1109/HiPC50609.2020.00022","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00022","url":null,"abstract":"Myriad applications of Deep Neural Networks (DNN) and the race for better accuracy have paved the way for the development of more computationally intensive network architectures. Execution of these heavy networks on embedded devices needs highly efficient real-time DNN inference frameworks. But the sequential architecture of popular DNNs makes it difficult to parallelize its operations among different processors. We propose a novel pipelining method pluggable on top of conventional inference frameworks and capable of parallelizing DNN inference on heterogeneous processors without impacting the accuracy. We partition the network into subnets, by estimating the optimal split points, and pipeline these subnets across multiple processors. The results shows that the proposed method achieves up to 68% improvement in the frames per second (FPS) rate of popular network architectures like VGG19, DenseNet-121 and ResNet-152. Moreover, we show that our method can be used to extract even more performance out of high performance chipsets, by better utilizing the capabilities of its AI processor ecosystem. We also showcase that our method can be easily extended to other low performance chipsets, where this additional performance gain is crucial to deploy real-time AI applications. Our results show performance improvement of up to 47% in the FPS rate on these chipsets without the need of specialized AI hardware.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115450802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-12-01DOI: 10.1109/HiPC50609.2020.00035
Zhe Wang, Chen Xu, Kunal Agrawal, Jing Li
Modern parallel platforms, such as clouds or servers, are often shared among many different jobs. However, existing parallel programming runtime systems are designed and optimized for running a single parallel job, so it is generally hard to directly use them to schedule multiple parallel jobs without incurring high overhead and inefficiency. In this work, we develop AMCilk (Adaptive Multiprogrammed Cilk), a novel runtime system framework, designed to support multiprogrammed parallel workloads. AMCilk has client-server architecture where users can dynamically submit parallel jobs to the system. AMCilk has a single runtime system that runs these jobs while dynamically reallocating cores, last-level cache, and memory bandwidth among these jobs according to the scheduling policy. AMCilk exposes the interface to the system designer, which allows the designer to easily build different scheduling policies meeting the requirements of various application scenarios and performance metrics, while AMCilk transparently (to designers) enforces the scheduling policy. The primary feature of AMCilk is the low-overhead and responsive preemption mechanism that allows fast reallocation of cores between jobs. Our empirical evaluation indicates that AMCilk incurs small overheads and provides significant benefits on application-specific criteria for a set of 4 practical applications due to its fast and low-overhead core reallocation mechanism.
{"title":"AMCilk: A Framework for Multiprogrammed Parallel Workloads","authors":"Zhe Wang, Chen Xu, Kunal Agrawal, Jing Li","doi":"10.1109/HiPC50609.2020.00035","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00035","url":null,"abstract":"Modern parallel platforms, such as clouds or servers, are often shared among many different jobs. However, existing parallel programming runtime systems are designed and optimized for running a single parallel job, so it is generally hard to directly use them to schedule multiple parallel jobs without incurring high overhead and inefficiency. In this work, we develop AMCilk (Adaptive Multiprogrammed Cilk), a novel runtime system framework, designed to support multiprogrammed parallel workloads. AMCilk has client-server architecture where users can dynamically submit parallel jobs to the system. AMCilk has a single runtime system that runs these jobs while dynamically reallocating cores, last-level cache, and memory bandwidth among these jobs according to the scheduling policy. AMCilk exposes the interface to the system designer, which allows the designer to easily build different scheduling policies meeting the requirements of various application scenarios and performance metrics, while AMCilk transparently (to designers) enforces the scheduling policy. The primary feature of AMCilk is the low-overhead and responsive preemption mechanism that allows fast reallocation of cores between jobs. Our empirical evaluation indicates that AMCilk incurs small overheads and provides significant benefits on application-specific criteria for a set of 4 practical applications due to its fast and low-overhead core reallocation mechanism.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"311 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116488638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-12-01DOI: 10.1109/HiPC50609.2020.00040
Kai Keller, K. Parasyris, L. Bautista-Gomez
The efficient utilization of current supercomputing systems with deep storage hierarchies demands scientific applications that are capable of leveraging such heterogeneous hardware. Fault tolerance, and checkpointing in particular, is one of the most time-consuming aspects if not handled correctly. High checkpoint performance can be achieved using optimized multilevel checkpoint and restart libraries. Unfortunately, those libraries do not allow for restarts with a modified number of processes or scientific post-processing of the checkpointed data. This is because they typically use an N-N checkpointing scheme and opaque file-formats. In this article, we present a novel mechanism to asynchronously store checkpoints into a self-descriptive file format and load the data upon recovery with a different number of processes. We provide an API that defines the process-local data as part of a globally shared dataset. Our measurements demonstrate a low overhead between 0.6% and 2.5% for a 2.25 TB checkpoint with 6K processes.
{"title":"Design and Study of Elastic Recovery in HPC Applications","authors":"Kai Keller, K. Parasyris, L. Bautista-Gomez","doi":"10.1109/HiPC50609.2020.00040","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00040","url":null,"abstract":"The efficient utilization of current supercomputing systems with deep storage hierarchies demands scientific applications that are capable of leveraging such heterogeneous hardware. Fault tolerance, and checkpointing in particular, is one of the most time-consuming aspects if not handled correctly. High checkpoint performance can be achieved using optimized multilevel checkpoint and restart libraries. Unfortunately, those libraries do not allow for restarts with a modified number of processes or scientific post-processing of the checkpointed data. This is because they typically use an N-N checkpointing scheme and opaque file-formats. In this article, we present a novel mechanism to asynchronously store checkpoints into a self-descriptive file format and load the data upon recovery with a different number of processes. We provide an API that defines the process-local data as part of a globally shared dataset. Our measurements demonstrate a low overhead between 0.6% and 2.5% for a 2.25 TB checkpoint with 6K processes.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127432249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-16DOI: 10.1109/HiPC50609.2020.00023
Aditya Devarakonda, J. Demmel
Stochastic gradient descent (SGD) is one of the most widely used optimization methods for solving various machine learning problems. SGD solves an optimization problem by iteratively sampling a few data points from the input data, computing gradients for the selected data points, and updating the solution. However, in a parallel setting, SGD requires interprocess communication at every iteration. We introduce a new communication-avoiding technique for solving the logistic regression problem using SGD. This technique re-organizes the SGD computations into a form that communicates every $s$ iterations instead of every iteration, where $s$ is a tuning parameter. We prove theoretical flops, bandwidth, and latency upper bounds for SGD and its new communication-avoiding variant. Furthermore, we show experimental results that illustrate that the new Communication-Avoiding SGD (CA-SGD) method can achieve speedups of up to 4.97× on a high-performance Infiniband cluster without altering the convergence behavior or accuracy.
{"title":"Avoiding Communication in Logistic Regression","authors":"Aditya Devarakonda, J. Demmel","doi":"10.1109/HiPC50609.2020.00023","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00023","url":null,"abstract":"Stochastic gradient descent (SGD) is one of the most widely used optimization methods for solving various machine learning problems. SGD solves an optimization problem by iteratively sampling a few data points from the input data, computing gradients for the selected data points, and updating the solution. However, in a parallel setting, SGD requires interprocess communication at every iteration. We introduce a new communication-avoiding technique for solving the logistic regression problem using SGD. This technique re-organizes the SGD computations into a form that communicates every $s$ iterations instead of every iteration, where $s$ is a tuning parameter. We prove theoretical flops, bandwidth, and latency upper bounds for SGD and its new communication-avoiding variant. Furthermore, we show experimental results that illustrate that the new Communication-Avoiding SGD (CA-SGD) method can achieve speedups of up to 4.97× on a high-performance Infiniband cluster without altering the convergence behavior or accuracy.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132912788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-03DOI: 10.1109/HiPC50609.2020.00021
Charilaos Tzovas, Maria Predari, Henning Meyerhenke
Many problems in scientific and engineering applications contain sparse matrices or graphs as main input objects, e.g., numerical simulations on meshes. Large inputs are abundant these days and require parallel processing for memory size and speed. To optimize the execution of such simulations on cluster systems, the input problem needs to be distributed suitably onto the processing units (PUs). More and more frequently, such clusters contain different CPUs or a combination of CPUs and GPUs. This heterogeneity makes the load distribution problem quite challenging. Our study is motivated by the observation that established partitioning tools do not handle such heterogeneous distribution problems as well as homogeneous ones. In this paper, we first formulate the problem of balanced load distribution for heterogeneous architectures as a multiobjective, single-constraint optimization problem. We then split the problem into two phases and propose a greedy approach to determine optimal block sizes for each PU. These block sizes are then fed into numerous existing graph partitioners, for us to examine how well they handle the above problem. One of the tools we consider is an extension of our own previous work (von Looz et al., ICPP'18) called Geographer. Our experiments on well-known benchmark meshes indicate that only two tools under consideration are able to yield good quality. These two are ParMetis (both the geometric and the combinatorial variant) and Geographer. While ParMetis is faster, Geographer yields better quality on average.
科学和工程应用中的许多问题都包含稀疏矩阵或图作为主要输入对象,例如网格上的数值模拟。如今大量输入需要并行处理,以满足内存大小和速度的要求。为了优化此类模拟在集群系统上的执行,输入问题需要适当地分布到处理单元(pu)上。这种集群越来越频繁地包含不同的cpu或cpu和gpu的组合。这种异构性使得负载分配问题非常具有挑战性。我们的研究的动机是观察到建立的划分工具不能处理这种异构分布问题以及同质分布问题。在本文中,我们首先将异构架构的均衡负载分配问题表述为一个多目标、单约束的优化问题。然后,我们将问题分为两个阶段,并提出一种贪婪方法来确定每个PU的最佳块大小。然后将这些块大小馈送到许多现有的图分区器中,以便我们检查它们如何处理上述问题。我们考虑的工具之一是我们自己以前的工作(von Looz et al., ICPP'18)的扩展,称为地理学家。我们在众所周知的基准网格上的实验表明,只有两种工具能够产生良好的质量。这两个是ParMetis(几何和组合变体)和地理学家。ParMetis更快,而geography的平均质量更好。
{"title":"Distributing Sparse Matrix/Graph Applications in Heterogeneous Clusters - an Experimental Study","authors":"Charilaos Tzovas, Maria Predari, Henning Meyerhenke","doi":"10.1109/HiPC50609.2020.00021","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00021","url":null,"abstract":"Many problems in scientific and engineering applications contain sparse matrices or graphs as main input objects, e.g., numerical simulations on meshes. Large inputs are abundant these days and require parallel processing for memory size and speed. To optimize the execution of such simulations on cluster systems, the input problem needs to be distributed suitably onto the processing units (PUs). More and more frequently, such clusters contain different CPUs or a combination of CPUs and GPUs. This heterogeneity makes the load distribution problem quite challenging. Our study is motivated by the observation that established partitioning tools do not handle such heterogeneous distribution problems as well as homogeneous ones. In this paper, we first formulate the problem of balanced load distribution for heterogeneous architectures as a multiobjective, single-constraint optimization problem. We then split the problem into two phases and propose a greedy approach to determine optimal block sizes for each PU. These block sizes are then fed into numerous existing graph partitioners, for us to examine how well they handle the above problem. One of the tools we consider is an extension of our own previous work (von Looz et al., ICPP'18) called Geographer. Our experiments on well-known benchmark meshes indicate that only two tools under consideration are able to yield good quality. These two are ParMetis (both the geometric and the combinatorial variant) and Geographer. While ParMetis is faster, Geographer yields better quality on average.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133101546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-16DOI: 10.1109/HiPC50609.2020.00015
Daniel Jünger, Robin Kobus, André Müller, Christian Hundt, Kai Xu, Weiguo Liu, B. Schmidt
Hash tables are ubiquitous. Properties such as an amortized constant time complexity for insertion and querying as well as a compact memory layout make them versatile associative data structures with manifold applications. The rapidly growing amount of data emerging in many fields motivated the need for accelerated hash tables designed for modern parallel architectures. In this work, we exploit the fast memory interface of modern GPUs together with a parallel hashing scheme tailored to improve global memory access patterns, to design WarpCore – a versatile library of hash table data structures. Unique device-sided operations allow for building high performance data processing pipelines entirely on the GPU. Our implementation achieves up to 1.6 billion inserts and up to 4.3 billion retrievals per second on a single GV100 GPU thereby outperforming the state-of-the-art solutions cuDPP, SlabHash, and NVIDIA RAPIDS cuDF. This performance advantage becomes even more pronounced for high load factors of over 90%. To overcome the memory limitation of a single GPU, we scale our approach over a dense NVLink topology which gives us close-to-optimal weak scaling on DGX servers. We further show how WarpCore can be used for accelerating a real world bioinformatics application (metagenomic classification) with speedups of over two orders-of-magnitude against state-of-the-art CPU-based solutions. WarpCore is open source software written in C++/CUDA-C and can be downloaded at https://github.com/sleeepyjack/warpcore.
{"title":"WarpCore: A Library for fast Hash Tables on GPUs","authors":"Daniel Jünger, Robin Kobus, André Müller, Christian Hundt, Kai Xu, Weiguo Liu, B. Schmidt","doi":"10.1109/HiPC50609.2020.00015","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00015","url":null,"abstract":"Hash tables are ubiquitous. Properties such as an amortized constant time complexity for insertion and querying as well as a compact memory layout make them versatile associative data structures with manifold applications. The rapidly growing amount of data emerging in many fields motivated the need for accelerated hash tables designed for modern parallel architectures. In this work, we exploit the fast memory interface of modern GPUs together with a parallel hashing scheme tailored to improve global memory access patterns, to design WarpCore – a versatile library of hash table data structures. Unique device-sided operations allow for building high performance data processing pipelines entirely on the GPU. Our implementation achieves up to 1.6 billion inserts and up to 4.3 billion retrievals per second on a single GV100 GPU thereby outperforming the state-of-the-art solutions cuDPP, SlabHash, and NVIDIA RAPIDS cuDF. This performance advantage becomes even more pronounced for high load factors of over 90%. To overcome the memory limitation of a single GPU, we scale our approach over a dense NVLink topology which gives us close-to-optimal weak scaling on DGX servers. We further show how WarpCore can be used for accelerating a real world bioinformatics application (metagenomic classification) with speedups of over two orders-of-magnitude against state-of-the-art CPU-based solutions. WarpCore is open source software written in C++/CUDA-C and can be downloaded at https://github.com/sleeepyjack/warpcore.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"642 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116473920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-16DOI: 10.1109/HiPC50609.2020.00036
Mohak Chadha, Jophin John, M. Gerndt
With the growing constraints on power budget and increasing hardware failure rates, the operation of future exascale systems faces several challenges. Towards this, resource awareness and adaptivity by enabling malleable jobs has been actively researched in the HPC community. Malleable jobs can change their computing resources at runtime and can significantly improve HPC system performance. However, due to the rigid nature of popular parallel programming paradigms such as MPI and lack of support for dynamic resource management in batch systems, malleable jobs have been largely unrealized. In this paper, we extend the SLURM batch system to support the execution and batch scheduling of malleable jobs. The malleable applications are written using a new adaptive parallel paradigm called Invasive MPI which extends the MPI standard to support resource-adaptivity at runtime. We propose two malleable job scheduling strategies to support performance-aware and power-aware dynamic reconfiguration decisions at runtime. We implement the strategies in SLURM and evaluate them on a production HPC system. Results for our performance-aware scheduling strategy show improvements in makespan, average system utilization, average response, and waiting times as compared to other scheduling strategies. Moreover, we demonstrate dynamic power corridor management using our power-aware strategy.
{"title":"Extending SLURM for Dynamic Resource-Aware Adaptive Batch Scheduling","authors":"Mohak Chadha, Jophin John, M. Gerndt","doi":"10.1109/HiPC50609.2020.00036","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00036","url":null,"abstract":"With the growing constraints on power budget and increasing hardware failure rates, the operation of future exascale systems faces several challenges. Towards this, resource awareness and adaptivity by enabling malleable jobs has been actively researched in the HPC community. Malleable jobs can change their computing resources at runtime and can significantly improve HPC system performance. However, due to the rigid nature of popular parallel programming paradigms such as MPI and lack of support for dynamic resource management in batch systems, malleable jobs have been largely unrealized. In this paper, we extend the SLURM batch system to support the execution and batch scheduling of malleable jobs. The malleable applications are written using a new adaptive parallel paradigm called Invasive MPI which extends the MPI standard to support resource-adaptivity at runtime. We propose two malleable job scheduling strategies to support performance-aware and power-aware dynamic reconfiguration decisions at runtime. We implement the strategies in SLURM and evaluate them on a production HPC system. Results for our performance-aware scheduling strategy show improvements in makespan, average system utilization, average response, and waiting times as compared to other scheduling strategies. Moreover, we demonstrate dynamic power corridor management using our power-aware strategy.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114883647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper adresses static resource allocation problems for irregular distributed parallel applications. More precisely, we focus on two classical tiled linear algebra kernels: the Matrix Multiplication and the LU decomposition algorithms on large dense linear systems. In the context of parallel distributed platforms, data exchanges can dramatically degrade the performance of linear algebra kernels and in this context, compression techniques such as Block Low Rank (BLR) are good candidates both for limiting data storage on each node and data exchanges between nodes. On the other hand, the use of BLR representation makes the static allocation problem of tiles to nodes more complex. Indeed, the load associated to each tile depends on its compression factor, which induces an heterogeneous load balancing problem. In turn, solving this load balancing problem optimally might lead to complex allocation schemes, where the tiles allocated to a given node are scattered on all the matrix. This causes communication complexity problems, since matrix multiplication and LU decompositions rely heavily on broadcasting operations along rows and columns of processors, so that the communication volume is minimized when the number of different nodes on each row and column is minimized. In the fully homogeneous case, 2D block cyclic allocation solves both load balancing and communication minimization issues simultaneously, but it might lead to bad load balancing in the heterogeneous case. Our goal in this paper is to propose data allocation schemes dedicated to BLR format and to prove that it is possible to obtain good overall performance when simultaneously balancing the load and minimizing the maximal number of different resources in any row or column.
{"title":"2D Static Resource Allocation for Compressed Linear Algebra and Communication Constraints","authors":"Olivier Beaumont, Lionel Eyraud-Dubois, Mathieu Vérité","doi":"10.1109/HiPC50609.2020.00032","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00032","url":null,"abstract":"This paper adresses static resource allocation problems for irregular distributed parallel applications. More precisely, we focus on two classical tiled linear algebra kernels: the Matrix Multiplication and the LU decomposition algorithms on large dense linear systems. In the context of parallel distributed platforms, data exchanges can dramatically degrade the performance of linear algebra kernels and in this context, compression techniques such as Block Low Rank (BLR) are good candidates both for limiting data storage on each node and data exchanges between nodes. On the other hand, the use of BLR representation makes the static allocation problem of tiles to nodes more complex. Indeed, the load associated to each tile depends on its compression factor, which induces an heterogeneous load balancing problem. In turn, solving this load balancing problem optimally might lead to complex allocation schemes, where the tiles allocated to a given node are scattered on all the matrix. This causes communication complexity problems, since matrix multiplication and LU decompositions rely heavily on broadcasting operations along rows and columns of processors, so that the communication volume is minimized when the number of different nodes on each row and column is minimized. In the fully homogeneous case, 2D block cyclic allocation solves both load balancing and communication minimization issues simultaneously, but it might lead to bad load balancing in the heterogeneous case. Our goal in this paper is to propose data allocation schemes dedicated to BLR format and to prove that it is possible to obtain good overall performance when simultaneously balancing the load and minimizing the maximal number of different resources in any row or column.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128115724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-03-17DOI: 10.1109/HiPC50609.2020.00016
Ajitesh Srivastava, Naifeng Zhang, R. Kannan, V. Prasanna
Writing high-performance code requires significant expertise in the programming language, compiler optimizations, and hardware knowledge. This often leads to poor productivity and portability and is inconvenient for a non-programmer domain-specialist such as a Physicist. More desirable is a high-level language where the domain-specialist simply specifies the workload in terms of high-level operations (e.g., matrix-multiply(A, B)), and the compiler identifies the best implementation fully utilizing the heterogeneous platform. For creating a compiler that supports productivity, portability, and performance simultaneously, it is crucial to predict the performance of various available implementations (variants) of the dominant operations (kernels) contained in the workload on various hardware to decide (a) which variant should be chosen for each kernel in the workload, and (b) on which hardware resource the variant should run. To enable the performance prediction, we propose lightweight augmented neural networks for arbitrary combinations of kernel-variant-hardware. A key innovation is utilizing the mathematical complexity of the kernels as a feature to achieve higher accuracy. These models are compact to reduce training time and allow fast inference during compile-time and run-time. Using models with less than 75 parameters, and only 250 training data instances, we are able to obtain accurate performance predictions, significantly outperforming traditional feed-forward neural networks on 48 kernel-variant-hardware combinations. We further demonstrate that our variant-selection approach can be used in Halide implementations to obtain up to 1.7x speedup over Halide auto-scheduler.
{"title":"Towards High Performance, Portability, and Productivity: Lightweight Augmented Neural Networks for Performance Prediction","authors":"Ajitesh Srivastava, Naifeng Zhang, R. Kannan, V. Prasanna","doi":"10.1109/HiPC50609.2020.00016","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00016","url":null,"abstract":"Writing high-performance code requires significant expertise in the programming language, compiler optimizations, and hardware knowledge. This often leads to poor productivity and portability and is inconvenient for a non-programmer domain-specialist such as a Physicist. More desirable is a high-level language where the domain-specialist simply specifies the workload in terms of high-level operations (e.g., matrix-multiply(A, B)), and the compiler identifies the best implementation fully utilizing the heterogeneous platform. For creating a compiler that supports productivity, portability, and performance simultaneously, it is crucial to predict the performance of various available implementations (variants) of the dominant operations (kernels) contained in the workload on various hardware to decide (a) which variant should be chosen for each kernel in the workload, and (b) on which hardware resource the variant should run. To enable the performance prediction, we propose lightweight augmented neural networks for arbitrary combinations of kernel-variant-hardware. A key innovation is utilizing the mathematical complexity of the kernels as a feature to achieve higher accuracy. These models are compact to reduce training time and allow fast inference during compile-time and run-time. Using models with less than 75 parameters, and only 250 training data instances, we are able to obtain accurate performance predictions, significantly outperforming traditional feed-forward neural networks on 48 kernel-variant-hardware combinations. We further demonstrate that our variant-selection approach can be used in Halide implementations to obtain up to 1.7x speedup over Halide auto-scheduler.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128938038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-02-19DOI: 10.1109/HiPC50609.2020.00042
H. A. Beadle, Wentao Cai, Haosen Wen, M. Scott
Newly emerging nonvolatile alternatives to DRAM raise the possibility that applications might compute directly on long-lived data, rather than serializing them to and from a file system or database. To ensure crash consistency, such data must, like a file system or database, provide failure-atomic transactional semantics. Several persistent software transactional memory (STM) systems have been devised to provide these semantics, but only one—the OneFile system of Ramalhete et al.—is nonblocking. Nonblocking progress is desirable to avoid both performance anomalies due to process preemption or failures and deadlock due to priority inversion. Unfortunately, OneFile achieves nonblocking progress at the cost of 2 × space overhead, sacrificing much of the cost and density benefit of nonvolatile memory relative to DRAM. OneFile also requires extensive and intrusive changes to data declarations, and works only on a machine with double-width compare-and-swap (CAS) or load-linked/store-conditional (LL/SC) instructions. To address these limitations, we introduce QSTM, a nonblocking persistent STM that requires neither the modification of target data structures nor the availability of a wide CAS instruction. We describe our system, give arguments for safety and liveness, and compare performance to that of the Mnemosyne and OneFile persistent STM systems. We argue that modest performance costs (within a factor of 2 of OneFile in almost all cases) are easily justified by dramatically lower space overhead and higher programmer convenience.
{"title":"Nonblocking Persistent Software Transactional Memory","authors":"H. A. Beadle, Wentao Cai, Haosen Wen, M. Scott","doi":"10.1109/HiPC50609.2020.00042","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00042","url":null,"abstract":"Newly emerging nonvolatile alternatives to DRAM raise the possibility that applications might compute directly on long-lived data, rather than serializing them to and from a file system or database. To ensure crash consistency, such data must, like a file system or database, provide failure-atomic transactional semantics. Several persistent software transactional memory (STM) systems have been devised to provide these semantics, but only one—the OneFile system of Ramalhete et al.—is nonblocking. Nonblocking progress is desirable to avoid both performance anomalies due to process preemption or failures and deadlock due to priority inversion. Unfortunately, OneFile achieves nonblocking progress at the cost of 2 × space overhead, sacrificing much of the cost and density benefit of nonvolatile memory relative to DRAM. OneFile also requires extensive and intrusive changes to data declarations, and works only on a machine with double-width compare-and-swap (CAS) or load-linked/store-conditional (LL/SC) instructions. To address these limitations, we introduce QSTM, a nonblocking persistent STM that requires neither the modification of target data structures nor the availability of a wide CAS instruction. We describe our system, give arguments for safety and liveness, and compare performance to that of the Mnemosyne and OneFile persistent STM systems. We argue that modest performance costs (within a factor of 2 of OneFile in almost all cases) are easily justified by dramatically lower space overhead and higher programmer convenience.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126141958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}