Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00106
Laleh Aghababaie Beni, S. Ramanan, Aparna Chandramowlishwaran
There is a big gap between the algorithm one designs on paper and the code that runs efficiently on a parallel system. Our goal is to combine the body of work in compilers, performance optimization, and the domain of N-body problems to build a system where domain scientists can write programs at a high level while attaining performance of code written by experts at the low level. This paper presents Portal, a domain-specific language and compiler designed to enable high-performance implementations of N-body problems on modern multicore systems. Our goal in the development of Portal is three-fold, (a) to implement scalable, fast algorithms that have O(n log n) and O (n) complexity, (b) to design an intuitive language to enable rapid implementations of a variety of problems, and (c) to enable parallel large-scale problems to run on multicore systems. We target N-body problems in various domains from machine learning to scientific computing that can be expressed in Portal to obtain an out-of-the-box optimized parallel implementation. Experimental results on 6 N-body problems show that Portal is within a factor of 5% on average of expert hand-optimized C++ code on a dual-socket AMD EPYC processor. To our knowledge, there are no known libraries or frameworks that implement parallel asymptotically optimal algorithms for the class of generalized N-body problems and Portal aims to fill this gap. Moreover, the Portal language and intermediate algorithm representation are portable and easily extensible to different platforms.
{"title":"Portal: A High-Performance Language and Compiler for Parallel N-Body Problems","authors":"Laleh Aghababaie Beni, S. Ramanan, Aparna Chandramowlishwaran","doi":"10.1109/IPDPS.2019.00106","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00106","url":null,"abstract":"There is a big gap between the algorithm one designs on paper and the code that runs efficiently on a parallel system. Our goal is to combine the body of work in compilers, performance optimization, and the domain of N-body problems to build a system where domain scientists can write programs at a high level while attaining performance of code written by experts at the low level. This paper presents Portal, a domain-specific language and compiler designed to enable high-performance implementations of N-body problems on modern multicore systems. Our goal in the development of Portal is three-fold, (a) to implement scalable, fast algorithms that have O(n log n) and O (n) complexity, (b) to design an intuitive language to enable rapid implementations of a variety of problems, and (c) to enable parallel large-scale problems to run on multicore systems. We target N-body problems in various domains from machine learning to scientific computing that can be expressed in Portal to obtain an out-of-the-box optimized parallel implementation. Experimental results on 6 N-body problems show that Portal is within a factor of 5% on average of expert hand-optimized C++ code on a dual-socket AMD EPYC processor. To our knowledge, there are no known libraries or frameworks that implement parallel asymptotically optimal algorithms for the class of generalized N-body problems and Portal aims to fill this gap. Moreover, the Portal language and intermediate algorithm representation are portable and easily extensible to different platforms.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128921454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00054
Loc Hoang, Roshan Dathathri, G. Gill, K. Pingali
Graph analytics systems must analyze graphs with billions of vertices and edges which require several terabytes of storage. Distributed-memory clusters are often used for analyzing such large graphs since the main memory of a single machine is usually restricted to a few hundreds of gigabytes. This requires partitioning the graph among the machines in the cluster. Existing graph analytics systems usually come with a built-in partitioner that incorporates a particular partitioning policy, but the best partitioning policy is dependent on the algorithm, input graph, and platform. Therefore, built-in partitioners are not sufficiently flexible. Stand-alone graph partitioners are available, but they too implement only a small number of partitioning policies. This paper presents CuSP, a fast streaming edge partitioning framework which permits users to specify the desired partitioning policy at a high level of abstraction and generates high-quality graph partitions fast. For example, it can partition wdc12, the largest publicly available web-crawl graph, with 4 billion vertices and 129 billion edges, in under 2 minutes for clusters with 128 machines. Our experiments show that it can produce quality partitions 6× faster on average than the state-of-the-art stand-alone partitioner in the literature while supporting a wider range of partitioning policies.
{"title":"CuSP: A Customizable Streaming Edge Partitioner for Distributed Graph Analytics","authors":"Loc Hoang, Roshan Dathathri, G. Gill, K. Pingali","doi":"10.1109/IPDPS.2019.00054","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00054","url":null,"abstract":"Graph analytics systems must analyze graphs with billions of vertices and edges which require several terabytes of storage. Distributed-memory clusters are often used for analyzing such large graphs since the main memory of a single machine is usually restricted to a few hundreds of gigabytes. This requires partitioning the graph among the machines in the cluster. Existing graph analytics systems usually come with a built-in partitioner that incorporates a particular partitioning policy, but the best partitioning policy is dependent on the algorithm, input graph, and platform. Therefore, built-in partitioners are not sufficiently flexible. Stand-alone graph partitioners are available, but they too implement only a small number of partitioning policies. This paper presents CuSP, a fast streaming edge partitioning framework which permits users to specify the desired partitioning policy at a high level of abstraction and generates high-quality graph partitions fast. For example, it can partition wdc12, the largest publicly available web-crawl graph, with 4 billion vertices and 129 billion edges, in under 2 minutes for clusters with 128 machines. Our experiments show that it can produce quality partitions 6× faster on average than the state-of-the-art stand-alone partitioner in the literature while supporting a wider range of partitioning policies.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129116289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00032
Pengyu Wang, Lu Zhang, Chao Li, M. Guo
Graph traversal is an essential procedure for a growing amount of applications today. This type of algorithms typically iterate input graph datasets until convergence and the logic of each iteration is quite simple. GPUs are used extensively as graph traversal accelerators due to the capability of massive parallelism and high-bandwidth memory access. However, existing methods are inefficient in two ways. First, streaming multiprocessors (SMs) are still underutilized due to the unbalanced load allocation and uncoalesced memory access. Second, they use space-inefficient data structures or need auxiliary data to assist traversal. It is undesirable, considering the limited GPU memory capacity. Moreover, existing designs commonly focus on optimizing kernel execution time. Data-transfer time is also notable in the whole procedure. Thus, space-efficient data structure and data-transfer policy should be concerned. In this paper, we propose EtaGraph, a novel GPU graph traversal framework optimized for GPU memory system and execution parallelism. EtaGraph has several features: 1). It uses a frontier-like kernel execution model, featuring a lightweight graph transformation procedure, named Unified Degree Cut, allowing GPU threads to process skewed graph efficiently without modification of raw data or introducing extra space overhead; 2). It uses on-demand data-transfer to overlap computation so that it optimizes the total time of data-transfer and execution; 3). It adopts an explicit utilization of Shared Memory to enhance memory coalescing and to improve effective memory bandwidth. Evaluation of EtaGraph shows significant and consistent speedups over the state-of-the-art GPU-based graph processing frameworks on both real-world and synthetic graphs.
{"title":"Excavating the Potential of GPU for Accelerating Graph Traversal","authors":"Pengyu Wang, Lu Zhang, Chao Li, M. Guo","doi":"10.1109/IPDPS.2019.00032","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00032","url":null,"abstract":"Graph traversal is an essential procedure for a growing amount of applications today. This type of algorithms typically iterate input graph datasets until convergence and the logic of each iteration is quite simple. GPUs are used extensively as graph traversal accelerators due to the capability of massive parallelism and high-bandwidth memory access. However, existing methods are inefficient in two ways. First, streaming multiprocessors (SMs) are still underutilized due to the unbalanced load allocation and uncoalesced memory access. Second, they use space-inefficient data structures or need auxiliary data to assist traversal. It is undesirable, considering the limited GPU memory capacity. Moreover, existing designs commonly focus on optimizing kernel execution time. Data-transfer time is also notable in the whole procedure. Thus, space-efficient data structure and data-transfer policy should be concerned. In this paper, we propose EtaGraph, a novel GPU graph traversal framework optimized for GPU memory system and execution parallelism. EtaGraph has several features: 1). It uses a frontier-like kernel execution model, featuring a lightweight graph transformation procedure, named Unified Degree Cut, allowing GPU threads to process skewed graph efficiently without modification of raw data or introducing extra space overhead; 2). It uses on-demand data-transfer to overlap computation so that it optimizes the total time of data-transfer and execution; 3). It adopts an explicit utilization of Shared Memory to enhance memory coalescing and to improve effective memory bandwidth. Evaluation of EtaGraph shows significant and consistent speedups over the state-of-the-art GPU-based graph processing frameworks on both real-world and synthetic graphs.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131684352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00075
Mohammad Khavari Tavana, Yifan Sun, Nicolas Bohm Agostini, D. Kaeli
Graphics Processing Unit (GPU) performance has relied heavily on our ability to scale of number of transistors on chip, in order to satisfy the ever-increasing demands for more computation. However, transistor scaling has become extremely challenging, limiting the number of transistors that can be crammed onto a single die. Manufacturing large, fast and energy-efficient monolithic GPUs, while growing the number of stream processing units on-chip, is no longer a viable solution to scale performance. GPU vendors are aiming to exploit multi-GPU solutions, interconnecting multiple GPUs in the single node with a high bandwidth network (such as NVLink), or exploiting Multi-Chip-Module (MCM) packaging, where multiple GPU modules are integrated in a single package. The inter-GPU bandwidth is an expensive and critical resource for designing multi-GPU systems. The design of the inter-GPU network can impact performance significantly. To address this challenge, in this paper we explore the potential of hardware-based memory compression algorithms to save bandwidth and improve energy efficiency in multi-GPU systems. Specifically, we propose an adaptive inter-GPU data compression scheme to efficiently improve both performance and energy efficiency. Our evaluation shows that the proposed optimization on multi-GPU architectures can reduce the inter-GPU traffic up to 62%, improve system performance by up to 33%, and save energy spent powering the communication fabric by 45%, on average.
{"title":"Exploiting Adaptive Data Compression to Improve Performance and Energy-Efficiency of Compute Workloads in Multi-GPU Systems","authors":"Mohammad Khavari Tavana, Yifan Sun, Nicolas Bohm Agostini, D. Kaeli","doi":"10.1109/IPDPS.2019.00075","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00075","url":null,"abstract":"Graphics Processing Unit (GPU) performance has relied heavily on our ability to scale of number of transistors on chip, in order to satisfy the ever-increasing demands for more computation. However, transistor scaling has become extremely challenging, limiting the number of transistors that can be crammed onto a single die. Manufacturing large, fast and energy-efficient monolithic GPUs, while growing the number of stream processing units on-chip, is no longer a viable solution to scale performance. GPU vendors are aiming to exploit multi-GPU solutions, interconnecting multiple GPUs in the single node with a high bandwidth network (such as NVLink), or exploiting Multi-Chip-Module (MCM) packaging, where multiple GPU modules are integrated in a single package. The inter-GPU bandwidth is an expensive and critical resource for designing multi-GPU systems. The design of the inter-GPU network can impact performance significantly. To address this challenge, in this paper we explore the potential of hardware-based memory compression algorithms to save bandwidth and improve energy efficiency in multi-GPU systems. Specifically, we propose an adaptive inter-GPU data compression scheme to efficiently improve both performance and energy efficiency. Our evaluation shows that the proposed optimization on multi-GPU architectures can reduce the inter-GPU traffic up to 62%, improve system performance by up to 33%, and save energy spent powering the communication fabric by 45%, on average.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127660120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00092
Dongxiao Yu, Yifei Zou, Yong Zhang, Feng Li, Jiguo Yu, Yu Wu, Xiuzhen Cheng, F. Lau
This paper investigates distributed Dominating Set (DS) and Connected Dominating Set (CDS) construction in dynamic wireless networks under the SINR interference model. Specifically, we present a new model for dynamic networks that admits both churns (due to node arrivals/departures) and node mobility. Under this dynamic model, we propose efficient algorithms to construct a DS and a CDS with constant approximation ratios w.r.t. the corresponding minimum ones in O(log n) time with a high probability guarantee. To the best of our knowledge, these algorithms are the first known ones for DS and CDS construction in dynamic networks assuming the SINR interference model. We believe our dynamic network model can greatly facilitate distributed algorithm studies in mobile and dynamic wireless networks.
{"title":"Distributed Dominating Set and Connected Dominating Set Construction Under the Dynamic SINR Model","authors":"Dongxiao Yu, Yifei Zou, Yong Zhang, Feng Li, Jiguo Yu, Yu Wu, Xiuzhen Cheng, F. Lau","doi":"10.1109/IPDPS.2019.00092","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00092","url":null,"abstract":"This paper investigates distributed Dominating Set (DS) and Connected Dominating Set (CDS) construction in dynamic wireless networks under the SINR interference model. Specifically, we present a new model for dynamic networks that admits both churns (due to node arrivals/departures) and node mobility. Under this dynamic model, we propose efficient algorithms to construct a DS and a CDS with constant approximation ratios w.r.t. the corresponding minimum ones in O(log n) time with a high probability guarantee. To the best of our knowledge, these algorithms are the first known ones for DS and CDS construction in dynamic networks assuming the SINR interference model. We believe our dynamic network model can greatly facilitate distributed algorithm studies in mobile and dynamic wireless networks.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134540606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00022
A. Abdelfattah, S. Tomov, J. Dongarra
Matrix multiplication (GEMM) is the most important operation in dense linear algebra. Because it is a compute-bound operation that is rich in data reuse, many applications from different scientific domains cast their most performance-critical stages to use GEMM. With the rise of batch linear algebra, batched GEMM operations have become increasingly popular in domains other than dense linear solvers, such as tensor contractions, sparse direct solvers, and machine learning. In particular for the latter, batched GEMM in reduced precision (i.e., FP16) has been the core operation of many deep learning frameworks. This paper introduces an optimized batched GEMM for FP16 arithmetic (HGEMM) using graphics processing units (GPUs). We provide a detailed design strategy that takes advantage of the Tensor Core technology that was recently introduced in CUDA-enabled GPUs. The developed solution uses low-level APIs provided by the vendor in an optimized design that overcomes the limitations imposed by the hardware (in the form of discrete configurations). The outcome is a highly flexible GPU kernel that provides a lot of controls to the developer, despite the aforementioned restrictions. The paper also pays particular attention to multiplications of very small matrices that cannot fully occupy the Tensor Core units. Our results show that the proposed design can outperform the highly optimized vendor routine for sizes up to 100 by factors between 1.2x and 10x using a Tesla V100 GPU. For extremely small matrices, the observed speedups range between 1.8x and 26x.
{"title":"Fast Batched Matrix Multiplication for Small Sizes Using Half-Precision Arithmetic on GPUs","authors":"A. Abdelfattah, S. Tomov, J. Dongarra","doi":"10.1109/IPDPS.2019.00022","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00022","url":null,"abstract":"Matrix multiplication (GEMM) is the most important operation in dense linear algebra. Because it is a compute-bound operation that is rich in data reuse, many applications from different scientific domains cast their most performance-critical stages to use GEMM. With the rise of batch linear algebra, batched GEMM operations have become increasingly popular in domains other than dense linear solvers, such as tensor contractions, sparse direct solvers, and machine learning. In particular for the latter, batched GEMM in reduced precision (i.e., FP16) has been the core operation of many deep learning frameworks. This paper introduces an optimized batched GEMM for FP16 arithmetic (HGEMM) using graphics processing units (GPUs). We provide a detailed design strategy that takes advantage of the Tensor Core technology that was recently introduced in CUDA-enabled GPUs. The developed solution uses low-level APIs provided by the vendor in an optimized design that overcomes the limitations imposed by the hardware (in the form of discrete configurations). The outcome is a highly flexible GPU kernel that provides a lot of controls to the developer, despite the aforementioned restrictions. The paper also pays particular attention to multiplications of very small matrices that cannot fully occupy the Tensor Core units. Our results show that the proposed design can outperform the highly optimized vendor routine for sizes up to 100 by factors between 1.2x and 10x using a Tesla V100 GPU. For extremely small matrices, the observed speedups range between 1.8x and 26x.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124720763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00052
Nishant Saurabh, Julian Remmers, Dragi Kimovski, R. Prodan, Jorge G. Barbosa
Infrastructure-as-a-service (IaaS) Clouds concurrently accommodate diverse sets of user requests, requiring an efficient strategy for storing and retrieving virtual machine images (VMIs) at a large scale. The VMI storage management require dealing with multiple VMIs, typically in the magnitude of gigabytes, which entails VMI sprawl issues hindering the elastic resource management and provisioning. Nevertheless, existing techniques to facilitate VMI management overlook VMI semantics (i.e at the level of base image and software packages) with either restricted possibility to identify and extract reusable functionalities or with higher VMI publish and retrieval overheads. In this paper, we design, implement and evaluate Expelliarmus, a novel VMI management system that helps to minimize storage, publish and retrieval overheads. To achieve this goal, Expelliarmus incorporates three complementary features. First, it makes use of VMIs modelled as semantic graphs to expedite the similarity computation between multiple VMIs. Second, Expelliarmus provides a semantic aware VMI decomposition and base image selection to extract and store non-redundant base image and software packages. Third, Expelliarmus can also assemble VMIs based on the required software packages upon user request. We evaluate Expelliarmus through a representative set of synthetic Cloud VMIs on the real test-bed. Experimental results show that our semantic-centric approach is able to optimize repository size by 2.2-16 times compared to state-of-the-art systems (e.g. IBM's Mirage and Hemera) with significant VMI publish and retrieval performance improvement.
{"title":"Semantics-Aware Virtual Machine Image Management in IaaS Clouds","authors":"Nishant Saurabh, Julian Remmers, Dragi Kimovski, R. Prodan, Jorge G. Barbosa","doi":"10.1109/IPDPS.2019.00052","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00052","url":null,"abstract":"Infrastructure-as-a-service (IaaS) Clouds concurrently accommodate diverse sets of user requests, requiring an efficient strategy for storing and retrieving virtual machine images (VMIs) at a large scale. The VMI storage management require dealing with multiple VMIs, typically in the magnitude of gigabytes, which entails VMI sprawl issues hindering the elastic resource management and provisioning. Nevertheless, existing techniques to facilitate VMI management overlook VMI semantics (i.e at the level of base image and software packages) with either restricted possibility to identify and extract reusable functionalities or with higher VMI publish and retrieval overheads. In this paper, we design, implement and evaluate Expelliarmus, a novel VMI management system that helps to minimize storage, publish and retrieval overheads. To achieve this goal, Expelliarmus incorporates three complementary features. First, it makes use of VMIs modelled as semantic graphs to expedite the similarity computation between multiple VMIs. Second, Expelliarmus provides a semantic aware VMI decomposition and base image selection to extract and store non-redundant base image and software packages. Third, Expelliarmus can also assemble VMIs based on the required software packages upon user request. We evaluate Expelliarmus through a representative set of synthetic Cloud VMIs on the real test-bed. Experimental results show that our semantic-centric approach is able to optimize repository size by 2.2-16 times compared to state-of-the-art systems (e.g. IBM's Mirage and Hemera) with significant VMI publish and retrieval performance improvement.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125131537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00070
Bharti Wadhwa, A. Paul, Sarah Neuwirth, Feiyi Wang, S. Oral, A. Butt, Jon Bernard, K. Cameron
Parallel I/O performance is crucial to sustaining scientific applications on large-scale High-Performance Computing (HPC) systems. However, I/O load imbalance in the underlying distributed and shared storage systems can significantly reduce overall application performance. There are two conflicting challenges to mitigate this load imbalance: (i) optimizing systemwide data placement to maximize the bandwidth advantages of distributed storage servers, i.e., allocating I/O resources efficiently across applications and job runs; and (ii) optimizing client-centric data movement to minimize I/O load request latency between clients and servers, i.e., allocating I/O resources efficiently in service to a single application and job run. Moreover, existing approaches that require application changes limit wide-spread adoption in commercial or proprietary deployments. We propose iez, an "end-to-end control plane" where clients transparently and adaptively write to a set of selected I/O servers to achieve balanced data placement. Our control plane leverages realtime load information for distributed storage server global data placement while our design model leverages trace-based optimization techniques to minimize I/O load request latency between clients and servers. We evaluate our proposed system on an experimental cluster for two common use cases: synthetic I/O benchmark IOR for large sequential writes and a scientific application I/O kernel, HACC-I/O. Results show read and write performance improvements of up to 34% and 32%, respectively, compared to the state of the art.
{"title":"iez: Resource Contention Aware Load Balancing for Large-Scale Parallel File Systems","authors":"Bharti Wadhwa, A. Paul, Sarah Neuwirth, Feiyi Wang, S. Oral, A. Butt, Jon Bernard, K. Cameron","doi":"10.1109/IPDPS.2019.00070","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00070","url":null,"abstract":"Parallel I/O performance is crucial to sustaining scientific applications on large-scale High-Performance Computing (HPC) systems. However, I/O load imbalance in the underlying distributed and shared storage systems can significantly reduce overall application performance. There are two conflicting challenges to mitigate this load imbalance: (i) optimizing systemwide data placement to maximize the bandwidth advantages of distributed storage servers, i.e., allocating I/O resources efficiently across applications and job runs; and (ii) optimizing client-centric data movement to minimize I/O load request latency between clients and servers, i.e., allocating I/O resources efficiently in service to a single application and job run. Moreover, existing approaches that require application changes limit wide-spread adoption in commercial or proprietary deployments. We propose iez, an \"end-to-end control plane\" where clients transparently and adaptively write to a set of selected I/O servers to achieve balanced data placement. Our control plane leverages realtime load information for distributed storage server global data placement while our design model leverages trace-based optimization techniques to minimize I/O load request latency between clients and servers. We evaluate our proposed system on an experimental cluster for two common use cases: synthetic I/O benchmark IOR for large sequential writes and a scientific application I/O kernel, HACC-I/O. Results show read and write performance improvements of up to 34% and 32%, respectively, compared to the state of the art.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127894168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00072
G. Aupy, Olivier Beaumont, Lionel Eyraud-Dubois
Burst-Buffers are high throughput and small size storage which are being used as an intermediate storage between the PFS (Parallel File System) and the computational nodes of modern HPC systems. They can allow to hinder to contention to the PFS, a shared resource whose read and write performance increase slower than processing power in HPC systems. A second usage is to accelerate data transfers and to hide the latency to the PFS. In this paper, we concentrate on the first usage. We propose a model for Burst-Buffers and application transfers. We consider the problem of dimensioning and sharing the Burst-Buffers between several applications. This dimensioning can be done either dynamically or statically. The dynamic allocation considers that any application can use any available portion of the Burst-Buffers. The static allocation considers that when a new application enters the system, it is assigned some portion of the Burst-Buffers, which cannot be used by the other applications until that application leaves the system and its data is purged from it. We show that the general sharing problem to guarantee fair performance for all applications is an NP-Complete problem. We propose a polynomial time algorithms for the special case of finding the optimal buffer size such that no application is slowed down due to PFS contention, both in the static and dynamic cases. Finally, we provide evaluations of our algorithms in realistic settings. We use those to discuss how to minimize the overhead of the static allocation of buffers compared to the dynamic allocation.
{"title":"Sizing and Partitioning Strategies for Burst-Buffers to Reduce IO Contention","authors":"G. Aupy, Olivier Beaumont, Lionel Eyraud-Dubois","doi":"10.1109/IPDPS.2019.00072","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00072","url":null,"abstract":"Burst-Buffers are high throughput and small size storage which are being used as an intermediate storage between the PFS (Parallel File System) and the computational nodes of modern HPC systems. They can allow to hinder to contention to the PFS, a shared resource whose read and write performance increase slower than processing power in HPC systems. A second usage is to accelerate data transfers and to hide the latency to the PFS. In this paper, we concentrate on the first usage. We propose a model for Burst-Buffers and application transfers. We consider the problem of dimensioning and sharing the Burst-Buffers between several applications. This dimensioning can be done either dynamically or statically. The dynamic allocation considers that any application can use any available portion of the Burst-Buffers. The static allocation considers that when a new application enters the system, it is assigned some portion of the Burst-Buffers, which cannot be used by the other applications until that application leaves the system and its data is purged from it. We show that the general sharing problem to guarantee fair performance for all applications is an NP-Complete problem. We propose a polynomial time algorithms for the special case of finding the optimal buffer size such that no application is slowed down due to PFS contention, both in the static and dynamic cases. Finally, we provide evaluations of our algorithms in realistic settings. We use those to discuss how to minimize the overhead of the static allocation of buffers compared to the dynamic allocation.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128985039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00105
Tsung-Wei Huang, Chun-Xun Lin, Guannan Guo, Martin D. F. Wong
In this paper we introduce Cpp-Taskflow, a new C++ tasking library to help developers quickly write parallel programs using task dependency graphs. Cpp-Taskflow leverages the power of modern C++ and task-based approaches to enable efficient implementations of parallel decomposition strategies. Our programming model can quickly handle not only traditional loop-level parallelism, but also irregular patterns such as graph algorithms, incremental flows, and dynamic data structures. Compared with existing libraries, Cpp-Taskflow is more cost efficient in performance scaling and software integration. We have evaluated Cpp-Taskflow on both micro-benchmarks and real-world applications with million-scale tasking. In a machine learning example, Cpp-Taskflow achieved 1.5–2.7× less coding complexity and 14–38% speed-up over two industrial-strength libraries OpenMP Tasking and Intel Threading Building Blocks (TBB).
{"title":"Cpp-Taskflow: Fast Task-Based Parallel Programming Using Modern C++","authors":"Tsung-Wei Huang, Chun-Xun Lin, Guannan Guo, Martin D. F. Wong","doi":"10.1109/IPDPS.2019.00105","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00105","url":null,"abstract":"In this paper we introduce Cpp-Taskflow, a new C++ tasking library to help developers quickly write parallel programs using task dependency graphs. Cpp-Taskflow leverages the power of modern C++ and task-based approaches to enable efficient implementations of parallel decomposition strategies. Our programming model can quickly handle not only traditional loop-level parallelism, but also irregular patterns such as graph algorithms, incremental flows, and dynamic data structures. Compared with existing libraries, Cpp-Taskflow is more cost efficient in performance scaling and software integration. We have evaluated Cpp-Taskflow on both micro-benchmarks and real-world applications with million-scale tasking. In a machine learning example, Cpp-Taskflow achieved 1.5–2.7× less coding complexity and 14–38% speed-up over two industrial-strength libraries OpenMP Tasking and Intel Threading Building Blocks (TBB).","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"212 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124158852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}