Linear-time algorithms that are traditionally used to shuffle data on CPUs, such as the method of Fisher-Yates, are not well suited to implementation on GPUs due to inherent sequential dependencies, and existing parallel shuffling algorithms are unsuitable for GPU architectures because they incur a large number of read/write operations to high latency global memory. To address this, we provide a method of generating pseudo-random permutations in parallel by fusing suitable pseudo-random bijective functions with stream compaction operations. Our algorithm, termed “bijective shuffle” trades increased per-thread arithmetic operations for reduced global memory transactions. It is work-efficient, deterministic, and only requires a single global memory read and write per shuffle input, thus maximising use of global memory bandwidth. To empirically demonstrate the correctness of the algorithm, we develop a statistical test for the quality of pseudo-random permutations based on kernel space embeddings. Experimental results show that the bijective shuffle algorithm outperforms competing algorithms on GPUs, showing improvements of between one and two orders of magnitude and approaching peak device bandwidth.
{"title":"Bandwidth-Optimal Random Shuffling for GPUs","authors":"Rory Mitchell, Daniel Stokes, E. Frank, G. Holmes","doi":"10.1145/3505287","DOIUrl":"https://doi.org/10.1145/3505287","url":null,"abstract":"Linear-time algorithms that are traditionally used to shuffle data on CPUs, such as the method of Fisher-Yates, are not well suited to implementation on GPUs due to inherent sequential dependencies, and existing parallel shuffling algorithms are unsuitable for GPU architectures because they incur a large number of read/write operations to high latency global memory. To address this, we provide a method of generating pseudo-random permutations in parallel by fusing suitable pseudo-random bijective functions with stream compaction operations. Our algorithm, termed “bijective shuffle” trades increased per-thread arithmetic operations for reduced global memory transactions. It is work-efficient, deterministic, and only requires a single global memory read and write per shuffle input, thus maximising use of global memory bandwidth. To empirically demonstrate the correctness of the algorithm, we develop a statistical test for the quality of pseudo-random permutations based on kernel space embeddings. Experimental results show that the bijective shuffle algorithm outperforms competing algorithms on GPUs, showing improvements of between one and two orders of magnitude and approaching peak device bandwidth.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2021-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46192279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael Axtmann, Sascha Witt, Daniel Ferizovic, P. Sanders
We present new sequential and parallel sorting algorithms that now represent the fastest known techniques for a wide range of input sizes, input distributions, data types, and machines. Somewhat surprisingly, part of the speed advantage is due to the additional feature of the algorithms to work in-place, i.e., they do not need a significant amount of space beyond the input array. Previously, the in-place feature often implied performance penalties. Our main algorithmic contribution is a blockwise approach to in-place data distribution that is provably cache-efficient. We also parallelize this approach taking dynamic load balancing and memory locality into account. Our new comparison-based algorithm In-place Parallel Super Scalar Samplesort (IPS4o), combines this technique with branchless decision trees. By taking cases with many equal elements into account and by adapting the distribution degree dynamically, we obtain a highly robust algorithm that outperforms the best previous in-place parallel comparison-based sorting algorithms by almost a factor of three. That algorithm also outperforms the best comparison-based competitors regardless of whether we consider in-place or not in-place, parallel or sequential settings. Another surprising result is that IPS4o even outperforms the best (in-place or not in-place) integer sorting algorithms in a wide range of situations. In many of the remaining cases (often involving near-uniform input distributions, small keys, or a sequential setting), our new In-place Parallel Super Scalar Radix Sort (IPS2Ra) turns out to be the best algorithm. Claims to have the – in some sense – “best” sorting algorithm can be found in many papers which cannot all be true. Therefore, we base our conclusions on an extensive experimental study involving a large part of the cross product of 21 state-of-the-art sorting codes, 6 data types, 10 input distributions, 4 machines, 4 memory allocation strategies, and input sizes varying over 7 orders of magnitude. This confirms the claims made about the robust performance of our algorithms while revealing major performance problems in many competitors outside the concrete set of measurements reported in the associated publications. This is particularly true for integer sorting algorithms giving one reason to prefer comparison-based algorithms for robust general-purpose sorting.
{"title":"Engineering In-place (Shared-memory) Sorting Algorithms","authors":"Michael Axtmann, Sascha Witt, Daniel Ferizovic, P. Sanders","doi":"10.1145/3505286","DOIUrl":"https://doi.org/10.1145/3505286","url":null,"abstract":"We present new sequential and parallel sorting algorithms that now represent the fastest known techniques for a wide range of input sizes, input distributions, data types, and machines. Somewhat surprisingly, part of the speed advantage is due to the additional feature of the algorithms to work in-place, i.e., they do not need a significant amount of space beyond the input array. Previously, the in-place feature often implied performance penalties. Our main algorithmic contribution is a blockwise approach to in-place data distribution that is provably cache-efficient. We also parallelize this approach taking dynamic load balancing and memory locality into account. Our new comparison-based algorithm In-place Parallel Super Scalar Samplesort (IPS4o), combines this technique with branchless decision trees. By taking cases with many equal elements into account and by adapting the distribution degree dynamically, we obtain a highly robust algorithm that outperforms the best previous in-place parallel comparison-based sorting algorithms by almost a factor of three. That algorithm also outperforms the best comparison-based competitors regardless of whether we consider in-place or not in-place, parallel or sequential settings. Another surprising result is that IPS4o even outperforms the best (in-place or not in-place) integer sorting algorithms in a wide range of situations. In many of the remaining cases (often involving near-uniform input distributions, small keys, or a sequential setting), our new In-place Parallel Super Scalar Radix Sort (IPS2Ra) turns out to be the best algorithm. Claims to have the – in some sense – “best” sorting algorithm can be found in many papers which cannot all be true. Therefore, we base our conclusions on an extensive experimental study involving a large part of the cross product of 21 state-of-the-art sorting codes, 6 data types, 10 input distributions, 4 machines, 4 memory allocation strategies, and input sizes varying over 7 orders of magnitude. This confirms the claims made about the robust performance of our algorithms while revealing major performance problems in many competitors outside the concrete set of measurements reported in the associated publications. This is particularly true for integer sorting algorithms giving one reason to prefer comparison-based algorithms for robust general-purpose sorting.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2020-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43709384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aravind Natarajan, Arunmoezhi Ramachandran, N. Mittal
We present a lock-free algorithm for concurrent manipulation of a binary search tree (BST) in an asynchronous shared memory system that supports search, insert, and delete operations. In addition to read and write instructions, our algorithm uses (single-word) compare-and-swap (CAS) and bit-test-and-set (BTS) read-modify-write (RMW) instructions, both of which are commonly supported by many modern processors including Intel 64 and AMD64. In contrast to most of the existing concurrent algorithms for a binary search tree, our algorithm is edge-based rather than node-based. When compared to other concurrent algorithms for a binary search tree, modify (insert and delete) operations in our algorithm (a) work on a smaller section of the tree, (b) execute fewer RMW instructions, or (c) use fewer dynamically allocated objects. In our experiments, our lock-free algorithm significantly outperformed all other algorithms for a concurrent binary search tree especially when the contention was high. We also describe modifications to our basic lock-free algorithm so that the amortized complexity of any operation in the modified algorithm can be bounded by the sum of the tree height and the point contention to within a constant factor while preserving the other desirable features of our algorithm.
{"title":"FEAST: A Lightweight Lock-free Concurrent Binary Search Tree","authors":"Aravind Natarajan, Arunmoezhi Ramachandran, N. Mittal","doi":"10.1145/3391438","DOIUrl":"https://doi.org/10.1145/3391438","url":null,"abstract":"We present a lock-free algorithm for concurrent manipulation of a binary search tree (BST) in an asynchronous shared memory system that supports search, insert, and delete operations. In addition to read and write instructions, our algorithm uses (single-word) compare-and-swap (CAS) and bit-test-and-set (BTS) read-modify-write (RMW) instructions, both of which are commonly supported by many modern processors including Intel 64 and AMD64. In contrast to most of the existing concurrent algorithms for a binary search tree, our algorithm is edge-based rather than node-based. When compared to other concurrent algorithms for a binary search tree, modify (insert and delete) operations in our algorithm (a) work on a smaller section of the tree, (b) execute fewer RMW instructions, or (c) use fewer dynamically allocated objects. In our experiments, our lock-free algorithm significantly outperformed all other algorithms for a concurrent binary search tree especially when the contention was high. We also describe modifications to our basic lock-free algorithm so that the amortized complexity of any operation in the modified algorithm can be bounded by the sum of the tree height and the point contention to within a constant factor while preserving the other desirable features of our algorithm.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2020-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82801233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jeff Anderson, Engin Kayraklioglu, Shuai Sun, Joseph Crandall, Y. Alkabani, Vikram K. Narayana, V. Sorger, T. El-Ghazawi
Due to the end of Moore’s law and Dennard scaling, we are entering a new era of processors. Computing systems are increasingly facing power and performance challenges due to both deviceand circuit-related challenges with resistive and capacitive charging. Non-von Neumann architectures are needed to support future computations through innovative post-Moore’s law architectures. To enable these emerging architectures with high-performance and at ultra-low power, both parallel computation and inter-node communication on-the-chip can be supported using photons. To this end, we introduce ROC, a reconfigurable optical computer that can solve partial differential equations (PDEs). PDE solvers form the basis for many traditional simulation problems in science and engineering that are currently performed on supercomputers. Instead of solving problems iteratively, the proposed engine uses a resistive mesh architecture to solve a PDE in a single iteration (one-shot). Instead of using actual electrical circuits, the physical underlying hardware emulates such structures using a silicon-photonics mesh that splits light into separate pathways, allowing it to add or subtract optical power analogous to programmable resistors. The time to obtain the PDE solution then only depends on the time-of-flight of a photon through the programmed mesh, which can be on the order of 10’s of picoseconds given the millimeter-compact integrated photonic circuit. Numerically validated experimental results show that, over multiple configurations, ROC can achieve several orders of magnitude improvement over state-of-the-art GPUs when speed, power, and size are taken into account. Further, it comes within approximately 90% precision of current numerical solvers. As such, ROC can be a viable reconfigurable, approximate computer with the potential for more precise results when replacing silicon-photonics building blocks with nanoscale photonic lumped-elements.
{"title":"ROC: A Reconfigurable Optical Computer for Simulating Physical Processes","authors":"Jeff Anderson, Engin Kayraklioglu, Shuai Sun, Joseph Crandall, Y. Alkabani, Vikram K. Narayana, V. Sorger, T. El-Ghazawi","doi":"10.1145/3380944","DOIUrl":"https://doi.org/10.1145/3380944","url":null,"abstract":"Due to the end of Moore’s law and Dennard scaling, we are entering a new era of processors. Computing systems are increasingly facing power and performance challenges due to both deviceand circuit-related challenges with resistive and capacitive charging. Non-von Neumann architectures are needed to support future computations through innovative post-Moore’s law architectures. To enable these emerging architectures with high-performance and at ultra-low power, both parallel computation and inter-node communication on-the-chip can be supported using photons. To this end, we introduce ROC, a reconfigurable optical computer that can solve partial differential equations (PDEs). PDE solvers form the basis for many traditional simulation problems in science and engineering that are currently performed on supercomputers. Instead of solving problems iteratively, the proposed engine uses a resistive mesh architecture to solve a PDE in a single iteration (one-shot). Instead of using actual electrical circuits, the physical underlying hardware emulates such structures using a silicon-photonics mesh that splits light into separate pathways, allowing it to add or subtract optical power analogous to programmable resistors. The time to obtain the PDE solution then only depends on the time-of-flight of a photon through the programmed mesh, which can be on the order of 10’s of picoseconds given the millimeter-compact integrated photonic circuit. Numerically validated experimental results show that, over multiple configurations, ROC can achieve several orders of magnitude improvement over state-of-the-art GPUs when speed, power, and size are taken into account. Further, it comes within approximately 90% precision of current numerical solvers. As such, ROC can be a viable reconfigurable, approximate computer with the potential for more precise results when replacing silicon-photonics building blocks with nanoscale photonic lumped-elements.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2020-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78470620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A fundamental problem in parallel and distributed processing is the partial serialization that is imposed due to the need for mutually exclusive access to common resources. In this article, we investigate the problem of optimally scheduling (in terms of makespan) a set of jobs, where each job consists of the same number L of unit-duration tasks, and each task either accesses exclusively one resource from a given set of resources or accesses a fully shareable resource. We develop and establish the optimality of a fast polynomial-time algorithm to find a schedule with the shortest makespan for any number of jobs and for any number of resources for the case of L = 2. In the notation commonly used for job-shop scheduling problems, this result means that the problem J |dij=1, nj =2|Cmax is polynomially solvable, adding to the polynomial solutions known for the problems J2 | nj ≤ 2 | Cmax and J2 | dij = 1 | Cmax (whereas other closely related versions such as J2 | nj ≤ 3 | Cmax, J2 | dij ∈ { 1,2} | Cmax, J3 | nj ≤ 2 | Cmax, J3 | dij=1 | Cmax, and J |dij=1, nj ≤ 3| Cmax are all known to be NP-complete). For the general case L > 2 (i.e., for the job-shop problem J |dij=1, nj =L> 2| Cmax), we present a competitive heuristic and provide experimental comparisons with other heuristic versions and, when possible, with the ideal integer linear programming formulation.
{"title":"Scheduling Mutual Exclusion Accesses in Equal-Length Jobs","authors":"D. Kagaris, S. Dutta","doi":"10.1145/3342562","DOIUrl":"https://doi.org/10.1145/3342562","url":null,"abstract":"A fundamental problem in parallel and distributed processing is the partial serialization that is imposed due to the need for mutually exclusive access to common resources. In this article, we investigate the problem of optimally scheduling (in terms of makespan) a set of jobs, where each job consists of the same number <i>L</i> of unit-duration tasks, and each task either accesses exclusively one resource from a given set of resources or accesses a fully shareable resource. We develop and establish the optimality of a fast polynomial-time algorithm to find a schedule with the shortest makespan for any number of jobs and for any number of resources for the case of <i>L</i> = 2. In the notation commonly used for job-shop scheduling problems, this result means that the problem <i>J</i> |<i>d</i><sub><i>ij</i></sub>=1, <i>n</i><sub><i>j</i></sub> =2|<i>C</i><sub>max</sub> is polynomially solvable, adding to the polynomial solutions known for the problems <i>J</i>2 | <i>n</i><sub><i>j</i></sub> ≤ 2 | <i>C</i><sub>max</sub> and <i>J</i>2 | <i>d</i><sub><i>ij</i></sub> = 1 | <i>C</i><sub>max</sub> (whereas other closely related versions such as <i>J</i>2 | <i>n</i><sub><i>j</i></sub> ≤ 3 | <i>C</i><sub>max</sub>, <i>J</i>2 | <i>d</i><sub><i>ij</i></sub> ∈ { 1,2} | <i>C</i><sub>max</sub>, <i>J</i>3 | <i>n</i><sub><i>j</i></sub> ≤ 2 | <i>C</i><sub>max</sub>, <i>J</i>3 | <i>d</i><sub><i>ij</i></sub>=1 | <i>C</i><sub>max</sub>, and <i>J</i> |<i>d</i><sub><i>ij</i></sub>=1, <i>n</i><sub><i>j</i></sub> ≤ 3| <i>C</i><sub>max</sub> are all known to be NP-complete). For the general case <i>L</i> > 2 (i.e., for the job-shop problem <i>J</i> |<i>d</i><sub><i>ij</i></sub>=1, <i>n</i><sub><i>j</i></sub> =<i>L</i>> 2| <i>C</i><sub>max</sub>), we present a competitive heuristic and provide experimental comparisons with other heuristic versions and, when possible, with the ideal integer linear programming formulation.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2019-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86220203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the ever-growing need of data in HPC applications, the congestion at the I/O level becomes critical in supercomputers. Architectural enhancement such as burst buffers and pre-fetching are added to machines but are not sufficient to prevent congestion. Recent online I/O scheduling strategies have been put in place, but they add an additional congestion point and overheads in the computation of applications. In this work, we show how to take advantage of the periodic nature of HPC applications to develop efficient periodic scheduling strategies for their I/O transfers. Our strategy computes once during the job scheduling phase a pattern that defines the I/O behavior for each application, after which the applications run independently, performing their I/O at the specified times. Our strategy limits the amount of congestion at the I/O node level and can be easily integrated into current job schedulers. We validate this model through extensive simulations and experiments on an HPC cluster by comparing it to state-of-the-art online solutions, showing that not only does our scheduler have the advantage of being de-centralized and thus overcoming the overhead of online schedulers, but also that it performs better than the other solutions, improving the application dilation up to 16% and the maximum system efficiency up to 18%.
{"title":"I/O Scheduling Strategy for Periodic Applications","authors":"G. Aupy, Ana Gainaru, Valentin Le Fèvre","doi":"10.1145/3338510","DOIUrl":"https://doi.org/10.1145/3338510","url":null,"abstract":"With the ever-growing need of data in HPC applications, the congestion at the I/O level becomes critical in supercomputers. Architectural enhancement such as burst buffers and pre-fetching are added to machines but are not sufficient to prevent congestion. Recent online I/O scheduling strategies have been put in place, but they add an additional congestion point and overheads in the computation of applications.\u0000 In this work, we show how to take advantage of the periodic nature of HPC applications to develop efficient periodic scheduling strategies for their I/O transfers. Our strategy computes once during the job scheduling phase a pattern that defines the I/O behavior for each application, after which the applications run independently, performing their I/O at the specified times. Our strategy limits the amount of congestion at the I/O node level and can be easily integrated into current job schedulers. We validate this model through extensive simulations and experiments on an HPC cluster by comparing it to state-of-the-art online solutions, showing that not only does our scheduler have the advantage of being de-centralized and thus overcoming the overhead of online schedulers, but also that it performs better than the other solutions, improving the application dilation up to 16% and the maximum system efficiency up to 18%.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2019-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84499785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Md Atiqul Mollah, Wenqi Wang, Peyman Faizian, Md. Shafayat Rahman, Xin Yuan, S. Pakin, M. Lang
Universal globally adaptive load-balanced (UGAL) routing has been proposed for various interconnection networks and has been deployed in a number of current-generation supercomputers. Although UGAL-based schemes have been extensively studied, most existing results are based on either simulation or measurement. Without a theoretical understanding of UGAL, multiple questions remain: For which traffic patterns is UGAL most suited? In addition, what determines the performance of the UGAL-based scheme on a particular network configuration? In this work, we develop a set of throughput models for UGALbased on linear programming. We show that the throughput models are valid across the torus, Dragonfly, and Slim Fly network topologies. Finally, we identify a robust model that can accurately and efficiently predict UGAL throughput for a set of representative traffic patterns across different topologies. Our models not only provide a mechanism to predict UGAL performance on large-scale interconnection networks but also reveal the inner working of UGAL and further our understanding of this type of routing.
{"title":"Modeling Universal Globally Adaptive Load-Balanced Routing","authors":"Md Atiqul Mollah, Wenqi Wang, Peyman Faizian, Md. Shafayat Rahman, Xin Yuan, S. Pakin, M. Lang","doi":"10.1145/3349620","DOIUrl":"https://doi.org/10.1145/3349620","url":null,"abstract":"Universal globally adaptive load-balanced (UGAL) routing has been proposed for various interconnection networks and has been deployed in a number of current-generation supercomputers. Although UGAL-based schemes have been extensively studied, most existing results are based on either simulation or measurement. Without a theoretical understanding of UGAL, multiple questions remain: For which traffic patterns is UGAL most suited? In addition, what determines the performance of the UGAL-based scheme on a particular network configuration? In this work, we develop a set of throughput models for UGALbased on linear programming. We show that the throughput models are valid across the torus, Dragonfly, and Slim Fly network topologies. Finally, we identify a robust model that can accurately and efficiently predict UGAL throughput for a set of representative traffic patterns across different topologies. Our models not only provide a mechanism to predict UGAL performance on large-scale interconnection networks but also reveal the inner working of UGAL and further our understanding of this type of routing.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2019-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84908923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scalable deep neural network training has been gaining prominence because of the increasing importance of deep learning in a multitude of scientific and commercial domains. Consequently, a number of researchers have investigated techniques to optimize deep learning systems. Much of the prior work has focused on runtime and algorithmic enhancements to optimize the computation and communication. Despite these enhancements, however, deep learning systems still suffer from scalability limitations, particularly with respect to data I/O. This situation is especially true for training models where the computation can be effectively parallelized, leaving I/O as the major bottleneck. In fact, our analysis shows that I/O can take up to 90% of the total training time. Thus, in this article, we first analyze LMDB, the most widely used I/O subsystem of deep learning frameworks, to understand the causes of this I/O inefficiency. Based on our analysis, we propose LMDBIO—an optimized I/O plugin for scalable deep learning. LMDBIO includes six novel optimizations that together address the various shortcomings in existing I/O for deep learning. Our experimental results show that LMDBIO significantly outperforms LMDB in all cases and improves overall application performance by up to 65-fold on a 9,216-core system.
{"title":"Scalable Deep Learning via I/O Analysis and Optimization","authors":"S. Pumma, Min Si, W. Feng, P. Balaji","doi":"10.1145/3331526","DOIUrl":"https://doi.org/10.1145/3331526","url":null,"abstract":"Scalable deep neural network training has been gaining prominence because of the increasing importance of deep learning in a multitude of scientific and commercial domains. Consequently, a number of researchers have investigated techniques to optimize deep learning systems. Much of the prior work has focused on runtime and algorithmic enhancements to optimize the computation and communication. Despite these enhancements, however, deep learning systems still suffer from scalability limitations, particularly with respect to data I/O. This situation is especially true for training models where the computation can be effectively parallelized, leaving I/O as the major bottleneck. In fact, our analysis shows that I/O can take up to 90% of the total training time. Thus, in this article, we first analyze LMDB, the most widely used I/O subsystem of deep learning frameworks, to understand the causes of this I/O inefficiency. Based on our analysis, we propose LMDBIO—an optimized I/O plugin for scalable deep learning. LMDBIO includes six novel optimizations that together address the various shortcomings in existing I/O for deep learning. Our experimental results show that LMDBIO significantly outperforms LMDB in all cases and improves overall application performance by up to 65-fold on a 9,216-core system.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2019-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87253170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}