Pub Date : 2020-05-20DOI: 10.1109/SC41405.2020.00099
Yongkweon Jeon, Baeseong Park, Se Jung Kwon, Byeongwook Kim, Jeongin Yun, Dongsoo Lee
The number of parameters in deep neural networks (DNNs) is rapidly increasing to support complicated tasks and to improve model accuracy. Correspondingly, the amount of computations and required memory footprint increase as well. Quantization is an efficient method to address such concerns by compressing DNNs such that computations can be simplified while required storage footprint is significantly reduced. Unfortunately, commercial CPUs and GPUs do not fully support quantization because only fixed data transfers (such as 32 bits) are allowed. As a result, even if weights are quantized (by a non-uniform quantization scheme) into a few bits, CPUs and GPUs may not access multiple quantized weights without memory bandwidth waste. Success of quantization in practice, hence, relies on an efficient computation engine design, especially for matrix multiplication that is a basic computation engine in most DNNs. In this paper, we propose a novel matrix multiplication method, called BiQGEMM, dedicated to quantized DNNs. BiQGEMM can access multiple quantized weights simultaneously in one instruction. In addition, BiQGEMM pre-computes intermediate results that are highly redundant when quantization leads to limited available computation space. Since pre-computed values are stored in lookup tables and reused, BiQGEMM achieves lower amount of overall computations. Our extensive experimental results show that BiQGEMM presents higher performance than conventional schemes when DNNs are quantized.
{"title":"BiQGEMM: Matrix Multiplication with Lookup Table for Binary-Coding-Based Quantized DNNs","authors":"Yongkweon Jeon, Baeseong Park, Se Jung Kwon, Byeongwook Kim, Jeongin Yun, Dongsoo Lee","doi":"10.1109/SC41405.2020.00099","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00099","url":null,"abstract":"The number of parameters in deep neural networks (DNNs) is rapidly increasing to support complicated tasks and to improve model accuracy. Correspondingly, the amount of computations and required memory footprint increase as well. Quantization is an efficient method to address such concerns by compressing DNNs such that computations can be simplified while required storage footprint is significantly reduced. Unfortunately, commercial CPUs and GPUs do not fully support quantization because only fixed data transfers (such as 32 bits) are allowed. As a result, even if weights are quantized (by a non-uniform quantization scheme) into a few bits, CPUs and GPUs may not access multiple quantized weights without memory bandwidth waste. Success of quantization in practice, hence, relies on an efficient computation engine design, especially for matrix multiplication that is a basic computation engine in most DNNs. In this paper, we propose a novel matrix multiplication method, called BiQGEMM, dedicated to quantized DNNs. BiQGEMM can access multiple quantized weights simultaneously in one instruction. In addition, BiQGEMM pre-computes intermediate results that are highly redundant when quantization leads to limited available computation space. Since pre-computed values are stored in lookup tables and reused, BiQGEMM achieves lower amount of overall computations. Our extensive experimental results show that BiQGEMM presents higher performance than conventional schemes when DNNs are quantized.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116641822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-10DOI: 10.1109/SC41405.2020.00047
Dhiraj D. Kalamkar, E. Georganas, S. Srinivasan, Jianping Chen, Mikhail Shiryaev, A. Heinecke
During the last two years, the goal of many researchers has been to squeeze the last bit of performance out of HPC system for AI tasks. Often this discussion is held in the context of how fast ResNet50 can be trained. Unfortunately, ResNet50 is no longer a representative workload in 2020. Thus, we focus on Recommender Systems which account for most of the AI cycles in cloud computing centers. More specifically, we focus on Facebook’s DLRM benchmark. By enabling it to run on latest CPU hardware and software tailored for HPC, we are able to achieve up to two-orders of magnitude improvement in performance on a single socket compared to the reference CPU implementation, and high scaling efficiency up to 64 sockets, while fitting ultra-large datasets which cannot be held in single node’s memory. Therefore, this paper discusses and analyzes novel optimization and parallelization techniques for the various operators in DLRM. Several optimizations (e.g. tensor-contraction accelerated MLPs, framework MPI progression, BFLOAT16 training with up to $1.8 times $ speed-up) are general and transferable to many other deep learning topologies.
{"title":"Optimizing Deep Learning Recommender Systems Training on CPU Cluster Architectures","authors":"Dhiraj D. Kalamkar, E. Georganas, S. Srinivasan, Jianping Chen, Mikhail Shiryaev, A. Heinecke","doi":"10.1109/SC41405.2020.00047","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00047","url":null,"abstract":"During the last two years, the goal of many researchers has been to squeeze the last bit of performance out of HPC system for AI tasks. Often this discussion is held in the context of how fast ResNet50 can be trained. Unfortunately, ResNet50 is no longer a representative workload in 2020. Thus, we focus on Recommender Systems which account for most of the AI cycles in cloud computing centers. More specifically, we focus on Facebook’s DLRM benchmark. By enabling it to run on latest CPU hardware and software tailored for HPC, we are able to achieve up to two-orders of magnitude improvement in performance on a single socket compared to the reference CPU implementation, and high scaling efficiency up to 64 sockets, while fitting ultra-large datasets which cannot be held in single node’s memory. Therefore, this paper discusses and analyzes novel optimization and parallelization techniques for the various operators in DLRM. Several optimizations (e.g. tensor-contraction accelerated MLPs, framework MPI progression, BFLOAT16 training with up to $1.8 times $ speed-up) are general and transferable to many other deep learning topologies.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115162403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-07DOI: 10.1109/SC41405.2020.00074
Alok Tripathy, K. Yelick, A. Buluç
Graph Neural Networks (GNNs) are powerful and flexible neural networks that use the naturally sparse connectivity information of the data. GNNs represent this connectivity as sparse matrices, which have lower arithmetic intensity and thus higher communication costs compared to dense matrices, making GNNs harder to scale to high concurrencies than convolutional or fully-connected neural networks. We introduce a family of parallel algorithms for training GNNs and show that they can asymptotically reduce communication compared to previous parallel GNN training methods. We implement these algorithms, which are based on 1D, 1. 5D, 2D, and 3D sparse-dense matrix multiplication, using torch.distributed on GPU-equipped clusters. Our algorithms optimize communication across the full GNN training pipeline. We train GNNs on over a hundred GPUs on multiple datasets, including a protein network with over a billion edges.
{"title":"Reducing Communication in Graph Neural Network Training","authors":"Alok Tripathy, K. Yelick, A. Buluç","doi":"10.1109/SC41405.2020.00074","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00074","url":null,"abstract":"Graph Neural Networks (GNNs) are powerful and flexible neural networks that use the naturally sparse connectivity information of the data. GNNs represent this connectivity as sparse matrices, which have lower arithmetic intensity and thus higher communication costs compared to dense matrices, making GNNs harder to scale to high concurrencies than convolutional or fully-connected neural networks. We introduce a family of parallel algorithms for training GNNs and show that they can asymptotically reduce communication compared to previous parallel GNN training methods. We implement these algorithms, which are based on 1D, 1. 5D, 2D, and 3D sparse-dense matrix multiplication, using torch.distributed on GPU-equipped clusters. Our algorithms optimize communication across the full GNN training pipeline. We train GNNs on over a hundred GPUs on multiple datasets, including a protein network with over a billion edges.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116858093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/SC41405.2020.00009
Weile Jia, Han Wang, Mohan Chen, Denghui Lu, Jiduan Liu, Lin Lin, R. Car, E. Weinan, Linfeng Zhang
For 35 years, ab initio molecular dynamics (AIMD) has been the method of choice for modeling complex atomistic phenomena from first principles. However, most AIMD applications are limited by computational cost to systems with thousands of atoms at most. We report that a machine learningbased simulation protocol (Deep Potential Molecular Dynamics), while retaining ab initio accuracy, can simulate more than 1 nanosecond-long trajectory of over 100 million atoms per day, using a highly optimized code (GPU DeePMD-kit) on the Summit supercomputer. Our code can efficiently scale up to the entire Summit supercomputer, attaining 91 PFLOPS in double precision (45.5% of the peak) and 162/275 PFLOPS in mixed-single/half precision. The great accomplishment of this work is that it opens the door to simulating unprecedented size and time scales with ab initio accuracy. It also poses new challenges to the next-generation supercomputer for a better integration of machine learning and physical modeling.
{"title":"Pushing the Limit of Molecular Dynamics with Ab Initio Accuracy to 100 Million Atoms with Machine Learning","authors":"Weile Jia, Han Wang, Mohan Chen, Denghui Lu, Jiduan Liu, Lin Lin, R. Car, E. Weinan, Linfeng Zhang","doi":"10.1109/SC41405.2020.00009","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00009","url":null,"abstract":"For 35 years, ab initio molecular dynamics (AIMD) has been the method of choice for modeling complex atomistic phenomena from first principles. However, most AIMD applications are limited by computational cost to systems with thousands of atoms at most. We report that a machine learningbased simulation protocol (Deep Potential Molecular Dynamics), while retaining ab initio accuracy, can simulate more than 1 nanosecond-long trajectory of over 100 million atoms per day, using a highly optimized code (GPU DeePMD-kit) on the Summit supercomputer. Our code can efficiently scale up to the entire Summit supercomputer, attaining 91 PFLOPS in double precision (45.5% of the peak) and 162/275 PFLOPS in mixed-single/half precision. The great accomplishment of this work is that it opens the door to simulating unprecedented size and time scales with ab initio accuracy. It also poses new challenges to the next-generation supercomputer for a better integration of machine learning and physical modeling.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121101448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/SC41405.2020.00013
C. Jiang, S. Esmaeilzadeh, K. Azizzadenesheli, K. Kashinath, Mustafa A. Mustafa, H. Tchelepi, P. Marcus, Prabhat, Anima Anandkumar
We propose MESHFREEFLOWNET, a novel deep learning-based super-resolution framework to generate continuous (grid-free) spatio-temporal solutions from the lowresolution inputs. While being computationally efficient, MESHFREEFLOWNET accurately recovers the fine-scale quantities of interest. MESHFREEFLOWNET allows for: (i) the output to be sampled at all spatio-temporal resolutions, (ii) a set of Partial Differential Equation (PDE) constraints to be imposed, and (iii) training on fixed-size inputs on arbitrarily sized spatio-temporal domains owing to its fully convolutional encoder. We empirically study the performance of MESHFREEFLOWNET on the task of super-resolution of turbulent flows in the Rayleigh-Bénard convection problem. Across a diverse set of evaluation metrics, we show that MESHFREEFLOWNET significantly outperforms existing baselines. Furthermore, we provide a large scale implementation of MESHFREEFLOWNET and show that it efficiently scales across large clusters, achieving 96.80% scaling efficiency on up to 128 GPUs and a training time of less than 4 minutes. We provide an opensource implementation of our method that supports arbitrary combinations of PDE constraints1lsource code available:
{"title":"MESHFREEFLOWNET: A Physics-Constrained Deep Continuous Space-Time Super-Resolution Framework","authors":"C. Jiang, S. Esmaeilzadeh, K. Azizzadenesheli, K. Kashinath, Mustafa A. Mustafa, H. Tchelepi, P. Marcus, Prabhat, Anima Anandkumar","doi":"10.1109/SC41405.2020.00013","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00013","url":null,"abstract":"We propose MESHFREEFLOWNET, a novel deep learning-based super-resolution framework to generate continuous (grid-free) spatio-temporal solutions from the lowresolution inputs. While being computationally efficient, MESHFREEFLOWNET accurately recovers the fine-scale quantities of interest. MESHFREEFLOWNET allows for: (i) the output to be sampled at all spatio-temporal resolutions, (ii) a set of Partial Differential Equation (PDE) constraints to be imposed, and (iii) training on fixed-size inputs on arbitrarily sized spatio-temporal domains owing to its fully convolutional encoder. We empirically study the performance of MESHFREEFLOWNET on the task of super-resolution of turbulent flows in the Rayleigh-Bénard convection problem. Across a diverse set of evaluation metrics, we show that MESHFREEFLOWNET significantly outperforms existing baselines. Furthermore, we provide a large scale implementation of MESHFREEFLOWNET and show that it efficiently scales across large clusters, achieving 96.80% scaling efficiency on up to 128 GPUs and a training time of less than 4 minutes. We provide an opensource implementation of our method that supports arbitrary combinations of PDE constraints1lsource code available:","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133307565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-04-23DOI: 10.1109/SC41405.2020.00012
R. Maulik, Romain Egele, Bethany Lusch, Prasanna Balaprakash
Developing surrogate geophysical models from data is a key research topic in atmospheric and oceanic modeling because of the large computational costs associated with numerical simulation methods. Researchers have started applying a wide range of machine learning models, in particular neural networks, to geophysical data for forecasting without these constraints. Constructing neural networks for forecasting such data is nontrivial, however, and often requires trial and error. To address these limitations, we focus on developing proper-orthogonal-decomposition-based long short-term memory networks (PODLSTMs). We develop a scalable neural architecture search for generating stacked LSTMs to forecast temperature in the NOAA Optimum Interpolation Sea-Surface Temperature data set. Our approach identifies POD-LSTMs that are superior to manually designed variants and baseline time-series prediction methods. We also assess the scalability of different architecture search strategies on up to 512 Intel Knights Landing nodes of the Theta supercomputer at the Argonne Leadership Computing Facility.
{"title":"Recurrent Neural Network Architecture Search for Geophysical Emulation","authors":"R. Maulik, Romain Egele, Bethany Lusch, Prasanna Balaprakash","doi":"10.1109/SC41405.2020.00012","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00012","url":null,"abstract":"Developing surrogate geophysical models from data is a key research topic in atmospheric and oceanic modeling because of the large computational costs associated with numerical simulation methods. Researchers have started applying a wide range of machine learning models, in particular neural networks, to geophysical data for forecasting without these constraints. Constructing neural networks for forecasting such data is nontrivial, however, and often requires trial and error. To address these limitations, we focus on developing proper-orthogonal-decomposition-based long short-term memory networks (PODLSTMs). We develop a scalable neural architecture search for generating stacked LSTMs to forecast temperature in the NOAA Optimum Interpolation Sea-Surface Temperature data set. Our approach identifies POD-LSTMs that are superior to manually designed variants and baseline time-series prediction methods. We also assess the scalability of different architecture search strategies on up to 512 Intel Knights Landing nodes of the Theta supercomputer at the Argonne Leadership Computing Facility.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125957037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-04-22DOI: 10.1109/SC41405.2020.00084
Michael Lass, Robert Schade, T. Kuhne, Christian Plessl
Electronic structure calculations based on density-functional theory (DFT) represent a significant part of today’s HPC workloads and pose high demands on high-performance computing resources. To perform these quantum-mechanical DFT calculations on complex large-scale systems, so-called linear scaling methods instead of conventional cubic scaling methods are required. In this work, we take up the idea of the submatrix method and apply it to the DFT computations in the software package CP2K. For that purpose, we transform the underlying numeric operations on distributed, large, sparse matrices into computations on local, much smaller and nearly dense matrices. This allows us to exploit the full floating-point performance of modern CPUs and to make use of dedicated accelerator hardware, where performance has been limited by memory bandwidth before. We demonstrate both functionality and performance of our implementation and show how it can be accelerated with GPUs and FPGAs.
{"title":"A Submatrix-Based Method for Approximate Matrix Function Evaluation in the Quantum Chemistry Code CP2K","authors":"Michael Lass, Robert Schade, T. Kuhne, Christian Plessl","doi":"10.1109/SC41405.2020.00084","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00084","url":null,"abstract":"Electronic structure calculations based on density-functional theory (DFT) represent a significant part of today’s HPC workloads and pose high demands on high-performance computing resources. To perform these quantum-mechanical DFT calculations on complex large-scale systems, so-called linear scaling methods instead of conventional cubic scaling methods are required. In this work, we take up the idea of the submatrix method and apply it to the DFT computations in the software package CP2K. For that purpose, we transform the underlying numeric operations on distributed, large, sparse matrices into computations on local, much smaller and nearly dense matrices. This allows us to exploit the full floating-point performance of modern CPUs and to make use of dedicated accelerator hardware, where performance has been limited by memory bandwidth before. We demonstrate both functionality and performance of our implementation and show how it can be accelerated with GPUs and FPGAs.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116492266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-04-20DOI: 10.1109/SC41405.2020.00077
A. Hakim, J. Juno
Understanding fundamental kinetic processes is important for many problems, from plasma physics to gas dynamics. A first-principles approach to these problems requires a statistical description via the Boltzmann equation, coupled to appropriate field equations. In this paper we present a novel version of the discontinuous Galerkin (DG) algorithm to solve such kinetic equations. Unlike Monte-Carlo methods, we use a continuum scheme in which we directly discretize the 6D phase-space using discontinuous basis functions. Our DG scheme eliminates counting noise and aliasing errors that would otherwise contaminate the delicate field-particle interactions. We use modal basis functions with reduced degrees of freedom to improve efficiency while retaining a high formal order of convergence. Our implementation incorporates a number of software innovations: use of JIT compiled top-level language, automatically generated computational kernels and a sophisticated shared-memory MPI implementation to handle velocity space parallelization.
{"title":"Alias-Free, Matrix-Free, and Quadrature-Free Discontinuous Galerkin Algorithms for (Plasma) Kinetic Equations","authors":"A. Hakim, J. Juno","doi":"10.1109/SC41405.2020.00077","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00077","url":null,"abstract":"Understanding fundamental kinetic processes is important for many problems, from plasma physics to gas dynamics. A first-principles approach to these problems requires a statistical description via the Boltzmann equation, coupled to appropriate field equations. In this paper we present a novel version of the discontinuous Galerkin (DG) algorithm to solve such kinetic equations. Unlike Monte-Carlo methods, we use a continuum scheme in which we directly discretize the 6D phase-space using discontinuous basis functions. Our DG scheme eliminates counting noise and aliasing errors that would otherwise contaminate the delicate field-particle interactions. We use modal basis functions with reduced degrees of freedom to improve efficiency while retaining a high formal order of convergence. Our implementation incorporates a number of software innovations: use of JIT compiled top-level language, automatically generated computational kernels and a sophisticated shared-memory MPI implementation to handle velocity space parallelization.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128973001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-04DOI: 10.1109/SC41405.2020.00024
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He
Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into limited device memory, while obtaining computation, communication and development efficiency. We develop a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, vastly improving training speed while increasing the model size that can be efficiently trained. ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing us to scale the model size proportional to the number of devices with sustained high efficiency. Our analysis on memory requirements and communication volume demonstrates: ZeRO has the potential to scale beyond 1 Trillion parameters using today’s hardware. We implement and evaluate ZeRO: it trains large models of over 100B parameter with super-linear speedup on 400 GPUs, achieving throughput of 15 Petaflops. This represents an 8x increase in model size and 10x increase in achievable performance over state-of-the-art. In terms of usability, ZeRO can train large models of up to 13B parameters (e.g., larger than Megatron GPT 8. 3B and T5 11B) without requiring model parallelism which is harder for scientists to apply. Last but not the least, researchers have used the system breakthroughs of ZeRO to create Turing-NLG, the world’s largest language model at the time (17B parameters) with record breaking accuracy.
{"title":"ZeRO: Memory optimizations Toward Training Trillion Parameter Models","authors":"Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He","doi":"10.1109/SC41405.2020.00024","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00024","url":null,"abstract":"Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into limited device memory, while obtaining computation, communication and development efficiency. We develop a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, vastly improving training speed while increasing the model size that can be efficiently trained. ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing us to scale the model size proportional to the number of devices with sustained high efficiency. Our analysis on memory requirements and communication volume demonstrates: ZeRO has the potential to scale beyond 1 Trillion parameters using today’s hardware. We implement and evaluate ZeRO: it trains large models of over 100B parameter with super-linear speedup on 400 GPUs, achieving throughput of 15 Petaflops. This represents an 8x increase in model size and 10x increase in achievable performance over state-of-the-art. In terms of usability, ZeRO can train large models of up to 13B parameters (e.g., larger than Megatron GPT 8. 3B and T5 11B) without requiring model parallelism which is harder for scientists to apply. Last but not the least, researchers have used the system breakthroughs of ZeRO to create Turing-NLG, the world’s largest language model at the time (17B parameters) with record breaking accuracy.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114219361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-08-15DOI: 10.1109/SC41405.2020.00066
Elliott Slaughter, Wei Wu, Yuankun Fu, Legend Brandenburg, N. Garcia, Wilhem Kautz, Emily Marx, Kaleb S. Morris, Wonchan Lee, Qinglei Cao, G. Bosilca, S. Mirchandaney, Sean Treichler, P. McCormick, A. Aiken
We present Task Bench, a parameterized benchmark designed to explore the performance of distributed programming systems under a variety of application scenarios. Task Bench dramatically lowers the barrier to benchmarking and comparing multiple programming systems by making the implementation for a given system orthogonal to the benchmarks themselves: every benchmark constructed with Task Bench runs on every Task Bench implementation. Furthermore, Task Bench’s parameterization enables a wide variety of benchmark scenarios that distill the key characteristics of larger applications.To assess the effectiveness and overheads of the tested systems, we introduce a novel metric, minimum effective task granularity (METG). We conduct a comprehensive study with 15 programming systems on up to 256 Haswell nodes of the Cori supercomputer. Running at scale, 100$mu$s-long tasks are the finest granularity that any system runs efficiently with current technologies. We also study each system’s scalability, ability to hide communication and mitigate load imbalance.
{"title":"Task Bench: A Parameterized Benchmark for Evaluating Parallel Runtime Performance","authors":"Elliott Slaughter, Wei Wu, Yuankun Fu, Legend Brandenburg, N. Garcia, Wilhem Kautz, Emily Marx, Kaleb S. Morris, Wonchan Lee, Qinglei Cao, G. Bosilca, S. Mirchandaney, Sean Treichler, P. McCormick, A. Aiken","doi":"10.1109/SC41405.2020.00066","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00066","url":null,"abstract":"We present Task Bench, a parameterized benchmark designed to explore the performance of distributed programming systems under a variety of application scenarios. Task Bench dramatically lowers the barrier to benchmarking and comparing multiple programming systems by making the implementation for a given system orthogonal to the benchmarks themselves: every benchmark constructed with Task Bench runs on every Task Bench implementation. Furthermore, Task Bench’s parameterization enables a wide variety of benchmark scenarios that distill the key characteristics of larger applications.To assess the effectiveness and overheads of the tested systems, we introduce a novel metric, minimum effective task granularity (METG). We conduct a comprehensive study with 15 programming systems on up to 256 Haswell nodes of the Cori supercomputer. Running at scale, 100$mu$s-long tasks are the finest granularity that any system runs efficiently with current technologies. We also study each system’s scalability, ability to hide communication and mitigate load imbalance.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128512676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}