SC20: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

英文中文

BiQGEMM: Matrix Multiplication with Lookup Table for Binary-Coding-Based Quantized DNNs BiQGEMM:基于二进制编码的量化dnn的查找表矩阵乘法

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2020-05-20 DOI: 10.1109/SC41405.2020.00099

Yongkweon Jeon, Baeseong Park, Se Jung Kwon, Byeongwook Kim, Jeongin Yun, Dongsoo Lee

The number of parameters in deep neural networks (DNNs) is rapidly increasing to support complicated tasks and to improve model accuracy. Correspondingly, the amount of computations and required memory footprint increase as well. Quantization is an efficient method to address such concerns by compressing DNNs such that computations can be simplified while required storage footprint is significantly reduced. Unfortunately, commercial CPUs and GPUs do not fully support quantization because only fixed data transfers (such as 32 bits) are allowed. As a result, even if weights are quantized (by a non-uniform quantization scheme) into a few bits, CPUs and GPUs may not access multiple quantized weights without memory bandwidth waste. Success of quantization in practice, hence, relies on an efficient computation engine design, especially for matrix multiplication that is a basic computation engine in most DNNs. In this paper, we propose a novel matrix multiplication method, called BiQGEMM, dedicated to quantized DNNs. BiQGEMM can access multiple quantized weights simultaneously in one instruction. In addition, BiQGEMM pre-computes intermediate results that are highly redundant when quantization leads to limited available computation space. Since pre-computed values are stored in lookup tables and reused, BiQGEMM achieves lower amount of overall computations. Our extensive experimental results show that BiQGEMM presents higher performance than conventional schemes when DNNs are quantized.

为了支持复杂任务和提高模型精度，深度神经网络(dnn)中的参数数量正在迅速增加。相应地，计算量和所需的内存占用也会增加。量化是一种有效的方法，通过压缩dnn来解决这些问题，这样可以简化计算，同时显著减少所需的存储空间。不幸的是，商用cpu和gpu并不完全支持量化，因为只允许固定数据传输(如32位)。因此，即使权重被量化(通过非均匀量化方案)为几个位，cpu和gpu也可能无法访问多个量化的权重，而不会浪费内存带宽。因此，在实践中，量化的成功依赖于高效的计算引擎设计，特别是对于矩阵乘法，这是大多数深度神经网络的基本计算引擎。在本文中，我们提出了一种新的矩阵乘法方法，称为BiQGEMM，专门用于量子化dnn。BiQGEMM可以在一条指令中同时访问多个量化权重。此外，当量化导致可用计算空间有限时，BiQGEMM会预先计算高度冗余的中间结果。由于预先计算的值存储在查找表中并被重用，因此BiQGEMM实现了较低的总计算量。我们的大量实验结果表明，当dnn被量化时，BiQGEMM比传统方案具有更高的性能。

{"title":"BiQGEMM: Matrix Multiplication with Lookup Table for Binary-Coding-Based Quantized DNNs","authors":"Yongkweon Jeon, Baeseong Park, Se Jung Kwon, Byeongwook Kim, Jeongin Yun, Dongsoo Lee","doi":"10.1109/SC41405.2020.00099","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00099","url":null,"abstract":"The number of parameters in deep neural networks (DNNs) is rapidly increasing to support complicated tasks and to improve model accuracy. Correspondingly, the amount of computations and required memory footprint increase as well. Quantization is an efficient method to address such concerns by compressing DNNs such that computations can be simplified while required storage footprint is significantly reduced. Unfortunately, commercial CPUs and GPUs do not fully support quantization because only fixed data transfers (such as 32 bits) are allowed. As a result, even if weights are quantized (by a non-uniform quantization scheme) into a few bits, CPUs and GPUs may not access multiple quantized weights without memory bandwidth waste. Success of quantization in practice, hence, relies on an efficient computation engine design, especially for matrix multiplication that is a basic computation engine in most DNNs. In this paper, we propose a novel matrix multiplication method, called BiQGEMM, dedicated to quantized DNNs. BiQGEMM can access multiple quantized weights simultaneously in one instruction. In addition, BiQGEMM pre-computes intermediate results that are highly redundant when quantization leads to limited available computation space. Since pre-computed values are stored in lookup tables and reused, BiQGEMM achieves lower amount of overall computations. Our extensive experimental results show that BiQGEMM presents higher performance than conventional schemes when DNNs are quantized.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116641822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Optimizing Deep Learning Recommender Systems Training on CPU Cluster Architectures 在CPU集群架构上优化深度学习推荐系统训练

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2020-05-10 DOI: 10.1109/SC41405.2020.00047

Dhiraj D. Kalamkar, E. Georganas, S. Srinivasan, Jianping Chen, Mikhail Shiryaev, A. Heinecke

During the last two years, the goal of many researchers has been to squeeze the last bit of performance out of HPC system for AI tasks. Often this discussion is held in the context of how fast ResNet50 can be trained. Unfortunately, ResNet50 is no longer a representative workload in 2020. Thus, we focus on Recommender Systems which account for most of the AI cycles in cloud computing centers. More specifically, we focus on Facebook’s DLRM benchmark. By enabling it to run on latest CPU hardware and software tailored for HPC, we are able to achieve up to two-orders of magnitude improvement in performance on a single socket compared to the reference CPU implementation, and high scaling efficiency up to 64 sockets, while fitting ultra-large datasets which cannot be held in single node’s memory. Therefore, this paper discusses and analyzes novel optimization and parallelization techniques for the various operators in DLRM. Several optimizations (e.g. tensor-contraction accelerated MLPs, framework MPI progression, BFLOAT16 training with up to $1.8 times $ speed-up) are general and transferable to many other deep learning topologies.

在过去的两年里，许多研究人员的目标一直是将高性能计算系统的最后一点性能挤出人工智能任务。这种讨论通常是在ResNet50的训练速度有多快的背景下进行的。不幸的是，ResNet50在2020年不再是一个代表性的工作量。因此，我们专注于推荐系统，它占据了云计算中心的大部分人工智能周期。更具体地说，我们关注的是Facebook的DLRM基准。通过使其运行在最新的CPU硬件和为HPC量身定制的软件上，我们能够在单个插槽上实现与参考CPU实现相比高达两个数量级的性能改进，以及高达64个插槽的高扩展效率，同时适合无法在单个节点内存中保存的超大数据集。因此，本文讨论和分析了DLRM中各种操作符的新型优化和并行化技术。一些优化(例如张量收缩加速mlp，框架MPI进展，BFLOAT16训练，高达1.8倍加速)是通用的，可转移到许多其他深度学习拓扑中。

{"title":"Optimizing Deep Learning Recommender Systems Training on CPU Cluster Architectures","authors":"Dhiraj D. Kalamkar, E. Georganas, S. Srinivasan, Jianping Chen, Mikhail Shiryaev, A. Heinecke","doi":"10.1109/SC41405.2020.00047","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00047","url":null,"abstract":"During the last two years, the goal of many researchers has been to squeeze the last bit of performance out of HPC system for AI tasks. Often this discussion is held in the context of how fast ResNet50 can be trained. Unfortunately, ResNet50 is no longer a representative workload in 2020. Thus, we focus on Recommender Systems which account for most of the AI cycles in cloud computing centers. More specifically, we focus on Facebook’s DLRM benchmark. By enabling it to run on latest CPU hardware and software tailored for HPC, we are able to achieve up to two-orders of magnitude improvement in performance on a single socket compared to the reference CPU implementation, and high scaling efficiency up to 64 sockets, while fitting ultra-large datasets which cannot be held in single node’s memory. Therefore, this paper discusses and analyzes novel optimization and parallelization techniques for the various operators in DLRM. Several optimizations (e.g. tensor-contraction accelerated MLPs, framework MPI progression, BFLOAT16 training with up to $1.8 times $ speed-up) are general and transferable to many other deep learning topologies.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115162403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

Reducing Communication in Graph Neural Network Training 减少图神经网络训练中的通信

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2020-05-07 DOI: 10.1109/SC41405.2020.00074

Alok Tripathy, K. Yelick, A. Buluç

Graph Neural Networks (GNNs) are powerful and flexible neural networks that use the naturally sparse connectivity information of the data. GNNs represent this connectivity as sparse matrices, which have lower arithmetic intensity and thus higher communication costs compared to dense matrices, making GNNs harder to scale to high concurrencies than convolutional or fully-connected neural networks. We introduce a family of parallel algorithms for training GNNs and show that they can asymptotically reduce communication compared to previous parallel GNN training methods. We implement these algorithms, which are based on 1D, 1. 5D, 2D, and 3D sparse-dense matrix multiplication, using torch.distributed on GPU-equipped clusters. Our algorithms optimize communication across the full GNN training pipeline. We train GNNs on over a hundred GPUs on multiple datasets, including a protein network with over a billion edges.

图神经网络(gnn)是一种强大而灵活的神经网络，它利用了自然稀疏的数据连接信息。gnn将这种连通性表示为稀疏矩阵，与密集矩阵相比，稀疏矩阵具有较低的算术强度，因此通信成本更高，这使得gnn比卷积或全连接神经网络更难扩展到高并发。我们介绍了一系列用于训练GNN的并行算法，并表明与以前的并行GNN训练方法相比，它们可以渐进地减少通信。我们实现了这些基于1D, 1的算法。5D, 2D, 3D稀疏密集矩阵乘法，使用火炬。分布在配备gpu的集群上。我们的算法优化了整个GNN训练管道的通信。我们在多个数据集上的100多个gpu上训练gnn，包括一个有超过10亿个边的蛋白质网络。

引用次数: 73

Pushing the Limit of Molecular Dynamics with Ab Initio Accuracy to 100 Million Atoms with Machine Learning 用机器学习将分子动力学从头算精度的极限推到1亿个原子

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2020-05-01 DOI: 10.1109/SC41405.2020.00009

Weile Jia, Han Wang, Mohan Chen, Denghui Lu, Jiduan Liu, Lin Lin, R. Car, E. Weinan, Linfeng Zhang

For 35 years, ab initio molecular dynamics (AIMD) has been the method of choice for modeling complex atomistic phenomena from first principles. However, most AIMD applications are limited by computational cost to systems with thousands of atoms at most. We report that a machine learningbased simulation protocol (Deep Potential Molecular Dynamics), while retaining ab initio accuracy, can simulate more than 1 nanosecond-long trajectory of over 100 million atoms per day, using a highly optimized code (GPU DeePMD-kit) on the Summit supercomputer. Our code can efficiently scale up to the entire Summit supercomputer, attaining 91 PFLOPS in double precision (45.5% of the peak) and 162/275 PFLOPS in mixed-single/half precision. The great accomplishment of this work is that it opens the door to simulating unprecedented size and time scales with ab initio accuracy. It also poses new challenges to the next-generation supercomputer for a better integration of machine learning and physical modeling.

35年来，从头算分子动力学(AIMD)一直是从第一性原理出发模拟复杂原子现象的首选方法。然而，大多数AIMD应用受到计算成本的限制，最多只能使用数千个原子的系统。我们报告说，基于机器学习的模拟协议(Deep Potential Molecular Dynamics)，在保持从头算精度的同时，可以在Summit超级计算机上使用高度优化的代码(GPU DeePMD-kit)，每天模拟超过1纳秒的超过1亿个原子的轨迹。我们的代码可以有效地扩展到整个Summit超级计算机，双精度达到91 PFLOPS(峰值的45.5%)，混合单精度/半精度达到162/275 PFLOPS。这项工作的伟大成就是，它打开了以从头算的精度模拟前所未有的大小和时间尺度的大门。这也为下一代超级计算机更好地整合机器学习和物理建模提出了新的挑战。

引用次数: 145

MESHFREEFLOWNET: A Physics-Constrained Deep Continuous Space-Time Super-Resolution Framework 一个物理约束的深度连续时空超分辨率框架

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2020-05-01 DOI: 10.1109/SC41405.2020.00013

C. Jiang, S. Esmaeilzadeh, K. Azizzadenesheli, K. Kashinath, Mustafa A. Mustafa, H. Tchelepi, P. Marcus, Prabhat, Anima Anandkumar

We propose MESHFREEFLOWNET, a novel deep learning-based super-resolution framework to generate continuous (grid-free) spatio-temporal solutions from the lowresolution inputs. While being computationally efficient, MESHFREEFLOWNET accurately recovers the fine-scale quantities of interest. MESHFREEFLOWNET allows for: (i) the output to be sampled at all spatio-temporal resolutions, (ii) a set of Partial Differential Equation (PDE) constraints to be imposed, and (iii) training on fixed-size inputs on arbitrarily sized spatio-temporal domains owing to its fully convolutional encoder. We empirically study the performance of MESHFREEFLOWNET on the task of super-resolution of turbulent flows in the Rayleigh-Bénard convection problem. Across a diverse set of evaluation metrics, we show that MESHFREEFLOWNET significantly outperforms existing baselines. Furthermore, we provide a large scale implementation of MESHFREEFLOWNET and show that it efficiently scales across large clusters, achieving 96.80% scaling efficiency on up to 128 GPUs and a training time of less than 4 minutes. We provide an opensource implementation of our method that supports arbitrary combinations of PDE constraints1lsource code available:

我们提出MESHFREEFLOWNET，这是一个基于深度学习的超分辨率框架，用于从低分辨率输入生成连续(无网格)时空解。在计算效率高的同时，MESHFREEFLOWNET可以准确地恢复感兴趣的精细尺度数量。MESHFREEFLOWNET允许:(i)在所有时空分辨率下对输出进行采样，(ii)施加一组偏微分方程(PDE)约束，以及(iii)由于其全卷积编码器，在任意大小的时空域中对固定大小的输入进行训练。在rayleigh - b对流问题中，我们对MESHFREEFLOWNET在湍流超分辨任务上的性能进行了实证研究。通过一系列不同的评估指标，我们发现MESHFREEFLOWNET显著优于现有的基线。此外，我们提供了MESHFREEFLOWNET的大规模实现，并表明它可以有效地跨大型集群扩展，在多达128个gpu上实现96.80%的扩展效率，训练时间不到4分钟。我们提供了我们的方法的一个开源实现，它支持PDE约束的任意组合(源代码可用):

{"title":"MESHFREEFLOWNET: A Physics-Constrained Deep Continuous Space-Time Super-Resolution Framework","authors":"C. Jiang, S. Esmaeilzadeh, K. Azizzadenesheli, K. Kashinath, Mustafa A. Mustafa, H. Tchelepi, P. Marcus, Prabhat, Anima Anandkumar","doi":"10.1109/SC41405.2020.00013","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00013","url":null,"abstract":"We propose MESHFREEFLOWNET, a novel deep learning-based super-resolution framework to generate continuous (grid-free) spatio-temporal solutions from the lowresolution inputs. While being computationally efficient, MESHFREEFLOWNET accurately recovers the fine-scale quantities of interest. MESHFREEFLOWNET allows for: (i) the output to be sampled at all spatio-temporal resolutions, (ii) a set of Partial Differential Equation (PDE) constraints to be imposed, and (iii) training on fixed-size inputs on arbitrarily sized spatio-temporal domains owing to its fully convolutional encoder. We empirically study the performance of MESHFREEFLOWNET on the task of super-resolution of turbulent flows in the Rayleigh-Bénard convection problem. Across a diverse set of evaluation metrics, we show that MESHFREEFLOWNET significantly outperforms existing baselines. Furthermore, we provide a large scale implementation of MESHFREEFLOWNET and show that it efficiently scales across large clusters, achieving 96.80% scaling efficiency on up to 128 GPUs and a training time of less than 4 minutes. We provide an opensource implementation of our method that supports arbitrary combinations of PDE constraints1lsource code available:","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133307565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 96

Recurrent Neural Network Architecture Search for Geophysical Emulation 地球物理仿真中的递归神经网络结构搜索

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2020-04-23 DOI: 10.1109/SC41405.2020.00012

R. Maulik, Romain Egele, Bethany Lusch, Prasanna Balaprakash

Developing surrogate geophysical models from data is a key research topic in atmospheric and oceanic modeling because of the large computational costs associated with numerical simulation methods. Researchers have started applying a wide range of machine learning models, in particular neural networks, to geophysical data for forecasting without these constraints. Constructing neural networks for forecasting such data is nontrivial, however, and often requires trial and error. To address these limitations, we focus on developing proper-orthogonal-decomposition-based long short-term memory networks (PODLSTMs). We develop a scalable neural architecture search for generating stacked LSTMs to forecast temperature in the NOAA Optimum Interpolation Sea-Surface Temperature data set. Our approach identifies POD-LSTMs that are superior to manually designed variants and baseline time-series prediction methods. We also assess the scalability of different architecture search strategies on up to 512 Intel Knights Landing nodes of the Theta supercomputer at the Argonne Leadership Computing Facility.

由于数值模拟方法需要大量的计算成本，因此从数据中开发替代地球物理模型是大气和海洋模拟的一个关键研究课题。研究人员已经开始应用广泛的机器学习模型，特别是神经网络，来预测地球物理数据，而不受这些限制。然而，构建用于预测此类数据的神经网络并非易事，而且往往需要反复试验。为了解决这些限制，我们专注于开发基于适当正交分解的长短期记忆网络(PODLSTMs)。我们开发了一种可扩展的神经结构搜索，用于生成堆叠lstm来预测NOAA最优插值海面温度数据集的温度。我们的方法确定了pod - lstm优于手动设计的变量和基线时间序列预测方法。我们还在阿贡领导计算设施的Theta超级计算机的多达512个英特尔骑士登陆节点上评估了不同架构搜索策略的可扩展性。

{"title":"Recurrent Neural Network Architecture Search for Geophysical Emulation","authors":"R. Maulik, Romain Egele, Bethany Lusch, Prasanna Balaprakash","doi":"10.1109/SC41405.2020.00012","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00012","url":null,"abstract":"Developing surrogate geophysical models from data is a key research topic in atmospheric and oceanic modeling because of the large computational costs associated with numerical simulation methods. Researchers have started applying a wide range of machine learning models, in particular neural networks, to geophysical data for forecasting without these constraints. Constructing neural networks for forecasting such data is nontrivial, however, and often requires trial and error. To address these limitations, we focus on developing proper-orthogonal-decomposition-based long short-term memory networks (PODLSTMs). We develop a scalable neural architecture search for generating stacked LSTMs to forecast temperature in the NOAA Optimum Interpolation Sea-Surface Temperature data set. Our approach identifies POD-LSTMs that are superior to manually designed variants and baseline time-series prediction methods. We also assess the scalability of different architecture search strategies on up to 512 Intel Knights Landing nodes of the Theta supercomputer at the Argonne Leadership Computing Facility.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125957037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

A Submatrix-Based Method for Approximate Matrix Function Evaluation in the Quantum Chemistry Code CP2K 量子化学代码CP2K中基于子矩阵的近似矩阵函数求值方法

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2020-04-22 DOI: 10.1109/SC41405.2020.00084

Michael Lass, Robert Schade, T. Kuhne, Christian Plessl

Electronic structure calculations based on density-functional theory (DFT) represent a significant part of today’s HPC workloads and pose high demands on high-performance computing resources. To perform these quantum-mechanical DFT calculations on complex large-scale systems, so-called linear scaling methods instead of conventional cubic scaling methods are required. In this work, we take up the idea of the submatrix method and apply it to the DFT computations in the software package CP2K. For that purpose, we transform the underlying numeric operations on distributed, large, sparse matrices into computations on local, much smaller and nearly dense matrices. This allows us to exploit the full floating-point performance of modern CPUs and to make use of dedicated accelerator hardware, where performance has been limited by memory bandwidth before. We demonstrate both functionality and performance of our implementation and show how it can be accelerated with GPUs and FPGAs.

基于密度泛函理论(DFT)的电子结构计算是当今高性能计算工作的重要组成部分，对高性能计算资源提出了很高的要求。为了在复杂的大尺度系统上执行这些量子力学DFT计算，需要所谓的线性标度方法而不是传统的三次标度方法。在这项工作中，我们将子矩阵方法的思想应用到软件包CP2K中的DFT计算中。为此，我们将分布式、大型、稀疏矩阵上的底层数值运算转换为局部、小得多且接近密集的矩阵上的计算。这使我们能够充分利用现代cpu的浮点性能，并利用专用的加速器硬件，在此之前，性能受到内存带宽的限制。我们演示了我们实现的功能和性能，并展示了如何使用gpu和fpga加速它。

引用次数: 6

Alias-Free, Matrix-Free, and Quadrature-Free Discontinuous Galerkin Algorithms for (Plasma) Kinetic Equations (等离子体)动力学方程的无别名、无矩阵和无正交间断伽辽金算法

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2020-04-20 DOI: 10.1109/SC41405.2020.00077

A. Hakim, J. Juno

Understanding fundamental kinetic processes is important for many problems, from plasma physics to gas dynamics. A first-principles approach to these problems requires a statistical description via the Boltzmann equation, coupled to appropriate field equations. In this paper we present a novel version of the discontinuous Galerkin (DG) algorithm to solve such kinetic equations. Unlike Monte-Carlo methods, we use a continuum scheme in which we directly discretize the 6D phase-space using discontinuous basis functions. Our DG scheme eliminates counting noise and aliasing errors that would otherwise contaminate the delicate field-particle interactions. We use modal basis functions with reduced degrees of freedom to improve efficiency while retaining a high formal order of convergence. Our implementation incorporates a number of software innovations: use of JIT compiled top-level language, automatically generated computational kernels and a sophisticated shared-memory MPI implementation to handle velocity space parallelization.

了解基本的动力学过程对许多问题都很重要，从等离子体物理到气体动力学。这些问题的第一原理方法需要通过玻尔兹曼方程进行统计描述，并与适当的场方程耦合。本文提出了一种新的不连续伽辽金(DG)算法来求解这类动力学方程。与蒙特卡罗方法不同，我们使用连续体方案，其中我们使用不连续基函数直接离散6D相空间。我们的DG方案消除了计数噪声和混叠误差，否则会污染微妙的场-粒子相互作用。我们使用自由度降低的模态基函数来提高效率，同时保持较高的形式收敛阶。我们的实现结合了许多软件创新:使用JIT编译的顶级语言，自动生成的计算内核和一个复杂的共享内存MPI实现来处理速度空间并行化。

引用次数: 17

ZeRO: Memory optimizations Toward Training Trillion Parameter Models ZeRO:面向训练万亿参数模型的内存优化

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2019-10-04 DOI: 10.1109/SC41405.2020.00024

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He

Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into limited device memory, while obtaining computation, communication and development efficiency. We develop a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, vastly improving training speed while increasing the model size that can be efficiently trained. ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing us to scale the model size proportional to the number of devices with sustained high efficiency. Our analysis on memory requirements and communication volume demonstrates: ZeRO has the potential to scale beyond 1 Trillion parameters using today’s hardware. We implement and evaluate ZeRO: it trains large models of over 100B parameter with super-linear speedup on 400 GPUs, achieving throughput of 15 Petaflops. This represents an 8x increase in model size and 10x increase in achievable performance over state-of-the-art. In terms of usability, ZeRO can train large models of up to 13B parameters (e.g., larger than Megatron GPT 8. 3B and T5 11B) without requiring model parallelism which is harder for scientists to apply. Last but not the least, researchers have used the system breakthroughs of ZeRO to create Turing-NLG, the world’s largest language model at the time (17B parameters) with record breaking accuracy.

大型深度学习模型提供了显著的准确性提高，但训练数十亿到数万亿的参数是具有挑战性的。现有的解决方案，如数据和模型并行，在获得计算、通信和开发效率的同时，在将这些模型适应有限的设备内存方面表现出基本的局限性。我们开发了一种新颖的解决方案，零冗余优化器(Zero)，以优化内存，大大提高了训练速度，同时增加了可以有效训练的模型大小。ZeRO消除了数据和模型并行训练中的内存冗余，同时保持了低通信量和高计算粒度，使我们能够以持续的高效率按比例缩放模型大小。我们对内存需求和通信量的分析表明:使用当今的硬件，ZeRO有可能扩展到超过1万亿个参数。我们实现和评估ZeRO:它在400个gpu上以超线性加速训练超过100B参数的大型模型，实现15 Petaflops的吞吐量。这意味着模型尺寸增加了8倍，可实现性能增加了10倍。在可用性方面，ZeRO可以训练多达13B个参数的大型模型(例如，比威震天GPT 8更大)。3B和T5 (11B)，而不需要模型并行，这对科学家来说更难应用。最后但并非最不重要的是，研究人员利用ZeRO的系统突破创建了图灵- nlg，这是当时世界上最大的语言模型(17B个参数)，其准确性打破了纪录。

{"title":"ZeRO: Memory optimizations Toward Training Trillion Parameter Models","authors":"Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He","doi":"10.1109/SC41405.2020.00024","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00024","url":null,"abstract":"Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into limited device memory, while obtaining computation, communication and development efficiency. We develop a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, vastly improving training speed while increasing the model size that can be efficiently trained. ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing us to scale the model size proportional to the number of devices with sustained high efficiency. Our analysis on memory requirements and communication volume demonstrates: ZeRO has the potential to scale beyond 1 Trillion parameters using today’s hardware. We implement and evaluate ZeRO: it trains large models of over 100B parameter with super-linear speedup on 400 GPUs, achieving throughput of 15 Petaflops. This represents an 8x increase in model size and 10x increase in achievable performance over state-of-the-art. In terms of usability, ZeRO can train large models of up to 13B parameters (e.g., larger than Megatron GPT 8. 3B and T5 11B) without requiring model parallelism which is harder for scientists to apply. Last but not the least, researchers have used the system breakthroughs of ZeRO to create Turing-NLG, the world’s largest language model at the time (17B parameters) with record breaking accuracy.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114219361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 563

Task Bench: A Parameterized Benchmark for Evaluating Parallel Runtime Performance 任务工作台:用于评估并行运行时性能的参数化基准

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2019-08-15 DOI: 10.1109/SC41405.2020.00066

Elliott Slaughter, Wei Wu, Yuankun Fu, Legend Brandenburg, N. Garcia, Wilhem Kautz, Emily Marx, Kaleb S. Morris, Wonchan Lee, Qinglei Cao, G. Bosilca, S. Mirchandaney, Sean Treichler, P. McCormick, A. Aiken

We present Task Bench, a parameterized benchmark designed to explore the performance of distributed programming systems under a variety of application scenarios. Task Bench dramatically lowers the barrier to benchmarking and comparing multiple programming systems by making the implementation for a given system orthogonal to the benchmarks themselves: every benchmark constructed with Task Bench runs on every Task Bench implementation. Furthermore, Task Bench’s parameterization enables a wide variety of benchmark scenarios that distill the key characteristics of larger applications.To assess the effectiveness and overheads of the tested systems, we introduce a novel metric, minimum effective task granularity (METG). We conduct a comprehensive study with 15 programming systems on up to 256 Haswell nodes of the Cori supercomputer. Running at scale, 100$mu$s-long tasks are the finest granularity that any system runs efficiently with current technologies. We also study each system’s scalability, ability to hide communication and mitigate load imbalance.

我们提出Task Bench，这是一个参数化基准，旨在探索分布式编程系统在各种应用场景下的性能。Task Bench通过使给定系统的实现与基准测试本身正交，极大地降低了基准测试和比较多个编程系统的障碍:用Task Bench构建的每个基准测试都运行在每个Task Bench实现上。此外，Task Bench的参数化支持各种各样的基准测试场景，这些场景可以提炼出大型应用程序的关键特征。为了评估被测试系统的有效性和开销，我们引入了一个新的度量，最小有效任务粒度(METG)。我们在Cori超级计算机的多达256个Haswell节点上对15个编程系统进行了全面的研究。在规模上运行，100$mu$s长的任务是当前技术下任何系统有效运行的最细粒度。我们还研究了每个系统的可扩展性、隐藏通信和减轻负载不平衡的能力。

{"title":"Task Bench: A Parameterized Benchmark for Evaluating Parallel Runtime Performance","authors":"Elliott Slaughter, Wei Wu, Yuankun Fu, Legend Brandenburg, N. Garcia, Wilhem Kautz, Emily Marx, Kaleb S. Morris, Wonchan Lee, Qinglei Cao, G. Bosilca, S. Mirchandaney, Sean Treichler, P. McCormick, A. Aiken","doi":"10.1109/SC41405.2020.00066","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00066","url":null,"abstract":"We present Task Bench, a parameterized benchmark designed to explore the performance of distributed programming systems under a variety of application scenarios. Task Bench dramatically lowers the barrier to benchmarking and comparing multiple programming systems by making the implementation for a given system orthogonal to the benchmarks themselves: every benchmark constructed with Task Bench runs on every Task Bench implementation. Furthermore, Task Bench’s parameterization enables a wide variety of benchmark scenarios that distill the key characteristics of larger applications.To assess the effectiveness and overheads of the tested systems, we introduce a novel metric, minimum effective task granularity (METG). We conduct a comprehensive study with 15 programming systems on up to 256 Haswell nodes of the Cori supercomputer. Running at scale, 100$mu$s-long tasks are the finest granularity that any system runs efficiently with current technologies. We also study each system’s scalability, ability to hide communication and mitigate load imbalance.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128512676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀