GEMS: GPU-Enabled Memory-Aware Model-Parallelism System for Distributed DNN Training

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2020-11-01 DOI:10.1109/SC41405.2020.00049

Arpan Jain, A. Awan, Asmaa Aljuhani, J. Hashmi, Quentin G. Anthony, H. Subramoni, D. Panda, R. Machiraju, A. Parwani

{"title":"GEMS: GPU-Enabled Memory-Aware Model-Parallelism System for Distributed DNN Training","authors":"Arpan Jain, A. Awan, Asmaa Aljuhani, J. Hashmi, Quentin G. Anthony, H. Subramoni, D. Panda, R. Machiraju, A. Parwani","doi":"10.1109/SC41405.2020.00049","DOIUrl":null,"url":null,"abstract":"Data-parallelism has become an established paradigm to train DNNs that fit inside GPU memory on large-scale HPC systems. However, model-parallelism is required to train out-of-core DNNs. In this paper, we deal with emerging requirements brought forward by very large DNNs being trained using high-resolution images common in digital pathology. To address these, we propose, design, and implement GEMS; a GPU-Enabled Memory-Aware Model-Parallelism System. We present several design schemes like GEMS-MAST, GEMS-MASTER, and GEMS-Hybrid that offer excellent speedups over state-of-the-art systems like Mesh-TensorFlow and FlexFlow. Furthermore, we combine model-parallelism and data-parallelism to train a 1000-1ayer ResNet-lk model using 1,024 Volta V100 GPUs with 97.32% scaling-efficiency. For the real-world histopathology whole-slide-image (WSI) of 100,000 x 100,000 pixels, we train custom ResNet-110-v2 on image tiles of size 1024 x 1024 and reduce the training time from seven hours to 28 minutes.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC41405.2020.00049","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 28

Abstract

Data-parallelism has become an established paradigm to train DNNs that fit inside GPU memory on large-scale HPC systems. However, model-parallelism is required to train out-of-core DNNs. In this paper, we deal with emerging requirements brought forward by very large DNNs being trained using high-resolution images common in digital pathology. To address these, we propose, design, and implement GEMS; a GPU-Enabled Memory-Aware Model-Parallelism System. We present several design schemes like GEMS-MAST, GEMS-MASTER, and GEMS-Hybrid that offer excellent speedups over state-of-the-art systems like Mesh-TensorFlow and FlexFlow. Furthermore, we combine model-parallelism and data-parallelism to train a 1000-1ayer ResNet-lk model using 1,024 Volta V100 GPUs with 97.32% scaling-efficiency. For the real-world histopathology whole-slide-image (WSI) of 100,000 x 100,000 pixels, we train custom ResNet-110-v2 on image tiles of size 1024 x 1024 and reduce the training time from seven hours to 28 minutes.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

GEMS: gpu支持的分布式DNN训练的内存感知模型并行系统

数据并行已经成为训练适合大规模HPC系统GPU内存的dnn的既定范例。然而，训练核外dnn需要模型并行性。在本文中，我们处理了使用数字病理学中常见的高分辨率图像训练非常大的dnn所提出的新要求。为了解决这些问题，我们提出、设计和实施GEMS;一个支持gpu的内存感知模型并行系统。我们提出了几种设计方案，如GEMS-MAST, GEMS-MASTER和GEMS-Hybrid，它们比最先进的系统(如Mesh-TensorFlow和FlexFlow)提供了出色的加速。此外，我们将模型并行性和数据并行性结合起来，使用1,024个Volta V100 gpu以97.32%的扩展效率训练了1000层resnet - like模型。对于100,000 x 100,000像素的真实世界组织病理学全幻灯片图像(WSI)，我们在大小为1024 x 1024的图像块上训练自定义ResNet-110-v2，并将训练时间从7小时减少到28分钟。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

自引率

0.00%

发文量

期刊最新文献

CAB-MPI: Exploring Interprocess Work-Stealing towards Balanced MPI Communication Toward Realization of Numerical Towing-Tank Tests by Wall-Resolved Large Eddy Simulation based on 32 Billion Grid Finite-Element Computation Scalable yet Rigorous Floating-Point Error Analysis Scalable Knowledge Graph Analytics at 136 Petaflop/s BORA: A Bag Optimizer for Robotic Analysis