A. Awan, Ching-Hsiang Chu, H. Subramoni, Xiaoyi Lu, D. Panda
{"title":"OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training","authors":"A. Awan, Ching-Hsiang Chu, H. Subramoni, Xiaoyi Lu, D. Panda","doi":"10.1109/HiPC.2018.00024","DOIUrl":null,"url":null,"abstract":"Existing frameworks cannot train large DNNs that do not fit the GPU memory without explicit memory management schemes. In this paper, we propose OC-DNN - a novel Out-of-Core DNN training framework that exploits new Unified Memory features along with new hardware mechanisms in Pascal and Volta GPUs. OC-DNN has two major design components — 1) OC-Caffe; an enhanced version of Caffe that exploits innovative UM features like asynchronous prefetching, managed page-migration, exploitation of GPU-based page faults, and the cudaMemAdvise interface to enable efficient out-of-core training for very large DNNs, and 2) an interception library to transpar-ently leverage these cutting-edge features for other frameworks. We provide a comprehensive performance characterization of our designs. OC-Caffe provides comparable performance (to Caffe) for regular DNNs. OC-Caffe-Opt is up to 1.9X faster than OC-Caffe-Naive and up to 5X faster than optimized CPU-based training for out-of-core workloads. OC-Caffe also allows scale-up (DGX-1) and scale-out on multi-GPU clusters.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"27","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC.2018.00024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 27
Abstract
Existing frameworks cannot train large DNNs that do not fit the GPU memory without explicit memory management schemes. In this paper, we propose OC-DNN - a novel Out-of-Core DNN training framework that exploits new Unified Memory features along with new hardware mechanisms in Pascal and Volta GPUs. OC-DNN has two major design components — 1) OC-Caffe; an enhanced version of Caffe that exploits innovative UM features like asynchronous prefetching, managed page-migration, exploitation of GPU-based page faults, and the cudaMemAdvise interface to enable efficient out-of-core training for very large DNNs, and 2) an interception library to transpar-ently leverage these cutting-edge features for other frameworks. We provide a comprehensive performance characterization of our designs. OC-Caffe provides comparable performance (to Caffe) for regular DNNs. OC-Caffe-Opt is up to 1.9X faster than OC-Caffe-Naive and up to 5X faster than optimized CPU-based training for out-of-core workloads. OC-Caffe also allows scale-up (DGX-1) and scale-out on multi-GPU clusters.