Scalable Inference for Sparse Deep Neural Networks using Kokkos Kernels

2019 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2019-09-01 DOI:10.1109/HPEC.2019.8916378

J. Ellis, S. Rajamanickam

{"title":"Scalable Inference for Sparse Deep Neural Networks using Kokkos Kernels","authors":"J. Ellis, S. Rajamanickam","doi":"10.1109/HPEC.2019.8916378","DOIUrl":null,"url":null,"abstract":"Over the last decade, hardware advances have led to the feasibility of training and inference for very large deep neural networks. Sparsified deep neural networks (DNNs) can greatly reduce memory costs and increase throughput of standard DNNs, if loss of accuracy can be controlled. The IEEE HPEC Sparse Deep Neural Network Graph Challenge serves as a testbed for algorithmic and implementation advances to maximize computational performance of sparse deep neural networks. We base our sparse network for DNNs, KK-SpDNN, on the sparse linear algebra kernels within the Kokkos Kernels library. Using the sparse matrix-matrix multiplication in Kokkos Kernels allows us to reuse a highly optimized kernel. We focus on reducing the single node and multi-node runtimes for 12 sparse networks. We test KK-SpDNN on Intel Skylake and Knights Landing architectures and see 120-500x improvement on single node performance over the serial reference implementation. We run in data-parallel mode with MPI to further speed up network inference, ultimately obtaining an edge processing rate of 1.16e+12 on 20 Skylake nodes. This translates to a 13x speed up on 20 nodes compared to our highly optimized multithreaded implementation on a single Skylake node.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"168 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC.2019.8916378","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 17

Abstract

Over the last decade, hardware advances have led to the feasibility of training and inference for very large deep neural networks. Sparsified deep neural networks (DNNs) can greatly reduce memory costs and increase throughput of standard DNNs, if loss of accuracy can be controlled. The IEEE HPEC Sparse Deep Neural Network Graph Challenge serves as a testbed for algorithmic and implementation advances to maximize computational performance of sparse deep neural networks. We base our sparse network for DNNs, KK-SpDNN, on the sparse linear algebra kernels within the Kokkos Kernels library. Using the sparse matrix-matrix multiplication in Kokkos Kernels allows us to reuse a highly optimized kernel. We focus on reducing the single node and multi-node runtimes for 12 sparse networks. We test KK-SpDNN on Intel Skylake and Knights Landing architectures and see 120-500x improvement on single node performance over the serial reference implementation. We run in data-parallel mode with MPI to further speed up network inference, ultimately obtaining an edge processing rate of 1.16e+12 on 20 Skylake nodes. This translates to a 13x speed up on 20 nodes compared to our highly optimized multithreaded implementation on a single Skylake node.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于Kokkos核的稀疏深度神经网络的可扩展推理

在过去的十年里，硬件的进步使得训练和推理非常大的深度神经网络成为可能。稀疏化深度神经网络(dnn)在控制精度损失的前提下，可以大大降低标准深度神经网络的存储成本和提高吞吐量。IEEE HPEC稀疏深度神经网络图挑战赛作为算法和实现进步的测试平台，以最大限度地提高稀疏深度神经网络的计算性能。我们将dnn的稀疏网络KK-SpDNN建立在Kokkos内核库中的稀疏线性代数内核上。在Kokkos kernel中使用稀疏矩阵-矩阵乘法允许我们重用高度优化的内核。我们专注于减少12个稀疏网络的单节点和多节点运行时间。我们在英特尔Skylake和Knights Landing架构上测试了KK-SpDNN，发现单节点性能比串行参考实现提高了120-500倍。为了进一步加快网络推理速度，我们采用MPI数据并行模式运行，最终在20个Skylake节点上获得了1.16e+12的边缘处理速率。与我们在单个Skylake节点上高度优化的多线程实现相比，这意味着在20个节点上的速度提高了13倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2019 IEEE High Performance Extreme Computing Conference (HPEC)

自引率

0.00%

发文量