Harnessing Deep Learning via a Single Building Block

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI:10.1109/IPDPS47924.2020.00032

E. Georganas, K. Banerjee, Dhiraj D. Kalamkar, Sasikanth Avancha, Anand Venkat, Michael J. Anderson, G. Henry, Hans Pabst, A. Heinecke

{"title":"Harnessing Deep Learning via a Single Building Block","authors":"E. Georganas, K. Banerjee, Dhiraj D. Kalamkar, Sasikanth Avancha, Anand Venkat, Michael J. Anderson, G. Henry, Hans Pabst, A. Heinecke","doi":"10.1109/IPDPS47924.2020.00032","DOIUrl":null,"url":null,"abstract":"Deep learning (DL) is one of the most prominent branches of machine learning. Due to the immense computational cost of DL workloads, industry and academia have developed DL libraries with highly-specialized kernels for each workload/architecture, leading to numerous, complex code-bases that strive for performance, yet they are hard to maintain and do not generalize. In this work, we introduce the batch-reduce GEMM kernel and show how the most popular DL algorithms can be formulated with this kernel as the basic building-block. Consequently, the DL library-development degenerates to mere (potentially automatic) tuning of loops around this sole optimized kernel. By exploiting our new kernel we implement Recurrent Neural Networks, Convolution Neural Networks and Multilayer Perceptron training and inference primitives in just 3K lines of high-level code. Our primitives outperform vendor-optimized libraries on multi-node CPU clusters, and we also provide proof-of-concept CNN kernels targeting GPUs. Finally, we demonstrate that the batch-reduce GEMM kernel within a tensor compiler yields high-performance CNN primitives, further amplifying the viability of our approach.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"258 1","pages":"222-233"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS47924.2020.00032","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

Abstract

Deep learning (DL) is one of the most prominent branches of machine learning. Due to the immense computational cost of DL workloads, industry and academia have developed DL libraries with highly-specialized kernels for each workload/architecture, leading to numerous, complex code-bases that strive for performance, yet they are hard to maintain and do not generalize. In this work, we introduce the batch-reduce GEMM kernel and show how the most popular DL algorithms can be formulated with this kernel as the basic building-block. Consequently, the DL library-development degenerates to mere (potentially automatic) tuning of loops around this sole optimized kernel. By exploiting our new kernel we implement Recurrent Neural Networks, Convolution Neural Networks and Multilayer Perceptron training and inference primitives in just 3K lines of high-level code. Our primitives outperform vendor-optimized libraries on multi-node CPU clusters, and we also provide proof-of-concept CNN kernels targeting GPUs. Finally, we demonstrate that the batch-reduce GEMM kernel within a tensor compiler yields high-performance CNN primitives, further amplifying the viability of our approach.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过单个构建块利用深度学习

深度学习(DL)是机器学习最突出的分支之一。由于深度学习工作负载的巨大计算成本，工业界和学术界已经为每个工作负载/体系结构开发了具有高度专业化内核的深度学习库，导致了大量复杂的代码库，这些代码库追求性能，但它们难以维护且无法泛化。在这项工作中，我们介绍了批处理缩减GEMM内核，并展示了如何使用该内核作为基本构建块来制定最流行的深度学习算法。因此，DL库开发退化为围绕这个唯一优化的内核进行的循环调优(可能是自动的)。通过利用我们的新内核，我们实现了循环神经网络、卷积神经网络和多层感知器训练和推理原语，只需3K行高级代码。我们的原语在多节点CPU集群上的性能优于供应商优化的库，并且我们还提供了针对gpu的概念验证CNN内核。最后，我们证明了张量编译器中的批量减少GEMM内核产生高性能的CNN原语，进一步放大了我们方法的可行性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量

期刊最新文献

Asynch-SGBDT: Train Stochastic Gradient Boosting Decision Trees in an Asynchronous Parallel Manner Resilience at Extreme Scale and Connections with Other Domains A Tale of Two C's: Convergence and Composability 12 Ways to Fool the Masses with Irreproducible Results Is Asymptotic Cost Analysis Useful in Developing Practical Parallel Algorithms