在CPU集群架构上优化深度学习推荐系统训练

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2020-05-10 DOI:10.1109/SC41405.2020.00047

Dhiraj D. Kalamkar, E. Georganas, S. Srinivasan, Jianping Chen, Mikhail Shiryaev, A. Heinecke

{"title":"在CPU集群架构上优化深度学习推荐系统训练","authors":"Dhiraj D. Kalamkar, E. Georganas, S. Srinivasan, Jianping Chen, Mikhail Shiryaev, A. Heinecke","doi":"10.1109/SC41405.2020.00047","DOIUrl":null,"url":null,"abstract":"During the last two years, the goal of many researchers has been to squeeze the last bit of performance out of HPC system for AI tasks. Often this discussion is held in the context of how fast ResNet50 can be trained. Unfortunately, ResNet50 is no longer a representative workload in 2020. Thus, we focus on Recommender Systems which account for most of the AI cycles in cloud computing centers. More specifically, we focus on Facebook’s DLRM benchmark. By enabling it to run on latest CPU hardware and software tailored for HPC, we are able to achieve up to two-orders of magnitude improvement in performance on a single socket compared to the reference CPU implementation, and high scaling efficiency up to 64 sockets, while fitting ultra-large datasets which cannot be held in single node’s memory. Therefore, this paper discusses and analyzes novel optimization and parallelization techniques for the various operators in DLRM. Several optimizations (e.g. tensor-contraction accelerated MLPs, framework MPI progression, BFLOAT16 training with up to $1.8 \\times $ speed-up) are general and transferable to many other deep learning topologies.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"104 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":"{\"title\":\"Optimizing Deep Learning Recommender Systems Training on CPU Cluster Architectures\",\"authors\":\"Dhiraj D. Kalamkar, E. Georganas, S. Srinivasan, Jianping Chen, Mikhail Shiryaev, A. Heinecke\",\"doi\":\"10.1109/SC41405.2020.00047\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"During the last two years, the goal of many researchers has been to squeeze the last bit of performance out of HPC system for AI tasks. Often this discussion is held in the context of how fast ResNet50 can be trained. Unfortunately, ResNet50 is no longer a representative workload in 2020. Thus, we focus on Recommender Systems which account for most of the AI cycles in cloud computing centers. More specifically, we focus on Facebook’s DLRM benchmark. By enabling it to run on latest CPU hardware and software tailored for HPC, we are able to achieve up to two-orders of magnitude improvement in performance on a single socket compared to the reference CPU implementation, and high scaling efficiency up to 64 sockets, while fitting ultra-large datasets which cannot be held in single node’s memory. Therefore, this paper discusses and analyzes novel optimization and parallelization techniques for the various operators in DLRM. Several optimizations (e.g. tensor-contraction accelerated MLPs, framework MPI progression, BFLOAT16 training with up to $1.8 \\\\times $ speed-up) are general and transferable to many other deep learning topologies.\",\"PeriodicalId\":424429,\"journal\":{\"name\":\"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis\",\"volume\":\"104 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-05-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"32\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SC41405.2020.00047\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC41405.2020.00047","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 32

摘要

在过去的两年里，许多研究人员的目标一直是将高性能计算系统的最后一点性能挤出人工智能任务。这种讨论通常是在ResNet50的训练速度有多快的背景下进行的。不幸的是，ResNet50在2020年不再是一个代表性的工作量。因此，我们专注于推荐系统，它占据了云计算中心的大部分人工智能周期。更具体地说，我们关注的是Facebook的DLRM基准。通过使其运行在最新的CPU硬件和为HPC量身定制的软件上，我们能够在单个插槽上实现与参考CPU实现相比高达两个数量级的性能改进，以及高达64个插槽的高扩展效率，同时适合无法在单个节点内存中保存的超大数据集。因此，本文讨论和分析了DLRM中各种操作符的新型优化和并行化技术。一些优化(例如张量收缩加速mlp，框架MPI进展，BFLOAT16训练，高达1.8倍加速)是通用的，可转移到许多其他深度学习拓扑中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Optimizing Deep Learning Recommender Systems Training on CPU Cluster Architectures

During the last two years, the goal of many researchers has been to squeeze the last bit of performance out of HPC system for AI tasks. Often this discussion is held in the context of how fast ResNet50 can be trained. Unfortunately, ResNet50 is no longer a representative workload in 2020. Thus, we focus on Recommender Systems which account for most of the AI cycles in cloud computing centers. More specifically, we focus on Facebook’s DLRM benchmark. By enabling it to run on latest CPU hardware and software tailored for HPC, we are able to achieve up to two-orders of magnitude improvement in performance on a single socket compared to the reference CPU implementation, and high scaling efficiency up to 64 sockets, while fitting ultra-large datasets which cannot be held in single node’s memory. Therefore, this paper discusses and analyzes novel optimization and parallelization techniques for the various operators in DLRM. Several optimizations (e.g. tensor-contraction accelerated MLPs, framework MPI progression, BFLOAT16 training with up to $1.8 \times $ speed-up) are general and transferable to many other deep learning topologies.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

自引率

0.00%

发文量

期刊最新文献

CAB-MPI: Exploring Interprocess Work-Stealing towards Balanced MPI Communication Toward Realization of Numerical Towing-Tank Tests by Wall-Resolved Large Eddy Simulation based on 32 Billion Grid Finite-Element Computation Scalable yet Rigorous Floating-Point Error Analysis Scalable Knowledge Graph Analytics at 136 Petaflop/s BORA: A Bag Optimizer for Robotic Analysis