在CPU集群架构上优化深度学习推荐系统训练

Dhiraj D. Kalamkar, E. Georganas, S. Srinivasan, Jianping Chen, Mikhail Shiryaev, A. Heinecke
{"title":"在CPU集群架构上优化深度学习推荐系统训练","authors":"Dhiraj D. Kalamkar, E. Georganas, S. Srinivasan, Jianping Chen, Mikhail Shiryaev, A. Heinecke","doi":"10.1109/SC41405.2020.00047","DOIUrl":null,"url":null,"abstract":"During the last two years, the goal of many researchers has been to squeeze the last bit of performance out of HPC system for AI tasks. Often this discussion is held in the context of how fast ResNet50 can be trained. Unfortunately, ResNet50 is no longer a representative workload in 2020. Thus, we focus on Recommender Systems which account for most of the AI cycles in cloud computing centers. More specifically, we focus on Facebook’s DLRM benchmark. By enabling it to run on latest CPU hardware and software tailored for HPC, we are able to achieve up to two-orders of magnitude improvement in performance on a single socket compared to the reference CPU implementation, and high scaling efficiency up to 64 sockets, while fitting ultra-large datasets which cannot be held in single node’s memory. Therefore, this paper discusses and analyzes novel optimization and parallelization techniques for the various operators in DLRM. Several optimizations (e.g. tensor-contraction accelerated MLPs, framework MPI progression, BFLOAT16 training with up to $1.8 \\times $ speed-up) are general and transferable to many other deep learning topologies.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"104 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":"{\"title\":\"Optimizing Deep Learning Recommender Systems Training on CPU Cluster Architectures\",\"authors\":\"Dhiraj D. Kalamkar, E. Georganas, S. Srinivasan, Jianping Chen, Mikhail Shiryaev, A. Heinecke\",\"doi\":\"10.1109/SC41405.2020.00047\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"During the last two years, the goal of many researchers has been to squeeze the last bit of performance out of HPC system for AI tasks. Often this discussion is held in the context of how fast ResNet50 can be trained. Unfortunately, ResNet50 is no longer a representative workload in 2020. Thus, we focus on Recommender Systems which account for most of the AI cycles in cloud computing centers. More specifically, we focus on Facebook’s DLRM benchmark. By enabling it to run on latest CPU hardware and software tailored for HPC, we are able to achieve up to two-orders of magnitude improvement in performance on a single socket compared to the reference CPU implementation, and high scaling efficiency up to 64 sockets, while fitting ultra-large datasets which cannot be held in single node’s memory. Therefore, this paper discusses and analyzes novel optimization and parallelization techniques for the various operators in DLRM. Several optimizations (e.g. tensor-contraction accelerated MLPs, framework MPI progression, BFLOAT16 training with up to $1.8 \\\\times $ speed-up) are general and transferable to many other deep learning topologies.\",\"PeriodicalId\":424429,\"journal\":{\"name\":\"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis\",\"volume\":\"104 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-05-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"32\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SC41405.2020.00047\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC41405.2020.00047","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 32

摘要

在过去的两年里,许多研究人员的目标一直是将高性能计算系统的最后一点性能挤出人工智能任务。这种讨论通常是在ResNet50的训练速度有多快的背景下进行的。不幸的是,ResNet50在2020年不再是一个代表性的工作量。因此,我们专注于推荐系统,它占据了云计算中心的大部分人工智能周期。更具体地说,我们关注的是Facebook的DLRM基准。通过使其运行在最新的CPU硬件和为HPC量身定制的软件上,我们能够在单个插槽上实现与参考CPU实现相比高达两个数量级的性能改进,以及高达64个插槽的高扩展效率,同时适合无法在单个节点内存中保存的超大数据集。因此,本文讨论和分析了DLRM中各种操作符的新型优化和并行化技术。一些优化(例如张量收缩加速mlp,框架MPI进展,BFLOAT16训练,高达1.8倍加速)是通用的,可转移到许多其他深度学习拓扑中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Optimizing Deep Learning Recommender Systems Training on CPU Cluster Architectures
During the last two years, the goal of many researchers has been to squeeze the last bit of performance out of HPC system for AI tasks. Often this discussion is held in the context of how fast ResNet50 can be trained. Unfortunately, ResNet50 is no longer a representative workload in 2020. Thus, we focus on Recommender Systems which account for most of the AI cycles in cloud computing centers. More specifically, we focus on Facebook’s DLRM benchmark. By enabling it to run on latest CPU hardware and software tailored for HPC, we are able to achieve up to two-orders of magnitude improvement in performance on a single socket compared to the reference CPU implementation, and high scaling efficiency up to 64 sockets, while fitting ultra-large datasets which cannot be held in single node’s memory. Therefore, this paper discusses and analyzes novel optimization and parallelization techniques for the various operators in DLRM. Several optimizations (e.g. tensor-contraction accelerated MLPs, framework MPI progression, BFLOAT16 training with up to $1.8 \times $ speed-up) are general and transferable to many other deep learning topologies.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
CAB-MPI: Exploring Interprocess Work-Stealing towards Balanced MPI Communication Toward Realization of Numerical Towing-Tank Tests by Wall-Resolved Large Eddy Simulation based on 32 Billion Grid Finite-Element Computation Scalable yet Rigorous Floating-Point Error Analysis Scalable Knowledge Graph Analytics at 136 Petaflop/s BORA: A Bag Optimizer for Robotic Analysis
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1