Building a Performance Model for Deep Learning Recommendation Model Training on GPUs

Zhongyi Lin, Louis Feng, E. K. Ardestani, Jaewon Lee, J. Lundell, Changkyu Kim, A. Kejariwal, John Douglas Owens
{"title":"Building a Performance Model for Deep Learning Recommendation Model Training on GPUs","authors":"Zhongyi Lin, Louis Feng, E. K. Ardestani, Jaewon Lee, J. Lundell, Changkyu Kim, A. Kejariwal, John Douglas Owens","doi":"10.1109/HiPC56025.2022.00019","DOIUrl":null,"url":null,"abstract":"We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose GPU utilization is low compared to other well-optimized CV and NLP models. We show that both the device active time (the sum of kernel runtimes) but also the device idle time are important components of the overall device time. We therefore tackle them separately by (1) flexibly adopting heuristic-based and ML-based kernel performance models for operators that dominate the device active time, and (2) categorizing operator overheads into five types to determine quantitatively their contribution to the device active time. Combining these two parts, we propose a critical-path-based algorithm to predict the per-batch training time of DLRM by traversing its execution graph. We achieve less than 10% geometric mean average error (GMAE) in all kernel performance modeling, and 4.61% and 7.96% geomean errors for GPU active time and overall E2E per-batch training time prediction with overheads from individual workloads, respectively. A slight increase of 2.19% incurred in E2E prediction error with shared overheads across workloads suggests the feasibility of using shared overheads in large-scale prediction. We show that our general performance model not only achieves low prediction error on DLRM, which has highly customized configurations and is dominated by multiple factors but also yields comparable accuracy on other compute-bound ML models targeted by most previous methods. Using this performance model and graph-level data and task dependency analysis, we show our system can provide more general model-system co-design than previous methods.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"129 11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC56025.2022.00019","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose GPU utilization is low compared to other well-optimized CV and NLP models. We show that both the device active time (the sum of kernel runtimes) but also the device idle time are important components of the overall device time. We therefore tackle them separately by (1) flexibly adopting heuristic-based and ML-based kernel performance models for operators that dominate the device active time, and (2) categorizing operator overheads into five types to determine quantitatively their contribution to the device active time. Combining these two parts, we propose a critical-path-based algorithm to predict the per-batch training time of DLRM by traversing its execution graph. We achieve less than 10% geometric mean average error (GMAE) in all kernel performance modeling, and 4.61% and 7.96% geomean errors for GPU active time and overall E2E per-batch training time prediction with overheads from individual workloads, respectively. A slight increase of 2.19% incurred in E2E prediction error with shared overheads across workloads suggests the feasibility of using shared overheads in large-scale prediction. We show that our general performance model not only achieves low prediction error on DLRM, which has highly customized configurations and is dominated by multiple factors but also yields comparable accuracy on other compute-bound ML models targeted by most previous methods. Using this performance model and graph-level data and task dependency analysis, we show our system can provide more general model-system co-design than previous methods.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于gpu的深度学习推荐模型训练性能模型构建
我们设计了一个深度学习推荐模型(DLRM)的GPU训练性能模型,与其他优化好的CV和NLP模型相比,DLRM的GPU利用率较低。我们展示了设备活动时间(内核运行时的总和)和设备空闲时间都是整个设备时间的重要组成部分。因此,我们分别通过(1)灵活地采用基于启发式和基于ml的内核性能模型来解决这些问题,并(2)将操作员开销分为五种类型,以定量地确定它们对设备活动时间的贡献。结合这两部分,我们提出了一种基于关键路径的算法,通过遍历DLRM的执行图来预测每批DLRM的训练时间。我们在所有内核性能建模中实现了小于10%的几何平均误差(GMAE),在单个工作负载的开销下,GPU活动时间和整体E2E每批训练时间预测的几何误差分别为4.61%和7.96%。跨工作负载共享开销时,端到端加密预测误差略微增加2.19%,这表明在大规模预测中使用共享开销是可行的。我们的研究表明,我们的通用性能模型不仅在DLRM上实现了较低的预测误差,而且在大多数先前方法针对的其他计算绑定的ML模型上也产生了相当的精度。通过使用该性能模型、图级数据和任务依赖分析,我们证明了我们的系统可以提供比以前的方法更通用的模型-系统协同设计。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
HiPC 2022 Technical Program Committee A Deep Learning-Based In Situ Analysis Framework for Tropical Cyclogenesis Prediction COMPROF and COMPLACE: Shared-Memory Communication Profiling and Automated Thread Placement via Dynamic Binary Instrumentation Message from the HiPC 2022 General Co-Chairs Efficient Personalized and Non-Personalized Alltoall Communication for Modern Multi-HCA GPU-Based Clusters
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1