Time-Based Roofline for Deep Learning Performance Analysis

Yunsong Wang, Charlene Yang, S. Farrell, Yan Zhang, T. Kurth, Samuel Williams
{"title":"Time-Based Roofline for Deep Learning Performance Analysis","authors":"Yunsong Wang, Charlene Yang, S. Farrell, Yan Zhang, T. Kurth, Samuel Williams","doi":"10.1109/DLS51937.2020.00007","DOIUrl":null,"url":null,"abstract":"Deep learning applications based on neural networks are generating considerable interest in various fields due to their high accuracy. Such an application is usually very compute-intensive thus requires a long run time. Researchers and engineers are actively exploring new solutions to this issue from both hardware and software/algorithm sides. However, little previous work has focused on providing a practical methodology to characterize deep learning performance bottlenecks and potentially guide the following optimization efforts. In this paper, we introduce an extension of the Roofline model and use it to analyze two representative computation kernels in deep learning, 2D convolution and long short-term memory, on NVIDIA GPUs. This new time-based Roofline model incorporates both compute/bandwidth complexity and run time in its formulae to demonstrate performance issues that cannot be reflected by the classic Roofline. Factors such as arithmetic intensity, data transfer, kernel launch overhead, and the Tensor Core usage will be examined by varying different parameters such as batch size and feature size, etc. This work helped form a more systematic way to understand the performance issue of deep learning applications. Last but not least, this generic performance model can be applied to a wide category of applications besides deep learning as well.","PeriodicalId":185533,"journal":{"name":"2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS)","volume":"290 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DLS51937.2020.00007","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13

Abstract

Deep learning applications based on neural networks are generating considerable interest in various fields due to their high accuracy. Such an application is usually very compute-intensive thus requires a long run time. Researchers and engineers are actively exploring new solutions to this issue from both hardware and software/algorithm sides. However, little previous work has focused on providing a practical methodology to characterize deep learning performance bottlenecks and potentially guide the following optimization efforts. In this paper, we introduce an extension of the Roofline model and use it to analyze two representative computation kernels in deep learning, 2D convolution and long short-term memory, on NVIDIA GPUs. This new time-based Roofline model incorporates both compute/bandwidth complexity and run time in its formulae to demonstrate performance issues that cannot be reflected by the classic Roofline. Factors such as arithmetic intensity, data transfer, kernel launch overhead, and the Tensor Core usage will be examined by varying different parameters such as batch size and feature size, etc. This work helped form a more systematic way to understand the performance issue of deep learning applications. Last but not least, this generic performance model can be applied to a wide category of applications besides deep learning as well.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于时间的rooline深度学习性能分析
基于神经网络的深度学习应用由于其较高的准确性,在各个领域引起了相当大的兴趣。这样的应用程序通常是计算密集型的,因此需要很长的运行时间。研究人员和工程师正在从硬件和软件/算法两方面积极探索解决这个问题的新方法。然而,之前的工作很少专注于提供一种实用的方法来表征深度学习的性能瓶颈,并可能指导后续的优化工作。在本文中,我们引入了rooline模型的扩展,并利用它分析了在NVIDIA gpu上深度学习中两个代表性的计算内核,2D卷积和长短期记忆。这种新的基于时间的rooline模型在其公式中结合了计算/带宽复杂性和运行时间,以展示经典rooline无法反映的性能问题。诸如算术强度、数据传输、内核启动开销和Tensor Core使用等因素将通过改变不同的参数(如批处理大小和特征大小等)来检查。这项工作有助于形成一种更系统的方式来理解深度学习应用程序的性能问题。最后但并非最不重要的是,除了深度学习之外,这个通用性能模型还可以应用于广泛的应用程序类别。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Online-Codistillation Meets LARS, Going beyond the Limit of Data Parallelism in Deep Learning Vandermonde Wave Function Ansatz for Improved Variational Monte Carlo TopiQAL: Topic-aware Question Answering using Scalable Domain-specific Supercomputers DDLBench: Towards a Scalable Benchmarking Infrastructure for Distributed Deep Learning [Copyright notice]
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1