Time-Based Roofline for Deep Learning Performance Analysis

2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS) Pub Date : 2020-09-09 DOI:10.1109/DLS51937.2020.00007

Yunsong Wang, Charlene Yang, S. Farrell, Yan Zhang, T. Kurth, Samuel Williams

{"title":"Time-Based Roofline for Deep Learning Performance Analysis","authors":"Yunsong Wang, Charlene Yang, S. Farrell, Yan Zhang, T. Kurth, Samuel Williams","doi":"10.1109/DLS51937.2020.00007","DOIUrl":null,"url":null,"abstract":"Deep learning applications based on neural networks are generating considerable interest in various fields due to their high accuracy. Such an application is usually very compute-intensive thus requires a long run time. Researchers and engineers are actively exploring new solutions to this issue from both hardware and software/algorithm sides. However, little previous work has focused on providing a practical methodology to characterize deep learning performance bottlenecks and potentially guide the following optimization efforts. In this paper, we introduce an extension of the Roofline model and use it to analyze two representative computation kernels in deep learning, 2D convolution and long short-term memory, on NVIDIA GPUs. This new time-based Roofline model incorporates both compute/bandwidth complexity and run time in its formulae to demonstrate performance issues that cannot be reflected by the classic Roofline. Factors such as arithmetic intensity, data transfer, kernel launch overhead, and the Tensor Core usage will be examined by varying different parameters such as batch size and feature size, etc. This work helped form a more systematic way to understand the performance issue of deep learning applications. Last but not least, this generic performance model can be applied to a wide category of applications besides deep learning as well.","PeriodicalId":185533,"journal":{"name":"2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS)","volume":"290 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DLS51937.2020.00007","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

Deep learning applications based on neural networks are generating considerable interest in various fields due to their high accuracy. Such an application is usually very compute-intensive thus requires a long run time. Researchers and engineers are actively exploring new solutions to this issue from both hardware and software/algorithm sides. However, little previous work has focused on providing a practical methodology to characterize deep learning performance bottlenecks and potentially guide the following optimization efforts. In this paper, we introduce an extension of the Roofline model and use it to analyze two representative computation kernels in deep learning, 2D convolution and long short-term memory, on NVIDIA GPUs. This new time-based Roofline model incorporates both compute/bandwidth complexity and run time in its formulae to demonstrate performance issues that cannot be reflected by the classic Roofline. Factors such as arithmetic intensity, data transfer, kernel launch overhead, and the Tensor Core usage will be examined by varying different parameters such as batch size and feature size, etc. This work helped form a more systematic way to understand the performance issue of deep learning applications. Last but not least, this generic performance model can be applied to a wide category of applications besides deep learning as well.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于时间的rooline深度学习性能分析

基于神经网络的深度学习应用由于其较高的准确性，在各个领域引起了相当大的兴趣。这样的应用程序通常是计算密集型的，因此需要很长的运行时间。研究人员和工程师正在从硬件和软件/算法两方面积极探索解决这个问题的新方法。然而，之前的工作很少专注于提供一种实用的方法来表征深度学习的性能瓶颈，并可能指导后续的优化工作。在本文中，我们引入了rooline模型的扩展，并利用它分析了在NVIDIA gpu上深度学习中两个代表性的计算内核，2D卷积和长短期记忆。这种新的基于时间的rooline模型在其公式中结合了计算/带宽复杂性和运行时间，以展示经典rooline无法反映的性能问题。诸如算术强度、数据传输、内核启动开销和Tensor Core使用等因素将通过改变不同的参数(如批处理大小和特征大小等)来检查。这项工作有助于形成一种更系统的方式来理解深度学习应用程序的性能问题。最后但并非最不重要的是，除了深度学习之外，这个通用性能模型还可以应用于广泛的应用程序类别。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS)

自引率

0.00%

发文量

期刊最新文献

Online-Codistillation Meets LARS, Going beyond the Limit of Data Parallelism in Deep Learning Vandermonde Wave Function Ansatz for Improved Variational Monte Carlo TopiQAL: Topic-aware Question Answering using Scalable Domain-specific Supercomputers DDLBench: Towards a Scalable Benchmarking Infrastructure for Distributed Deep Learning [Copyright notice]