Data-driven Forecasting of Deep Learning Performance on GPUs

arXiv - CS - Performance Pub Date : 2024-07-18 DOI:arxiv-2407.13853

Seonho Lee, Amar Phanishayee, Divya Mahajan

{"title":"Data-driven Forecasting of Deep Learning Performance on GPUs","authors":"Seonho Lee, Amar Phanishayee, Divya Mahajan","doi":"arxiv-2407.13853","DOIUrl":null,"url":null,"abstract":"Deep learning kernels exhibit predictable memory accesses and compute\npatterns, making GPUs' parallel architecture well-suited for their execution.\nSoftware and runtime systems for GPUs are optimized to better utilize the\nstream multiprocessors, on-chip cache, and off-chip high-bandwidth memory. As\ndeep learning models and GPUs evolve, access to newer GPUs is often limited,\nraising questions about the performance of new model architectures on existing\nGPUs, existing models on new GPUs, and new model architectures on new GPUs. To\naddress these questions, we introduce NeuSight, a framework to predict the\nperformance of various deep learning models, for both training and inference,\non unseen GPUs without requiring actual execution. The framework leverages both\nGPU hardware behavior and software library optimizations to estimate end-to-end\nperformance. Previous work uses regression models that capture linear trends or\nmultilayer perceptrons to predict the overall latency of deep learning kernels\non GPUs. These approaches suffer from higher error percentages when forecasting\nperformance on unseen models and new GPUs. Instead, NeuSight decomposes the\nprediction problem into smaller problems, bounding the prediction through\nfundamental performance laws. NeuSight decomposes a single deep learning kernel\nprediction into smaller working sets called tiles, which are executed\nindependently on the GPU. Tile-granularity predictions are determined using a\nmachine learning approach and aggregated to estimate end-to-end latency.\nNeuSight outperforms prior work across various deep learning workloads and the\nlatest GPUs. It reduces the percentage error from 198% and 19.7% to 3.8% in\npredicting the latency of GPT3 model for training and inference on H100,\ncompared to state-of-the-art prior works, where both GPT3 and H100 were not\nused to train the framework.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"23 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.13853","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Deep learning kernels exhibit predictable memory accesses and compute patterns, making GPUs' parallel architecture well-suited for their execution. Software and runtime systems for GPUs are optimized to better utilize the stream multiprocessors, on-chip cache, and off-chip high-bandwidth memory. As deep learning models and GPUs evolve, access to newer GPUs is often limited, raising questions about the performance of new model architectures on existing GPUs, existing models on new GPUs, and new model architectures on new GPUs. To address these questions, we introduce NeuSight, a framework to predict the performance of various deep learning models, for both training and inference, on unseen GPUs without requiring actual execution. The framework leverages both GPU hardware behavior and software library optimizations to estimate end-to-end performance. Previous work uses regression models that capture linear trends or multilayer perceptrons to predict the overall latency of deep learning kernels on GPUs. These approaches suffer from higher error percentages when forecasting performance on unseen models and new GPUs. Instead, NeuSight decomposes the prediction problem into smaller problems, bounding the prediction through fundamental performance laws. NeuSight decomposes a single deep learning kernel prediction into smaller working sets called tiles, which are executed independently on the GPU. Tile-granularity predictions are determined using a machine learning approach and aggregated to estimate end-to-end latency. NeuSight outperforms prior work across various deep learning workloads and the latest GPUs. It reduces the percentage error from 198% and 19.7% to 3.8% in predicting the latency of GPT3 model for training and inference on H100, compared to state-of-the-art prior works, where both GPT3 and H100 were not used to train the framework.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

GPU 上深度学习性能的数据驱动预测

GPU 的软件和运行系统经过优化，能够更好地利用流式多处理器、片上高速缓存和片外高带宽内存。随着深度学习模型和GPU的不断发展，对较新GPU的访问往往受到限制，这就提出了新模型架构在现有GPU上的性能、现有模型在新GPU上的性能以及新模型架构在新GPU上的性能等问题。为了解决这些问题，我们推出了 NeuSight，这是一个预测各种深度学习模型在未见过的 GPU 上训练和推理性能的框架，无需实际执行。该框架利用 GPU 硬件行为和软件库优化来估算端到端的性能。以前的工作使用捕捉线性趋势的回归模型或多层感知器来预测深度学习 kernelson GPU 的整体延迟。当预测未见模型和新 GPU 的性能时，这些方法的误差率较高。相反，NeuSight 将预测问题分解成更小的问题，通过基本性能法则对预测进行约束。NeuSight 将单个深度学习核心预测分解成更小的工作集（称为 "瓦片"），这些工作集在 GPU 上独立执行。NeuSight 在各种深度学习工作负载和最新 GPU 上的表现都优于之前的工作。在预测 GPT3 模型在 H100 上的训练和推理延迟时，它将误差百分比从 198% 和 19.7% 降低到 3.8%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Performance

自引率

0.00%

发文量