GPU 上深度学习性能的数据驱动预测

Seonho Lee, Amar Phanishayee, Divya Mahajan
{"title":"GPU 上深度学习性能的数据驱动预测","authors":"Seonho Lee, Amar Phanishayee, Divya Mahajan","doi":"arxiv-2407.13853","DOIUrl":null,"url":null,"abstract":"Deep learning kernels exhibit predictable memory accesses and compute\npatterns, making GPUs' parallel architecture well-suited for their execution.\nSoftware and runtime systems for GPUs are optimized to better utilize the\nstream multiprocessors, on-chip cache, and off-chip high-bandwidth memory. As\ndeep learning models and GPUs evolve, access to newer GPUs is often limited,\nraising questions about the performance of new model architectures on existing\nGPUs, existing models on new GPUs, and new model architectures on new GPUs. To\naddress these questions, we introduce NeuSight, a framework to predict the\nperformance of various deep learning models, for both training and inference,\non unseen GPUs without requiring actual execution. The framework leverages both\nGPU hardware behavior and software library optimizations to estimate end-to-end\nperformance. Previous work uses regression models that capture linear trends or\nmultilayer perceptrons to predict the overall latency of deep learning kernels\non GPUs. These approaches suffer from higher error percentages when forecasting\nperformance on unseen models and new GPUs. Instead, NeuSight decomposes the\nprediction problem into smaller problems, bounding the prediction through\nfundamental performance laws. NeuSight decomposes a single deep learning kernel\nprediction into smaller working sets called tiles, which are executed\nindependently on the GPU. Tile-granularity predictions are determined using a\nmachine learning approach and aggregated to estimate end-to-end latency.\nNeuSight outperforms prior work across various deep learning workloads and the\nlatest GPUs. It reduces the percentage error from 198% and 19.7% to 3.8% in\npredicting the latency of GPT3 model for training and inference on H100,\ncompared to state-of-the-art prior works, where both GPT3 and H100 were not\nused to train the framework.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"23 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Data-driven Forecasting of Deep Learning Performance on GPUs\",\"authors\":\"Seonho Lee, Amar Phanishayee, Divya Mahajan\",\"doi\":\"arxiv-2407.13853\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep learning kernels exhibit predictable memory accesses and compute\\npatterns, making GPUs' parallel architecture well-suited for their execution.\\nSoftware and runtime systems for GPUs are optimized to better utilize the\\nstream multiprocessors, on-chip cache, and off-chip high-bandwidth memory. As\\ndeep learning models and GPUs evolve, access to newer GPUs is often limited,\\nraising questions about the performance of new model architectures on existing\\nGPUs, existing models on new GPUs, and new model architectures on new GPUs. To\\naddress these questions, we introduce NeuSight, a framework to predict the\\nperformance of various deep learning models, for both training and inference,\\non unseen GPUs without requiring actual execution. The framework leverages both\\nGPU hardware behavior and software library optimizations to estimate end-to-end\\nperformance. Previous work uses regression models that capture linear trends or\\nmultilayer perceptrons to predict the overall latency of deep learning kernels\\non GPUs. These approaches suffer from higher error percentages when forecasting\\nperformance on unseen models and new GPUs. Instead, NeuSight decomposes the\\nprediction problem into smaller problems, bounding the prediction through\\nfundamental performance laws. NeuSight decomposes a single deep learning kernel\\nprediction into smaller working sets called tiles, which are executed\\nindependently on the GPU. Tile-granularity predictions are determined using a\\nmachine learning approach and aggregated to estimate end-to-end latency.\\nNeuSight outperforms prior work across various deep learning workloads and the\\nlatest GPUs. It reduces the percentage error from 198% and 19.7% to 3.8% in\\npredicting the latency of GPT3 model for training and inference on H100,\\ncompared to state-of-the-art prior works, where both GPT3 and H100 were not\\nused to train the framework.\",\"PeriodicalId\":501291,\"journal\":{\"name\":\"arXiv - CS - Performance\",\"volume\":\"23 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Performance\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.13853\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.13853","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

GPU 的软件和运行系统经过优化,能够更好地利用流式多处理器、片上高速缓存和片外高带宽内存。随着深度学习模型和GPU的不断发展,对较新GPU的访问往往受到限制,这就提出了新模型架构在现有GPU上的性能、现有模型在新GPU上的性能以及新模型架构在新GPU上的性能等问题。为了解决这些问题,我们推出了 NeuSight,这是一个预测各种深度学习模型在未见过的 GPU 上训练和推理性能的框架,无需实际执行。该框架利用 GPU 硬件行为和软件库优化来估算端到端的性能。以前的工作使用捕捉线性趋势的回归模型或多层感知器来预测深度学习 kernelson GPU 的整体延迟。当预测未见模型和新 GPU 的性能时,这些方法的误差率较高。相反,NeuSight 将预测问题分解成更小的问题,通过基本性能法则对预测进行约束。NeuSight 将单个深度学习核心预测分解成更小的工作集(称为 "瓦片"),这些工作集在 GPU 上独立执行。NeuSight 在各种深度学习工作负载和最新 GPU 上的表现都优于之前的工作。在预测 GPT3 模型在 H100 上的训练和推理延迟时,它将误差百分比从 198% 和 19.7% 降低到 3.8%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Data-driven Forecasting of Deep Learning Performance on GPUs
Deep learning kernels exhibit predictable memory accesses and compute patterns, making GPUs' parallel architecture well-suited for their execution. Software and runtime systems for GPUs are optimized to better utilize the stream multiprocessors, on-chip cache, and off-chip high-bandwidth memory. As deep learning models and GPUs evolve, access to newer GPUs is often limited, raising questions about the performance of new model architectures on existing GPUs, existing models on new GPUs, and new model architectures on new GPUs. To address these questions, we introduce NeuSight, a framework to predict the performance of various deep learning models, for both training and inference, on unseen GPUs without requiring actual execution. The framework leverages both GPU hardware behavior and software library optimizations to estimate end-to-end performance. Previous work uses regression models that capture linear trends or multilayer perceptrons to predict the overall latency of deep learning kernels on GPUs. These approaches suffer from higher error percentages when forecasting performance on unseen models and new GPUs. Instead, NeuSight decomposes the prediction problem into smaller problems, bounding the prediction through fundamental performance laws. NeuSight decomposes a single deep learning kernel prediction into smaller working sets called tiles, which are executed independently on the GPU. Tile-granularity predictions are determined using a machine learning approach and aggregated to estimate end-to-end latency. NeuSight outperforms prior work across various deep learning workloads and the latest GPUs. It reduces the percentage error from 198% and 19.7% to 3.8% in predicting the latency of GPT3 model for training and inference on H100, compared to state-of-the-art prior works, where both GPT3 and H100 were not used to train the framework.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
HRA: A Multi-Criteria Framework for Ranking Metaheuristic Optimization Algorithms Temporal Load Imbalance on Ondes3D Seismic Simulator for Different Multicore Architectures Can Graph Reordering Speed Up Graph Neural Network Training? An Experimental Study The Landscape of GPU-Centric Communication A Global Perspective on the Past, Present, and Future of Video Streaming over Starlink
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1