{"title":"Data-driven Forecasting of Deep Learning Performance on GPUs","authors":"Seonho Lee, Amar Phanishayee, Divya Mahajan","doi":"arxiv-2407.13853","DOIUrl":null,"url":null,"abstract":"Deep learning kernels exhibit predictable memory accesses and compute\npatterns, making GPUs' parallel architecture well-suited for their execution.\nSoftware and runtime systems for GPUs are optimized to better utilize the\nstream multiprocessors, on-chip cache, and off-chip high-bandwidth memory. As\ndeep learning models and GPUs evolve, access to newer GPUs is often limited,\nraising questions about the performance of new model architectures on existing\nGPUs, existing models on new GPUs, and new model architectures on new GPUs. To\naddress these questions, we introduce NeuSight, a framework to predict the\nperformance of various deep learning models, for both training and inference,\non unseen GPUs without requiring actual execution. The framework leverages both\nGPU hardware behavior and software library optimizations to estimate end-to-end\nperformance. Previous work uses regression models that capture linear trends or\nmultilayer perceptrons to predict the overall latency of deep learning kernels\non GPUs. These approaches suffer from higher error percentages when forecasting\nperformance on unseen models and new GPUs. Instead, NeuSight decomposes the\nprediction problem into smaller problems, bounding the prediction through\nfundamental performance laws. NeuSight decomposes a single deep learning kernel\nprediction into smaller working sets called tiles, which are executed\nindependently on the GPU. Tile-granularity predictions are determined using a\nmachine learning approach and aggregated to estimate end-to-end latency.\nNeuSight outperforms prior work across various deep learning workloads and the\nlatest GPUs. It reduces the percentage error from 198% and 19.7% to 3.8% in\npredicting the latency of GPT3 model for training and inference on H100,\ncompared to state-of-the-art prior works, where both GPT3 and H100 were not\nused to train the framework.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"23 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.13853","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Deep learning kernels exhibit predictable memory accesses and compute
patterns, making GPUs' parallel architecture well-suited for their execution.
Software and runtime systems for GPUs are optimized to better utilize the
stream multiprocessors, on-chip cache, and off-chip high-bandwidth memory. As
deep learning models and GPUs evolve, access to newer GPUs is often limited,
raising questions about the performance of new model architectures on existing
GPUs, existing models on new GPUs, and new model architectures on new GPUs. To
address these questions, we introduce NeuSight, a framework to predict the
performance of various deep learning models, for both training and inference,
on unseen GPUs without requiring actual execution. The framework leverages both
GPU hardware behavior and software library optimizations to estimate end-to-end
performance. Previous work uses regression models that capture linear trends or
multilayer perceptrons to predict the overall latency of deep learning kernels
on GPUs. These approaches suffer from higher error percentages when forecasting
performance on unseen models and new GPUs. Instead, NeuSight decomposes the
prediction problem into smaller problems, bounding the prediction through
fundamental performance laws. NeuSight decomposes a single deep learning kernel
prediction into smaller working sets called tiles, which are executed
independently on the GPU. Tile-granularity predictions are determined using a
machine learning approach and aggregated to estimate end-to-end latency.
NeuSight outperforms prior work across various deep learning workloads and the
latest GPUs. It reduces the percentage error from 198% and 19.7% to 3.8% in
predicting the latency of GPT3 model for training and inference on H100,
compared to state-of-the-art prior works, where both GPT3 and H100 were not
used to train the framework.