PowerTrain: Fast, generalizable time and power prediction models to optimize DNN training on accelerated edges

IF 6.2 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Future Generation Computer Systems-The International Journal of Escience Pub Date : 2024-12-01 Epub Date: 2024-07-06 DOI:10.1016/j.future.2024.07.001

Prashanthi S.K., Saisamarth Taluri, Beautlin S, Lakshya Karwa, Yogesh Simmhan

{"title":"PowerTrain: Fast, generalizable time and power prediction models to optimize DNN training on accelerated edges","authors":"Prashanthi S.K., Saisamarth Taluri, Beautlin S, Lakshya Karwa, Yogesh Simmhan","doi":"10.1016/j.future.2024.07.001","DOIUrl":null,"url":null,"abstract":"<div>Accelerated edge devices, like Nvidia’s Jetson with 1000+ CUDA cores, are increasingly used for DNN training and federated learning, rather than just for inferencing workloads. A unique feature of these compact devices is their fine-grained control over CPU, GPU, memory frequencies, and active CPU cores, which can limit their power envelope in a constrained setting while throttling the compute performance. Given this vast 10k+ parameter space, selecting a power mode for dynamically arriving training workloads to exploit power–performance trade-offs requires costly profiling for each new workload, or is done ad hoc. We propose PowerTrain, a transfer-learning approach to accurately predict the power and time that will be consumed when we train a given DNN workload (model + dataset) using any specified power mode (CPU/GPU/memory frequencies, core-count). It requires a one-time offline profiling of 1000s of power modes for a reference DNN workload on a single Jetson device (Orin AGX) to build Neural Network (NN) based prediction models for time and power. These NN models are subsequently transferred (retrained) for a new DNN workload, or even a different Jetson device, with minimal additional profiling of just 50 power modes to make accurate time and power predictions. These are then used to rapidly construct the Pareto front and select the optimal power mode for the new workload, e.g., to minimize training time while meeting a power limit. PowerTrain’s predictions are robust to new workloads, exhibiting a low MAPE of <math><mrow><mo><</mo><mn>6</mn><mtext>%</mtext></mrow></math> for power and <math><mrow><mo><</mo><mn>15</mn><mtext>%</mtext></mrow></math> for time on six new training workloads (MobileNet, YOLO, BERT, LSTM, etc.) for up to 4400 power modes, when transferred from a ResNet reference workload on Orin AGX. It is also resilient when transferred to two entirely new Jetson devices (Xavier AGX and Jetson Orin Nano) with prediction errors of <math><mrow><mo><</mo><mn>14</mn><mo>.</mo><mn>5</mn><mtext>%</mtext></mrow></math> and <math><mrow><mo><</mo><mn>11</mn><mtext>%</mtext></mrow></math>. These outperform baseline predictions by more than 10% and baseline optimizations by up to 45% on time and 88% on power.</div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"161 ","pages":"Pages 329-344"},"PeriodicalIF":6.2000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X24003649","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/7/6 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Accelerated edge devices, like Nvidia’s Jetson with 1000+ CUDA cores, are increasingly used for DNN training and federated learning, rather than just for inferencing workloads. A unique feature of these compact devices is their fine-grained control over CPU, GPU, memory frequencies, and active CPU cores, which can limit their power envelope in a constrained setting while throttling the compute performance. Given this vast 10k+ parameter space, selecting a power mode for dynamically arriving training workloads to exploit power–performance trade-offs requires costly profiling for each new workload, or is done ad hoc. We propose PowerTrain, a transfer-learning approach to accurately predict the power and time that will be consumed when we train a given DNN workload (model + dataset) using any specified power mode (CPU/GPU/memory frequencies, core-count). It requires a one-time offline profiling of 1000s of power modes for a reference DNN workload on a single Jetson device (Orin AGX) to build Neural Network (NN) based prediction models for time and power. These NN models are subsequently transferred (retrained) for a new DNN workload, or even a different Jetson device, with minimal additional profiling of just 50 power modes to make accurate time and power predictions. These are then used to rapidly construct the Pareto front and select the optimal power mode for the new workload, e.g., to minimize training time while meeting a power limit. PowerTrain’s predictions are robust to new workloads, exhibiting a low MAPE of $< 6 %$ for power and $< 15 %$ for time on six new training workloads (MobileNet, YOLO, BERT, LSTM, etc.) for up to 4400 power modes, when transferred from a ResNet reference workload on Orin AGX. It is also resilient when transferred to two entirely new Jetson devices (Xavier AGX and Jetson Orin Nano) with prediction errors of $< 14.5 %$ and $< 11 %$ . These outperform baseline predictions by more than 10% and baseline optimizations by up to 45% on time and 88% on power.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

PowerTrain：快速、可通用的时间和功率预测模型，用于优化加速边缘上的 DNN 训练

加速边缘设备，如配备 1000 多个 CUDA 内核的 Nvidia Jetson，越来越多地用于 DNN 训练和联合学习，而不仅仅是推理工作负载。这些紧凑型设备的一个独特功能是对 CPU、GPU、内存频率和活动 CPU 内核进行细粒度控制，这可以在受限设置中限制其功率包络，同时节流计算性能。考虑到这一庞大的 10k+ 参数空间，为动态到达的训练工作负载选择功率模式以利用功率-性能权衡，需要对每个新工作负载进行代价高昂的剖析，或者临时进行剖析。我们提出的 PowerTrain 是一种迁移学习方法，用于准确预测使用任何指定的功率模式（CPU/GPU/内存频率、内核数量）训练给定 DNN 工作负载（模型+数据集）时所消耗的功率和时间。它需要在单个 Jetson 设备（Orin AGX）上对参考 DNN 工作负载的 1000 种电源模式进行一次性离线分析，以建立基于神经网络 (NN) 的时间和电源预测模型。这些神经网络模型随后可转移（重新训练）到新的 DNN 工作负载，甚至不同的 Jetson 设备上，只需对 50 种功率模式进行最低限度的额外剖析，即可做出准确的时间和功率预测。然后利用这些预测快速构建帕累托前沿，并为新的工作负载选择最佳功率模式，例如，在满足功率限制的同时尽量缩短训练时间。PowerTrain 的预测对新的工作负载具有很强的鲁棒性，在从 Orin AGX 上的 ResNet 参考工作负载转移到六个新的训练工作负载（MobileNet、YOLO、BERT、LSTM 等）时，功率的 MAPE 为 6%，时间的 MAPE 为 15%，功率模式多达 4400 种。当将其传输到两款全新的 Jetson 设备（Xavier AGX 和 Jetson Orin Nano）时，它也具有很强的弹性，预测误差分别为 14.5% 和 11%。在时间和功耗方面，预测结果比基准预测结果高出 10%以上，比基准优化结果高出 45%，比基准优化结果高出 88%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Future Generation Computer Systems-The International Journal of Escience 工程技术-计算机：理论方法

CiteScore

19.90

自引率

2.70%

发文量

376

审稿时长

10.6 months

期刊介绍： Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.