Proteus：模拟分布式 DNN 训练的性能

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-08-14 DOI:10.1109/TPDS.2024.3443255

Jiangfei Duan;Xiuhong Li;Ping Xu;Xingcheng Zhang;Shengen Yan;Yun Liang;Dahua Lin

{"title":"Proteus：模拟分布式 DNN 训练的性能","authors":"Jiangfei Duan;Xiuhong Li;Ping Xu;Xingcheng Zhang;Shengen Yan;Yun Liang;Dahua Lin","doi":"10.1109/TPDS.2024.3443255","DOIUrl":null,"url":null,"abstract":"DNN models are becoming increasingly larger to achieve unprecedented accuracy, and the accompanying increased computation and memory requirements necessitate the employment of massive clusters and elaborate parallelization strategies to accelerate DNN training. In order to better optimize the performance and analyze the cost, it is indispensable to model the training throughput of distributed DNN training. However, complex parallelization strategies and the resulting complex runtime behaviors make it challenging to construct an accurate performance model. In this article, we present Proteus, the first standalone simulator to model the performance of complex parallelization strategies through simulation execution. Proteus first models complex parallelization strategies with a unified representation named \n<italic>Strategy Tree\n. Then, it compiles the strategy tree into a distributed execution graph and simulates the complex runtime behaviors, \n<italic>comp-comm overlap\n and \n<italic>bandwidth sharing\n, with a \n<underline>H\nierarchical \n<underline>T\nopo-\n<underline>A\nware \n<underline>E\nxecutor (\n<italic>HTAE\n). We finally evaluate Proteus across a wide variety of DNNs on three hardware configurations. Experimental results show that Proteus achieves 3.0% average prediction error and preserves order for training throughput of various parallelization strategies. Compared to state-of-the-art approaches, Proteus reduces prediction error by up to 133.8%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 10","pages":"1867-1878"},"PeriodicalIF":5.6000,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10636756","citationCount":"0","resultStr":"{\"title\":\"Proteus: Simulating the Performance of Distributed DNN Training\",\"authors\":\"Jiangfei Duan;Xiuhong Li;Ping Xu;Xingcheng Zhang;Shengen Yan;Yun Liang;Dahua Lin\",\"doi\":\"10.1109/TPDS.2024.3443255\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"DNN models are becoming increasingly larger to achieve unprecedented accuracy, and the accompanying increased computation and memory requirements necessitate the employment of massive clusters and elaborate parallelization strategies to accelerate DNN training. In order to better optimize the performance and analyze the cost, it is indispensable to model the training throughput of distributed DNN training. However, complex parallelization strategies and the resulting complex runtime behaviors make it challenging to construct an accurate performance model. In this article, we present Proteus, the first standalone simulator to model the performance of complex parallelization strategies through simulation execution. Proteus first models complex parallelization strategies with a unified representation named \\n<italic>Strategy Tree\\n. Then, it compiles the strategy tree into a distributed execution graph and simulates the complex runtime behaviors, \\n<italic>comp-comm overlap\\n and \\n<italic>bandwidth sharing\\n, with a \\n<underline>H\\nierarchical \\n<underline>T\\nopo-\\n<underline>A\\nware \\n<underline>E\\nxecutor (\\n<italic>HTAE\\n). We finally evaluate Proteus across a wide variety of DNNs on three hardware configurations. Experimental results show that Proteus achieves 3.0% average prediction error and preserves order for training throughput of various parallelization strategies. Compared to state-of-the-art approaches, Proteus reduces prediction error by up to 133.8%.\",\"PeriodicalId\":13257,\"journal\":{\"name\":\"IEEE Transactions on Parallel and Distributed Systems\",\"volume\":\"35 10\",\"pages\":\"1867-1878\"},\"PeriodicalIF\":5.6000,\"publicationDate\":\"2024-08-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10636756\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Parallel and Distributed Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10636756/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Parallel and Distributed Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10636756/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

摘要

为了达到前所未有的精确度，DNN 模型变得越来越大，随之而来的计算和内存要求也越来越高，因此有必要使用大规模集群和精心设计的并行化策略来加速 DNN 训练。为了更好地优化性能和分析成本，建立分布式 DNN 训练吞吐量模型是必不可少的。然而，复杂的并行化策略和由此产生的复杂运行时行为使得构建精确的性能模型变得十分困难。在本文中，我们将介绍 Proteus，它是第一个通过模拟执行对复杂并行化策略的性能进行建模的独立模拟器。Proteus 首先用名为 "策略树 "的统一表示法对复杂并行化策略进行建模。然后，它将策略树编译成分布式执行图，并通过分层拓扑感知执行器（HTAE）模拟复杂的运行时行为、计算-通信重叠和带宽共享。最后，我们在三种硬件配置上对各种 DNN 进行了 Proteus 评估。实验结果表明，Proteus 实现了 3.0% 的平均预测误差，并保持了各种并行化策略的训练吞吐量顺序。与最先进的方法相比，Proteus 最多可将预测误差降低 133.8%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Proteus: Simulating the Performance of Distributed DNN Training

DNN models are becoming increasingly larger to achieve unprecedented accuracy, and the accompanying increased computation and memory requirements necessitate the employment of massive clusters and elaborate parallelization strategies to accelerate DNN training. In order to better optimize the performance and analyze the cost, it is indispensable to model the training throughput of distributed DNN training. However, complex parallelization strategies and the resulting complex runtime behaviors make it challenging to construct an accurate performance model. In this article, we present Proteus, the first standalone simulator to model the performance of complex parallelization strategies through simulation execution. Proteus first models complex parallelization strategies with a unified representation named Strategy Tree . Then, it compiles the strategy tree into a distributed execution graph and simulates the complex runtime behaviors, comp-comm overlap and bandwidth sharing , with a H ierarchical T opo- A ware E xecutor ( HTAE ). We finally evaluate Proteus across a wide variety of DNNs on three hardware configurations. Experimental results show that Proteus achieves 3.0% average prediction error and preserves order for training throughput of various parallelization strategies. Compared to state-of-the-art approaches, Proteus reduces prediction error by up to 133.8%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Parallel and Distributed Systems 工程技术-工程：电子与电气

CiteScore

11.00

自引率

9.40%

发文量

281

审稿时长

5.6 months

期刊介绍： IEEE Transactions on Parallel and Distributed Systems (TPDS) is published monthly. It publishes a range of papers, comments on previously published papers, and survey articles that deal with the parallel and distributed systems research areas of current importance to our readers. Particular areas of interest include, but are not limited to: a) Parallel and distributed algorithms, focusing on topics such as: models of computation; numerical, combinatorial, and data-intensive parallel algorithms, scalability of algorithms and data structures for parallel and distributed systems, communication and synchronization protocols, network algorithms, scheduling, and load balancing. b) Applications of parallel and distributed computing, including computational and data-enabled science and engineering, big data applications, parallel crowd sourcing, large-scale social network analysis, management of big data, cloud and grid computing, scientific and biomedical applications, mobile computing, and cyber-physical systems. c) Parallel and distributed architectures, including architectures for instruction-level and thread-level parallelism; design, analysis, implementation, fault resilience and performance measurements of multiple-processor systems; multicore processors, heterogeneous many-core systems; petascale and exascale systems designs; novel big data architectures; special purpose architectures, including graphics processors, signal processors, network processors, media accelerators, and other special purpose processors and accelerators; impact of technology on architecture; network and interconnect architectures; parallel I/O and storage systems; architecture of the memory hierarchy; power-efficient and green computing architectures; dependable architectures; and performance modeling and evaluation. d) Parallel and distributed software, including parallel and multicore programming languages and compilers, runtime systems, operating systems, Internet computing and web services, resource management including green computing, middleware for grids, clouds, and data centers, libraries, performance modeling and evaluation, parallel programming paradigms, and programming environments and tools.

期刊最新文献

2024 Reviewers List* HpT: Hybrid Acceleration of Spatio-Temporal Attention Model Training on Heterogeneous Manycore Architectures Sparrow: Expediting Smart Contract Execution for Blockchain Sharding via Inter-Shard Caching CAT: Cellular Automata on Tensor Cores UMPIPE: Unequal Microbatches-Based Pipeline Parallelism for Deep Neural Network Training