Dian Xiong, Li Chen, Youhe Jiang, Dan Li, Shuai Wang, Songtao Wang
{"title":"重新审视 AllReduce 的时间成本模型","authors":"Dian Xiong, Li Chen, Youhe Jiang, Dan Li, Shuai Wang, Songtao Wang","doi":"arxiv-2409.04202","DOIUrl":null,"url":null,"abstract":"AllReduce is an important and popular collective communication primitive,\nwhich has been widely used in areas such as distributed machine learning and\nhigh performance computing. To design, analyze, and choose from various\nalgorithms and implementations of AllReduce, the time cost model plays a\ncrucial role, and the predominant one is the $(\\alpha,\\beta,\\gamma)$ model. In\nthis paper, we revisit this model, and reveal that it cannot well characterize\nthe time cost of AllReduce on modern clusters; thus must be updated. We perform\nextensive measurements to identify two additional terms contributing to the\ntime cost: the incast term and the memory access term. We augment the\n$(\\alpha,\\beta,\\gamma)$ model with these two terms, and present GenModel as a\nresult. Using GenModel, we discover two new optimalities for AllReduce\nalgorithms, and prove that they cannot be achieved simultaneously. Finally,\nstriking the balance between the two new optimalities, we design GenTree, an\nAllReduce plan generation algorithm specialized for tree-like topologies.\nExperiments on a real testbed with 64 GPUs show that GenTree can achieve\n1.22$\\times$ to 1.65$\\times$ speed-up against NCCL. Large-scale simulations\nalso confirm that GenTree can improve the state-of-the-art AllReduce algorithm\nby a factor of $1.2$ to $7.4$ in scenarios where the two new terms dominate.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"62 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Revisiting the Time Cost Model of AllReduce\",\"authors\":\"Dian Xiong, Li Chen, Youhe Jiang, Dan Li, Shuai Wang, Songtao Wang\",\"doi\":\"arxiv-2409.04202\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"AllReduce is an important and popular collective communication primitive,\\nwhich has been widely used in areas such as distributed machine learning and\\nhigh performance computing. To design, analyze, and choose from various\\nalgorithms and implementations of AllReduce, the time cost model plays a\\ncrucial role, and the predominant one is the $(\\\\alpha,\\\\beta,\\\\gamma)$ model. In\\nthis paper, we revisit this model, and reveal that it cannot well characterize\\nthe time cost of AllReduce on modern clusters; thus must be updated. We perform\\nextensive measurements to identify two additional terms contributing to the\\ntime cost: the incast term and the memory access term. We augment the\\n$(\\\\alpha,\\\\beta,\\\\gamma)$ model with these two terms, and present GenModel as a\\nresult. Using GenModel, we discover two new optimalities for AllReduce\\nalgorithms, and prove that they cannot be achieved simultaneously. Finally,\\nstriking the balance between the two new optimalities, we design GenTree, an\\nAllReduce plan generation algorithm specialized for tree-like topologies.\\nExperiments on a real testbed with 64 GPUs show that GenTree can achieve\\n1.22$\\\\times$ to 1.65$\\\\times$ speed-up against NCCL. Large-scale simulations\\nalso confirm that GenTree can improve the state-of-the-art AllReduce algorithm\\nby a factor of $1.2$ to $7.4$ in scenarios where the two new terms dominate.\",\"PeriodicalId\":501422,\"journal\":{\"name\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"volume\":\"62 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.04202\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.04202","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
AllReduce is an important and popular collective communication primitive,
which has been widely used in areas such as distributed machine learning and
high performance computing. To design, analyze, and choose from various
algorithms and implementations of AllReduce, the time cost model plays a
crucial role, and the predominant one is the $(\alpha,\beta,\gamma)$ model. In
this paper, we revisit this model, and reveal that it cannot well characterize
the time cost of AllReduce on modern clusters; thus must be updated. We perform
extensive measurements to identify two additional terms contributing to the
time cost: the incast term and the memory access term. We augment the
$(\alpha,\beta,\gamma)$ model with these two terms, and present GenModel as a
result. Using GenModel, we discover two new optimalities for AllReduce
algorithms, and prove that they cannot be achieved simultaneously. Finally,
striking the balance between the two new optimalities, we design GenTree, an
AllReduce plan generation algorithm specialized for tree-like topologies.
Experiments on a real testbed with 64 GPUs show that GenTree can achieve
1.22$\times$ to 1.65$\times$ speed-up against NCCL. Large-scale simulations
also confirm that GenTree can improve the state-of-the-art AllReduce algorithm
by a factor of $1.2$ to $7.4$ in scenarios where the two new terms dominate.