{"title":"面向网络内Allreduce的路由器微架构","authors":"Ruiqi Wang, Dezun Dong, Fei Lei, Junchao Ma, Ketong Wu, KaiCheng Lu","doi":"10.1145/3577193.3593711","DOIUrl":null,"url":null,"abstract":"The allreduce operation is the most commonly used collective operation in distributed or parallel applications. It aggregates data collected from distributed hosts and broadcasts the aggregated result back to them. In-network computing can accelerate allreduce by offloading this operation into network devices. However, existing in-network solutions face the challenge of high throughput, performance of aggregating large message and producing repeatable results. In this work, we propose a simple and effective router microarchitecture for in-network allreduce, which uses an RDMA protocol to improve its throughput. We further discuss strategies to tackle the aforementioned challenges. Our approach not only shows advantages in comparison with the state-of-the-art in-network solutions, but also accelerates allreduce at a near-optimal level compared to host-based algorithms, as demonstrated through experiments.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Roar: A Router Microarchitecture for In-network Allreduce\",\"authors\":\"Ruiqi Wang, Dezun Dong, Fei Lei, Junchao Ma, Ketong Wu, KaiCheng Lu\",\"doi\":\"10.1145/3577193.3593711\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The allreduce operation is the most commonly used collective operation in distributed or parallel applications. It aggregates data collected from distributed hosts and broadcasts the aggregated result back to them. In-network computing can accelerate allreduce by offloading this operation into network devices. However, existing in-network solutions face the challenge of high throughput, performance of aggregating large message and producing repeatable results. In this work, we propose a simple and effective router microarchitecture for in-network allreduce, which uses an RDMA protocol to improve its throughput. We further discuss strategies to tackle the aforementioned challenges. Our approach not only shows advantages in comparison with the state-of-the-art in-network solutions, but also accelerates allreduce at a near-optimal level compared to host-based algorithms, as demonstrated through experiments.\",\"PeriodicalId\":424155,\"journal\":{\"name\":\"Proceedings of the 37th International Conference on Supercomputing\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 37th International Conference on Supercomputing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3577193.3593711\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 37th International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3577193.3593711","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Roar: A Router Microarchitecture for In-network Allreduce
The allreduce operation is the most commonly used collective operation in distributed or parallel applications. It aggregates data collected from distributed hosts and broadcasts the aggregated result back to them. In-network computing can accelerate allreduce by offloading this operation into network devices. However, existing in-network solutions face the challenge of high throughput, performance of aggregating large message and producing repeatable results. In this work, we propose a simple and effective router microarchitecture for in-network allreduce, which uses an RDMA protocol to improve its throughput. We further discuss strategies to tackle the aforementioned challenges. Our approach not only shows advantages in comparison with the state-of-the-art in-network solutions, but also accelerates allreduce at a near-optimal level compared to host-based algorithms, as demonstrated through experiments.