{"title":"深度学习工作负载的分层分布式内存多leader MPI-Allreduce","authors":"Truong Thao Nguyen, M. Wahib, Ryousei Takano","doi":"10.1109/CANDARW.2018.00048","DOIUrl":null,"url":null,"abstract":"Driven by the increase in complexity and size in Deep Learning models, training models on large-scale GPUs-accelerated clusters is becoming a commonplace. One of the main challenges for distributed training is the collective communication overhead for the very large message size: from several to hundreds of MB. In this paper, we exploit two hierarchical distributed-memory multi-leader allreduce algorithms optimized for GPU-accelerated clusters (named lr_lr and lr_rab). In which, one node performs the inter-node data transfer in parallel using other GPUs that are designated as node leaders. Each leader keeps and exchanges a partial result of local reduced values rather than the whole one. Hence we are capable of significantly reducing the time for injecting data into the internode network. We evaluate these algorithms on the discreteevent simulation Simgrid. We show that our algorithms, lr_lr and lr_rab, can cut down the execution time of an Allreduce microbenchmark that uses logical ring algorithm (lr) by up to 45% and 51%, respectively. In addition, saving the power consumption of network devices of up to 23% and 32% are projected.","PeriodicalId":329439,"journal":{"name":"2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"Hierarchical Distributed-Memory Multi-Leader MPI-Allreduce for Deep Learning Workloads\",\"authors\":\"Truong Thao Nguyen, M. Wahib, Ryousei Takano\",\"doi\":\"10.1109/CANDARW.2018.00048\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Driven by the increase in complexity and size in Deep Learning models, training models on large-scale GPUs-accelerated clusters is becoming a commonplace. One of the main challenges for distributed training is the collective communication overhead for the very large message size: from several to hundreds of MB. In this paper, we exploit two hierarchical distributed-memory multi-leader allreduce algorithms optimized for GPU-accelerated clusters (named lr_lr and lr_rab). In which, one node performs the inter-node data transfer in parallel using other GPUs that are designated as node leaders. Each leader keeps and exchanges a partial result of local reduced values rather than the whole one. Hence we are capable of significantly reducing the time for injecting data into the internode network. We evaluate these algorithms on the discreteevent simulation Simgrid. We show that our algorithms, lr_lr and lr_rab, can cut down the execution time of an Allreduce microbenchmark that uses logical ring algorithm (lr) by up to 45% and 51%, respectively. In addition, saving the power consumption of network devices of up to 23% and 32% are projected.\",\"PeriodicalId\":329439,\"journal\":{\"name\":\"2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW)\",\"volume\":\"47 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CANDARW.2018.00048\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CANDARW.2018.00048","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Hierarchical Distributed-Memory Multi-Leader MPI-Allreduce for Deep Learning Workloads
Driven by the increase in complexity and size in Deep Learning models, training models on large-scale GPUs-accelerated clusters is becoming a commonplace. One of the main challenges for distributed training is the collective communication overhead for the very large message size: from several to hundreds of MB. In this paper, we exploit two hierarchical distributed-memory multi-leader allreduce algorithms optimized for GPU-accelerated clusters (named lr_lr and lr_rab). In which, one node performs the inter-node data transfer in parallel using other GPUs that are designated as node leaders. Each leader keeps and exchanges a partial result of local reduced values rather than the whole one. Hence we are capable of significantly reducing the time for injecting data into the internode network. We evaluate these algorithms on the discreteevent simulation Simgrid. We show that our algorithms, lr_lr and lr_rab, can cut down the execution time of an Allreduce microbenchmark that uses logical ring algorithm (lr) by up to 45% and 51%, respectively. In addition, saving the power consumption of network devices of up to 23% and 32% are projected.