{"title":"分布式深度学习大规模训练中混合电/光开关架构的可行性研究","authors":"Truong Thao Nguyen, Ryousei Takano","doi":"10.1109/PHOTONICS49561.2019.00007","DOIUrl":null,"url":null,"abstract":"Data parallelism is the dominant method used to train deep learning (DL) model on High-Performance Computing systems such as large-scale GPU clusters. When training a DL model on a large number of nodes, inter-node communication becomes bottle-neck due to its relatively higher latency and lower link bandwidth (than intra-node communication). To cope with this problem, some techniques have been proposed to (a) optimize the collective communication algorithms that take into account the network topology, (b) reduce the message size, and (c) overlap the communication and computation. All of these approaches target to deal with the large message size issue while diminishing the effect of the limitation of the inter-node network. In this study, we investigate the benefit of increasing inter-node link bandwidth by using the hybrid switching systems, i.e., Electrical Packet Switching and Optical Circuit Switching. We found that the typical data-transfer of synchronous data-parallelism training are long-live and rarely changed that can be speed-up with optical switching. Simulation results on Simgrid simulator show that our approach speed-up the training time of deep learning application around 10%.","PeriodicalId":64491,"journal":{"name":"光学与光子学期刊(英文)","volume":"26 1","pages":"7-14"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"On the Feasibility of Hybrid Electrical/Optical Switch Architecture for Large-Scale Training of Distributed Deep Learning\",\"authors\":\"Truong Thao Nguyen, Ryousei Takano\",\"doi\":\"10.1109/PHOTONICS49561.2019.00007\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data parallelism is the dominant method used to train deep learning (DL) model on High-Performance Computing systems such as large-scale GPU clusters. When training a DL model on a large number of nodes, inter-node communication becomes bottle-neck due to its relatively higher latency and lower link bandwidth (than intra-node communication). To cope with this problem, some techniques have been proposed to (a) optimize the collective communication algorithms that take into account the network topology, (b) reduce the message size, and (c) overlap the communication and computation. All of these approaches target to deal with the large message size issue while diminishing the effect of the limitation of the inter-node network. In this study, we investigate the benefit of increasing inter-node link bandwidth by using the hybrid switching systems, i.e., Electrical Packet Switching and Optical Circuit Switching. We found that the typical data-transfer of synchronous data-parallelism training are long-live and rarely changed that can be speed-up with optical switching. Simulation results on Simgrid simulator show that our approach speed-up the training time of deep learning application around 10%.\",\"PeriodicalId\":64491,\"journal\":{\"name\":\"光学与光子学期刊(英文)\",\"volume\":\"26 1\",\"pages\":\"7-14\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"光学与光子学期刊(英文)\",\"FirstCategoryId\":\"1089\",\"ListUrlMain\":\"https://doi.org/10.1109/PHOTONICS49561.2019.00007\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"光学与光子学期刊(英文)","FirstCategoryId":"1089","ListUrlMain":"https://doi.org/10.1109/PHOTONICS49561.2019.00007","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
On the Feasibility of Hybrid Electrical/Optical Switch Architecture for Large-Scale Training of Distributed Deep Learning
Data parallelism is the dominant method used to train deep learning (DL) model on High-Performance Computing systems such as large-scale GPU clusters. When training a DL model on a large number of nodes, inter-node communication becomes bottle-neck due to its relatively higher latency and lower link bandwidth (than intra-node communication). To cope with this problem, some techniques have been proposed to (a) optimize the collective communication algorithms that take into account the network topology, (b) reduce the message size, and (c) overlap the communication and computation. All of these approaches target to deal with the large message size issue while diminishing the effect of the limitation of the inter-node network. In this study, we investigate the benefit of increasing inter-node link bandwidth by using the hybrid switching systems, i.e., Electrical Packet Switching and Optical Circuit Switching. We found that the typical data-transfer of synchronous data-parallelism training are long-live and rarely changed that can be speed-up with optical switching. Simulation results on Simgrid simulator show that our approach speed-up the training time of deep learning application around 10%.