分布式深度学习大规模训练中混合电/光开关架构的可行性研究

Truong Thao Nguyen, Ryousei Takano
{"title":"分布式深度学习大规模训练中混合电/光开关架构的可行性研究","authors":"Truong Thao Nguyen, Ryousei Takano","doi":"10.1109/PHOTONICS49561.2019.00007","DOIUrl":null,"url":null,"abstract":"Data parallelism is the dominant method used to train deep learning (DL) model on High-Performance Computing systems such as large-scale GPU clusters. When training a DL model on a large number of nodes, inter-node communication becomes bottle-neck due to its relatively higher latency and lower link bandwidth (than intra-node communication). To cope with this problem, some techniques have been proposed to (a) optimize the collective communication algorithms that take into account the network topology, (b) reduce the message size, and (c) overlap the communication and computation. All of these approaches target to deal with the large message size issue while diminishing the effect of the limitation of the inter-node network. In this study, we investigate the benefit of increasing inter-node link bandwidth by using the hybrid switching systems, i.e., Electrical Packet Switching and Optical Circuit Switching. We found that the typical data-transfer of synchronous data-parallelism training are long-live and rarely changed that can be speed-up with optical switching. Simulation results on Simgrid simulator show that our approach speed-up the training time of deep learning application around 10%.","PeriodicalId":64491,"journal":{"name":"光学与光子学期刊(英文)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"On the Feasibility of Hybrid Electrical/Optical Switch Architecture for Large-Scale Training of Distributed Deep Learning\",\"authors\":\"Truong Thao Nguyen, Ryousei Takano\",\"doi\":\"10.1109/PHOTONICS49561.2019.00007\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data parallelism is the dominant method used to train deep learning (DL) model on High-Performance Computing systems such as large-scale GPU clusters. When training a DL model on a large number of nodes, inter-node communication becomes bottle-neck due to its relatively higher latency and lower link bandwidth (than intra-node communication). To cope with this problem, some techniques have been proposed to (a) optimize the collective communication algorithms that take into account the network topology, (b) reduce the message size, and (c) overlap the communication and computation. All of these approaches target to deal with the large message size issue while diminishing the effect of the limitation of the inter-node network. In this study, we investigate the benefit of increasing inter-node link bandwidth by using the hybrid switching systems, i.e., Electrical Packet Switching and Optical Circuit Switching. We found that the typical data-transfer of synchronous data-parallelism training are long-live and rarely changed that can be speed-up with optical switching. Simulation results on Simgrid simulator show that our approach speed-up the training time of deep learning application around 10%.\",\"PeriodicalId\":64491,\"journal\":{\"name\":\"光学与光子学期刊(英文)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"光学与光子学期刊(英文)\",\"FirstCategoryId\":\"1089\",\"ListUrlMain\":\"https://doi.org/10.1109/PHOTONICS49561.2019.00007\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"光学与光子学期刊(英文)","FirstCategoryId":"1089","ListUrlMain":"https://doi.org/10.1109/PHOTONICS49561.2019.00007","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

摘要

数据并行是在大规模GPU集群等高性能计算系统上训练深度学习(DL)模型的主要方法。当在大量节点上训练DL模型时,节点间通信(相对于节点内通信)的延迟较高,链路带宽较低,成为瓶颈。为了解决这个问题,已经提出了一些技术:(a)优化考虑网络拓扑的集体通信算法,(b)减少消息大小,以及(c)重叠通信和计算。所有这些方法都旨在处理大消息大小问题,同时减少节点间网络限制的影响。在本研究中,我们探讨了使用混合交换系统(即电分组交换和光电路交换)增加节点间链路带宽的好处。我们发现同步数据并行训练中典型的数据传输是长寿命且很少变化的,可以通过光交换加速。在Simgrid模拟器上的仿真结果表明,我们的方法使深度学习应用的训练时间加快了10%左右。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
On the Feasibility of Hybrid Electrical/Optical Switch Architecture for Large-Scale Training of Distributed Deep Learning
Data parallelism is the dominant method used to train deep learning (DL) model on High-Performance Computing systems such as large-scale GPU clusters. When training a DL model on a large number of nodes, inter-node communication becomes bottle-neck due to its relatively higher latency and lower link bandwidth (than intra-node communication). To cope with this problem, some techniques have been proposed to (a) optimize the collective communication algorithms that take into account the network topology, (b) reduce the message size, and (c) overlap the communication and computation. All of these approaches target to deal with the large message size issue while diminishing the effect of the limitation of the inter-node network. In this study, we investigate the benefit of increasing inter-node link bandwidth by using the hybrid switching systems, i.e., Electrical Packet Switching and Optical Circuit Switching. We found that the typical data-transfer of synchronous data-parallelism training are long-live and rarely changed that can be speed-up with optical switching. Simulation results on Simgrid simulator show that our approach speed-up the training time of deep learning application around 10%.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
431
期刊最新文献
Analysis and Prediction of Effect of Turning Marks Diffraction on Image Quality of Optical System Numerical Simulation of External-Cavity Distributed Feedback Semiconductor Laser The Influence of Energy Transfer on the Color Temperature Change in Color-Tunable Organic Light Emitting Diodes with Interface Exciplex A High Spectral Efficient Frequency-Domain Channel-Estimation Method for the Polarization-Division-Multiplexed CO-OFDM-OQAM System The Study on the Relationship between Dynamic Balance Energy Distribution and Spectral Stability with Voltage Change in White Organic Light Emitting Diode
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1