Alessandro Ottino, Joshua Benjamin , Georgios Zervas
{"title":"RAMP: A flat nanosecond optical network and MPI operations for distributed deep learning systems","authors":"Alessandro Ottino, Joshua Benjamin , Georgios Zervas","doi":"10.1016/j.osn.2023.100761","DOIUrl":null,"url":null,"abstract":"<div><p><span><span>Distributed deep learning (DDL) systems strongly depend on network performance. Current electronic packet switched (EPS) network architectures and technologies suffer from variable diameter topologies, low-bisection bandwidth and over-subscription affecting completion time of communication and collective operations. We introduce a near-exascale, full-bisection bandwidth, all-to-all, single-hop, all-optical </span>network architecture<span><span> with nanosecond reconfiguration called RAMP, which supports large-scale distributed and parallel computing systems (12.8 Tbps per node for up to 65,536 nodes). For the first time, a custom RAMP-x </span>MPI strategy and a network transcoder is proposed to run MPI collective operations across the optical circuit switched (OCS) network in a schedule-less and contention-less manner. RAMP achieves 7.6-171</span></span><span><math><mo>×</mo></math></span> speed-up in completion time across all MPI operations compared to realistic EPS and OCS counterparts. It can also deliver a 1.3-16<span><math><mo>×</mo></math></span> and 7.8-58<span><math><mo>×</mo></math></span> reduction in Megatron and DLRM training time respectively while offering 38-47<span><math><mo>×</mo></math></span> and 6.4-26.5<span><math><mo>×</mo></math></span> improvement in energy consumption and cost respectively.</p></div>","PeriodicalId":54674,"journal":{"name":"Optical Switching and Networking","volume":"51 ","pages":"Article 100761"},"PeriodicalIF":1.9000,"publicationDate":"2023-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Optical Switching and Networking","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1573427723000322","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Distributed deep learning (DDL) systems strongly depend on network performance. Current electronic packet switched (EPS) network architectures and technologies suffer from variable diameter topologies, low-bisection bandwidth and over-subscription affecting completion time of communication and collective operations. We introduce a near-exascale, full-bisection bandwidth, all-to-all, single-hop, all-optical network architecture with nanosecond reconfiguration called RAMP, which supports large-scale distributed and parallel computing systems (12.8 Tbps per node for up to 65,536 nodes). For the first time, a custom RAMP-x MPI strategy and a network transcoder is proposed to run MPI collective operations across the optical circuit switched (OCS) network in a schedule-less and contention-less manner. RAMP achieves 7.6-171 speed-up in completion time across all MPI operations compared to realistic EPS and OCS counterparts. It can also deliver a 1.3-16 and 7.8-58 reduction in Megatron and DLRM training time respectively while offering 38-47 and 6.4-26.5 improvement in energy consumption and cost respectively.
期刊介绍:
Optical Switching and Networking (OSN) is an archival journal aiming to provide complete coverage of all topics of interest to those involved in the optical and high-speed opto-electronic networking areas. The editorial board is committed to providing detailed, constructive feedback to submitted papers, as well as a fast turn-around time.
Optical Switching and Networking considers high-quality, original, and unpublished contributions addressing all aspects of optical and opto-electronic networks. Specific areas of interest include, but are not limited to:
• Optical and Opto-Electronic Backbone, Metropolitan and Local Area Networks
• Optical Data Center Networks
• Elastic optical networks
• Green Optical Networks
• Software Defined Optical Networks
• Novel Multi-layer Architectures and Protocols (Ethernet, Internet, Physical Layer)
• Optical Networks for Interet of Things (IOT)
• Home Networks, In-Vehicle Networks, and Other Short-Reach Networks
• Optical Access Networks
• Optical Data Center Interconnection Systems
• Optical OFDM and coherent optical network systems
• Free Space Optics (FSO) networks
• Hybrid Fiber - Wireless Networks
• Optical Satellite Networks
• Visible Light Communication Networks
• Optical Storage Networks
• Optical Network Security
• Optical Network Resiliance and Reliability
• Control Plane Issues and Signaling Protocols
• Optical Quality of Service (OQoS) and Impairment Monitoring
• Optical Layer Anycast, Broadcast and Multicast
• Optical Network Applications, Testbeds and Experimental Networks
• Optical Network for Science and High Performance Computing Networks