Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction

2016 First International Workshop on Communication Optimizations in HPC (COMHPC) Pub Date : 2016-11-13 DOI:10.1109/COM-HPC.2016.6

R. Graham, Devendar Bureddy, Pak Lui, H. Rosenstock, G. Shainer, Gil Bloch, Dror Goldenberg, Mike Dubman, Sasha Kotchubievsky, Vladimir Koushnir, Lion Levi, Alexander Margolin, Tamir Ronen, Alexander Shpiner, O. Wertheim, E. Zahavi

{"title":"Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction","authors":"R. Graham, Devendar Bureddy, Pak Lui, H. Rosenstock, G. Shainer, Gil Bloch, Dror Goldenberg, Mike Dubman, Sasha Kotchubievsky, Vladimir Koushnir, Lion Levi, Alexander Margolin, Tamir Ronen, Alexander Shpiner, O. Wertheim, E. Zahavi","doi":"10.1109/COM-HPC.2016.6","DOIUrl":null,"url":null,"abstract":"Increased system size and a greater reliance on utilizing system parallelism to achieve computational needs, requires innovative system architectures to meet the simulation challenges. As a step towards a new network class of co-processors — intelligent network devices, which manipulate data traversing the data-center network, this paper describes the SHArP technology designed to offload collective operation processing to the network. This is implemented in Mellanox's SwitchIB-2 ASIC, using innetwork trees to reduce data from a group of sources, and to distribute the result. Multiple parallel jobs with several partially overlapping groups are supported each with several reduction operations in-flight. Large performance enhancements are obtained, with an improvement of a factor of 2.1 for an eight byte MPI_Allreduce() operation on 128 hosts, going from 6.01 to 2.83 microseconds. Pipelining is used for an improvement of a factor of 3.24 in the latency of a 4096 byte MPI_Allreduce() operations, declining from 46.93 to 14.48 microseconds.","PeriodicalId":332852,"journal":{"name":"2016 First International Workshop on Communication Optimizations in HPC (COMHPC)","volume":"74 6","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"82","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 First International Workshop on Communication Optimizations in HPC (COMHPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/COM-HPC.2016.6","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 82

Abstract

Increased system size and a greater reliance on utilizing system parallelism to achieve computational needs, requires innovative system architectures to meet the simulation challenges. As a step towards a new network class of co-processors — intelligent network devices, which manipulate data traversing the data-center network, this paper describes the SHArP technology designed to offload collective operation processing to the network. This is implemented in Mellanox's SwitchIB-2 ASIC, using innetwork trees to reduce data from a group of sources, and to distribute the result. Multiple parallel jobs with several partially overlapping groups are supported each with several reduction operations in-flight. Large performance enhancements are obtained, with an improvement of a factor of 2.1 for an eight byte MPI_Allreduce() operation on 128 hosts, going from 6.01 to 2.83 microseconds. Pipelining is used for an improvement of a factor of 3.24 in the latency of a 4096 byte MPI_Allreduce() operations, declining from 46.93 to 14.48 microseconds.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

可伸缩的分层聚合协议(SHArP):一种高效数据缩减的硬件架构

增加的系统尺寸和更依赖于利用系统并行性来实现计算需求，需要创新的系统架构来满足仿真挑战。作为一种新型的网络类协处理器——智能网络设备——在数据中心网络中操作数据的一步，本文描述了旨在将集体操作处理卸载到网络中的SHArP技术。这是在Mellanox的SwitchIB-2 ASIC中实现的，使用网络树从一组源中减少数据，并分发结果。支持具有几个部分重叠组的多个并行作业，每个作业都有几个正在进行的缩减操作。获得了很大的性能增强，对于128台主机上的8字节MPI_Allreduce()操作，性能提高了2.1倍，从6.01微秒降低到2.83微秒。使用流水线可以将4096字节的MPI_Allreduce()操作的延迟提高3.24倍，从46.93微秒降低到14.48微秒。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 First International Workshop on Communication Optimizations in HPC (COMHPC)

自引率

0.00%

发文量