Efficient Reliability Support for Hardware Multicast-Based Broadcast in GPU-enabled Streaming Applications

2016 First International Workshop on Communication Optimizations in HPC (COMHPC) Pub Date : 2016-11-13 DOI:10.1109/COM-HPC.2016.9

Ching-Hsiang Chu, Khaled Hamidouche, H. Subramoni, Akshay Venkatesh, B. Elton, D. Panda

{"title":"Efficient Reliability Support for Hardware Multicast-Based Broadcast in GPU-enabled Streaming Applications","authors":"Ching-Hsiang Chu, Khaled Hamidouche, H. Subramoni, Akshay Venkatesh, B. Elton, D. Panda","doi":"10.1109/COM-HPC.2016.9","DOIUrl":null,"url":null,"abstract":"Streaming applications, which are data-intensive, have been extensively run on High-Performance Computing (HPC) systems to seek the higher performance and scalability. These applications typically utilize broadcast operations to disseminate in real-time data from a single source to multiple workers, each being a multi-GPU based computing site. State-of-the-art broadcast operations take advantage of InfiniBand (IB) hardware multicast (MCAST) and NVIDIA GPUDirect features to boost inter-node communications performance and scalability. The IB MCAST feature works only with the IB Unreliable Datagram (UD) mechanism and consequently provides unreliable communication for applications. Higher-level libraries and/or runtime environments must handle and provide reliability explicitly. However, handling reliability at that level can be a performance bottleneck for streaming applications. In this paper, we analyze the specific requirements of streaming applications and the performance bottlenecks involved in handling reliability. We show that the traditional Negative-Acknowledgement (NACK) based approach requires the broadcast sender to perform retransmissions for lost packets, degrading streaming throughput. To alleviate this issue, we propose a novel Remote Memory Access (RMA) based scheme to provide high-performance reliability support at the MPI-level. In the proposed scheme, the receivers themselves (as opposed to the sender) retrieve lost packets through RMA operations. Furthermore, we provide an analytical model to illustrate the memory requirements of the proposed RMA-based scheme. Our experimental results show that the proposed scheme introduces nearly no overhead compared to the existing solutions. In a micro-benchmark with injected failures (to simulate unreliable network environments), the proposed scheme shows up to 45% reduction in latency compared to the existing NACK-based scheme. Moreover, with a synthetic streaming benchmark, our design also shows up to a 56% higher broadcast rate compared to the traditional NACK-based scheme on a GPU-dense Cray CS-Storm system with up to 88 NVIDIA K80 GPU cards.","PeriodicalId":332852,"journal":{"name":"2016 First International Workshop on Communication Optimizations in HPC (COMHPC)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 First International Workshop on Communication Optimizations in HPC (COMHPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/COM-HPC.2016.9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Streaming applications, which are data-intensive, have been extensively run on High-Performance Computing (HPC) systems to seek the higher performance and scalability. These applications typically utilize broadcast operations to disseminate in real-time data from a single source to multiple workers, each being a multi-GPU based computing site. State-of-the-art broadcast operations take advantage of InfiniBand (IB) hardware multicast (MCAST) and NVIDIA GPUDirect features to boost inter-node communications performance and scalability. The IB MCAST feature works only with the IB Unreliable Datagram (UD) mechanism and consequently provides unreliable communication for applications. Higher-level libraries and/or runtime environments must handle and provide reliability explicitly. However, handling reliability at that level can be a performance bottleneck for streaming applications. In this paper, we analyze the specific requirements of streaming applications and the performance bottlenecks involved in handling reliability. We show that the traditional Negative-Acknowledgement (NACK) based approach requires the broadcast sender to perform retransmissions for lost packets, degrading streaming throughput. To alleviate this issue, we propose a novel Remote Memory Access (RMA) based scheme to provide high-performance reliability support at the MPI-level. In the proposed scheme, the receivers themselves (as opposed to the sender) retrieve lost packets through RMA operations. Furthermore, we provide an analytical model to illustrate the memory requirements of the proposed RMA-based scheme. Our experimental results show that the proposed scheme introduces nearly no overhead compared to the existing solutions. In a micro-benchmark with injected failures (to simulate unreliable network environments), the proposed scheme shows up to 45% reduction in latency compared to the existing NACK-based scheme. Moreover, with a synthetic streaming benchmark, our design also shows up to a 56% higher broadcast rate compared to the traditional NACK-based scheme on a GPU-dense Cray CS-Storm system with up to 88 NVIDIA K80 GPU cards.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在支持gpu的流媒体应用中，为基于硬件组播的广播提供高效的可靠性支持

流媒体应用作为数据密集型应用，在高性能计算(HPC)系统上广泛运行，以寻求更高的性能和可扩展性。这些应用程序通常利用广播操作将来自单个源的实时数据传播到多个工作人员，每个工作人员都是基于多gpu的计算站点。最先进的广播操作利用InfiniBand (IB)硬件组播(MCAST)和NVIDIA GPUDirect功能来提高节点间通信性能和可扩展性。IB MCAST特性仅适用于IB不可靠数据报(UD)机制，因此为应用程序提供了不可靠的通信。高级库和/或运行时环境必须显式地处理和提供可靠性。然而，在这个级别上处理可靠性可能会成为流应用程序的性能瓶颈。在本文中，我们分析了流应用的具体需求和处理可靠性所涉及的性能瓶颈。我们表明，传统的基于否定确认(NACK)的方法要求广播发送方对丢失的数据包进行重传，从而降低了流吞吐量。为了解决这个问题，我们提出了一种新的基于远程内存访问(RMA)的方案来提供mpi级别的高性能可靠性支持。在提议的方案中，接收方自己(而不是发送方)通过RMA操作检索丢失的数据包。此外，我们提供了一个分析模型来说明所提出的基于rma的方案的内存需求。实验结果表明，与现有方案相比，该方案几乎没有带来任何开销。在具有注入故障的微基准测试中(模拟不可靠的网络环境)，与现有的基于nack的方案相比，所提出的方案的延迟减少了45%。此外，与传统的基于nack的方案相比，我们的设计在具有多达88个NVIDIA K80 GPU卡的GPU密集的Cray CS-Storm系统上显示出高达56%的广播速率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 First International Workshop on Communication Optimizations in HPC (COMHPC)

自引率

0.00%

发文量