首页 > 最新文献

2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)最新文献

英文 中文
Message from the HiPC 2022 General Co-Chairs 2022年重债穷国会议共同主席致辞
{"title":"Message from the HiPC 2022 General Co-Chairs","authors":"","doi":"10.1109/hipc56025.2022.00005","DOIUrl":"https://doi.org/10.1109/hipc56025.2022.00005","url":null,"abstract":"","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115286608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating Prefix Scan with in-network computing on Intel PIUMA 在Intel PIUMA上加速前缀扫描与网络内计算
Kartik Lakhotia, F. Petrini, R. Kannan, V. Prasanna
Prefix Scan is a versatile collective used in several classes of algorithms including sorting, lexical analysis, graph analytics, and regex matching. It is also a powerful tool to perform tree operations and load balancing. However, host-based Prefix Scan implementations incur high latency, large network traffic and poor scalability on large distributed systems.We explore in-network computation to accelerate Prefix Scan, using switches with data aggregation capabilities. We discuss the fundamental challenges associated with offloading Prefix Scan onto a network, and resolve them with innovations in dataflow topology and embedding methodology. We implement the proposed approach on the Intel PIUMA system. To the best of our knowledge, this is the first realization of a Prefix Scan offloading onto network switches.Our in-network Prefix Scan is highly scalable with less than 5μs latency on 16K PIUMA nodes and 6× lower latency than the host-based Prefix Scan. The performance benefits directly translate to improved workload scalability, as we demonstrate using a key bioinformatics application called Sequence Alignment.
前缀扫描是一个通用的集合,用于几类算法,包括排序、词法分析、图形分析和正则表达式匹配。它也是执行树操作和负载平衡的强大工具。然而,在大型分布式系统上,基于主机的前缀扫描实现会带来高延迟、大网络流量和较差的可扩展性。我们探索网络内计算来加速前缀扫描,使用具有数据聚合功能的交换机。我们讨论了与将前缀扫描卸载到网络上相关的基本挑战,并通过数据流拓扑和嵌入方法的创新来解决这些挑战。我们在Intel PIUMA系统上实现了该方法。据我们所知,这是第一次实现前缀扫描卸载到网络交换机上。我们的网络内前缀扫描具有高度可扩展性,在16K PIUMA节点上延迟小于5μs,比基于主机的前缀扫描延迟低6倍。性能优势直接转化为改进的工作负载可伸缩性,正如我们使用称为Sequence Alignment的关键生物信息学应用程序所演示的那样。
{"title":"Accelerating Prefix Scan with in-network computing on Intel PIUMA","authors":"Kartik Lakhotia, F. Petrini, R. Kannan, V. Prasanna","doi":"10.1109/HiPC56025.2022.00020","DOIUrl":"https://doi.org/10.1109/HiPC56025.2022.00020","url":null,"abstract":"Prefix Scan is a versatile collective used in several classes of algorithms including sorting, lexical analysis, graph analytics, and regex matching. It is also a powerful tool to perform tree operations and load balancing. However, host-based Prefix Scan implementations incur high latency, large network traffic and poor scalability on large distributed systems.We explore in-network computation to accelerate Prefix Scan, using switches with data aggregation capabilities. We discuss the fundamental challenges associated with offloading Prefix Scan onto a network, and resolve them with innovations in dataflow topology and embedding methodology. We implement the proposed approach on the Intel PIUMA system. To the best of our knowledge, this is the first realization of a Prefix Scan offloading onto network switches.Our in-network Prefix Scan is highly scalable with less than 5μs latency on 16K PIUMA nodes and 6× lower latency than the host-based Prefix Scan. The performance benefits directly translate to improved workload scalability, as we demonstrate using a key bioinformatics application called Sequence Alignment.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116890030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Efficient Personalized and Non-Personalized Alltoall Communication for Modern Multi-HCA GPU-Based Clusters 现代多hca gpu集群的高效个性化和非个性化全通信
K. Suresh, Akshay Paniraja Guptha, Benjamin Michalowicz, B. Ramesh, M. Abduljabbar, A. Shafi, H. Subramoni, D. Panda
Graphics Processing Units (GPUs) have become ubiquitous in today’s supercomputing clusters primarily because of their high compute capability and power efficiency. Message Passing Interface (MPI) is a widely adopted programming model for large-scale GPU-based applications used in such clusters. Modern GPU-based systems have multiple HCAs. Previously, scientists have leveraged multi-HCA systems to accelerate inter-node transfers between CPUs using point-to-point primitives. In this work, we show the need for collective-level, multi-rail aware algorithms using MPI_Allgather as an example. We then propose an efficient multi-rail MPI_Allgather algorithm and extend it to MPI_Alltoall. We analyze the performance of this algorithm using OMB benchmark suite. We demonstrate approximately 30% and 43% improvement in non-personalized and personalized communication benchmarks respectively when compared with the state-of-the-art MPI libraries on 128 GPUs
图形处理单元(gpu)在当今的超级计算集群中变得无处不在,主要是因为它们具有高计算能力和能效。消息传递接口(Message Passing Interface, MPI)是一种广泛采用的编程模型,用于此类集群中使用的基于gpu的大规模应用程序。现代基于gpu的系统有多个hca。此前,科学家们利用多hca系统使用点对点原语加速cpu之间的节点间传输。在这项工作中,我们以MPI_Allgather为例展示了对集体级多轨道感知算法的需求。然后,我们提出了一种高效的多轨道MPI_Allgather算法,并将其扩展到MPI_Alltoall。我们使用OMB基准测试套件对该算法的性能进行了分析。与128 gpu上最先进的MPI库相比,我们在非个性化和个性化通信基准测试中分别展示了大约30%和43%的改进
{"title":"Efficient Personalized and Non-Personalized Alltoall Communication for Modern Multi-HCA GPU-Based Clusters","authors":"K. Suresh, Akshay Paniraja Guptha, Benjamin Michalowicz, B. Ramesh, M. Abduljabbar, A. Shafi, H. Subramoni, D. Panda","doi":"10.1109/HiPC56025.2022.00025","DOIUrl":"https://doi.org/10.1109/HiPC56025.2022.00025","url":null,"abstract":"Graphics Processing Units (GPUs) have become ubiquitous in today’s supercomputing clusters primarily because of their high compute capability and power efficiency. Message Passing Interface (MPI) is a widely adopted programming model for large-scale GPU-based applications used in such clusters. Modern GPU-based systems have multiple HCAs. Previously, scientists have leveraged multi-HCA systems to accelerate inter-node transfers between CPUs using point-to-point primitives. In this work, we show the need for collective-level, multi-rail aware algorithms using MPI_Allgather as an example. We then propose an efficient multi-rail MPI_Allgather algorithm and extend it to MPI_Alltoall. We analyze the performance of this algorithm using OMB benchmark suite. We demonstrate approximately 30% and 43% improvement in non-personalized and personalized communication benchmarks respectively when compared with the state-of-the-art MPI libraries on 128 GPUs","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116024451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Keynote 2: P Sadayappan 主题演讲2:P Sadayappan
{"title":"Keynote 2: P Sadayappan","authors":"","doi":"10.1109/hipc56025.2022.00011","DOIUrl":"https://doi.org/10.1109/hipc56025.2022.00011","url":null,"abstract":"","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128236302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Keynote 1: Paolo Lenne
{"title":"Keynote 1: Paolo Lenne","authors":"","doi":"10.1109/hipc56025.2022.00010","DOIUrl":"https://doi.org/10.1109/hipc56025.2022.00010","url":null,"abstract":"","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122369722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HiPC 2022 Organization HiPC 2022组织
{"title":"HiPC 2022 Organization","authors":"","doi":"10.1109/hipc56025.2022.00007","DOIUrl":"https://doi.org/10.1109/hipc56025.2022.00007","url":null,"abstract":"","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129039570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Input Feature Pruning for Accelerating GNN Inference on Heterogeneous Platforms 异构平台上加速GNN推理的输入特征剪枝
Jason Yik, S. Kuppannagari, Hanqing Zeng, V. Prasanna
Graph Neural Networks (GNNs) are an emerging class of machine learning models which utilize structured graph information and node features to reduce high-dimensional input data to low-dimensional embeddings, from which predictions can be made. Due to the compounding effect of aggregating neighbor information, GNN inferences require raw data from many times more nodes than are targeted for prediction. Thus, on heterogeneous compute platforms, inference latency can be largely subject to the inter-device communication cost of transferring input feature data to the GPU/accelerator before computation has even begun. In this paper, we analyze the trade-off effect of pruning input features from GNN models, reducing the volume of raw data that the model works with to lower communication latency at the expense of an expected decrease in the overall model accuracy. We develop greedy and regression-based algorithms to determine which features to retain for optimal prediction accuracy. We evaluate pruned model variants and find that they can reduce inference latency by up to 80% with an accuracy loss of less than 5% compared to non-pruned models. Furthermore, we show that the latency reductions from input feature pruning can be extended under different system variables such as batch size and floating point precision.
图神经网络(gnn)是一类新兴的机器学习模型,它利用结构化的图信息和节点特征将高维输入数据减少到低维嵌入,从中可以进行预测。由于聚合邻居信息的复合效应,GNN推断需要的原始数据比预测的目标节点多很多倍。因此,在异构计算平台上,推理延迟很大程度上取决于在计算开始之前将输入特征数据传输到GPU/加速器的设备间通信成本。在本文中,我们分析了从GNN模型中修剪输入特征的权衡效应,减少了模型处理的原始数据量,以降低整体模型精度的预期下降为代价降低了通信延迟。我们开发了贪婪和基于回归的算法来确定保留哪些特征以获得最佳预测精度。我们评估了经过修剪的模型变体,发现与未经过修剪的模型相比,它们可以减少高达80%的推理延迟,准确性损失小于5%。此外,我们还证明了输入特征修剪可以在不同的系统变量(如批大小和浮点精度)下扩展延迟减少。
{"title":"Input Feature Pruning for Accelerating GNN Inference on Heterogeneous Platforms","authors":"Jason Yik, S. Kuppannagari, Hanqing Zeng, V. Prasanna","doi":"10.1109/HiPC56025.2022.00045","DOIUrl":"https://doi.org/10.1109/HiPC56025.2022.00045","url":null,"abstract":"Graph Neural Networks (GNNs) are an emerging class of machine learning models which utilize structured graph information and node features to reduce high-dimensional input data to low-dimensional embeddings, from which predictions can be made. Due to the compounding effect of aggregating neighbor information, GNN inferences require raw data from many times more nodes than are targeted for prediction. Thus, on heterogeneous compute platforms, inference latency can be largely subject to the inter-device communication cost of transferring input feature data to the GPU/accelerator before computation has even begun. In this paper, we analyze the trade-off effect of pruning input features from GNN models, reducing the volume of raw data that the model works with to lower communication latency at the expense of an expected decrease in the overall model accuracy. We develop greedy and regression-based algorithms to determine which features to retain for optimal prediction accuracy. We evaluate pruned model variants and find that they can reduce inference latency by up to 80% with an accuracy loss of less than 5% compared to non-pruned models. Furthermore, we show that the latency reductions from input feature pruning can be extended under different system variables such as batch size and floating point precision.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133407326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
COMPROF and COMPLACE: Shared-Memory Communication Profiling and Automated Thread Placement via Dynamic Binary Instrumentation COMPROF和COMPLACE:通过动态二进制检测实现共享内存通信分析和自动线程放置
Ryan Kirkpatrick, Christopher Brown, Vladimir Janjic
This paper presents COMPROF and COMPLACE, a novel profiling tool and thread placement technique for shared-memory architectures that requires no recompilation or user intervention. We use dynamic binary instrumentation to intercept memory operations and estimate inter-thread communication overhead, deriving (and possibly visualising) a communication graph of data-sharing between threads. We then use this graph to map threads to cores in order to optimise memory traffic through the memory system. Different paths through a system’s memory hierarchy have different latency, throughput and energy properties, COMPLACE exploits this heterogeneity to provide automatic performance and energy improvements for multithreaded programs. We demonstrate COMPLACE on the NAS Parallel Benchmark (NPB) suite where, using our technique, we are able to achieve improvements of up to 12% in the execution time and up to 10% in the energy consumption (compared to default Linux scheduling) while not requiring any modification or recompilation of the application code.
本文介绍了COMPROF和COMPLACE,这是一种新的分析工具和线程放置技术,用于共享内存体系结构,不需要重新编译或用户干预。我们使用动态二进制工具来拦截内存操作并估计线程间通信开销,推导(并可能可视化)线程间数据共享的通信图。然后,我们使用此图将线程映射到内核,以便通过内存系统优化内存流量。通过系统内存层次结构的不同路径具有不同的延迟、吞吐量和能量属性,COMPLACE利用这种异质性为多线程程序提供自动性能和能量改进。我们在NAS Parallel Benchmark (NPB)套件上演示了COMPLACE,使用我们的技术,我们能够在执行时间上实现高达12%的改进,在能耗上实现高达10%的改进(与默认Linux调度相比),同时不需要修改或重新编译应用程序代码。
{"title":"COMPROF and COMPLACE: Shared-Memory Communication Profiling and Automated Thread Placement via Dynamic Binary Instrumentation","authors":"Ryan Kirkpatrick, Christopher Brown, Vladimir Janjic","doi":"10.1109/HiPC56025.2022.00040","DOIUrl":"https://doi.org/10.1109/HiPC56025.2022.00040","url":null,"abstract":"This paper presents COMPROF and COMPLACE, a novel profiling tool and thread placement technique for shared-memory architectures that requires no recompilation or user intervention. We use dynamic binary instrumentation to intercept memory operations and estimate inter-thread communication overhead, deriving (and possibly visualising) a communication graph of data-sharing between threads. We then use this graph to map threads to cores in order to optimise memory traffic through the memory system. Different paths through a system’s memory hierarchy have different latency, throughput and energy properties, COMPLACE exploits this heterogeneity to provide automatic performance and energy improvements for multithreaded programs. We demonstrate COMPLACE on the NAS Parallel Benchmark (NPB) suite where, using our technique, we are able to achieve improvements of up to 12% in the execution time and up to 10% in the energy consumption (compared to default Linux scheduling) while not requiring any modification or recompilation of the application code.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115121152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads 加速广播通信与GPU压缩深度学习工作负载
Qinghua Zhou, Quentin G. Anthony, A. Shafi, H. Subramoni, D. Panda
With the rapidly increasing model sizes, state-of-the-art Deep Learning (DL) models rely on multiple GPU nodes to run distributed training. Large message communication of GPU data between the GPUs is becoming a performance bottleneck in the overall training performance. GPU-Aware MPI libraries are widely adopted for state-of-the-art DL frameworks to improve communication performance. In the existing optimization solutions for Distributed Data-Parallel (DDP) training, the broadcast operation is often utilized to sync up the updated model parameters among all the GPUs. However, for state-of-the-art GPU-Aware MPI libraries, broadcasting large GPU data turns to overburden the training performance due to the limited bandwidth of interconnect between the GPU nodes. On the other hand, the recent research on using GPU-based compression libraries to lower the pressure on the nearly saturated interconnection and co-designing online compression with the communication pattern provides a new perspective to optimize the performance of broadcast on modern GPU clusters.In this paper, we redesign the GPU-Aware MPI library to enable efficient collective-level online compression with an optimized chunked-chain scheme for large message broadcast communication. The proposed design is evaluated to show benefits at both microbenchmark and application levels. At the microbenchmark level, the proposed design can reduce the broadcast communication latency by up to 80.9% compared to the baseline using a state-of-the-art MPI library and 55.1% compared to the existing point-to-point-based compression on modern GPU clusters. For DDP training with PyTorch, the proposed design reduces the training time by up to 15.0% and 6.4% compared to the existing chunked-chain scheme and point-to-point-based compression, respectively, while keeping similar training accuracy. To the best of our knowledge, this is the first work that leverages online GPU-based compression techniques to significantly accelerate broadcast communication for DL workloads.
随着模型规模的快速增长,最先进的深度学习(DL)模型依赖于多个GPU节点来运行分布式训练。GPU之间的大量消息通信正在成为影响整体训练性能的瓶颈。gpu感知MPI库被广泛用于最先进的DL框架,以提高通信性能。在现有的分布式数据并行(DDP)训练优化方案中,通常采用广播操作在所有gpu之间同步更新的模型参数。然而,对于最先进的GPU感知MPI库,由于GPU节点之间互连的带宽有限,广播大型GPU数据会使训练性能负担过重。另一方面,利用基于GPU的压缩库来降低近饱和互连的压力,并与通信模式共同设计在线压缩,为优化现代GPU集群上的广播性能提供了新的视角。在本文中,我们重新设计了gpu感知的MPI库,通过优化的块链方案实现高效的集体级在线压缩,用于大型消息广播通信。对所提出的设计进行了评估,以显示在微基准测试和应用程序级别上的优势。在微基准测试水平上,与使用最先进的MPI库的基线相比,所提出的设计可以将广播通信延迟减少80.9%,与现代GPU集群上现有的基于点对点的压缩相比,可以减少55.1%。对于PyTorch的DDP训练,与现有的块链方案和基于点对点的压缩相比,所提出的设计分别将训练时间缩短了15.0%和6.4%,同时保持了相似的训练精度。据我们所知,这是第一次利用基于gpu的在线压缩技术来显著加速DL工作负载的广播通信。
{"title":"Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads","authors":"Qinghua Zhou, Quentin G. Anthony, A. Shafi, H. Subramoni, D. Panda","doi":"10.1109/HiPC56025.2022.00016","DOIUrl":"https://doi.org/10.1109/HiPC56025.2022.00016","url":null,"abstract":"With the rapidly increasing model sizes, state-of-the-art Deep Learning (DL) models rely on multiple GPU nodes to run distributed training. Large message communication of GPU data between the GPUs is becoming a performance bottleneck in the overall training performance. GPU-Aware MPI libraries are widely adopted for state-of-the-art DL frameworks to improve communication performance. In the existing optimization solutions for Distributed Data-Parallel (DDP) training, the broadcast operation is often utilized to sync up the updated model parameters among all the GPUs. However, for state-of-the-art GPU-Aware MPI libraries, broadcasting large GPU data turns to overburden the training performance due to the limited bandwidth of interconnect between the GPU nodes. On the other hand, the recent research on using GPU-based compression libraries to lower the pressure on the nearly saturated interconnection and co-designing online compression with the communication pattern provides a new perspective to optimize the performance of broadcast on modern GPU clusters.In this paper, we redesign the GPU-Aware MPI library to enable efficient collective-level online compression with an optimized chunked-chain scheme for large message broadcast communication. The proposed design is evaluated to show benefits at both microbenchmark and application levels. At the microbenchmark level, the proposed design can reduce the broadcast communication latency by up to 80.9% compared to the baseline using a state-of-the-art MPI library and 55.1% compared to the existing point-to-point-based compression on modern GPU clusters. For DDP training with PyTorch, the proposed design reduces the training time by up to 15.0% and 6.4% compared to the existing chunked-chain scheme and point-to-point-based compression, respectively, while keeping similar training accuracy. To the best of our knowledge, this is the first work that leverages online GPU-based compression techniques to significantly accelerate broadcast communication for DL workloads.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"11 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120972327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Keynote 4: Per Stenstr̈m 主题演讲4:Per Stenstr + m
{"title":"Keynote 4: Per Stenstr̈m","authors":"","doi":"10.1109/hipc56025.2022.00013","DOIUrl":"https://doi.org/10.1109/hipc56025.2022.00013","url":null,"abstract":"","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116341100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1