2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第8页

A Novel Framework for Efficient Offloading of Communication Operations to Bluefield SmartNICs 一种将通信操作有效卸载到Bluefield smartnic的新框架

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00022

K. Suresh, Benjamin Michalowicz, B. Ramesh, Nicholas Contini, Jinghan Yao, Shulei Xu, A. Shafi, H. Subramoni, D. Panda

Smart Network Interface Cards (SmartNICs) such as NVIDIA’s BlueField Data Processing Units (DPUs) provide advanced networking capabilities and processor cores, enabling the offload of complex operations away from the host. In the context of MPI, prior work has explored the use of DPUs to offload non-blocking collective operations. The limitations of current state-of-the-art approaches are twofold: They only work for a pre-defined set of algorithms/communication patterns and have degraded communication latency due to staging data between the DPU and the host. In this paper, we propose a framework that supports the offload of any communication pattern to the DPU while achieving low communication latency with perfect overlap. To achieve this, we first study the limitations of higher-level programming models such as MPI in expressing the offload of complex communication patterns to the DPU. We present a new set of APIs to alleviate these shortcomings and support any generic communication pattern. Then, we analyze the bottlenecks involved in offloading communication operations to the DPU and propose efficient designs for a few candidate communication patterns. To the best of our knowledge, this is the first framework providing both efficient and generic communication offload to the DPU. Our proposed framework outperforms state-of-the-art staging-based offload solutions by 47% in Alltoall micro-benchmarks, and at the application level, we see improvements up to 60% in P3DFFT and 15% in HPL on 512 processes.

智能网络接口卡(smartnic)，如NVIDIA的BlueField数据处理单元(dpu)提供先进的网络功能和处理器核心，使复杂的操作从主机上卸载。在MPI的背景下，之前的工作已经探索了使用dpu卸载非阻塞集体操作。当前最先进的方法有两个局限性:它们只适用于一组预定义的算法/通信模式，并且由于在DPU和主机之间暂存数据而降低了通信延迟。在本文中，我们提出了一个框架，该框架支持将任何通信模式卸载到DPU，同时实现低通信延迟和完美重叠。为了实现这一点，我们首先研究了高级编程模型(如MPI)在向DPU表达复杂通信模式的卸载方面的局限性。我们提供了一组新的api来减轻这些缺点，并支持任何通用的通信模式。然后，我们分析了将通信操作卸载到DPU所涉及的瓶颈，并提出了一些候选通信模式的有效设计。据我们所知，这是第一个为DPU提供高效和通用通信卸载的框架。我们提出的框架在Alltoall微基准测试中比最先进的基于阶段的卸载解决方案高出47%，在应用程序级别，我们看到P3DFFT的改进高达60%，512进程的HPL的改进高达15%。

{"title":"A Novel Framework for Efficient Offloading of Communication Operations to Bluefield SmartNICs","authors":"K. Suresh, Benjamin Michalowicz, B. Ramesh, Nicholas Contini, Jinghan Yao, Shulei Xu, A. Shafi, H. Subramoni, D. Panda","doi":"10.1109/IPDPS54959.2023.00022","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00022","url":null,"abstract":"Smart Network Interface Cards (SmartNICs) such as NVIDIA’s BlueField Data Processing Units (DPUs) provide advanced networking capabilities and processor cores, enabling the offload of complex operations away from the host. In the context of MPI, prior work has explored the use of DPUs to offload non-blocking collective operations. The limitations of current state-of-the-art approaches are twofold: They only work for a pre-defined set of algorithms/communication patterns and have degraded communication latency due to staging data between the DPU and the host. In this paper, we propose a framework that supports the offload of any communication pattern to the DPU while achieving low communication latency with perfect overlap. To achieve this, we first study the limitations of higher-level programming models such as MPI in expressing the offload of complex communication patterns to the DPU. We present a new set of APIs to alleviate these shortcomings and support any generic communication pattern. Then, we analyze the bottlenecks involved in offloading communication operations to the DPU and propose efficient designs for a few candidate communication patterns. To the best of our knowledge, this is the first framework providing both efficient and generic communication offload to the DPU. Our proposed framework outperforms state-of-the-art staging-based offload solutions by 47% in Alltoall micro-benchmarks, and at the application level, we see improvements up to 60% in P3DFFT and 15% in HPL on 512 processes.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132980088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

PFedSA: Personalized Federated Multi-Task Learning via Similarity Awareness PFedSA:基于相似性感知的个性化联邦多任务学习

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00055

Chuyao Ye, Hao Zheng, Zhi-gang Hu, Meiguang Zheng

Federated Learning (FL) constructs a distributed machine learning framework that involves multiple remote clients collaboratively training models. However in real-world situations, the emergence of non-Independent and Identically Distributed (non-IID) data makes the global model generated by traditional FL algorithms no longer meet the needs of all clients, and the accuracy is greatly reduced. In this paper, we propose a personalized federated multi-task learning method via similarity awareness (PFedSA), which captures the similarity between client data through model parameters uploaded by clients, thus facilitating collaborative training of similar clients and providing personalized models based on each client’s data distribution. Specifically, it generates the intrinsic cluster structure among clients and introduces personalized patch layers into the cluster to personalize the cluster model. PFedSA also maintains the generalization ability of models, which allows each client to benefit from nodes with similar data distributions when training data, and the greater the similarity, the more benefit. We evaluate the performance of the PFedSA method using MNIST, EMNIST and CIFAR10 datasets, and investigate the impact of different data setting schemes on the performance of PFedSA. The results show that in all data setting scenarios, the PFedSA method proposed in this paper can achieve the best personalization performance, having more clients with higher accuracy, and it is especially effective when the client’s data is non-IID.

联邦学习(FL)构建了一个分布式机器学习框架，该框架涉及多个远程客户端协作训练模型。然而在现实场景中，非独立同分布(non-Independent and Identically Distributed, non-IID)数据的出现，使得传统FL算法生成的全局模型不再满足所有客户端的需求，精度大大降低。本文提出了一种基于相似性感知(PFedSA)的个性化联邦多任务学习方法，该方法通过客户端上传的模型参数来获取客户端数据之间的相似性，从而促进相似客户端的协同训练，并基于每个客户端的数据分布提供个性化模型。具体来说，它在客户端之间生成固有的集群结构，并在集群中引入个性化的补丁层，实现集群模型的个性化。PFedSA还保持了模型的泛化能力，这使得每个客户端在训练数据时都能从具有相似数据分布的节点中获益，而且相似度越大，获益越多。我们使用MNIST、EMNIST和CIFAR10数据集评估了PFedSA方法的性能，并研究了不同数据设置方案对PFedSA性能的影响。结果表明，在所有数据设置场景下，本文提出的PFedSA方法都能达到最佳的个性化性能，客户端数量多，准确率高，尤其在客户端数据为非iid的情况下效果更好。

{"title":"PFedSA: Personalized Federated Multi-Task Learning via Similarity Awareness","authors":"Chuyao Ye, Hao Zheng, Zhi-gang Hu, Meiguang Zheng","doi":"10.1109/IPDPS54959.2023.00055","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00055","url":null,"abstract":"Federated Learning (FL) constructs a distributed machine learning framework that involves multiple remote clients collaboratively training models. However in real-world situations, the emergence of non-Independent and Identically Distributed (non-IID) data makes the global model generated by traditional FL algorithms no longer meet the needs of all clients, and the accuracy is greatly reduced. In this paper, we propose a personalized federated multi-task learning method via similarity awareness (PFedSA), which captures the similarity between client data through model parameters uploaded by clients, thus facilitating collaborative training of similar clients and providing personalized models based on each client’s data distribution. Specifically, it generates the intrinsic cluster structure among clients and introduces personalized patch layers into the cluster to personalize the cluster model. PFedSA also maintains the generalization ability of models, which allows each client to benefit from nodes with similar data distributions when training data, and the greater the similarity, the more benefit. We evaluate the performance of the PFedSA method using MNIST, EMNIST and CIFAR10 datasets, and investigate the impact of different data setting schemes on the performance of PFedSA. The results show that in all data setting scenarios, the PFedSA method proposed in this paper can achieve the best personalization performance, having more clients with higher accuracy, and it is especially effective when the client’s data is non-IID.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134249511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GraphMetaP: Efficient MetaPath Generation for Dynamic Heterogeneous Graph Models GraphMetaP:动态异构图模型的高效元路径生成

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00012

Haiheng He, Dan Chen, Long Zheng, Yu Huang, Haifeng Liu, Chao Liu, Xiaofei Liao, Hai Jin

Metapath-based heterogeneous graph models (MHGM) show excellent performance in learning semantic and structural information in heterogeneous graphs. Metapath matching is an essential processing step in MHGM to find all metapath instances, bringing significant overhead compared to the total model execution time. Even worse, in dynamic heterogeneous graphs, metapath instances require to be rematched while graph updated. In this paper, we observe that only a small fraction of metapath instances change and propose GraphMetaP, an efficient incremental metapath maintenance method in order to eliminate the matching overhead in dynamic heterogeneous graphs. GraphMetaP introduces a novel format for metapath instances to capture the dependencies among the metapath instances. The format incrementally maintains metapath instances based on the graph updates to avoide the rematching metapath overhead for the updated graph. Furthermore, GraphMetaP uses the fold way to simplify the format in order to recover all metapath instances faster. Experiments show that GraphMetaP enables efficient maintenance of metapath instances on dynamic heterogeneous graphs and outperforms 172.4X on average compared to the matching metapath method.

基于元路径的异构图模型(MHGM)在异构图的语义和结构信息学习方面表现出优异的性能。元路径匹配是MHGM中查找所有元路径实例的重要处理步骤，与总模型执行时间相比，会带来很大的开销。更糟糕的是，在动态异构图中，更新图时需要重新匹配元路径实例。在本文中，我们观察到只有一小部分元路径实例发生了变化，并提出了一种有效的增量元路径维护方法GraphMetaP，以消除动态异构图中的匹配开销。GraphMetaP为元路径实例引入了一种新的格式，用于捕获元路径实例之间的依赖关系。该格式基于图更新增量地维护元路径实例，以避免为更新的图重新匹配元路径开销。此外，GraphMetaP使用折叠方式来简化格式，以便更快地恢复所有元路径实例。实验表明，GraphMetaP能够有效地维护动态异构图上的元路径实例，与匹配的元路径方法相比，平均性能高出172.4X。

{"title":"GraphMetaP: Efficient MetaPath Generation for Dynamic Heterogeneous Graph Models","authors":"Haiheng He, Dan Chen, Long Zheng, Yu Huang, Haifeng Liu, Chao Liu, Xiaofei Liao, Hai Jin","doi":"10.1109/IPDPS54959.2023.00012","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00012","url":null,"abstract":"Metapath-based heterogeneous graph models (MHGM) show excellent performance in learning semantic and structural information in heterogeneous graphs. Metapath matching is an essential processing step in MHGM to find all metapath instances, bringing significant overhead compared to the total model execution time. Even worse, in dynamic heterogeneous graphs, metapath instances require to be rematched while graph updated. In this paper, we observe that only a small fraction of metapath instances change and propose GraphMetaP, an efficient incremental metapath maintenance method in order to eliminate the matching overhead in dynamic heterogeneous graphs. GraphMetaP introduces a novel format for metapath instances to capture the dependencies among the metapath instances. The format incrementally maintains metapath instances based on the graph updates to avoide the rematching metapath overhead for the updated graph. Furthermore, GraphMetaP uses the fold way to simplify the format in order to recover all metapath instances faster. Experiments show that GraphMetaP enables efficient maintenance of metapath instances on dynamic heterogeneous graphs and outperforms 172.4X on average compared to the matching metapath method.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124755594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On Doorway Egress by Autonomous Robots 自主机器人的门口出口

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00039

Rory Hector, R. Vaidyanathan, Gokarna Sharma, J. Trahan

We consider the distributed setting of n autonomous mobile robots operating in Look-Compute-Move (LCM) cycles on the real plane. Robots may be without lights (the classic oblivious robots model) or equipped with lights (the robots with lights model). Under obstructed visibility, a robot cannot see another robot if a third robot is positioned between them on the straight line connecting them, but it is not the case under unobstructed visibility. Robots are said to collide if they share positions or their paths intersect within concurrent LCM cycles. In this paper, we introduce and study Doorway Egress, the problem of robots exiting through a doorway from one side of a wall to the other; initially, the robots are positioned at distinct positions on one side of a wall.We study time-efficient solutions where time is measured using a standard notion of epochs – an epoch is a duration in which each robot completes at least one LCM cycle. For solutions to Doorway Egress with only 1 epoch, we: design an asynchronous algorithm if collisions are allowed; prove that an asynchronous algorithm is impossible if collisions are not allowed; and design a semi-synchronous algorithm without collisions. To further investigate asynchronous algorithms without collisions, we present algorithms with different combinations of robot abilities:•O(1) epochs with lights under obstructed visibility;•O(1) epochs without lights under unobstructed visibility; and•O(n) epochs without lights under obstructed visibility.Our results reveal dependencies and trade-offs among obstructed/unobstructed visibility, lights/no lights, and semi-synchronous/asynchronous settings.

我们考虑了在真实平面上以Look-Compute-Move (LCM)周期运行的n个自主移动机器人的分布式设置。机器人可能没有灯(经典的健忘机器人模型)，也可能有灯(有灯的机器人模型)。在能见度受阻的情况下，如果有第三个机器人在连接它们的直线上位于它们之间，则机器人无法看到另一个机器人，但在能见度通畅的情况下则不是这样。如果机器人共享位置或它们的路径在并发LCM周期内相交，则称为碰撞。在本文中，我们引入并研究了门口出口问题，即机器人从墙的一边穿过门口到另一边的问题;最初，机器人被放置在墙一侧的不同位置。我们研究了时间效率的解决方案，其中时间是使用标准的时间概念来测量的-一个时间是每个机器人完成至少一个LCM循环的持续时间。对于只有1 epoch的Doorway Egress问题，我们设计了一个允许碰撞的异步算法;证明如果不允许碰撞，异步算法是不可能的;并设计了一种无碰撞的半同步算法。为了进一步研究无碰撞的异步算法，我们提出了不同机器人能力组合的算法:•O(1)在遮挡下有灯的时代，•O(1)在无遮挡下没有灯的时代，•O(1)在无遮挡下有灯的时代，•O(n)个在遮挡能见度下没有灯光的时期。我们的研究结果揭示了受阻碍/无阻碍的可见度、有灯/无灯以及半同步/异步设置之间的依赖关系和权衡。

{"title":"On Doorway Egress by Autonomous Robots","authors":"Rory Hector, R. Vaidyanathan, Gokarna Sharma, J. Trahan","doi":"10.1109/IPDPS54959.2023.00039","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00039","url":null,"abstract":"We consider the distributed setting of n autonomous mobile robots operating in Look-Compute-Move (LCM) cycles on the real plane. Robots may be without lights (the classic oblivious robots model) or equipped with lights (the robots with lights model). Under obstructed visibility, a robot cannot see another robot if a third robot is positioned between them on the straight line connecting them, but it is not the case under unobstructed visibility. Robots are said to collide if they share positions or their paths intersect within concurrent LCM cycles. In this paper, we introduce and study Doorway Egress, the problem of robots exiting through a doorway from one side of a wall to the other; initially, the robots are positioned at distinct positions on one side of a wall.We study time-efficient solutions where time is measured using a standard notion of epochs – an epoch is a duration in which each robot completes at least one LCM cycle. For solutions to Doorway Egress with only 1 epoch, we: design an asynchronous algorithm if collisions are allowed; prove that an asynchronous algorithm is impossible if collisions are not allowed; and design a semi-synchronous algorithm without collisions. To further investigate asynchronous algorithms without collisions, we present algorithms with different combinations of robot abilities:•O(1) epochs with lights under obstructed visibility;•O(1) epochs without lights under unobstructed visibility; and•O(n) epochs without lights under obstructed visibility.Our results reveal dependencies and trade-offs among obstructed/unobstructed visibility, lights/no lights, and semi-synchronous/asynchronous settings.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115860835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SBGT: Scaling Bayesian-based Group Testing for Disease Surveillance 扩展基于贝叶斯的疾病监测群体检测

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00099

Weicong Chen, Hao Qi, Xiaoyi Lu, C. Tatsuoka

The COVID-19 pandemic underscored the necessity for disease surveillance using group testing. Novel Bayesian methods using lattice models were proposed, which offer substantial improvements in group testing efficiency by precisely quantifying uncertainty in diagnoses, acknowledging varying individual risk and dilution effects, and guiding optimally convergent sequential pooled test selections using a Bayesian Halving Algorithm. Computationally, however, Bayesian group testing poses considerable challenges as computational complexity grows exponentially with sample size. This can lead to shortcomings in reaching a desirable scale without practical limitations. We propose a new framework for scaling Bayesian group testing based on Spark: SBGT. We show that SBGT is lightning fast and highly scalable. In particular, SBGT is up to 376x, 1733x, and 1523x faster than the state-of-the-art framework in manipulating lattice models, performing test selections, and conducting statistical analyses, respectively, while achieving up to 97.9% scaling efficiency up to 4096 CPU cores. More importantly, SBGT fulfills our mission towards reaching applicable scale for guiding pooling decisions in wide-scale disease surveillance, and other large scale group testing applications.

COVID-19大流行凸显了采用群体检测进行疾病监测的必要性。提出了一种新的基于格模型的贝叶斯方法，该方法通过精确量化诊断中的不确定性，识别个体风险和稀释效应的变化，并使用贝叶斯减半算法指导最优收敛的顺序池测试选择，从而大大提高了群体测试效率。然而，在计算上，贝叶斯组测试带来了相当大的挑战，因为计算复杂度随着样本量的增长呈指数增长。这可能导致在没有实际限制的情况下无法达到理想的规模。提出了一种新的基于Spark: SBGT的贝叶斯组测试扩展框架。我们证明了SBGT具有闪电般的速度和高度可扩展性。特别是，在操作格子模型、执行测试选择和进行统计分析方面，SBGT分别比最先进的框架快376倍、1733倍和1523倍，同时在4096个CPU内核下实现高达97.9%的扩展效率。更重要的是，SBGT实现了我们的使命，即在大规模疾病监测和其他大规模群体测试应用中达到指导池决策的适用规模。

{"title":"SBGT: Scaling Bayesian-based Group Testing for Disease Surveillance","authors":"Weicong Chen, Hao Qi, Xiaoyi Lu, C. Tatsuoka","doi":"10.1109/IPDPS54959.2023.00099","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00099","url":null,"abstract":"The COVID-19 pandemic underscored the necessity for disease surveillance using group testing. Novel Bayesian methods using lattice models were proposed, which offer substantial improvements in group testing efficiency by precisely quantifying uncertainty in diagnoses, acknowledging varying individual risk and dilution effects, and guiding optimally convergent sequential pooled test selections using a Bayesian Halving Algorithm. Computationally, however, Bayesian group testing poses considerable challenges as computational complexity grows exponentially with sample size. This can lead to shortcomings in reaching a desirable scale without practical limitations. We propose a new framework for scaling Bayesian group testing based on Spark: SBGT. We show that SBGT is lightning fast and highly scalable. In particular, SBGT is up to 376x, 1733x, and 1523x faster than the state-of-the-art framework in manipulating lattice models, performing test selections, and conducting statistical analyses, respectively, while achieving up to 97.9% scaling efficiency up to 4096 CPU cores. More importantly, SBGT fulfills our mission towards reaching applicable scale for guiding pooling decisions in wide-scale disease surveillance, and other large scale group testing applications.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"23 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132193061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Evaluating Asynchronous Parallel I/O on HPC Systems 在高性能计算系统上评估异步并行I/O

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00030

J. Ravi, S. Byna, Q. Koziol, Houjun Tang, M. Becchi

Parallel I/O is an effective method to optimize data movement between memory and storage for many scientific applications. Poor performance of traditional disk-based file systems has led to the design of I/O libraries which take advantage of faster memory layers, such as on-node memory, present in high-performance computing (HPC) systems. By allowing caching and prefetching of data for applications alternating computation and I/O phases, a faster memory layer also provides opportunities for hiding the latency of I/O phases by overlapping them with computation phases, a technique called asynchronous I/O. Since asynchronous parallel I/O in HPC systems is still in the initial stages of development, there hasn't been a systematic study of the factors affecting its performance.In this paper, we perform a systematic study of various factors affecting the performance and efficacy of asynchronous I/O, we develop a performance model to estimate the aggregate I/O bandwidth achievable by iterative applications using synchronous and asynchronous I/O based on past observations, and we evaluate the performance of the recently developed asynchronous I/O feature of a parallel I/O library (HDF5) using benchmarks and real-world science applications. Our study covers parallel file systems on two large-scale HPC systems: Summit and Cori, the former with a GPFS storage and the latter with a Lustre parallel file system.

对于许多科学应用来说，并行I/O是优化内存和存储之间数据移动的有效方法。传统的基于磁盘的文件系统性能不佳，导致I/O库的设计利用了高性能计算(HPC)系统中存在的更快的内存层，例如节点上内存。通过允许应用程序交替进行计算和I/O阶段的数据缓存和预取，更快的内存层还提供了通过与计算阶段重叠来隐藏I/O阶段延迟的机会，这种技术称为异步I/O。由于高性能计算系统中的异步并行I/O还处于发展的初级阶段，目前还没有对其性能影响因素进行系统的研究。在本文中，我们对影响异步I/O性能和效率的各种因素进行了系统的研究，我们开发了一个性能模型，以估计使用同步和异步I/O的迭代应用程序可以实现的总I/O带宽，并使用基准测试和现实世界的科学应用程序评估了最近开发的并行I/O库(HDF5)的异步I/O特性的性能。我们的研究涵盖了两个大型HPC系统上的并行文件系统:Summit和Cori，前者使用GPFS存储，后者使用Lustre并行文件系统。

{"title":"Evaluating Asynchronous Parallel I/O on HPC Systems","authors":"J. Ravi, S. Byna, Q. Koziol, Houjun Tang, M. Becchi","doi":"10.1109/IPDPS54959.2023.00030","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00030","url":null,"abstract":"Parallel I/O is an effective method to optimize data movement between memory and storage for many scientific applications. Poor performance of traditional disk-based file systems has led to the design of I/O libraries which take advantage of faster memory layers, such as on-node memory, present in high-performance computing (HPC) systems. By allowing caching and prefetching of data for applications alternating computation and I/O phases, a faster memory layer also provides opportunities for hiding the latency of I/O phases by overlapping them with computation phases, a technique called asynchronous I/O. Since asynchronous parallel I/O in HPC systems is still in the initial stages of development, there hasn't been a systematic study of the factors affecting its performance.In this paper, we perform a systematic study of various factors affecting the performance and efficacy of asynchronous I/O, we develop a performance model to estimate the aggregate I/O bandwidth achievable by iterative applications using synchronous and asynchronous I/O based on past observations, and we evaluate the performance of the recently developed asynchronous I/O feature of a parallel I/O library (HDF5) using benchmarks and real-world science applications. Our study covers parallel file systems on two large-scale HPC systems: Summit and Cori, the former with a GPFS storage and the latter with a Lustre parallel file system.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124095299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AnyQ: An Evaluation Framework for Massively-Parallel Queue Algorithms 大规模并行队列算法的评估框架

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00079

Michael Kenzel, Stefan Lemme, Richard Membarth, Matthias Kurtenacker, Hugo Devillers, M. Steinberger, P. Slusallek

Concurrent queue algorithms have been subject to extensive research. However, the target hardware and evaluation methodology on which the published results for any two given concurrent queue algorithms are based often share only minimal overlap. A meaningful comparison is, thus, exceedingly difficult. With the continuing trend towards more and more heterogeneous systems, it is becoming more and more important to not only evaluate and compare novel and existing queue algorithms across a wider range of target architectures, but to also be able to continuously re-evaluate queue algorithms in light of novel architectures and capabilities.To address this need, we present AnyQ, an evaluation framework for concurrent queue algorithms. We design a set of programming abstractions that enable the mapping of concurrent queue algorithms and benchmarks to a wide variety of target architectures. We demonstrate the effectiveness of these abstractions by showing that a queue algorithm expressed in a portable, high-level manner can achieve performance comparable to hand-crafted implementations. We design a system for testing and benchmarking queue algorithms. Using the developed framework, we investigate concurrent queue algorithm performance across a range of both CPU as well as GPU architectures. In hopes that it may serve the community as a starting point for building a common repository of concurrent queue algorithms as well as a base for future research, all code and data is made available as open source software at https://anydsl.github.io/anyq.

并发队列算法已经得到了广泛的研究。然而，对于任意两个给定的并发队列算法，所发布的结果所基于的目标硬件和评估方法通常只有最小的重叠。因此，进行有意义的比较是极其困难的。随着异构系统越来越多的持续趋势，不仅要在更广泛的目标体系结构中评估和比较新的和现有的队列算法，而且要能够根据新的体系结构和功能不断地重新评估队列算法，这变得越来越重要。为了满足这一需求，我们提出了AnyQ，一个并发队列算法的评估框架。我们设计了一组编程抽象，使并发队列算法和基准能够映射到各种各样的目标体系结构。我们通过展示以可移植的高级方式表达的队列算法可以获得与手工实现相当的性能，来证明这些抽象的有效性。我们设计了一个测试和基准测试队列算法的系统。使用开发的框架，我们在CPU和GPU架构的范围内研究并发队列算法的性能。希望它可以作为社区构建并发队列算法公共存储库的起点以及未来研究的基础，所有代码和数据都可以在https://anydsl.github.io/anyq上作为开源软件获得。

{"title":"AnyQ: An Evaluation Framework for Massively-Parallel Queue Algorithms","authors":"Michael Kenzel, Stefan Lemme, Richard Membarth, Matthias Kurtenacker, Hugo Devillers, M. Steinberger, P. Slusallek","doi":"10.1109/IPDPS54959.2023.00079","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00079","url":null,"abstract":"Concurrent queue algorithms have been subject to extensive research. However, the target hardware and evaluation methodology on which the published results for any two given concurrent queue algorithms are based often share only minimal overlap. A meaningful comparison is, thus, exceedingly difficult. With the continuing trend towards more and more heterogeneous systems, it is becoming more and more important to not only evaluate and compare novel and existing queue algorithms across a wider range of target architectures, but to also be able to continuously re-evaluate queue algorithms in light of novel architectures and capabilities.To address this need, we present AnyQ, an evaluation framework for concurrent queue algorithms. We design a set of programming abstractions that enable the mapping of concurrent queue algorithms and benchmarks to a wide variety of target architectures. We demonstrate the effectiveness of these abstractions by showing that a queue algorithm expressed in a portable, high-level manner can achieve performance comparable to hand-crafted implementations. We design a system for testing and benchmarking queue algorithms. Using the developed framework, we investigate concurrent queue algorithm performance across a range of both CPU as well as GPU architectures. In hopes that it may serve the community as a starting point for building a common repository of concurrent queue algorithms as well as a base for future research, all code and data is made available as open source software at https://anydsl.github.io/anyq.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128197337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

IPDPS 2023 Technical Program IPDPS 2023技术计划

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/ipdps54959.2023.00007

引用次数: 0

Fast And Automatic Floating Point Error Analysis With CHEF-FP 快速和自动浮点误差分析与CHEF-FP

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-04-13 DOI: 10.1109/IPDPS54959.2023.00105

Garima Singh, B. Kundu, Harshitha Menon, A. Penev, David J. Lange, V. Vassilev

As we reach the limit of Moore’s Law, researchers are exploring different paradigms to achieve unprecedented performance. Approximate Computing (AC), which relies on the ability of applications to tolerate some error in the results to trade-off accuracy for performance, has shown significant promise. Despite the success of AC in domains such as Machine Learning, its acceptance in High-Performance Computing (HPC) is limited due to its stringent requirement of accuracy. We need tools and techniques to identify regions of the code that are amenable to approximations and their impact on the application output quality so as to guide developers to employ selective approximation. To this end, we propose CHEF-FP, a flexible, scalable, and easy-to-use source-code transformation tool based on Automatic Differentiation (AD) for analysing approximation errors in HPC applications.CHEF-FP uses Clad, an efficient AD tool built as a plugin to the Clang compiler and based on the LLVM compiler infrastructure, as a backend and utilizes its AD abilities to evaluate approximation errors in C++ code. CHEF-FP works at the source level by injecting error estimation code into the generated adjoints. This enables the error-estimation code to undergo compiler optimizations resulting in improved analysis time and reduced memory usage. We also provide theoretical and architectural augmentations to source code transformation-based AD tools to perform FP error analysis. In this paper, we primarily focus on analyzing errors introduced by mixed-precision AC techniques, the most popular approximate technique in HPC. We also show the applicability of our tool in estimating other kinds of errors by evaluating our tool on codes that use approximate functions. Moreover, we demonstrate the speedups achieved by CHEF-FP during analysis time as compared to the existing state-of-the-art tool as a result of its ability to generate and insert approximation error estimate code directly into the derivative source. The generated code also becomes a candidate for better compiler optimizations contributing to lesser runtime performance overhead.

当我们达到摩尔定律的极限时，研究人员正在探索不同的范式来实现前所未有的性能。近似计算(AC)依赖于应用程序容忍结果中的一些错误以权衡性能的准确性的能力，它已经显示出很大的前景。尽管AC在机器学习等领域取得了成功，但由于其对精度的严格要求，其在高性能计算(HPC)中的接受程度有限。我们需要工具和技术来识别适合近似的代码区域，以及它们对应用程序输出质量的影响，从而指导开发人员使用有选择的近似。为此，我们提出了CHEF-FP，一个灵活、可扩展且易于使用的基于自动区分(AD)的源代码转换工具，用于分析HPC应用中的近似误差。CHEF-FP使用基于LLVM编译器基础架构的高效AD工具Clad作为后端，并利用其AD功能来评估c++代码中的近似误差。CHEF-FP通过将误差估计代码注入到生成的伴随中来在源级工作。这使得错误估计代码可以进行编译器优化，从而提高分析时间并减少内存使用。我们还提供了对基于源代码转换的AD工具的理论和架构增强，以执行FP错误分析。本文主要分析了高性能计算中最流行的混合精度交流近似技术所带来的误差。通过在使用近似函数的代码上评估我们的工具，我们还展示了我们的工具在估计其他类型错误方面的适用性。此外，我们还演示了CHEF-FP在分析期间与现有最先进的工具相比所实现的加速，因为它能够直接生成和插入近似误差估计代码到导数源中。生成的代码也成为更好的编译器优化的候选，有助于减少运行时性能开销。

{"title":"Fast And Automatic Floating Point Error Analysis With CHEF-FP","authors":"Garima Singh, B. Kundu, Harshitha Menon, A. Penev, David J. Lange, V. Vassilev","doi":"10.1109/IPDPS54959.2023.00105","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00105","url":null,"abstract":"As we reach the limit of Moore’s Law, researchers are exploring different paradigms to achieve unprecedented performance. Approximate Computing (AC), which relies on the ability of applications to tolerate some error in the results to trade-off accuracy for performance, has shown significant promise. Despite the success of AC in domains such as Machine Learning, its acceptance in High-Performance Computing (HPC) is limited due to its stringent requirement of accuracy. We need tools and techniques to identify regions of the code that are amenable to approximations and their impact on the application output quality so as to guide developers to employ selective approximation. To this end, we propose CHEF-FP, a flexible, scalable, and easy-to-use source-code transformation tool based on Automatic Differentiation (AD) for analysing approximation errors in HPC applications.CHEF-FP uses Clad, an efficient AD tool built as a plugin to the Clang compiler and based on the LLVM compiler infrastructure, as a backend and utilizes its AD abilities to evaluate approximation errors in C++ code. CHEF-FP works at the source level by injecting error estimation code into the generated adjoints. This enables the error-estimation code to undergo compiler optimizations resulting in improved analysis time and reduced memory usage. We also provide theoretical and architectural augmentations to source code transformation-based AD tools to perform FP error analysis. In this paper, we primarily focus on analyzing errors introduced by mixed-precision AC techniques, the most popular approximate technique in HPC. We also show the applicability of our tool in estimating other kinds of errors by evaluating our tool on codes that use approximate functions. Moreover, we demonstrate the speedups achieved by CHEF-FP during analysis time as compared to the existing state-of-the-art tool as a result of its ability to generate and insert approximation error estimate code directly into the derivative source. The generated code also becomes a candidate for better compiler optimizations contributing to lesser runtime performance overhead.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130861414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FedTrip: A Resource-Efficient Federated Learning Method with Triplet Regularization FedTrip:一种资源高效的三元组正则化联邦学习方法

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-04-12 DOI: 10.1109/IPDPS54959.2023.00086

Xujing Li, Min Liu, Sheng Sun, Yuwei Wang, Hui Jiang, Xue Jiang

In the federated learning scenario, geographically distributed clients collaboratively train a global model. Data heterogeneity among clients significantly results in inconsistent model updates, which evidently slow down model convergence. To alleviate this issue, many methods employ regularization terms to narrow the discrepancy between client-side local models and the server-side global model. However, these methods impose limitations on the ability to explore superior local models and ignore the valuable information in historical models. Besides, although the up-to-date representation method simultaneously concerns the global and historical local models, it suffers from unbearable computation cost. To accelerate convergence with low resource consumption, we innovatively propose a model regularization method named FedTrip, which is designed to restrict global-local divergence and decrease current-historical correlation for alleviating the negative effects derived from data heterogeneity. FedTrip helps the current local model to be close to the global model while keeping away from historical local models, which contributes to guaranteeing the consistency of local updates among clients and efficiently exploring superior local models with negligible additional computation cost on attaching operations. Empirically, we demonstrate the superiority of FedTrip via extensive evaluations. To achieve the target accuracy, FedTrip outperforms the state-of-the-art baselines in terms of significantly reducing the total overhead of client-server communication and local computation.

在联邦学习场景中，地理上分布的客户端协作训练全局模型。客户端间数据的异质性会导致模型更新不一致，从而显著减缓模型的收敛速度。为了缓解这个问题，许多方法使用正则化术语来缩小客户端本地模型和服务器端全局模型之间的差异。然而，这些方法限制了探索优秀的局部模型的能力，并且忽略了历史模型中有价值的信息。此外，最新的表示方法虽然同时关注全局模型和历史局部模型，但其计算成本难以承受。为了在低资源消耗的情况下加速收敛，我们创新地提出了一种名为FedTrip的模型正则化方法，该方法旨在限制全局-局部发散并降低当前-历史相关性，以减轻数据异质性带来的负面影响。FedTrip帮助当前局部模型接近全局模型，同时远离历史局部模型，这有助于保证客户端之间局部更新的一致性，并在附加操作的额外计算成本可以忽略的情况下有效地探索优质的局部模型。经验上，我们通过广泛的评估证明了FedTrip的优越性。为了达到目标精度，FedTrip在显著减少客户机-服务器通信和本地计算的总开销方面优于最先进的基线。

{"title":"FedTrip: A Resource-Efficient Federated Learning Method with Triplet Regularization","authors":"Xujing Li, Min Liu, Sheng Sun, Yuwei Wang, Hui Jiang, Xue Jiang","doi":"10.1109/IPDPS54959.2023.00086","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00086","url":null,"abstract":"In the federated learning scenario, geographically distributed clients collaboratively train a global model. Data heterogeneity among clients significantly results in inconsistent model updates, which evidently slow down model convergence. To alleviate this issue, many methods employ regularization terms to narrow the discrepancy between client-side local models and the server-side global model. However, these methods impose limitations on the ability to explore superior local models and ignore the valuable information in historical models. Besides, although the up-to-date representation method simultaneously concerns the global and historical local models, it suffers from unbearable computation cost. To accelerate convergence with low resource consumption, we innovatively propose a model regularization method named FedTrip, which is designed to restrict global-local divergence and decrease current-historical correlation for alleviating the negative effects derived from data heterogeneity. FedTrip helps the current local model to be close to the global model while keeping away from historical local models, which contributes to guaranteeing the consistency of local updates among clients and efficiently exploring superior local models with negligible additional computation cost on attaching operations. Empirically, we demonstrate the superiority of FedTrip via extensive evaluations. To achieve the target accuracy, FedTrip outperforms the state-of-the-art baselines in terms of significantly reducing the total overhead of client-server communication and local computation.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"71 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114314000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0