Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00022
K. Suresh, Benjamin Michalowicz, B. Ramesh, Nicholas Contini, Jinghan Yao, Shulei Xu, A. Shafi, H. Subramoni, D. Panda
Smart Network Interface Cards (SmartNICs) such as NVIDIA’s BlueField Data Processing Units (DPUs) provide advanced networking capabilities and processor cores, enabling the offload of complex operations away from the host. In the context of MPI, prior work has explored the use of DPUs to offload non-blocking collective operations. The limitations of current state-of-the-art approaches are twofold: They only work for a pre-defined set of algorithms/communication patterns and have degraded communication latency due to staging data between the DPU and the host. In this paper, we propose a framework that supports the offload of any communication pattern to the DPU while achieving low communication latency with perfect overlap. To achieve this, we first study the limitations of higher-level programming models such as MPI in expressing the offload of complex communication patterns to the DPU. We present a new set of APIs to alleviate these shortcomings and support any generic communication pattern. Then, we analyze the bottlenecks involved in offloading communication operations to the DPU and propose efficient designs for a few candidate communication patterns. To the best of our knowledge, this is the first framework providing both efficient and generic communication offload to the DPU. Our proposed framework outperforms state-of-the-art staging-based offload solutions by 47% in Alltoall micro-benchmarks, and at the application level, we see improvements up to 60% in P3DFFT and 15% in HPL on 512 processes.
{"title":"A Novel Framework for Efficient Offloading of Communication Operations to Bluefield SmartNICs","authors":"K. Suresh, Benjamin Michalowicz, B. Ramesh, Nicholas Contini, Jinghan Yao, Shulei Xu, A. Shafi, H. Subramoni, D. Panda","doi":"10.1109/IPDPS54959.2023.00022","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00022","url":null,"abstract":"Smart Network Interface Cards (SmartNICs) such as NVIDIA’s BlueField Data Processing Units (DPUs) provide advanced networking capabilities and processor cores, enabling the offload of complex operations away from the host. In the context of MPI, prior work has explored the use of DPUs to offload non-blocking collective operations. The limitations of current state-of-the-art approaches are twofold: They only work for a pre-defined set of algorithms/communication patterns and have degraded communication latency due to staging data between the DPU and the host. In this paper, we propose a framework that supports the offload of any communication pattern to the DPU while achieving low communication latency with perfect overlap. To achieve this, we first study the limitations of higher-level programming models such as MPI in expressing the offload of complex communication patterns to the DPU. We present a new set of APIs to alleviate these shortcomings and support any generic communication pattern. Then, we analyze the bottlenecks involved in offloading communication operations to the DPU and propose efficient designs for a few candidate communication patterns. To the best of our knowledge, this is the first framework providing both efficient and generic communication offload to the DPU. Our proposed framework outperforms state-of-the-art staging-based offload solutions by 47% in Alltoall micro-benchmarks, and at the application level, we see improvements up to 60% in P3DFFT and 15% in HPL on 512 processes.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132980088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00055
Chuyao Ye, Hao Zheng, Zhi-gang Hu, Meiguang Zheng
Federated Learning (FL) constructs a distributed machine learning framework that involves multiple remote clients collaboratively training models. However in real-world situations, the emergence of non-Independent and Identically Distributed (non-IID) data makes the global model generated by traditional FL algorithms no longer meet the needs of all clients, and the accuracy is greatly reduced. In this paper, we propose a personalized federated multi-task learning method via similarity awareness (PFedSA), which captures the similarity between client data through model parameters uploaded by clients, thus facilitating collaborative training of similar clients and providing personalized models based on each client’s data distribution. Specifically, it generates the intrinsic cluster structure among clients and introduces personalized patch layers into the cluster to personalize the cluster model. PFedSA also maintains the generalization ability of models, which allows each client to benefit from nodes with similar data distributions when training data, and the greater the similarity, the more benefit. We evaluate the performance of the PFedSA method using MNIST, EMNIST and CIFAR10 datasets, and investigate the impact of different data setting schemes on the performance of PFedSA. The results show that in all data setting scenarios, the PFedSA method proposed in this paper can achieve the best personalization performance, having more clients with higher accuracy, and it is especially effective when the client’s data is non-IID.
联邦学习(FL)构建了一个分布式机器学习框架,该框架涉及多个远程客户端协作训练模型。然而在现实场景中,非独立同分布(non-Independent and Identically Distributed, non-IID)数据的出现,使得传统FL算法生成的全局模型不再满足所有客户端的需求,精度大大降低。本文提出了一种基于相似性感知(PFedSA)的个性化联邦多任务学习方法,该方法通过客户端上传的模型参数来获取客户端数据之间的相似性,从而促进相似客户端的协同训练,并基于每个客户端的数据分布提供个性化模型。具体来说,它在客户端之间生成固有的集群结构,并在集群中引入个性化的补丁层,实现集群模型的个性化。PFedSA还保持了模型的泛化能力,这使得每个客户端在训练数据时都能从具有相似数据分布的节点中获益,而且相似度越大,获益越多。我们使用MNIST、EMNIST和CIFAR10数据集评估了PFedSA方法的性能,并研究了不同数据设置方案对PFedSA性能的影响。结果表明,在所有数据设置场景下,本文提出的PFedSA方法都能达到最佳的个性化性能,客户端数量多,准确率高,尤其在客户端数据为非iid的情况下效果更好。
{"title":"PFedSA: Personalized Federated Multi-Task Learning via Similarity Awareness","authors":"Chuyao Ye, Hao Zheng, Zhi-gang Hu, Meiguang Zheng","doi":"10.1109/IPDPS54959.2023.00055","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00055","url":null,"abstract":"Federated Learning (FL) constructs a distributed machine learning framework that involves multiple remote clients collaboratively training models. However in real-world situations, the emergence of non-Independent and Identically Distributed (non-IID) data makes the global model generated by traditional FL algorithms no longer meet the needs of all clients, and the accuracy is greatly reduced. In this paper, we propose a personalized federated multi-task learning method via similarity awareness (PFedSA), which captures the similarity between client data through model parameters uploaded by clients, thus facilitating collaborative training of similar clients and providing personalized models based on each client’s data distribution. Specifically, it generates the intrinsic cluster structure among clients and introduces personalized patch layers into the cluster to personalize the cluster model. PFedSA also maintains the generalization ability of models, which allows each client to benefit from nodes with similar data distributions when training data, and the greater the similarity, the more benefit. We evaluate the performance of the PFedSA method using MNIST, EMNIST and CIFAR10 datasets, and investigate the impact of different data setting schemes on the performance of PFedSA. The results show that in all data setting scenarios, the PFedSA method proposed in this paper can achieve the best personalization performance, having more clients with higher accuracy, and it is especially effective when the client’s data is non-IID.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134249511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00012
Haiheng He, Dan Chen, Long Zheng, Yu Huang, Haifeng Liu, Chao Liu, Xiaofei Liao, Hai Jin
Metapath-based heterogeneous graph models (MHGM) show excellent performance in learning semantic and structural information in heterogeneous graphs. Metapath matching is an essential processing step in MHGM to find all metapath instances, bringing significant overhead compared to the total model execution time. Even worse, in dynamic heterogeneous graphs, metapath instances require to be rematched while graph updated. In this paper, we observe that only a small fraction of metapath instances change and propose GraphMetaP, an efficient incremental metapath maintenance method in order to eliminate the matching overhead in dynamic heterogeneous graphs. GraphMetaP introduces a novel format for metapath instances to capture the dependencies among the metapath instances. The format incrementally maintains metapath instances based on the graph updates to avoide the rematching metapath overhead for the updated graph. Furthermore, GraphMetaP uses the fold way to simplify the format in order to recover all metapath instances faster. Experiments show that GraphMetaP enables efficient maintenance of metapath instances on dynamic heterogeneous graphs and outperforms 172.4X on average compared to the matching metapath method.
{"title":"GraphMetaP: Efficient MetaPath Generation for Dynamic Heterogeneous Graph Models","authors":"Haiheng He, Dan Chen, Long Zheng, Yu Huang, Haifeng Liu, Chao Liu, Xiaofei Liao, Hai Jin","doi":"10.1109/IPDPS54959.2023.00012","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00012","url":null,"abstract":"Metapath-based heterogeneous graph models (MHGM) show excellent performance in learning semantic and structural information in heterogeneous graphs. Metapath matching is an essential processing step in MHGM to find all metapath instances, bringing significant overhead compared to the total model execution time. Even worse, in dynamic heterogeneous graphs, metapath instances require to be rematched while graph updated. In this paper, we observe that only a small fraction of metapath instances change and propose GraphMetaP, an efficient incremental metapath maintenance method in order to eliminate the matching overhead in dynamic heterogeneous graphs. GraphMetaP introduces a novel format for metapath instances to capture the dependencies among the metapath instances. The format incrementally maintains metapath instances based on the graph updates to avoide the rematching metapath overhead for the updated graph. Furthermore, GraphMetaP uses the fold way to simplify the format in order to recover all metapath instances faster. Experiments show that GraphMetaP enables efficient maintenance of metapath instances on dynamic heterogeneous graphs and outperforms 172.4X on average compared to the matching metapath method.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124755594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00039
Rory Hector, R. Vaidyanathan, Gokarna Sharma, J. Trahan
We consider the distributed setting of n autonomous mobile robots operating in Look-Compute-Move (LCM) cycles on the real plane. Robots may be without lights (the classic oblivious robots model) or equipped with lights (the robots with lights model). Under obstructed visibility, a robot cannot see another robot if a third robot is positioned between them on the straight line connecting them, but it is not the case under unobstructed visibility. Robots are said to collide if they share positions or their paths intersect within concurrent LCM cycles. In this paper, we introduce and study Doorway Egress, the problem of robots exiting through a doorway from one side of a wall to the other; initially, the robots are positioned at distinct positions on one side of a wall.We study time-efficient solutions where time is measured using a standard notion of epochs – an epoch is a duration in which each robot completes at least one LCM cycle. For solutions to Doorway Egress with only 1 epoch, we: design an asynchronous algorithm if collisions are allowed; prove that an asynchronous algorithm is impossible if collisions are not allowed; and design a semi-synchronous algorithm without collisions. To further investigate asynchronous algorithms without collisions, we present algorithms with different combinations of robot abilities:•O(1) epochs with lights under obstructed visibility;•O(1) epochs without lights under unobstructed visibility; and•O(n) epochs without lights under obstructed visibility.Our results reveal dependencies and trade-offs among obstructed/unobstructed visibility, lights/no lights, and semi-synchronous/asynchronous settings.
{"title":"On Doorway Egress by Autonomous Robots","authors":"Rory Hector, R. Vaidyanathan, Gokarna Sharma, J. Trahan","doi":"10.1109/IPDPS54959.2023.00039","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00039","url":null,"abstract":"We consider the distributed setting of n autonomous mobile robots operating in Look-Compute-Move (LCM) cycles on the real plane. Robots may be without lights (the classic oblivious robots model) or equipped with lights (the robots with lights model). Under obstructed visibility, a robot cannot see another robot if a third robot is positioned between them on the straight line connecting them, but it is not the case under unobstructed visibility. Robots are said to collide if they share positions or their paths intersect within concurrent LCM cycles. In this paper, we introduce and study Doorway Egress, the problem of robots exiting through a doorway from one side of a wall to the other; initially, the robots are positioned at distinct positions on one side of a wall.We study time-efficient solutions where time is measured using a standard notion of epochs – an epoch is a duration in which each robot completes at least one LCM cycle. For solutions to Doorway Egress with only 1 epoch, we: design an asynchronous algorithm if collisions are allowed; prove that an asynchronous algorithm is impossible if collisions are not allowed; and design a semi-synchronous algorithm without collisions. To further investigate asynchronous algorithms without collisions, we present algorithms with different combinations of robot abilities:•O(1) epochs with lights under obstructed visibility;•O(1) epochs without lights under unobstructed visibility; and•O(n) epochs without lights under obstructed visibility.Our results reveal dependencies and trade-offs among obstructed/unobstructed visibility, lights/no lights, and semi-synchronous/asynchronous settings.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115860835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00099
Weicong Chen, Hao Qi, Xiaoyi Lu, C. Tatsuoka
The COVID-19 pandemic underscored the necessity for disease surveillance using group testing. Novel Bayesian methods using lattice models were proposed, which offer substantial improvements in group testing efficiency by precisely quantifying uncertainty in diagnoses, acknowledging varying individual risk and dilution effects, and guiding optimally convergent sequential pooled test selections using a Bayesian Halving Algorithm. Computationally, however, Bayesian group testing poses considerable challenges as computational complexity grows exponentially with sample size. This can lead to shortcomings in reaching a desirable scale without practical limitations. We propose a new framework for scaling Bayesian group testing based on Spark: SBGT. We show that SBGT is lightning fast and highly scalable. In particular, SBGT is up to 376x, 1733x, and 1523x faster than the state-of-the-art framework in manipulating lattice models, performing test selections, and conducting statistical analyses, respectively, while achieving up to 97.9% scaling efficiency up to 4096 CPU cores. More importantly, SBGT fulfills our mission towards reaching applicable scale for guiding pooling decisions in wide-scale disease surveillance, and other large scale group testing applications.
{"title":"SBGT: Scaling Bayesian-based Group Testing for Disease Surveillance","authors":"Weicong Chen, Hao Qi, Xiaoyi Lu, C. Tatsuoka","doi":"10.1109/IPDPS54959.2023.00099","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00099","url":null,"abstract":"The COVID-19 pandemic underscored the necessity for disease surveillance using group testing. Novel Bayesian methods using lattice models were proposed, which offer substantial improvements in group testing efficiency by precisely quantifying uncertainty in diagnoses, acknowledging varying individual risk and dilution effects, and guiding optimally convergent sequential pooled test selections using a Bayesian Halving Algorithm. Computationally, however, Bayesian group testing poses considerable challenges as computational complexity grows exponentially with sample size. This can lead to shortcomings in reaching a desirable scale without practical limitations. We propose a new framework for scaling Bayesian group testing based on Spark: SBGT. We show that SBGT is lightning fast and highly scalable. In particular, SBGT is up to 376x, 1733x, and 1523x faster than the state-of-the-art framework in manipulating lattice models, performing test selections, and conducting statistical analyses, respectively, while achieving up to 97.9% scaling efficiency up to 4096 CPU cores. More importantly, SBGT fulfills our mission towards reaching applicable scale for guiding pooling decisions in wide-scale disease surveillance, and other large scale group testing applications.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"23 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132193061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00030
J. Ravi, S. Byna, Q. Koziol, Houjun Tang, M. Becchi
Parallel I/O is an effective method to optimize data movement between memory and storage for many scientific applications. Poor performance of traditional disk-based file systems has led to the design of I/O libraries which take advantage of faster memory layers, such as on-node memory, present in high-performance computing (HPC) systems. By allowing caching and prefetching of data for applications alternating computation and I/O phases, a faster memory layer also provides opportunities for hiding the latency of I/O phases by overlapping them with computation phases, a technique called asynchronous I/O. Since asynchronous parallel I/O in HPC systems is still in the initial stages of development, there hasn't been a systematic study of the factors affecting its performance.In this paper, we perform a systematic study of various factors affecting the performance and efficacy of asynchronous I/O, we develop a performance model to estimate the aggregate I/O bandwidth achievable by iterative applications using synchronous and asynchronous I/O based on past observations, and we evaluate the performance of the recently developed asynchronous I/O feature of a parallel I/O library (HDF5) using benchmarks and real-world science applications. Our study covers parallel file systems on two large-scale HPC systems: Summit and Cori, the former with a GPFS storage and the latter with a Lustre parallel file system.
{"title":"Evaluating Asynchronous Parallel I/O on HPC Systems","authors":"J. Ravi, S. Byna, Q. Koziol, Houjun Tang, M. Becchi","doi":"10.1109/IPDPS54959.2023.00030","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00030","url":null,"abstract":"Parallel I/O is an effective method to optimize data movement between memory and storage for many scientific applications. Poor performance of traditional disk-based file systems has led to the design of I/O libraries which take advantage of faster memory layers, such as on-node memory, present in high-performance computing (HPC) systems. By allowing caching and prefetching of data for applications alternating computation and I/O phases, a faster memory layer also provides opportunities for hiding the latency of I/O phases by overlapping them with computation phases, a technique called asynchronous I/O. Since asynchronous parallel I/O in HPC systems is still in the initial stages of development, there hasn't been a systematic study of the factors affecting its performance.In this paper, we perform a systematic study of various factors affecting the performance and efficacy of asynchronous I/O, we develop a performance model to estimate the aggregate I/O bandwidth achievable by iterative applications using synchronous and asynchronous I/O based on past observations, and we evaluate the performance of the recently developed asynchronous I/O feature of a parallel I/O library (HDF5) using benchmarks and real-world science applications. Our study covers parallel file systems on two large-scale HPC systems: Summit and Cori, the former with a GPFS storage and the latter with a Lustre parallel file system.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124095299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00079
Michael Kenzel, Stefan Lemme, Richard Membarth, Matthias Kurtenacker, Hugo Devillers, M. Steinberger, P. Slusallek
Concurrent queue algorithms have been subject to extensive research. However, the target hardware and evaluation methodology on which the published results for any two given concurrent queue algorithms are based often share only minimal overlap. A meaningful comparison is, thus, exceedingly difficult. With the continuing trend towards more and more heterogeneous systems, it is becoming more and more important to not only evaluate and compare novel and existing queue algorithms across a wider range of target architectures, but to also be able to continuously re-evaluate queue algorithms in light of novel architectures and capabilities.To address this need, we present AnyQ, an evaluation framework for concurrent queue algorithms. We design a set of programming abstractions that enable the mapping of concurrent queue algorithms and benchmarks to a wide variety of target architectures. We demonstrate the effectiveness of these abstractions by showing that a queue algorithm expressed in a portable, high-level manner can achieve performance comparable to hand-crafted implementations. We design a system for testing and benchmarking queue algorithms. Using the developed framework, we investigate concurrent queue algorithm performance across a range of both CPU as well as GPU architectures. In hopes that it may serve the community as a starting point for building a common repository of concurrent queue algorithms as well as a base for future research, all code and data is made available as open source software at https://anydsl.github.io/anyq.
{"title":"AnyQ: An Evaluation Framework for Massively-Parallel Queue Algorithms","authors":"Michael Kenzel, Stefan Lemme, Richard Membarth, Matthias Kurtenacker, Hugo Devillers, M. Steinberger, P. Slusallek","doi":"10.1109/IPDPS54959.2023.00079","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00079","url":null,"abstract":"Concurrent queue algorithms have been subject to extensive research. However, the target hardware and evaluation methodology on which the published results for any two given concurrent queue algorithms are based often share only minimal overlap. A meaningful comparison is, thus, exceedingly difficult. With the continuing trend towards more and more heterogeneous systems, it is becoming more and more important to not only evaluate and compare novel and existing queue algorithms across a wider range of target architectures, but to also be able to continuously re-evaluate queue algorithms in light of novel architectures and capabilities.To address this need, we present AnyQ, an evaluation framework for concurrent queue algorithms. We design a set of programming abstractions that enable the mapping of concurrent queue algorithms and benchmarks to a wide variety of target architectures. We demonstrate the effectiveness of these abstractions by showing that a queue algorithm expressed in a portable, high-level manner can achieve performance comparable to hand-crafted implementations. We design a system for testing and benchmarking queue algorithms. Using the developed framework, we investigate concurrent queue algorithm performance across a range of both CPU as well as GPU architectures. In hopes that it may serve the community as a starting point for building a common repository of concurrent queue algorithms as well as a base for future research, all code and data is made available as open source software at https://anydsl.github.io/anyq.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128197337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-13DOI: 10.1109/IPDPS54959.2023.00105
Garima Singh, B. Kundu, Harshitha Menon, A. Penev, David J. Lange, V. Vassilev
As we reach the limit of Moore’s Law, researchers are exploring different paradigms to achieve unprecedented performance. Approximate Computing (AC), which relies on the ability of applications to tolerate some error in the results to trade-off accuracy for performance, has shown significant promise. Despite the success of AC in domains such as Machine Learning, its acceptance in High-Performance Computing (HPC) is limited due to its stringent requirement of accuracy. We need tools and techniques to identify regions of the code that are amenable to approximations and their impact on the application output quality so as to guide developers to employ selective approximation. To this end, we propose CHEF-FP, a flexible, scalable, and easy-to-use source-code transformation tool based on Automatic Differentiation (AD) for analysing approximation errors in HPC applications.CHEF-FP uses Clad, an efficient AD tool built as a plugin to the Clang compiler and based on the LLVM compiler infrastructure, as a backend and utilizes its AD abilities to evaluate approximation errors in C++ code. CHEF-FP works at the source level by injecting error estimation code into the generated adjoints. This enables the error-estimation code to undergo compiler optimizations resulting in improved analysis time and reduced memory usage. We also provide theoretical and architectural augmentations to source code transformation-based AD tools to perform FP error analysis. In this paper, we primarily focus on analyzing errors introduced by mixed-precision AC techniques, the most popular approximate technique in HPC. We also show the applicability of our tool in estimating other kinds of errors by evaluating our tool on codes that use approximate functions. Moreover, we demonstrate the speedups achieved by CHEF-FP during analysis time as compared to the existing state-of-the-art tool as a result of its ability to generate and insert approximation error estimate code directly into the derivative source. The generated code also becomes a candidate for better compiler optimizations contributing to lesser runtime performance overhead.
{"title":"Fast And Automatic Floating Point Error Analysis With CHEF-FP","authors":"Garima Singh, B. Kundu, Harshitha Menon, A. Penev, David J. Lange, V. Vassilev","doi":"10.1109/IPDPS54959.2023.00105","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00105","url":null,"abstract":"As we reach the limit of Moore’s Law, researchers are exploring different paradigms to achieve unprecedented performance. Approximate Computing (AC), which relies on the ability of applications to tolerate some error in the results to trade-off accuracy for performance, has shown significant promise. Despite the success of AC in domains such as Machine Learning, its acceptance in High-Performance Computing (HPC) is limited due to its stringent requirement of accuracy. We need tools and techniques to identify regions of the code that are amenable to approximations and their impact on the application output quality so as to guide developers to employ selective approximation. To this end, we propose CHEF-FP, a flexible, scalable, and easy-to-use source-code transformation tool based on Automatic Differentiation (AD) for analysing approximation errors in HPC applications.CHEF-FP uses Clad, an efficient AD tool built as a plugin to the Clang compiler and based on the LLVM compiler infrastructure, as a backend and utilizes its AD abilities to evaluate approximation errors in C++ code. CHEF-FP works at the source level by injecting error estimation code into the generated adjoints. This enables the error-estimation code to undergo compiler optimizations resulting in improved analysis time and reduced memory usage. We also provide theoretical and architectural augmentations to source code transformation-based AD tools to perform FP error analysis. In this paper, we primarily focus on analyzing errors introduced by mixed-precision AC techniques, the most popular approximate technique in HPC. We also show the applicability of our tool in estimating other kinds of errors by evaluating our tool on codes that use approximate functions. Moreover, we demonstrate the speedups achieved by CHEF-FP during analysis time as compared to the existing state-of-the-art tool as a result of its ability to generate and insert approximation error estimate code directly into the derivative source. The generated code also becomes a candidate for better compiler optimizations contributing to lesser runtime performance overhead.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130861414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the federated learning scenario, geographically distributed clients collaboratively train a global model. Data heterogeneity among clients significantly results in inconsistent model updates, which evidently slow down model convergence. To alleviate this issue, many methods employ regularization terms to narrow the discrepancy between client-side local models and the server-side global model. However, these methods impose limitations on the ability to explore superior local models and ignore the valuable information in historical models. Besides, although the up-to-date representation method simultaneously concerns the global and historical local models, it suffers from unbearable computation cost. To accelerate convergence with low resource consumption, we innovatively propose a model regularization method named FedTrip, which is designed to restrict global-local divergence and decrease current-historical correlation for alleviating the negative effects derived from data heterogeneity. FedTrip helps the current local model to be close to the global model while keeping away from historical local models, which contributes to guaranteeing the consistency of local updates among clients and efficiently exploring superior local models with negligible additional computation cost on attaching operations. Empirically, we demonstrate the superiority of FedTrip via extensive evaluations. To achieve the target accuracy, FedTrip outperforms the state-of-the-art baselines in terms of significantly reducing the total overhead of client-server communication and local computation.
{"title":"FedTrip: A Resource-Efficient Federated Learning Method with Triplet Regularization","authors":"Xujing Li, Min Liu, Sheng Sun, Yuwei Wang, Hui Jiang, Xue Jiang","doi":"10.1109/IPDPS54959.2023.00086","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00086","url":null,"abstract":"In the federated learning scenario, geographically distributed clients collaboratively train a global model. Data heterogeneity among clients significantly results in inconsistent model updates, which evidently slow down model convergence. To alleviate this issue, many methods employ regularization terms to narrow the discrepancy between client-side local models and the server-side global model. However, these methods impose limitations on the ability to explore superior local models and ignore the valuable information in historical models. Besides, although the up-to-date representation method simultaneously concerns the global and historical local models, it suffers from unbearable computation cost. To accelerate convergence with low resource consumption, we innovatively propose a model regularization method named FedTrip, which is designed to restrict global-local divergence and decrease current-historical correlation for alleviating the negative effects derived from data heterogeneity. FedTrip helps the current local model to be close to the global model while keeping away from historical local models, which contributes to guaranteeing the consistency of local updates among clients and efficiently exploring superior local models with negligible additional computation cost on attaching operations. Empirically, we demonstrate the superiority of FedTrip via extensive evaluations. To achieve the target accuracy, FedTrip outperforms the state-of-the-art baselines in terms of significantly reducing the total overhead of client-server communication and local computation.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"71 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114314000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}