Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00040
Wissam M. Sid-Lakhdar, S. Cayrols, Daniel Bielich, A. Abdelfattah, P. Luszczek, M. Gates, S. Tomov, H. Johansen, David B. Williams-Young, T. Davis, J. Dongarra, H. Anzt
The solution of linear least-squares problems is at the heart of many scientific and engineering applications. While any method able to minimize the backward error of such problems is considered numerically stable, the theory states that the forward error depends on the condition number of the matrix in the system of equations. On the one hand, the QR factorization is an efficient method to solve such problems, but the solutions it produces may have large forward errors when the matrix is rank deficient. On the other hand, rank-revealing QR (RRQR) is able to produce smaller forward errors on rank deficient matrices, but its cost is prohibitive compared to QR due to memory-inefficient operations. The aim of this paper is to propose PAQR for the solution of rank-deficient linear least-squares problems as an alternative solution method. It has the same (or smaller) cost as QR and is as accurate as QR with column pivoting in many practical cases. In addition to presenting the algorithm and its implementations on different hardware architectures, we compare its accuracy and performance results on a variety of application-derived problems.
{"title":"PAQR: Pivoting Avoiding QR factorization","authors":"Wissam M. Sid-Lakhdar, S. Cayrols, Daniel Bielich, A. Abdelfattah, P. Luszczek, M. Gates, S. Tomov, H. Johansen, David B. Williams-Young, T. Davis, J. Dongarra, H. Anzt","doi":"10.1109/IPDPS54959.2023.00040","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00040","url":null,"abstract":"The solution of linear least-squares problems is at the heart of many scientific and engineering applications. While any method able to minimize the backward error of such problems is considered numerically stable, the theory states that the forward error depends on the condition number of the matrix in the system of equations. On the one hand, the QR factorization is an efficient method to solve such problems, but the solutions it produces may have large forward errors when the matrix is rank deficient. On the other hand, rank-revealing QR (RRQR) is able to produce smaller forward errors on rank deficient matrices, but its cost is prohibitive compared to QR due to memory-inefficient operations. The aim of this paper is to propose PAQR for the solution of rank-deficient linear least-squares problems as an alternative solution method. It has the same (or smaller) cost as QR and is as accurate as QR with column pivoting in many practical cases. In addition to presenting the algorithm and its implementations on different hardware architectures, we compare its accuracy and performance results on a variety of application-derived problems.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123723380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00037
Michael J. Brim, A. Moody, Seung-Hwan Lim, Ross G. Miller, Swen Boehm, Cameron Stanavige, K. Mohror, S. Oral
We introduce UnifyFS, a user-level file system that aggregates node-local storage tiers available on high performance computing (HPC) systems and makes them available to HPC applications under a unified namespace. UnifyFS employs transparent I/O interception, so it does not require changes to application code and is compatible with commonly used HPC I/O libraries. The design of UnifyFS supports the predominant HPC I/O workloads and is optimized for bulk-synchronous I/O patterns. Furthermore, UnifyFS provides customizable file system semantics to flexibly adapt its behavior for diverse I/O workloads and storage devices. In this paper, we discuss the unique design goals and architecture of UnifyFS and evaluate its performance on a leadership-class HPC system. In our experimental results, we demonstrate that UnifyFS exhibits excellent scaling performance for write operations and can improve the performance of application checkpoint operations by as much as 3× versus a tuned configuration.
{"title":"UnifyFS: A User-level Shared File System for Unified Access to Distributed Local Storage","authors":"Michael J. Brim, A. Moody, Seung-Hwan Lim, Ross G. Miller, Swen Boehm, Cameron Stanavige, K. Mohror, S. Oral","doi":"10.1109/IPDPS54959.2023.00037","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00037","url":null,"abstract":"We introduce UnifyFS, a user-level file system that aggregates node-local storage tiers available on high performance computing (HPC) systems and makes them available to HPC applications under a unified namespace. UnifyFS employs transparent I/O interception, so it does not require changes to application code and is compatible with commonly used HPC I/O libraries. The design of UnifyFS supports the predominant HPC I/O workloads and is optimized for bulk-synchronous I/O patterns. Furthermore, UnifyFS provides customizable file system semantics to flexibly adapt its behavior for diverse I/O workloads and storage devices. In this paper, we discuss the unique design goals and architecture of UnifyFS and evaluate its performance on a leadership-class HPC system. In our experimental results, we demonstrate that UnifyFS exhibits excellent scaling performance for write operations and can improve the performance of application checkpoint operations by as much as 3× versus a tuned configuration.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"212 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117295350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The accelerating growth of modern distributed applications with low delivery deadlines leads to a paradigm shift towards the multi-tier computing continuum. However, the geographical dispersion, heterogeneity, and availability of the continuum resources may result in failures and quality of service degradation, significantly negating its advantages and lowering users’ satisfaction. We propose in this paper a proactive application placement (PROS) method relying on distributed coordination to prevent the quality of service violations through service-level agreements on the computing continuum. PROS employs a sigmoid function with adaptive weights for the different parameters to predict the service level agreement assurance of devices based on their past credentials and current capabilities. We evaluate PROS using two application workloads with different traffic stress levels up to 90 million services on a real testbed with 600 heterogeneous instances deployed over eight geographical locations. The results show that PROS increases the success rate by 7%–33%, reduces the response time by 16%–38%, and increases the deadline satisfaction rate by 19%–42% compared to two related work methods. A comprehensive simulation study with 1000 devices and a workload of up to 670 million services confirm the scalability of the results.
{"title":"Proactive SLA-aware Application Placement in the Computing Continuum","authors":"Zahra Najafabadi Samani, Narges Mehran, Dragi Kimovski, R.-C. Prodan","doi":"10.1109/IPDPS54959.2023.00054","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00054","url":null,"abstract":"The accelerating growth of modern distributed applications with low delivery deadlines leads to a paradigm shift towards the multi-tier computing continuum. However, the geographical dispersion, heterogeneity, and availability of the continuum resources may result in failures and quality of service degradation, significantly negating its advantages and lowering users’ satisfaction. We propose in this paper a proactive application placement (PROS) method relying on distributed coordination to prevent the quality of service violations through service-level agreements on the computing continuum. PROS employs a sigmoid function with adaptive weights for the different parameters to predict the service level agreement assurance of devices based on their past credentials and current capabilities. We evaluate PROS using two application workloads with different traffic stress levels up to 90 million services on a real testbed with 600 heterogeneous instances deployed over eight geographical locations. The results show that PROS increases the success rate by 7%–33%, reduces the response time by 16%–38%, and increases the deadline satisfaction rate by 19%–42% compared to two related work methods. A comprehensive simulation study with 1000 devices and a workload of up to 670 million services confirm the scalability of the results.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126744447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-tenancy in public clouds may lead to co-location interference on shared resources, which possibly results in performance degradation of cloud applications. Cloud providers want to know when such events happen and how serious the degradation is, to perform interference-aware migrations and alleviate the problem. However, virtual machines (VM) in Infrastructure-as-a-Service public clouds are black boxes to providers, where application-level performance information cannot be acquired. This makes performance monitoring intensely challenging as cloud providers can only rely on low-level metrics such as CPU usage and hardware counters.We propose a novel machine learning framework, Alioth, to monitor the performance degradation of cloud applications. To feed the data-hungry models, we first elaborate interference generators and conduct comprehensive co-location experiments on a testbed to build Alioth-dataset which reflects the complexity and dynamicity in real-world scenarios. Then we construct Alioth by (1) augmenting features via recovering low-level metrics under no interference using denoising auto-encoders, (2) devising a transfer learning model based on domain adaptation neural network to make models generalize on test cases unseen in offline training, and (3) developing a SHAP explainer to automate feature selection and enhance model interpretability. Experiments show that Alioth achieves an average mean absolute error of 5.29% offline and 10.8% when testing on applications unseen in the training stage, outperforming the baseline methods. Alioth is also robust in signaling quality-of-service violation under dynamicity. Finally, we demonstrate a possible application of Alioth’s interpretability, providing insights to benefit the decision-making of cloud operators. The dataset and code of Alioth have been released on GitHub.
{"title":"Alioth: A Machine Learning Based Interference-Aware Performance Monitor for Multi-Tenancy Applications in Public Cloud","authors":"Tianyao Shi, Yingxuan Yang, Yunlong Cheng, Xiaofeng Gao, Zhen Fang, Yongqiang Yang","doi":"10.1109/IPDPS54959.2023.00095","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00095","url":null,"abstract":"Multi-tenancy in public clouds may lead to co-location interference on shared resources, which possibly results in performance degradation of cloud applications. Cloud providers want to know when such events happen and how serious the degradation is, to perform interference-aware migrations and alleviate the problem. However, virtual machines (VM) in Infrastructure-as-a-Service public clouds are black boxes to providers, where application-level performance information cannot be acquired. This makes performance monitoring intensely challenging as cloud providers can only rely on low-level metrics such as CPU usage and hardware counters.We propose a novel machine learning framework, Alioth, to monitor the performance degradation of cloud applications. To feed the data-hungry models, we first elaborate interference generators and conduct comprehensive co-location experiments on a testbed to build Alioth-dataset which reflects the complexity and dynamicity in real-world scenarios. Then we construct Alioth by (1) augmenting features via recovering low-level metrics under no interference using denoising auto-encoders, (2) devising a transfer learning model based on domain adaptation neural network to make models generalize on test cases unseen in offline training, and (3) developing a SHAP explainer to automate feature selection and enhance model interpretability. Experiments show that Alioth achieves an average mean absolute error of 5.29% offline and 10.8% when testing on applications unseen in the training stage, outperforming the baseline methods. Alioth is also robust in signaling quality-of-service violation under dynamicity. Finally, we demonstrate a possible application of Alioth’s interpretability, providing insights to benefit the decision-making of cloud operators. The dataset and code of Alioth have been released on GitHub.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127097704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, Mixture-of-Experts (MoE) has become one of the most popular techniques to scale pre-trained models to extraordinarily large sizes. Dynamic activation of experts allows for conditional computation, increasing the number of parameters of neural networks, which is critical for absorbing the vast amounts of knowledge available in many deep learning areas. However, despite the existing system and algorithm optimizations, there are significant challenges to be tackled when it comes to the inefficiencies of communication and memory consumption.In this paper, we present the design and implementation of MPipeMoE, a high-performance library that accelerates MoE training with adaptive and memory-efficient pipeline parallelism. Inspired by that the MoE training procedure can be divided into multiple independent sub-stages, we design adaptive pipeline parallelism with an online algorithm to configure the granularity of the pipelining. Further, we analyze the memory footprint breakdown of MoE training and identify that activations and temporary buffers are the primary contributors to the overall memory footprint. Toward memory efficiency, we propose memory reusing strategies to reduce memory requirements by eliminating memory redundancies, and develop an adaptive selection component to determine the optimal strategy that considers both hardware capacities and model characteristics at runtime. We implement MPipeMoE upon PyTorch and evaluate it with common MoE models in a physical cluster consisting of 8 NVIDIA DGX A100 servers. Compared with the state-of-art approach, MPipeMoE achieves up to 2.8× speedup and reduces memory footprint by up to 47% in training large models.
{"title":"MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism","authors":"Zhenghang Zhang, Donglin Yang, Yaqi Xia, Liang Ding, Dacheng Tao, Xiaobo Zhou, Dazhao Cheng","doi":"10.1109/IPDPS54959.2023.00026","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00026","url":null,"abstract":"Recently, Mixture-of-Experts (MoE) has become one of the most popular techniques to scale pre-trained models to extraordinarily large sizes. Dynamic activation of experts allows for conditional computation, increasing the number of parameters of neural networks, which is critical for absorbing the vast amounts of knowledge available in many deep learning areas. However, despite the existing system and algorithm optimizations, there are significant challenges to be tackled when it comes to the inefficiencies of communication and memory consumption.In this paper, we present the design and implementation of MPipeMoE, a high-performance library that accelerates MoE training with adaptive and memory-efficient pipeline parallelism. Inspired by that the MoE training procedure can be divided into multiple independent sub-stages, we design adaptive pipeline parallelism with an online algorithm to configure the granularity of the pipelining. Further, we analyze the memory footprint breakdown of MoE training and identify that activations and temporary buffers are the primary contributors to the overall memory footprint. Toward memory efficiency, we propose memory reusing strategies to reduce memory requirements by eliminating memory redundancies, and develop an adaptive selection component to determine the optimal strategy that considers both hardware capacities and model characteristics at runtime. We implement MPipeMoE upon PyTorch and evaluate it with common MoE models in a physical cluster consisting of 8 NVIDIA DGX A100 servers. Compared with the state-of-art approach, MPipeMoE achieves up to 2.8× speedup and reduces memory footprint by up to 47% in training large models.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132710551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00052
Bo Wang, Anara Kozhokanova, C. Terboven, Matthias S. Müller
The ever-growing power draw in high-performance computing (HPC) clusters and the rising energy costs enforce a pressing urge for energy-efficient computing. Consequently, advanced infrastructure orchestration is required to regulate power dissipation efficiently. In this work, we propose a novel approach for managing power consumption at runtime based on the well-known roofline model and call it Roofline Power (RLP) management. The RLP employs rigorously selected but generally available hardware performance events to construct rooflines, with minimal overheads. In particular, RLP extends the original roofline model to include the memory access latency metric for the first time. The extension identifies whether execution is bandwidth, latency, or compute-bound, and improves the modeling accuracy. We evaluated the RLP model on server-grade CPUs and a GPU with real-world HPC workloads in two scenarios: optimization with and without power capping. Compared to system default settings, RLP reduces the energy-to-solution up to 22% with negligible performance degradation. The other scenario accelerates the execution up to 14.7% under power capping. In addition, RLP outperforms other state-of-the-art techniques in generality and effectiveness.
{"title":"RLP: Power Management Based on a Latency-Aware Roofline Model","authors":"Bo Wang, Anara Kozhokanova, C. Terboven, Matthias S. Müller","doi":"10.1109/IPDPS54959.2023.00052","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00052","url":null,"abstract":"The ever-growing power draw in high-performance computing (HPC) clusters and the rising energy costs enforce a pressing urge for energy-efficient computing. Consequently, advanced infrastructure orchestration is required to regulate power dissipation efficiently. In this work, we propose a novel approach for managing power consumption at runtime based on the well-known roofline model and call it Roofline Power (RLP) management. The RLP employs rigorously selected but generally available hardware performance events to construct rooflines, with minimal overheads. In particular, RLP extends the original roofline model to include the memory access latency metric for the first time. The extension identifies whether execution is bandwidth, latency, or compute-bound, and improves the modeling accuracy. We evaluated the RLP model on server-grade CPUs and a GPU with real-world HPC workloads in two scenarios: optimization with and without power capping. Compared to system default settings, RLP reduces the energy-to-solution up to 22% with negligible performance degradation. The other scenario accelerates the execution up to 14.7% under power capping. In addition, RLP outperforms other state-of-the-art techniques in generality and effectiveness.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128681438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00020
Mohamed W. Hassan, A. Dabah, H. Ltaief, Suhaib A. Fahmy
Wireless communication systems rely on aggressive spatial multiplexing Multiple-Input Multiple-Output (MIMO) access points to enhance network throughput. A significant computational hurdle for large MIMO systems is signal detection and decoding, which has exponentially increasing computational complexity as the number of antennas increases. Hence, the feasibility of large MIMO systems depends on suitable implementations of signal decoding schemes.This paper presents an FPGA-based Sphere Decoder (SD) architecture that provides high-performance signal decoding for large MIMO systems, supporting up to 16-QAM modulation. The SD algorithm is refactored to map well to the FPGA architecture using a GEMM-based approach to exploit the parallel computational power of FPGAs. We implement FPGA-specific optimization techniques to improve computational complexity. We show significant improvement in time to decode the received signal with under 10–2 BER. The design is deployed on a Xilinx Alveo U280 FPGA and shows up to a 9× speedup compared to optimized multi-core CPU execution, achieving real-time requirements. Our proposed design reduces power consumption by a geo-mean of 38.1× compared to CPU implementation, which is important in real-world deployments. We also evaluate our design against alternative approaches on GPU.
{"title":"Signal Detection for Large MIMO Systems Using Sphere Decoding on FPGAs","authors":"Mohamed W. Hassan, A. Dabah, H. Ltaief, Suhaib A. Fahmy","doi":"10.1109/IPDPS54959.2023.00020","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00020","url":null,"abstract":"Wireless communication systems rely on aggressive spatial multiplexing Multiple-Input Multiple-Output (MIMO) access points to enhance network throughput. A significant computational hurdle for large MIMO systems is signal detection and decoding, which has exponentially increasing computational complexity as the number of antennas increases. Hence, the feasibility of large MIMO systems depends on suitable implementations of signal decoding schemes.This paper presents an FPGA-based Sphere Decoder (SD) architecture that provides high-performance signal decoding for large MIMO systems, supporting up to 16-QAM modulation. The SD algorithm is refactored to map well to the FPGA architecture using a GEMM-based approach to exploit the parallel computational power of FPGAs. We implement FPGA-specific optimization techniques to improve computational complexity. We show significant improvement in time to decode the received signal with under 10–2 BER. The design is deployed on a Xilinx Alveo U280 FPGA and shows up to a 9× speedup compared to optimized multi-core CPU execution, achieving real-time requirements. Our proposed design reduces power consumption by a geo-mean of 38.1× compared to CPU implementation, which is important in real-world deployments. We also evaluate our design against alternative approaches on GPU.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131330761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Federated Learning (FL) emerges as a distributed machine learning paradigm without end-user data transmission, effectively avoiding privacy leakage. Participating devices in FL are usually bandwidth-constrained, and the uplink is much slower than the downlink in wireless networks, which causes a severe uplink communication bottleneck. A prominent direction to alleviate this problem is federated dropout, which drops fractional weights of local models. However, existing federated dropout studies focus on random or ordered dropout and lack theoretical support, resulting in unguaranteed performance. In this paper, we propose Federated learning with Bayesian Inference-based Adaptive Dropout (FedBIAD), which regards weight rows of local models as probability distributions and adaptively drops partial weight rows based on importance indicators correlated with the trend of local training loss. By applying FedBIAD, each client adaptively selects a high-quality dropping pattern with accurate approximations and only transmits parameters of non-dropped weight rows to mitigate uplink costs while improving accuracy. Theoretical analysis demonstrates that the convergence rate of the average generalization error of FedBIAD is minimax optimal up to a squared logarithmic factor. Extensive experiments on image classification and next-word prediction show that compared with status quo approaches, FedBIAD provides 2× uplink reduction with an accuracy increase of up to 2.41% even on non-Independent and Identically Distributed (non-IID) data, which brings up to 72% decrease in training time.
联邦学习(FL)作为一种不需要终端用户数据传输的分布式机器学习范式而出现,有效地避免了隐私泄露。FL的参与设备通常受带宽限制,且上行速度比无线网络中的下行速度慢得多,造成了严重的上行通信瓶颈。缓解这一问题的一个重要方向是联邦dropout,它降低了局部模型的分数权重。然而,现有的联邦退学研究多集中于随机退学或有序退学,缺乏理论支持,导致性能得不到保证。本文提出了基于贝叶斯推理的自适应Dropout (FedBIAD)联邦学习方法,该方法将局部模型的权重行视为概率分布,并根据与局部训练损失趋势相关的重要指标自适应地丢弃部分权重行。通过应用FedBIAD,每个客户端自适应地选择具有精确近似的高质量丢弃模式,并且只传输未丢弃的权值行参数,从而在降低上行成本的同时提高准确性。理论分析表明,FedBIAD的平均泛化误差收敛速度在一个对数因子的平方范围内是极小极大最优的。大量的图像分类和下一词预测实验表明,与现有方法相比,FedBIAD在非独立同分布(non-Independent and Identically Distributed, non-IID)数据上的上行链路减少了2倍,准确率提高了2.41%,训练时间减少了72%。
{"title":"FedBIAD: Communication-Efficient and Accuracy-Guaranteed Federated Learning with Bayesian Inference-Based Adaptive Dropout","authors":"Jingjing Xue, Min Liu, Sheng Sun, Yuwei Wang, Hui Jiang, Xue Jiang","doi":"10.1109/IPDPS54959.2023.00056","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00056","url":null,"abstract":"Federated Learning (FL) emerges as a distributed machine learning paradigm without end-user data transmission, effectively avoiding privacy leakage. Participating devices in FL are usually bandwidth-constrained, and the uplink is much slower than the downlink in wireless networks, which causes a severe uplink communication bottleneck. A prominent direction to alleviate this problem is federated dropout, which drops fractional weights of local models. However, existing federated dropout studies focus on random or ordered dropout and lack theoretical support, resulting in unguaranteed performance. In this paper, we propose Federated learning with Bayesian Inference-based Adaptive Dropout (FedBIAD), which regards weight rows of local models as probability distributions and adaptively drops partial weight rows based on importance indicators correlated with the trend of local training loss. By applying FedBIAD, each client adaptively selects a high-quality dropping pattern with accurate approximations and only transmits parameters of non-dropped weight rows to mitigate uplink costs while improving accuracy. Theoretical analysis demonstrates that the convergence rate of the average generalization error of FedBIAD is minimax optimal up to a squared logarithmic factor. Extensive experiments on image classification and next-word prediction show that compared with status quo approaches, FedBIAD provides 2× uplink reduction with an accuracy increase of up to 2.41% even on non-Independent and Identically Distributed (non-IID) data, which brings up to 72% decrease in training time.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129357731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00051
Lanshun Nie, Yuqi Qiu, Fei Meng, Mo Yu, Jing Li
Resource allocation for stream processing graphs on computing devices is critical to the performance of stream processing. Efficient allocations need to balance workload distribution and minimize communication simultaneously and globally. Since this problem is known to be NP-complete, recent machine learning solutions were proposed based on an encoder-decoder framework, which predicts the device assignment of computing nodes sequentially as an approximation. However, for large graphs, these solutions suffer from the deficiency in handling long-distance dependency and global information, resulting in suboptimal predictions. This work proposes a new paradigm to deal with this challenge, which first coarsens the graph and conducts assignments on the smaller graph with existing graph partitioning methods. Unlike existing graph coarsening works, we leverage the theoretical insights in this resource allocation problem, formulate the coarsening of stream graphs as edge-collapsing predictions, and propose an edge-aware coarsening model. Extensive experiments on various datasets show that our framework significantly improves over existing learning-based and heuristic-based baselines with up to 56% relative improvement on large graphs.
{"title":"Generalizable Reinforcement Learning-Based Coarsening Model for Resource Allocation over Large and Diverse Stream Processing Graphs","authors":"Lanshun Nie, Yuqi Qiu, Fei Meng, Mo Yu, Jing Li","doi":"10.1109/IPDPS54959.2023.00051","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00051","url":null,"abstract":"Resource allocation for stream processing graphs on computing devices is critical to the performance of stream processing. Efficient allocations need to balance workload distribution and minimize communication simultaneously and globally. Since this problem is known to be NP-complete, recent machine learning solutions were proposed based on an encoder-decoder framework, which predicts the device assignment of computing nodes sequentially as an approximation. However, for large graphs, these solutions suffer from the deficiency in handling long-distance dependency and global information, resulting in suboptimal predictions. This work proposes a new paradigm to deal with this challenge, which first coarsens the graph and conducts assignments on the smaller graph with existing graph partitioning methods. Unlike existing graph coarsening works, we leverage the theoretical insights in this resource allocation problem, formulate the coarsening of stream graphs as edge-collapsing predictions, and propose an edge-aware coarsening model. Extensive experiments on various datasets show that our framework significantly improves over existing learning-based and heuristic-based baselines with up to 56% relative improvement on large graphs.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"9 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114133003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}