Andrew Prout, Albert Reuther, Michael Houle, Michael Jones, Peter Michaleas, LaToya Anderson, William Arcand, Bill Bergeron, David Bestor, Alex Bonn, Daniel Burrill, Chansup Byun, Vijay Gadepally, Matthew Hubbell, Hayden Jananthan, Piotr Luszczek, Lauren Milechin, Guillermo Morales, Julie Mullen, Antonio Rosa, Charles Yee, Jeremy Kepner
HPC systems used for research run a wide variety of software and workflows. This software is often written or modified by users to meet the needs of their research projects, and rarely is built with security in mind. In this paper we explore several of the key techniques that MIT Lincoln Laboratory Supercomputing Center has deployed on its systems to manage the security implications of these workflows by providing enforced separation for processes, filesystem access, network traffic, and accelerators to make every user feel like they are running on a personal HPC.
{"title":"HPC with Enhanced User Separation","authors":"Andrew Prout, Albert Reuther, Michael Houle, Michael Jones, Peter Michaleas, LaToya Anderson, William Arcand, Bill Bergeron, David Bestor, Alex Bonn, Daniel Burrill, Chansup Byun, Vijay Gadepally, Matthew Hubbell, Hayden Jananthan, Piotr Luszczek, Lauren Milechin, Guillermo Morales, Julie Mullen, Antonio Rosa, Charles Yee, Jeremy Kepner","doi":"arxiv-2409.10770","DOIUrl":"https://doi.org/arxiv-2409.10770","url":null,"abstract":"HPC systems used for research run a wide variety of software and workflows.\u0000This software is often written or modified by users to meet the needs of their\u0000research projects, and rarely is built with security in mind. In this paper we\u0000explore several of the key techniques that MIT Lincoln Laboratory\u0000Supercomputing Center has deployed on its systems to manage the security\u0000implications of these workflows by providing enforced separation for processes,\u0000filesystem access, network traffic, and accelerators to make every user feel\u0000like they are running on a personal HPC.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
John Augustine, Antonio Cruciani, Iqra Altaf Gillani
We study robust and efficient distributed algorithms for building and maintaining distributed data structures in dynamic Peer-to-Peer (P2P) networks. P2P networks are characterized by a high level of dynamicity with abrupt heavy node emph{churn} (nodes that join and leave the network continuously over time). We present a novel algorithm that builds and maintains with high probability a skip list for $poly(n)$ rounds despite $mathcal{O}(n/log n)$ churn emph{per round} ($n$ is the stable network size). We assume that the churn is controlled by an oblivious adversary (that has complete knowledge and control of what nodes join and leave and at what time and has unlimited computational power, but is oblivious to the random choices made by the algorithm). Moreover, the maintenance overhead is proportional to the churn rate. Furthermore, the algorithm is scalable in that the messages are small (i.e., at most $polylog(n)$ bits) and every node sends and receives at most $polylog(n)$ messages per round. Our algorithm crucially relies on novel distributed and parallel algorithms to merge two $n$-elements skip lists and delete a large subset of items, both in $mathcal{O}(log n)$ rounds with high probability. These procedures may be of independent interest due to their elegance and potential applicability in other contexts in distributed data structures. To the best of our knowledge, our work provides the first-known fully-distributed data structure that provably works under highly dynamic settings (i.e., high churn rate). Furthermore, they are localized (i.e., do not require any global topological knowledge). Finally, we believe that our framework can be generalized to other distributed and dynamic data structures including graphs, potentially leading to stable distributed computation despite heavy churn.
{"title":"Maintaining Distributed Data Structures in Dynamic Peer-to-Peer Networks","authors":"John Augustine, Antonio Cruciani, Iqra Altaf Gillani","doi":"arxiv-2409.10235","DOIUrl":"https://doi.org/arxiv-2409.10235","url":null,"abstract":"We study robust and efficient distributed algorithms for building and\u0000maintaining distributed data structures in dynamic Peer-to-Peer (P2P) networks.\u0000P2P networks are characterized by a high level of dynamicity with abrupt heavy\u0000node emph{churn} (nodes that join and leave the network continuously over\u0000time). We present a novel algorithm that builds and maintains with high\u0000probability a skip list for $poly(n)$ rounds despite $mathcal{O}(n/log n)$\u0000churn emph{per round} ($n$ is the stable network size). We assume that the\u0000churn is controlled by an oblivious adversary (that has complete knowledge and\u0000control of what nodes join and leave and at what time and has unlimited\u0000computational power, but is oblivious to the random choices made by the\u0000algorithm). Moreover, the maintenance overhead is proportional to the churn\u0000rate. Furthermore, the algorithm is scalable in that the messages are small\u0000(i.e., at most $polylog(n)$ bits) and every node sends and receives at most\u0000$polylog(n)$ messages per round. Our algorithm crucially relies on novel distributed and parallel algorithms\u0000to merge two $n$-elements skip lists and delete a large subset of items, both\u0000in $mathcal{O}(log n)$ rounds with high probability. These procedures may be\u0000of independent interest due to their elegance and potential applicability in\u0000other contexts in distributed data structures. To the best of our knowledge, our work provides the first-known\u0000fully-distributed data structure that provably works under highly dynamic\u0000settings (i.e., high churn rate). Furthermore, they are localized (i.e., do not\u0000require any global topological knowledge). Finally, we believe that our\u0000framework can be generalized to other distributed and dynamic data structures\u0000including graphs, potentially leading to stable distributed computation despite\u0000heavy churn.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Jian Huang, Bekzod Iskandarov, Mizanur Rahman, Hakan T. Otal, M. Abdullah Canbaz
This paper presents the design and implementation of a Federated Learning (FL) testbed, focusing on its application in cybersecurity and evaluating its resilience against poisoning attacks. Federated Learning allows multiple clients to collaboratively train a global model while keeping their data decentralized, addressing critical needs for data privacy and security, particularly in sensitive fields like cybersecurity. Our testbed, built using the Flower framework, facilitates experimentation with various FL frameworks, assessing their performance, scalability, and ease of integration. Through a case study on federated intrusion detection systems, we demonstrate the testbed's capabilities in detecting anomalies and securing critical infrastructure without exposing sensitive network data. Comprehensive poisoning tests, targeting both model and data integrity, evaluate the system's robustness under adversarial conditions. Our results show that while federated learning enhances data privacy and distributed learning, it remains vulnerable to poisoning attacks, which must be mitigated to ensure its reliability in real-world applications.
{"title":"Federated Learning in Adversarial Environments: Testbed Design and Poisoning Resilience in Cybersecurity","authors":"Hao Jian Huang, Bekzod Iskandarov, Mizanur Rahman, Hakan T. Otal, M. Abdullah Canbaz","doi":"arxiv-2409.09794","DOIUrl":"https://doi.org/arxiv-2409.09794","url":null,"abstract":"This paper presents the design and implementation of a Federated Learning\u0000(FL) testbed, focusing on its application in cybersecurity and evaluating its\u0000resilience against poisoning attacks. Federated Learning allows multiple\u0000clients to collaboratively train a global model while keeping their data\u0000decentralized, addressing critical needs for data privacy and security,\u0000particularly in sensitive fields like cybersecurity. Our testbed, built using\u0000the Flower framework, facilitates experimentation with various FL frameworks,\u0000assessing their performance, scalability, and ease of integration. Through a\u0000case study on federated intrusion detection systems, we demonstrate the\u0000testbed's capabilities in detecting anomalies and securing critical\u0000infrastructure without exposing sensitive network data. Comprehensive poisoning\u0000tests, targeting both model and data integrity, evaluate the system's\u0000robustness under adversarial conditions. Our results show that while federated\u0000learning enhances data privacy and distributed learning, it remains vulnerable\u0000to poisoning attacks, which must be mitigated to ensure its reliability in\u0000real-world applications.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents an approach to authoring a textbook titled Interactive OpenMP Programming with the assistance of Large Language Models (LLMs). The writing process utilized state-of-the-art LLMs, including Gemini Pro 1.5, Claude 3, and ChatGPT-4, to generate the initial structure and outline of the book, as well as the initial content for specific chapters. This content included detailed descriptions of individual OpenMP constructs and practical programming examples. The outline and content have then undergone extensive manual revisions to meet our book goals. In this paper, we report our findings about the capabilities and limitations of these LLMs. We address critical questions concerning the necessity of textbook resources and the effectiveness of LLMs in creating fundamental and practical programming content. Our findings suggest that while LLMs offer significant advantages in generating textbook content, they require careful integration with traditional educational methodologies to ensure depth, accuracy, and pedagogical effectiveness. The Interactive OpenMP Programming book is developed with the framework of Jupyter Book, enabling the execution of code within the book from the web browser, providing instant feedback and a dynamic learning experience that stands in contrast to traditional educational resources. The book represents a significant step towards modernizing programming education, offering insights into practical strategies for generating the textbook through advanced AI tools.
{"title":"Developing an Interactive OpenMP Programming Book with Large Language Models","authors":"Xinyao Yi, Anjia Wang, Yonghong Yan, Chunhua Liao","doi":"arxiv-2409.09296","DOIUrl":"https://doi.org/arxiv-2409.09296","url":null,"abstract":"This paper presents an approach to authoring a textbook titled Interactive\u0000OpenMP Programming with the assistance of Large Language Models (LLMs). The\u0000writing process utilized state-of-the-art LLMs, including Gemini Pro 1.5,\u0000Claude 3, and ChatGPT-4, to generate the initial structure and outline of the\u0000book, as well as the initial content for specific chapters. This content\u0000included detailed descriptions of individual OpenMP constructs and practical\u0000programming examples. The outline and content have then undergone extensive\u0000manual revisions to meet our book goals. In this paper, we report our findings\u0000about the capabilities and limitations of these LLMs. We address critical\u0000questions concerning the necessity of textbook resources and the effectiveness\u0000of LLMs in creating fundamental and practical programming content. Our findings\u0000suggest that while LLMs offer significant advantages in generating textbook\u0000content, they require careful integration with traditional educational\u0000methodologies to ensure depth, accuracy, and pedagogical effectiveness. The\u0000Interactive OpenMP Programming book is developed with the framework of Jupyter\u0000Book, enabling the execution of code within the book from the web browser,\u0000providing instant feedback and a dynamic learning experience that stands in\u0000contrast to traditional educational resources. The book represents a\u0000significant step towards modernizing programming education, offering insights\u0000into practical strategies for generating the textbook through advanced AI\u0000tools.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The increasing complexity of deep learning models and the demand for processing vast amounts of data make the utilization of large-scale distributed systems for efficient training essential. These systems, however, face significant challenges such as communication overhead, hardware limitations, and node failure. This paper investigates various optimization techniques in distributed deep learning, including Elastic Averaging SGD (EASGD) and the second-order method AdaHessian. We propose a dynamic weighting strategy to mitigate the problem of straggler nodes due to failure, enhancing the performance and efficiency of the overall training process. We conduct experiments with different numbers of workers and communication periods to demonstrate improved convergence rates and test performance using our strategy.
{"title":"A Dynamic Weighting Strategy to Mitigate Worker Node Failure in Distributed Deep Learning","authors":"Yuesheng Xu, Arielle Carr","doi":"arxiv-2409.09242","DOIUrl":"https://doi.org/arxiv-2409.09242","url":null,"abstract":"The increasing complexity of deep learning models and the demand for\u0000processing vast amounts of data make the utilization of large-scale distributed\u0000systems for efficient training essential. These systems, however, face\u0000significant challenges such as communication overhead, hardware limitations,\u0000and node failure. This paper investigates various optimization techniques in\u0000distributed deep learning, including Elastic Averaging SGD (EASGD) and the\u0000second-order method AdaHessian. We propose a dynamic weighting strategy to\u0000mitigate the problem of straggler nodes due to failure, enhancing the\u0000performance and efficiency of the overall training process. We conduct\u0000experiments with different numbers of workers and communication periods to\u0000demonstrate improved convergence rates and test performance using our strategy.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As global climate change intensifies, accurate weather forecasting is increasingly crucial for sectors such as agriculture, energy management, and environmental protection. Traditional methods, which rely on physical and statistical models, often struggle with complex, nonlinear, and time-varying data, underscoring the need for more advanced techniques. This study explores a hybrid CNN-LSTM model to enhance temperature forecasting accuracy for the Delhi region, using historical meteorological data from 1996 to 2017. We employed both direct and indirect methods, including comprehensive data preprocessing and exploratory analysis, to construct and train our model. The CNN component effectively extracts spatial features, while the LSTM captures temporal dependencies, leading to improved prediction accuracy. Experimental results indicate that the CNN-LSTM model significantly outperforms traditional forecasting methods in terms of both accuracy and stability, with a mean square error (MSE) of 3.26217 and a root mean square error (RMSE) of 1.80615. The hybrid model demonstrates its potential as a robust tool for temperature prediction, offering valuable insights for meteorological forecasting and related fields. Future research should focus on optimizing model architecture, exploring additional feature extraction techniques, and addressing challenges such as overfitting and computational complexity. This approach not only advances temperature forecasting but also provides a foundation for applying deep learning to other time series forecasting tasks.
{"title":"Weather Prediction Using CNN-LSTM for Time Series Analysis: A Case Study on Delhi Temperature Data","authors":"Bangyu Li, Yang Qian","doi":"arxiv-2409.09414","DOIUrl":"https://doi.org/arxiv-2409.09414","url":null,"abstract":"As global climate change intensifies, accurate weather forecasting is\u0000increasingly crucial for sectors such as agriculture, energy management, and\u0000environmental protection. Traditional methods, which rely on physical and\u0000statistical models, often struggle with complex, nonlinear, and time-varying\u0000data, underscoring the need for more advanced techniques. This study explores a\u0000hybrid CNN-LSTM model to enhance temperature forecasting accuracy for the Delhi\u0000region, using historical meteorological data from 1996 to 2017. We employed\u0000both direct and indirect methods, including comprehensive data preprocessing\u0000and exploratory analysis, to construct and train our model. The CNN component\u0000effectively extracts spatial features, while the LSTM captures temporal\u0000dependencies, leading to improved prediction accuracy. Experimental results\u0000indicate that the CNN-LSTM model significantly outperforms traditional\u0000forecasting methods in terms of both accuracy and stability, with a mean square\u0000error (MSE) of 3.26217 and a root mean square error (RMSE) of 1.80615. The\u0000hybrid model demonstrates its potential as a robust tool for temperature\u0000prediction, offering valuable insights for meteorological forecasting and\u0000related fields. Future research should focus on optimizing model architecture,\u0000exploring additional feature extraction techniques, and addressing challenges\u0000such as overfitting and computational complexity. This approach not only\u0000advances temperature forecasting but also provides a foundation for applying\u0000deep learning to other time series forecasting tasks.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"77 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Kawa Atapour, S. Jamal SeyedMohammadi, S. Mohammad Sheikholeslami, Jamshid Abouei, Konstantinos N. Plataniotis, Arash Mohammadi
Recently pre-trained Foundation Models (FMs) have been combined with Federated Learning (FL) to improve training of downstream tasks while preserving privacy. However, deploying FMs over edge networks with resource-constrained Internet of Things (IoT) devices is under-explored. This paper proposes a novel framework, namely, Federated Distilling knowledge to Prompt (FedD2P), for leveraging the robust representation abilities of a vision-language FM without deploying it locally on edge devices. This framework distills the aggregated knowledge of IoT devices to a prompt generator to efficiently adapt the frozen FM for downstream tasks. To eliminate the dependency on a public dataset, our framework leverages perclass local knowledge from IoT devices and linguistic descriptions of classes to train the prompt generator. Our experiments on diverse image classification datasets CIFAR, OxfordPets, SVHN, EuroSAT, and DTD show that FedD2P outperforms the baselines in terms of model performance.
最近,预训练基础模型(FM)与联合学习(FL)相结合,在保护隐私的同时改进了下游任务的训练。然而,在资源受限的物联网(IoT)设备的边缘网络上部署 FM 还没有得到充分探索。本文提出了一个新颖的框架,即 "知识到提示的联合分馏(FedD2P)",用于利用视觉语言调频的强大表示能力,而无需在边缘设备上进行本地部署。该框架将物联网设备的聚合知识蒸馏到提示生成器中,以便有效地将冻结的调频适应下游任务。为了消除对公共数据集的依赖,我们的框架利用物联网设备的每类本地知识和类的语言描述来训练提示生成器。我们在各种图像分类数据集 CIFAR、OxfordPets、SVHN、EuroSAT 和 DTD 上进行的实验表明,FedD2P 在模型性能方面优于基准。
{"title":"Leveraging Foundation Models for Efficient Federated Learning in Resource-restricted Edge Networks","authors":"S. Kawa Atapour, S. Jamal SeyedMohammadi, S. Mohammad Sheikholeslami, Jamshid Abouei, Konstantinos N. Plataniotis, Arash Mohammadi","doi":"arxiv-2409.09273","DOIUrl":"https://doi.org/arxiv-2409.09273","url":null,"abstract":"Recently pre-trained Foundation Models (FMs) have been combined with\u0000Federated Learning (FL) to improve training of downstream tasks while\u0000preserving privacy. However, deploying FMs over edge networks with\u0000resource-constrained Internet of Things (IoT) devices is under-explored. This\u0000paper proposes a novel framework, namely, Federated Distilling knowledge to\u0000Prompt (FedD2P), for leveraging the robust representation abilities of a\u0000vision-language FM without deploying it locally on edge devices. This framework\u0000distills the aggregated knowledge of IoT devices to a prompt generator to\u0000efficiently adapt the frozen FM for downstream tasks. To eliminate the\u0000dependency on a public dataset, our framework leverages perclass local\u0000knowledge from IoT devices and linguistic descriptions of classes to train the\u0000prompt generator. Our experiments on diverse image classification datasets\u0000CIFAR, OxfordPets, SVHN, EuroSAT, and DTD show that FedD2P outperforms the\u0000baselines in terms of model performance.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Federated learning is a distributed learning paradigm in which multiple mobile clients train a global model while keeping data local. These mobile clients can have various available memory and network bandwidth. However, to achieve the best global model performance, how we can utilize available memory and network bandwidth to the maximum remains an open challenge. In this paper, we propose to assign each client a subset of the global model, having different layers and channels on each layer. To realize that, we design a constrained model search process with early stop to improve efficiency of finding the models from such a very large space; and a data-free knowledge distillation mechanism to improve the global model performance when aggregating models of such different structures. For fair and reproducible comparison between different solutions, we develop a new system, which can directly allocate different memory and bandwidth to each client according to memory and bandwidth logs collected on mobile devices. The evaluation shows that our solution can have accuracy increase ranging from 2.43% to 15.81% and provide 5% to 40% more memory and bandwidth utilization with negligible extra running time, comparing to existing state-of-the-art system-heterogeneous federated learning methods under different available memory and bandwidth, non-i.i.d.~datasets, image and text tasks.
{"title":"Exploring System-Heterogeneous Federated Learning with Dynamic Model Selection","authors":"Dixi Yao","doi":"arxiv-2409.08858","DOIUrl":"https://doi.org/arxiv-2409.08858","url":null,"abstract":"Federated learning is a distributed learning paradigm in which multiple\u0000mobile clients train a global model while keeping data local. These mobile\u0000clients can have various available memory and network bandwidth. However, to\u0000achieve the best global model performance, how we can utilize available memory\u0000and network bandwidth to the maximum remains an open challenge. In this paper,\u0000we propose to assign each client a subset of the global model, having different\u0000layers and channels on each layer. To realize that, we design a constrained\u0000model search process with early stop to improve efficiency of finding the\u0000models from such a very large space; and a data-free knowledge distillation\u0000mechanism to improve the global model performance when aggregating models of\u0000such different structures. For fair and reproducible comparison between\u0000different solutions, we develop a new system, which can directly allocate\u0000different memory and bandwidth to each client according to memory and bandwidth\u0000logs collected on mobile devices. The evaluation shows that our solution can\u0000have accuracy increase ranging from 2.43% to 15.81% and provide 5% to 40%\u0000more memory and bandwidth utilization with negligible extra running time,\u0000comparing to existing state-of-the-art system-heterogeneous federated learning\u0000methods under different available memory and bandwidth, non-i.i.d.~datasets,\u0000image and text tasks.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andreas Plesner, Hans Henrik Brandenborg Sørensen, Søren Hauberg
Bessel functions are critical in scientific computing for applications such as machine learning, protein structure modeling, and robotics. However, currently, available routines lack precision or fail for certain input ranges, such as when the order $v$ is large, and GPU-specific implementations are limited. We address the precision limitations of current numerical implementations while dramatically improving the runtime. We propose two novel algorithms for computing the logarithm of modified Bessel functions of the first and second kinds by computing intermediate values on a logarithmic scale. Our algorithms are robust and never have issues with underflows or overflows while having relative errors on the order of machine precision, even for inputs where existing libraries fail. In C++/CUDA, our algorithms have median and maximum speedups of 45x and 6150x for GPU and 17x and 3403x for CPU, respectively, over the ranges of inputs and third-party libraries tested. Compared to SciPy, the algorithms have median and maximum speedups of 77x and 300x for GPU and 35x and 98x for CPU, respectively, over the tested inputs. The ability to robustly compute a solution and the low relative errors allow us to fit von Mises-Fisher, vMF, distributions to high-dimensional neural network features. This is, e.g., relevant for uncertainty quantification in metric learning. We obtain image feature data by processing CIFAR10 training images with the convolutional layers of a pre-trained ResNet50. We successfully fit vMF distributions to 2048-, 8192-, and 32768-dimensional image feature data using our algorithms. Our approach provides fast and accurate results while existing implementations in SciPy and mpmath fail to fit successfully. Our approach is readily implementable on GPUs, and we provide a fast open-source implementation alongside this paper.
{"title":"Accurate Computation of the Logarithm of Modified Bessel Functions on GPUs","authors":"Andreas Plesner, Hans Henrik Brandenborg Sørensen, Søren Hauberg","doi":"arxiv-2409.08729","DOIUrl":"https://doi.org/arxiv-2409.08729","url":null,"abstract":"Bessel functions are critical in scientific computing for applications such\u0000as machine learning, protein structure modeling, and robotics. However,\u0000currently, available routines lack precision or fail for certain input ranges,\u0000such as when the order $v$ is large, and GPU-specific implementations are\u0000limited. We address the precision limitations of current numerical\u0000implementations while dramatically improving the runtime. We propose two novel\u0000algorithms for computing the logarithm of modified Bessel functions of the\u0000first and second kinds by computing intermediate values on a logarithmic scale.\u0000Our algorithms are robust and never have issues with underflows or overflows\u0000while having relative errors on the order of machine precision, even for inputs\u0000where existing libraries fail. In C++/CUDA, our algorithms have median and\u0000maximum speedups of 45x and 6150x for GPU and 17x and 3403x for CPU,\u0000respectively, over the ranges of inputs and third-party libraries tested.\u0000Compared to SciPy, the algorithms have median and maximum speedups of 77x and\u0000300x for GPU and 35x and 98x for CPU, respectively, over the tested inputs. The ability to robustly compute a solution and the low relative errors allow\u0000us to fit von Mises-Fisher, vMF, distributions to high-dimensional neural\u0000network features. This is, e.g., relevant for uncertainty quantification in\u0000metric learning. We obtain image feature data by processing CIFAR10 training\u0000images with the convolutional layers of a pre-trained ResNet50. We successfully\u0000fit vMF distributions to 2048-, 8192-, and 32768-dimensional image feature data\u0000using our algorithms. Our approach provides fast and accurate results while\u0000existing implementations in SciPy and mpmath fail to fit successfully. Our approach is readily implementable on GPUs, and we provide a fast\u0000open-source implementation alongside this paper.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This work presents WarmSwap, a novel provider-side cold-start optimization for serverless computing. This optimization reduces cold-start time when booting and loading dependencies at runtime inside a function container. Previous approaches to the optimization of cold starts tend to fall into two categories: optimizing the infrastructure of serverless computing to benefit all serverless functions; or function-specific tuning for individual serverless functions. In contrast, WarmSwap offers a broad middle ground, which optimizes entire categories of serverless functions. WarmSwap eliminates the need to initialize middleware or software dependencies when launching a new serverless container, by migrating a pre-initialized live dependency image to the new function instance. WarmSwap respects the provider's cache constraints, as a single pre-warmed dependency image in the cache is shared among all serverless functions requiring that software dependency image. WarmSwap has been tested on seven representative functions from FunctionBench. The functions are chosen to compare with previous work. In those tests, WarmSwap accelerates cold-start executions for those serverless functions with large dependency requirements by a factor ranging from 1.2 to 2.2.
{"title":"WarmSwap: Sharing Dependencies for Accelerating Cold Starts in Serverless Functions","authors":"Rui Li, Devesh Tiwari, Gene Cooperman","doi":"arxiv-2409.09202","DOIUrl":"https://doi.org/arxiv-2409.09202","url":null,"abstract":"This work presents WarmSwap, a novel provider-side cold-start optimization\u0000for serverless computing. This optimization reduces cold-start time when\u0000booting and loading dependencies at runtime inside a function container.\u0000Previous approaches to the optimization of cold starts tend to fall into two\u0000categories: optimizing the infrastructure of serverless computing to benefit\u0000all serverless functions; or function-specific tuning for individual serverless\u0000functions. In contrast, WarmSwap offers a broad middle ground, which optimizes\u0000entire categories of serverless functions. WarmSwap eliminates the need to\u0000initialize middleware or software dependencies when launching a new serverless\u0000container, by migrating a pre-initialized live dependency image to the new\u0000function instance. WarmSwap respects the provider's cache constraints, as a\u0000single pre-warmed dependency image in the cache is shared among all serverless\u0000functions requiring that software dependency image. WarmSwap has been tested on\u0000seven representative functions from FunctionBench. The functions are chosen to\u0000compare with previous work. In those tests, WarmSwap accelerates cold-start\u0000executions for those serverless functions with large dependency requirements by\u0000a factor ranging from 1.2 to 2.2.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}