A service-level objective (SLO) is a target performance metric of service that cloud vendors aim to ensure. Delivering optimized SLOs can enhance user satisfaction and improve the competitiveness of cloud vendors. As large language models (LLMs) are gaining increasing popularity across various fields, it is of great significance to optimize SLOs for LLM inference services. In this paper, we observe that adjusting the parameters of LLM inference engines can improve service performance, and the optimal parameter configurations of different services are different. Therefore, we propose SCOOT, an automatic performance tuning system to optimize SLOs for each LLM inference service by tuning the parameters of the inference engine. We first propose a generalized formulation of the tuning problem to handle various objectives and constraints between parameters, and SCOOT exploits the Bayesian optimization (BO) technique to resolve the problem via exploration and exploitation. Moreover, SCOOT adopts a random forest to learn hidden constraints during the tuning process to mitigate invalid exploration. To improve the tuning efficiency, SCOOT utilizes the parallel suggestion to accelerate the tuning process. Extensive experiments demonstrate that SCOOT can significantly outperform existing tuning techniques in SLO optimization while greatly improving the tuning efficiency.
{"title":"Towards SLO-Optimized LLM Serving via Automatic Inference Engine Tuning","authors":"Ke Cheng, Zhi Wang, Wen Hu, Tiannuo Yang, Jianguo Li, Sheng Zhang","doi":"arxiv-2408.04323","DOIUrl":"https://doi.org/arxiv-2408.04323","url":null,"abstract":"A service-level objective (SLO) is a target performance metric of service\u0000that cloud vendors aim to ensure. Delivering optimized SLOs can enhance user\u0000satisfaction and improve the competitiveness of cloud vendors. As large\u0000language models (LLMs) are gaining increasing popularity across various fields,\u0000it is of great significance to optimize SLOs for LLM inference services. In\u0000this paper, we observe that adjusting the parameters of LLM inference engines\u0000can improve service performance, and the optimal parameter configurations of\u0000different services are different. Therefore, we propose SCOOT, an automatic\u0000performance tuning system to optimize SLOs for each LLM inference service by\u0000tuning the parameters of the inference engine. We first propose a generalized\u0000formulation of the tuning problem to handle various objectives and constraints\u0000between parameters, and SCOOT exploits the Bayesian optimization (BO) technique\u0000to resolve the problem via exploration and exploitation. Moreover, SCOOT adopts\u0000a random forest to learn hidden constraints during the tuning process to\u0000mitigate invalid exploration. To improve the tuning efficiency, SCOOT utilizes\u0000the parallel suggestion to accelerate the tuning process. Extensive experiments\u0000demonstrate that SCOOT can significantly outperform existing tuning techniques\u0000in SLO optimization while greatly improving the tuning efficiency.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"184 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multimodal large language models (LLMs) have demonstrated significant potential in a wide range of AI applications. Yet, training multimodal LLMs suffers from low efficiency and scalability, due to the inherent model heterogeneity and data heterogeneity across different modalities. We present MMScale, an efficient and adaptive framework to reform the training of multimodal large language models on large-scale clusters. MMScale exploits the system characteristics of multimodal LLM training to achieve high efficiency and scalability. The core of MMScale is the adaptive resource allocation and data-aware reordering techniques to eliminate the model and data heterogeneity respectively. We also tailor system optimizations for multimodal LLM training to offload certain operations from the GPU training. We evaluate MMScale across different sizes of multimodal LLMs on a large-scale production cluster with thousands of GPUs. The experimental results show that MMScale achieves 54.7% Model FLOPs Utilization (MFU) when training a 72B multimodal LLM on 1172 GPUs and outperforms Megatron-LM by up to 2.2$times$ on throughput. The ablation study shows the main techniques of MMScale are both effective and lightweight.
{"title":"Addressing Model and Data Heterogeneity in Multimodal Large Language Model Training","authors":"Zili Zhang, Yinmin Zhong, Ranchen Ming, Hanpeng Hu, Jianjian Sun, Zheng Ge, Yibo Zhu, Xin Jin","doi":"arxiv-2408.04275","DOIUrl":"https://doi.org/arxiv-2408.04275","url":null,"abstract":"Multimodal large language models (LLMs) have demonstrated significant\u0000potential in a wide range of AI applications. Yet, training multimodal LLMs\u0000suffers from low efficiency and scalability, due to the inherent model\u0000heterogeneity and data heterogeneity across different modalities. We present MMScale, an efficient and adaptive framework to reform the\u0000training of multimodal large language models on large-scale clusters. MMScale\u0000exploits the system characteristics of multimodal LLM training to achieve high\u0000efficiency and scalability. The core of MMScale is the adaptive resource\u0000allocation and data-aware reordering techniques to eliminate the model and data\u0000heterogeneity respectively. We also tailor system optimizations for multimodal\u0000LLM training to offload certain operations from the GPU training. We evaluate\u0000MMScale across different sizes of multimodal LLMs on a large-scale production\u0000cluster with thousands of GPUs. The experimental results show that MMScale\u0000achieves 54.7% Model FLOPs Utilization (MFU) when training a 72B multimodal LLM\u0000on 1172 GPUs and outperforms Megatron-LM by up to 2.2$times$ on throughput.\u0000The ablation study shows the main techniques of MMScale are both effective and\u0000lightweight.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"119 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rachid Guerraoui, Anne-Marie Kermarrec, Anastasiia Kucherenko, Rafael Pinot, Marijn de Vos
The ability of a peer-to-peer (P2P) system to effectively host decentralized applications often relies on the availability of a peer-sampling service, which provides each participant with a random sample of other peers. Despite the practical effectiveness of existing peer samplers, their ability to produce random samples within a reasonable time frame remains poorly understood from a theoretical standpoint. This paper contributes to bridging this gap by introducing PeerSwap, a peer-sampling protocol with provable randomness guarantees. We establish execution time bounds for PeerSwap, demonstrating its ability to scale effectively with the network size. We prove that PeerSwap maintains the fixed structure of the communication graph while allowing sequential peer position swaps within this graph. We do so by showing that PeerSwap is a specific instance of an interchange process, a renowned model for particle movement analysis. Leveraging this mapping, we derive execution time bounds, expressed as a function of the network size N. Depending on the network structure, this time can be as low as a polylogarithmic function of N, highlighting the efficiency of PeerSwap. We implement PeerSwap and conduct numerical evaluations using regular graphs with varying connectivity and containing up to 32768 (2^15) peers. Our evaluation demonstrates that PeerSwap quickly provides peers with uniform random samples of other peers.
{"title":"PeerSwap: A Peer-Sampler with Randomness Guarantees","authors":"Rachid Guerraoui, Anne-Marie Kermarrec, Anastasiia Kucherenko, Rafael Pinot, Marijn de Vos","doi":"arxiv-2408.03829","DOIUrl":"https://doi.org/arxiv-2408.03829","url":null,"abstract":"The ability of a peer-to-peer (P2P) system to effectively host decentralized\u0000applications often relies on the availability of a peer-sampling service, which\u0000provides each participant with a random sample of other peers. Despite the\u0000practical effectiveness of existing peer samplers, their ability to produce\u0000random samples within a reasonable time frame remains poorly understood from a\u0000theoretical standpoint. This paper contributes to bridging this gap by\u0000introducing PeerSwap, a peer-sampling protocol with provable randomness\u0000guarantees. We establish execution time bounds for PeerSwap, demonstrating its\u0000ability to scale effectively with the network size. We prove that PeerSwap\u0000maintains the fixed structure of the communication graph while allowing\u0000sequential peer position swaps within this graph. We do so by showing that\u0000PeerSwap is a specific instance of an interchange process, a renowned model for\u0000particle movement analysis. Leveraging this mapping, we derive execution time\u0000bounds, expressed as a function of the network size N. Depending on the network\u0000structure, this time can be as low as a polylogarithmic function of N,\u0000highlighting the efficiency of PeerSwap. We implement PeerSwap and conduct\u0000numerical evaluations using regular graphs with varying connectivity and\u0000containing up to 32768 (2^15) peers. Our evaluation demonstrates that PeerSwap\u0000quickly provides peers with uniform random samples of other peers.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multimodal large language models (MLLMs) have extended the success of large language models (LLMs) to multiple data types, such as image, text and audio, achieving significant performance in various domains, including multimodal translation, visual question answering and content generation. Nonetheless, existing systems are inefficient to train MLLMs due to substantial GPU bubbles caused by the heterogeneous modality models and complex data dependencies in 3D parallelism. This paper proposes Optimus, a distributed MLLM training system that reduces end-to-end MLLM training time. Optimus is based on our principled analysis that scheduling the encoder computation within the LLM bubbles can reduce bubbles in MLLM training. To make scheduling encoder computation possible for all GPUs, Optimus searches the separate parallel plans for encoder and LLM, and adopts a bubble scheduling algorithm to enable exploiting LLM bubbles without breaking the original data dependencies in the MLLM model architecture. We further decompose encoder layer computation into a series of kernels, and analyze the common bubble pattern of 3D parallelism to carefully optimize the sub-millisecond bubble scheduling, minimizing the overall training time. Our experiments in a production cluster show that Optimus accelerates MLLM training by 20.5%-21.3% with ViT-22B and GPT-175B model over 3072 GPUs compared to baselines.
{"title":"Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation","authors":"Weiqi Feng, Yangrui Chen, Shaoyu Wang, Yanghua Peng, Haibin Lin, Minlan Yu","doi":"arxiv-2408.03505","DOIUrl":"https://doi.org/arxiv-2408.03505","url":null,"abstract":"Multimodal large language models (MLLMs) have extended the success of large\u0000language models (LLMs) to multiple data types, such as image, text and audio,\u0000achieving significant performance in various domains, including multimodal\u0000translation, visual question answering and content generation. Nonetheless,\u0000existing systems are inefficient to train MLLMs due to substantial GPU bubbles\u0000caused by the heterogeneous modality models and complex data dependencies in 3D\u0000parallelism. This paper proposes Optimus, a distributed MLLM training system\u0000that reduces end-to-end MLLM training time. Optimus is based on our principled\u0000analysis that scheduling the encoder computation within the LLM bubbles can\u0000reduce bubbles in MLLM training. To make scheduling encoder computation\u0000possible for all GPUs, Optimus searches the separate parallel plans for encoder\u0000and LLM, and adopts a bubble scheduling algorithm to enable exploiting LLM\u0000bubbles without breaking the original data dependencies in the MLLM model\u0000architecture. We further decompose encoder layer computation into a series of\u0000kernels, and analyze the common bubble pattern of 3D parallelism to carefully\u0000optimize the sub-millisecond bubble scheduling, minimizing the overall training\u0000time. Our experiments in a production cluster show that Optimus accelerates\u0000MLLM training by 20.5%-21.3% with ViT-22B and GPT-175B model over 3072 GPUs\u0000compared to baselines.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emna Baccour, Aiman Erbad, Amr Mohamed, Mounir Hamdi, Mohsen Guizani
The metaverse, envisioned as the next digital frontier for avatar-based virtual interaction, involves high-performance models. In this dynamic environment, users' tasks frequently shift, requiring fast model personalization despite limited data. This evolution consumes extensive resources and requires vast data volumes. To address this, meta-learning emerges as an invaluable tool for metaverse users, with federated meta-learning (FML), offering even more tailored solutions owing to its adaptive capabilities. However, the metaverse is characterized by users heterogeneity with diverse data structures, varied tasks, and uneven sample sizes, potentially undermining global training outcomes due to statistical difference. Given this, an urgent need arises for smart coalition formation that accounts for these disparities. This paper introduces a dual game-theoretic framework for metaverse services involving meta-learners as workers to manage FML. A blockchain-based cooperative coalition formation game is crafted, grounded on a reputation metric, user similarity, and incentives. We also introduce a novel reputation system based on users' historical contributions and potential contributions to present tasks, leveraging correlations between past and new tasks. Finally, a Stackelberg game-based incentive mechanism is presented to attract reliable workers to participate in meta-learning, minimizing users' energy costs, increasing payoffs, boosting FML efficacy, and improving metaverse utility. Results show that our dual game framework outperforms best-effort, random, and non-uniform clustering schemes - improving training performance by up to 10%, cutting completion times by as much as 30%, enhancing metaverse utility by more than 25%, and offering up to 5% boost in training efficiency over non-blockchain systems, effectively countering misbehaving users.
{"title":"A Blockchain-based Reliable Federated Meta-learning for Metaverse: A Dual Game Framework","authors":"Emna Baccour, Aiman Erbad, Amr Mohamed, Mounir Hamdi, Mohsen Guizani","doi":"arxiv-2408.03694","DOIUrl":"https://doi.org/arxiv-2408.03694","url":null,"abstract":"The metaverse, envisioned as the next digital frontier for avatar-based\u0000virtual interaction, involves high-performance models. In this dynamic\u0000environment, users' tasks frequently shift, requiring fast model\u0000personalization despite limited data. This evolution consumes extensive\u0000resources and requires vast data volumes. To address this, meta-learning\u0000emerges as an invaluable tool for metaverse users, with federated meta-learning\u0000(FML), offering even more tailored solutions owing to its adaptive\u0000capabilities. However, the metaverse is characterized by users heterogeneity\u0000with diverse data structures, varied tasks, and uneven sample sizes,\u0000potentially undermining global training outcomes due to statistical difference.\u0000Given this, an urgent need arises for smart coalition formation that accounts\u0000for these disparities. This paper introduces a dual game-theoretic framework\u0000for metaverse services involving meta-learners as workers to manage FML. A\u0000blockchain-based cooperative coalition formation game is crafted, grounded on a\u0000reputation metric, user similarity, and incentives. We also introduce a novel\u0000reputation system based on users' historical contributions and potential\u0000contributions to present tasks, leveraging correlations between past and new\u0000tasks. Finally, a Stackelberg game-based incentive mechanism is presented to\u0000attract reliable workers to participate in meta-learning, minimizing users'\u0000energy costs, increasing payoffs, boosting FML efficacy, and improving\u0000metaverse utility. Results show that our dual game framework outperforms\u0000best-effort, random, and non-uniform clustering schemes - improving training\u0000performance by up to 10%, cutting completion times by as much as 30%, enhancing\u0000metaverse utility by more than 25%, and offering up to 5% boost in training\u0000efficiency over non-blockchain systems, effectively countering misbehaving\u0000users.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Serverless computing is a growing and maturing field that is the focus of much research, industry interest and adoption. Previous works exploring Functions-as-a-Service providers have focused primarily on the most well known providers AWS Lambda, Google Cloud Functions and Microsoft Azure Functions without exploring other providers in similar detail. In this work, we conduct the first detailed review of ten currently publicly available FaaS platforms exploring everything from their history, to their features and pricing to where they sit within the overall public FaaS landscape, before making a number of observations as to the state of the FaaS.
{"title":"The State of FaaS: An analysis of public Functions-as-a-Service providers","authors":"Nnamdi Ekwe-Ekwe, Lucas Amos","doi":"arxiv-2408.03021","DOIUrl":"https://doi.org/arxiv-2408.03021","url":null,"abstract":"Serverless computing is a growing and maturing field that is the focus of\u0000much research, industry interest and adoption. Previous works exploring\u0000Functions-as-a-Service providers have focused primarily on the most well known\u0000providers AWS Lambda, Google Cloud Functions and Microsoft Azure Functions\u0000without exploring other providers in similar detail. In this work, we conduct\u0000the first detailed review of ten currently publicly available FaaS platforms\u0000exploring everything from their history, to their features and pricing to where\u0000they sit within the overall public FaaS landscape, before making a number of\u0000observations as to the state of the FaaS.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep Reinforcement Learning (DRL) techniques have been successfully applied for solving complex decision-making and control tasks in multiple fields including robotics, autonomous driving, healthcare and natural language processing. The ability of DRL agents to learn from experience and utilize real-time data for making decisions makes it an ideal candidate for dealing with the complexities associated with the problem of workflow scheduling in highly dynamic cloud and edge computing environments. Despite the benefits of DRL, there are multiple challenges associated with the application of DRL techniques including multi-objectivity, curse of dimensionality, partial observability and multi-agent coordination. In this paper, we comprehensively analyze the challenges and opportunities associated with the design and implementation of DRL oriented solutions for workflow scheduling in cloud and edge computing environments. Based on the identified characteristics, we propose a taxonomy of workflow scheduling with DRL. We map reviewed works with respect to the taxonomy to identify their strengths and weaknesses. Based on taxonomy driven analysis, we propose novel future research directions for the field.
{"title":"Reinforcement Learning based Workflow Scheduling in Cloud and Edge Computing Environments: A Taxonomy, Review and Future Directions","authors":"Amanda Jayanetti, Saman Halgamuge, Rajkumar Buyya","doi":"arxiv-2408.02938","DOIUrl":"https://doi.org/arxiv-2408.02938","url":null,"abstract":"Deep Reinforcement Learning (DRL) techniques have been successfully applied\u0000for solving complex decision-making and control tasks in multiple fields\u0000including robotics, autonomous driving, healthcare and natural language\u0000processing. The ability of DRL agents to learn from experience and utilize\u0000real-time data for making decisions makes it an ideal candidate for dealing\u0000with the complexities associated with the problem of workflow scheduling in\u0000highly dynamic cloud and edge computing environments. Despite the benefits of\u0000DRL, there are multiple challenges associated with the application of DRL\u0000techniques including multi-objectivity, curse of dimensionality, partial\u0000observability and multi-agent coordination. In this paper, we comprehensively\u0000analyze the challenges and opportunities associated with the design and\u0000implementation of DRL oriented solutions for workflow scheduling in cloud and\u0000edge computing environments. Based on the identified characteristics, we\u0000propose a taxonomy of workflow scheduling with DRL. We map reviewed works with\u0000respect to the taxonomy to identify their strengths and weaknesses. Based on\u0000taxonomy driven analysis, we propose novel future research directions for the\u0000field.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"374 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cost optimization is a common goal of workflow schedulers operating in cloud computing environments. The use of spot instances is a potential means of achieving this goal, as they are offered by cloud providers at discounted prices compared to their on-demand counterparts in exchange for reduced reliability. This is due to the fact that spot instances are subjected to interruptions when spare computing capacity used for provisioning them is needed back owing to demand variations. Also, the prices of spot instances are not fixed as pricing is dependent on long term supply and demand. The possibility of interruptions and pricing variations associated with spot instances adds a layer of uncertainty to the general problem of workflow scheduling across cloud computing environments. These challenges need to be efficiently addressed for enjoying the cost savings achievable with the use of spot instances without compromising the underlying business requirements. To this end, in this paper we use Deep Reinforcement Learning for developing an autonomous agent capable of scheduling workflows in a cost efficient manner by using an intelligent mix of spot and on-demand instances. The proposed solution is implemented in the open source container native Argo workflow engine that is widely used for executing industrial workflows. The results of the experiments demonstrate that the proposed scheduling method is capable of outperforming the current benchmarks.
{"title":"A Deep Reinforcement Learning Approach for Cost Optimized Workflow Scheduling in Cloud Computing Environments","authors":"Amanda Jayanetti, Saman Halgamuge, Rajkumar Buyya","doi":"arxiv-2408.02926","DOIUrl":"https://doi.org/arxiv-2408.02926","url":null,"abstract":"Cost optimization is a common goal of workflow schedulers operating in cloud\u0000computing environments. The use of spot instances is a potential means of\u0000achieving this goal, as they are offered by cloud providers at discounted\u0000prices compared to their on-demand counterparts in exchange for reduced\u0000reliability. This is due to the fact that spot instances are subjected to\u0000interruptions when spare computing capacity used for provisioning them is\u0000needed back owing to demand variations. Also, the prices of spot instances are\u0000not fixed as pricing is dependent on long term supply and demand. The\u0000possibility of interruptions and pricing variations associated with spot\u0000instances adds a layer of uncertainty to the general problem of workflow\u0000scheduling across cloud computing environments. These challenges need to be\u0000efficiently addressed for enjoying the cost savings achievable with the use of\u0000spot instances without compromising the underlying business requirements. To\u0000this end, in this paper we use Deep Reinforcement Learning for developing an\u0000autonomous agent capable of scheduling workflows in a cost efficient manner by\u0000using an intelligent mix of spot and on-demand instances. The proposed solution\u0000is implemented in the open source container native Argo workflow engine that is\u0000widely used for executing industrial workflows. The results of the experiments\u0000demonstrate that the proposed scheduling method is capable of outperforming the\u0000current benchmarks.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
MPI is the de facto standard for parallel computing on a cluster of computers. Checkpointing is an important component in any strategy for software resilience and for long-running jobs that must be executed by chaining together time-bounded resource allocations. This work solves an old problem: a practical and general algorithm for transparent checkpointing of MPI that is both efficient and compatible with most of the latest network software. Transparent checkpointing is attractive due to its generality and ease of use for most MPI application developers. Earlier efforts at transparent checkpointing for MPI, one decade ago, had two difficult problems: (i) by relying on a specific MPI implementation tied to a specific network technology; and (ii) by failing to demonstrate sufficiently low runtime overhead. Problem (i) (network dependence) was already solved in 2019 by MANA's introduction of split processes. Problem (ii) (efficient runtime overhead) is solved in this work. This paper introduces an approach that avoids these limitations, employing a novel topological sort to algorithmically determine a safe future synchronization point. The algorithm is valid for both blocking and non-blocking collective communication in MPI. We demonstrate the efficacy and scalability of our approach through both micro-benchmarks and a set of five real-world MPI applications, notably including the widely used VASP (Vienna Ab Initio Simulation Package), which is responsible for 11% of the workload on the Perlmutter supercomputer at Lawrence Berkley National Laboratory. VASP was previously cited as a special challenge for checkpointing, in part due to its multi-algorithm codes.
{"title":"Enabling Practical Transparent Checkpointing for MPI: A Topological Sort Approach","authors":"Yao Xu, Gene Cooperman","doi":"arxiv-2408.02218","DOIUrl":"https://doi.org/arxiv-2408.02218","url":null,"abstract":"MPI is the de facto standard for parallel computing on a cluster of\u0000computers. Checkpointing is an important component in any strategy for software\u0000resilience and for long-running jobs that must be executed by chaining together\u0000time-bounded resource allocations. This work solves an old problem: a practical\u0000and general algorithm for transparent checkpointing of MPI that is both\u0000efficient and compatible with most of the latest network software. Transparent\u0000checkpointing is attractive due to its generality and ease of use for most MPI\u0000application developers. Earlier efforts at transparent checkpointing for MPI,\u0000one decade ago, had two difficult problems: (i) by relying on a specific MPI\u0000implementation tied to a specific network technology; and (ii) by failing to\u0000demonstrate sufficiently low runtime overhead. Problem (i) (network dependence) was already solved in 2019 by MANA's\u0000introduction of split processes. Problem (ii) (efficient runtime overhead) is\u0000solved in this work. This paper introduces an approach that avoids these\u0000limitations, employing a novel topological sort to algorithmically determine a\u0000safe future synchronization point. The algorithm is valid for both blocking and\u0000non-blocking collective communication in MPI. We demonstrate the efficacy and\u0000scalability of our approach through both micro-benchmarks and a set of five\u0000real-world MPI applications, notably including the widely used VASP (Vienna Ab\u0000Initio Simulation Package), which is responsible for 11% of the workload on the\u0000Perlmutter supercomputer at Lawrence Berkley National Laboratory. VASP was\u0000previously cited as a special challenge for checkpointing, in part due to its\u0000multi-algorithm codes.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
João Paulo Bezerra, Luciano Freitas, Petr Kuznetsov
The original goal of this paper was a novel, fast atomic-snapshot protocol for asynchronous message-passing systems. In the process of defining what fast means exactly, we faced a number of interesting issues that arise when conventional time metrics are applied to asynchronous implementations. We discovered some gaps in latency claims made in earlier work on snapshot algorithms, which hampers their comparative time-complexity analysis. We then came up with a new unifying time-complexity analysis that captures the latency of an operation in an asynchronous, long-lived implementation, which allowed us to formally grasp latency improvements of our solution with respect to the state-of-the-art protocols: optimal latency in fault-free runs without contention, short constant latency in fault-free runs with contention, the worst-case latency proportional to the number of failures, and constant, close to optimal amortized latency.
{"title":"Asynchronous Latency and Fast Atomic Snapshot","authors":"João Paulo Bezerra, Luciano Freitas, Petr Kuznetsov","doi":"arxiv-2408.02562","DOIUrl":"https://doi.org/arxiv-2408.02562","url":null,"abstract":"The original goal of this paper was a novel, fast atomic-snapshot protocol\u0000for asynchronous message-passing systems. In the process of defining what fast\u0000means exactly, we faced a number of interesting issues that arise when\u0000conventional time metrics are applied to asynchronous implementations. We\u0000discovered some gaps in latency claims made in earlier work on snapshot\u0000algorithms, which hampers their comparative time-complexity analysis. We then\u0000came up with a new unifying time-complexity analysis that captures the latency\u0000of an operation in an asynchronous, long-lived implementation, which allowed us\u0000to formally grasp latency improvements of our solution with respect to the\u0000state-of-the-art protocols: optimal latency in fault-free runs without\u0000contention, short constant latency in fault-free runs with contention, the\u0000worst-case latency proportional to the number of failures, and constant, close\u0000to optimal amortized latency.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"75 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}