arXiv - CS - Distributed, Parallel, and Cluster Computing最新文献_第10页

Towards SLO-Optimized LLM Serving via Automatic Inference Engine Tuning 通过自动推理引擎调整实现 SLO 优化的 LLM 服务

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-08 DOI: arxiv-2408.04323

Ke Cheng, Zhi Wang, Wen Hu, Tiannuo Yang, Jianguo Li, Sheng Zhang

A service-level objective (SLO) is a target performance metric of servicethat cloud vendors aim to ensure. Delivering optimized SLOs can enhance usersatisfaction and improve the competitiveness of cloud vendors. As largelanguage models (LLMs) are gaining increasing popularity across various fields,it is of great significance to optimize SLOs for LLM inference services. Inthis paper, we observe that adjusting the parameters of LLM inference enginescan improve service performance, and the optimal parameter configurations ofdifferent services are different. Therefore, we propose SCOOT, an automaticperformance tuning system to optimize SLOs for each LLM inference service bytuning the parameters of the inference engine. We first propose a generalizedformulation of the tuning problem to handle various objectives and constraintsbetween parameters, and SCOOT exploits the Bayesian optimization (BO) techniqueto resolve the problem via exploration and exploitation. Moreover, SCOOT adoptsa random forest to learn hidden constraints during the tuning process tomitigate invalid exploration. To improve the tuning efficiency, SCOOT utilizesthe parallel suggestion to accelerate the tuning process. Extensive experimentsdemonstrate that SCOOT can significantly outperform existing tuning techniquesin SLO optimization while greatly improving the tuning efficiency.

服务级目标（SLO）是云计算厂商旨在确保实现的服务性能指标。提供优化的 SLO 可以提高用户满意度，增强云厂商的竞争力。随着大型语言模型（LLM）在各个领域的普及，优化 LLM 推断服务的 SLO 具有重要意义。在本文中，我们发现调整 LLM 推理引擎的参数可以提高服务性能，而不同服务的最优参数配置是不同的。因此，我们提出了自动性能调优系统 SCOOT，通过调优推理引擎的参数来优化每个 LLM 推理服务的 SLO。我们首先提出了调整问题的广义表述，以处理各种目标和参数之间的约束，然后 SCOOT 利用贝叶斯优化（BO）技术，通过探索和利用来解决问题。此外，SCOOT 还采用随机森林来学习调优过程中的隐藏约束，以避免无效探索。为了提高调整效率，SCOOT 利用并行建议来加速调整过程。广泛的实验证明，SCOOT 在 SLO 优化中的表现明显优于现有的调优技术，同时极大地提高了调优效率。

{"title":"Towards SLO-Optimized LLM Serving via Automatic Inference Engine Tuning","authors":"Ke Cheng, Zhi Wang, Wen Hu, Tiannuo Yang, Jianguo Li, Sheng Zhang","doi":"arxiv-2408.04323","DOIUrl":"https://doi.org/arxiv-2408.04323","url":null,"abstract":"A service-level objective (SLO) is a target performance metric of service\u0000that cloud vendors aim to ensure. Delivering optimized SLOs can enhance user\u0000satisfaction and improve the competitiveness of cloud vendors. As large\u0000language models (LLMs) are gaining increasing popularity across various fields,\u0000it is of great significance to optimize SLOs for LLM inference services. In\u0000this paper, we observe that adjusting the parameters of LLM inference engines\u0000can improve service performance, and the optimal parameter configurations of\u0000different services are different. Therefore, we propose SCOOT, an automatic\u0000performance tuning system to optimize SLOs for each LLM inference service by\u0000tuning the parameters of the inference engine. We first propose a generalized\u0000formulation of the tuning problem to handle various objectives and constraints\u0000between parameters, and SCOOT exploits the Bayesian optimization (BO) technique\u0000to resolve the problem via exploration and exploitation. Moreover, SCOOT adopts\u0000a random forest to learn hidden constraints during the tuning process to\u0000mitigate invalid exploration. To improve the tuning efficiency, SCOOT utilizes\u0000the parallel suggestion to accelerate the tuning process. Extensive experiments\u0000demonstrate that SCOOT can significantly outperform existing tuning techniques\u0000in SLO optimization while greatly improving the tuning efficiency.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"184 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Addressing Model and Data Heterogeneity in Multimodal Large Language Model Training 解决多模态大语言模型训练中的模型和数据异质性问题

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-08 DOI: arxiv-2408.04275

Zili Zhang, Yinmin Zhong, Ranchen Ming, Hanpeng Hu, Jianjian Sun, Zheng Ge, Yibo Zhu, Xin Jin

Multimodal large language models (LLMs) have demonstrated significantpotential in a wide range of AI applications. Yet, training multimodal LLMssuffers from low efficiency and scalability, due to the inherent modelheterogeneity and data heterogeneity across different modalities. We present MMScale, an efficient and adaptive framework to reform thetraining of multimodal large language models on large-scale clusters. MMScaleexploits the system characteristics of multimodal LLM training to achieve highefficiency and scalability. The core of MMScale is the adaptive resourceallocation and data-aware reordering techniques to eliminate the model and dataheterogeneity respectively. We also tailor system optimizations for multimodalLLM training to offload certain operations from the GPU training. We evaluateMMScale across different sizes of multimodal LLMs on a large-scale productioncluster with thousands of GPUs. The experimental results show that MMScaleachieves 54.7% Model FLOPs Utilization (MFU) when training a 72B multimodal LLMon 1172 GPUs and outperforms Megatron-LM by up to 2.2$times$ on throughput.The ablation study shows the main techniques of MMScale are both effective andlightweight.

多模态大语言模型（LLM）在广泛的人工智能应用中展现出了巨大的潜力。然而，由于不同模态之间固有的模型异构性和数据异构性，训练多模态大型语言模型的效率和可扩展性都很低。我们提出的 MMScale 是一个高效的自适应框架，用于改革大规模集群上多模态大型语言模型的训练。MMScalee 利用多模态大语言模型训练的系统特性，实现了高效率和可扩展性。MMScale 的核心是自适应资源分配和数据感知重排序技术，以分别消除模型和数据的异构性。我们还为多模态LLM 训练定制了系统优化，以卸载 GPU 训练中的某些操作。我们在一个拥有数千个 GPU 的大规模生产集群上对不同规模的多模态 LLM 进行了 MMScale 评估。实验结果表明，当训练一个72B的多模态LLM时，MMScale在1172个GPU上实现了54.7%的模型FLOPs利用率（MFU），在吞吐量上比Megatron-LM高出2.2倍。

{"title":"Addressing Model and Data Heterogeneity in Multimodal Large Language Model Training","authors":"Zili Zhang, Yinmin Zhong, Ranchen Ming, Hanpeng Hu, Jianjian Sun, Zheng Ge, Yibo Zhu, Xin Jin","doi":"arxiv-2408.04275","DOIUrl":"https://doi.org/arxiv-2408.04275","url":null,"abstract":"Multimodal large language models (LLMs) have demonstrated significant\u0000potential in a wide range of AI applications. Yet, training multimodal LLMs\u0000suffers from low efficiency and scalability, due to the inherent model\u0000heterogeneity and data heterogeneity across different modalities. We present MMScale, an efficient and adaptive framework to reform the\u0000training of multimodal large language models on large-scale clusters. MMScale\u0000exploits the system characteristics of multimodal LLM training to achieve high\u0000efficiency and scalability. The core of MMScale is the adaptive resource\u0000allocation and data-aware reordering techniques to eliminate the model and data\u0000heterogeneity respectively. We also tailor system optimizations for multimodal\u0000LLM training to offload certain operations from the GPU training. We evaluate\u0000MMScale across different sizes of multimodal LLMs on a large-scale production\u0000cluster with thousands of GPUs. The experimental results show that MMScale\u0000achieves 54.7% Model FLOPs Utilization (MFU) when training a 72B multimodal LLM\u0000on 1172 GPUs and outperforms Megatron-LM by up to 2.2$times$ on throughput.\u0000The ablation study shows the main techniques of MMScale are both effective and\u0000lightweight.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"119 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PeerSwap: A Peer-Sampler with Randomness Guarantees PeerSwap：具有随机性保证的对等取样器

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-07 DOI: arxiv-2408.03829

Rachid Guerraoui, Anne-Marie Kermarrec, Anastasiia Kucherenko, Rafael Pinot, Marijn de Vos

The ability of a peer-to-peer (P2P) system to effectively host decentralizedapplications often relies on the availability of a peer-sampling service, whichprovides each participant with a random sample of other peers. Despite thepractical effectiveness of existing peer samplers, their ability to producerandom samples within a reasonable time frame remains poorly understood from atheoretical standpoint. This paper contributes to bridging this gap byintroducing PeerSwap, a peer-sampling protocol with provable randomnessguarantees. We establish execution time bounds for PeerSwap, demonstrating itsability to scale effectively with the network size. We prove that PeerSwapmaintains the fixed structure of the communication graph while allowingsequential peer position swaps within this graph. We do so by showing thatPeerSwap is a specific instance of an interchange process, a renowned model forparticle movement analysis. Leveraging this mapping, we derive execution timebounds, expressed as a function of the network size N. Depending on the networkstructure, this time can be as low as a polylogarithmic function of N,highlighting the efficiency of PeerSwap. We implement PeerSwap and conductnumerical evaluations using regular graphs with varying connectivity andcontaining up to 32768 (2^15) peers. Our evaluation demonstrates that PeerSwapquickly provides peers with uniform random samples of other peers.

点对点（P2P）系统能否有效地承载去中心化应用，往往取决于点对点采样服务的可用性，它能为每个参与者提供其他点对点的随机样本。尽管现有的对等采样器非常实用，但从理论上讲，人们对它们在合理时间内随机采样的能力仍然知之甚少。本文介绍了 PeerSwap，这是一种具有可证明随机性保证的对等采样协议，有助于弥合这一差距。我们建立了 PeerSwap 的执行时间界限，证明了它能随着网络规模的扩大而有效扩展。我们证明了 PeerSwap 能保持通信图的固定结构，同时允许在此图中进行连续的对等位置交换。为此，我们证明了 PeerSwap 是交换过程的一个特定实例，而交换过程是粒子移动分析的一个著名模型。根据网络结构的不同，执行时间可以低至 N 的多项式函数，这突出了 PeerSwap 的效率。我们实现了 PeerSwap，并使用具有不同连通性、最多包含 32768 (2^15) 个对等节点的常规图进行了数值评估。我们的评估结果表明，PeerSwap 能快速为对等者提供其他对等者的统一随机样本。

{"title":"PeerSwap: A Peer-Sampler with Randomness Guarantees","authors":"Rachid Guerraoui, Anne-Marie Kermarrec, Anastasiia Kucherenko, Rafael Pinot, Marijn de Vos","doi":"arxiv-2408.03829","DOIUrl":"https://doi.org/arxiv-2408.03829","url":null,"abstract":"The ability of a peer-to-peer (P2P) system to effectively host decentralized\u0000applications often relies on the availability of a peer-sampling service, which\u0000provides each participant with a random sample of other peers. Despite the\u0000practical effectiveness of existing peer samplers, their ability to produce\u0000random samples within a reasonable time frame remains poorly understood from a\u0000theoretical standpoint. This paper contributes to bridging this gap by\u0000introducing PeerSwap, a peer-sampling protocol with provable randomness\u0000guarantees. We establish execution time bounds for PeerSwap, demonstrating its\u0000ability to scale effectively with the network size. We prove that PeerSwap\u0000maintains the fixed structure of the communication graph while allowing\u0000sequential peer position swaps within this graph. We do so by showing that\u0000PeerSwap is a specific instance of an interchange process, a renowned model for\u0000particle movement analysis. Leveraging this mapping, we derive execution time\u0000bounds, expressed as a function of the network size N. Depending on the network\u0000structure, this time can be as low as a polylogarithmic function of N,\u0000highlighting the efficiency of PeerSwap. We implement PeerSwap and conduct\u0000numerical evaluations using regular graphs with varying connectivity and\u0000containing up to 32768 (2^15) peers. Our evaluation demonstrates that PeerSwap\u0000quickly provides peers with uniform random samples of other peers.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation Optimus：通过气泡开发加速大规模多模态 LLM 训练

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-07 DOI: arxiv-2408.03505

Weiqi Feng, Yangrui Chen, Shaoyu Wang, Yanghua Peng, Haibin Lin, Minlan Yu

Multimodal large language models (MLLMs) have extended the success of largelanguage models (LLMs) to multiple data types, such as image, text and audio,achieving significant performance in various domains, including multimodaltranslation, visual question answering and content generation. Nonetheless,existing systems are inefficient to train MLLMs due to substantial GPU bubblescaused by the heterogeneous modality models and complex data dependencies in 3Dparallelism. This paper proposes Optimus, a distributed MLLM training systemthat reduces end-to-end MLLM training time. Optimus is based on our principledanalysis that scheduling the encoder computation within the LLM bubbles canreduce bubbles in MLLM training. To make scheduling encoder computationpossible for all GPUs, Optimus searches the separate parallel plans for encoderand LLM, and adopts a bubble scheduling algorithm to enable exploiting LLMbubbles without breaking the original data dependencies in the MLLM modelarchitecture. We further decompose encoder layer computation into a series ofkernels, and analyze the common bubble pattern of 3D parallelism to carefullyoptimize the sub-millisecond bubble scheduling, minimizing the overall trainingtime. Our experiments in a production cluster show that Optimus acceleratesMLLM training by 20.5%-21.3% with ViT-22B and GPT-175B model over 3072 GPUscompared to baselines.

多模态大型语言模型（MLLMs）将大型语言模型（LLMs）的成功经验扩展到了图像、文本和音频等多种数据类型，在多模态翻译、视觉问题解答和内容生成等多个领域取得了显著的性能。然而，由于异构模态模型和三维并行中的复杂数据依赖性造成了大量 GPU 气泡，现有系统在训练 MLLM 时效率低下。本文提出的 Optimus 是一种分布式 MLLM 训练系统，可缩短端到端 MLLM 训练时间。Optimus 基于我们的原理分析，即在 LLM 气泡内调度编码器计算可以减少 MLLM 训练中的气泡。为了使所有 GPU 都能调度编码器计算，Optimus 分别搜索编码器和 LLM 的并行计划，并采用气泡调度算法，以便在不破坏 MLLM 模型架构中原有数据依赖关系的情况下利用 LLM 气泡。我们进一步将编码器层计算分解为一系列内核，并分析三维并行的常见气泡模式，精心优化亚毫秒级的气泡调度，最大限度地缩短了整体训练时间。我们在生产集群中进行的实验表明，与基线相比，Optimus 在 3072 个 GPU 上使用 ViT-22B 和 GPT-175B 模型将MLLM 训练加速了 20.5%-21.3% 。

{"title":"Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation","authors":"Weiqi Feng, Yangrui Chen, Shaoyu Wang, Yanghua Peng, Haibin Lin, Minlan Yu","doi":"arxiv-2408.03505","DOIUrl":"https://doi.org/arxiv-2408.03505","url":null,"abstract":"Multimodal large language models (MLLMs) have extended the success of large\u0000language models (LLMs) to multiple data types, such as image, text and audio,\u0000achieving significant performance in various domains, including multimodal\u0000translation, visual question answering and content generation. Nonetheless,\u0000existing systems are inefficient to train MLLMs due to substantial GPU bubbles\u0000caused by the heterogeneous modality models and complex data dependencies in 3D\u0000parallelism. This paper proposes Optimus, a distributed MLLM training system\u0000that reduces end-to-end MLLM training time. Optimus is based on our principled\u0000analysis that scheduling the encoder computation within the LLM bubbles can\u0000reduce bubbles in MLLM training. To make scheduling encoder computation\u0000possible for all GPUs, Optimus searches the separate parallel plans for encoder\u0000and LLM, and adopts a bubble scheduling algorithm to enable exploiting LLM\u0000bubbles without breaking the original data dependencies in the MLLM model\u0000architecture. We further decompose encoder layer computation into a series of\u0000kernels, and analyze the common bubble pattern of 3D parallelism to carefully\u0000optimize the sub-millisecond bubble scheduling, minimizing the overall training\u0000time. Our experiments in a production cluster show that Optimus accelerates\u0000MLLM training by 20.5%-21.3% with ViT-22B and GPT-175B model over 3072 GPUs\u0000compared to baselines.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Blockchain-based Reliable Federated Meta-learning for Metaverse: A Dual Game Framework 基于区块链的可靠联合元学习：双重游戏框架

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-07 DOI: arxiv-2408.03694

Emna Baccour, Aiman Erbad, Amr Mohamed, Mounir Hamdi, Mohsen Guizani

The metaverse, envisioned as the next digital frontier for avatar-basedvirtual interaction, involves high-performance models. In this dynamicenvironment, users' tasks frequently shift, requiring fast modelpersonalization despite limited data. This evolution consumes extensiveresources and requires vast data volumes. To address this, meta-learningemerges as an invaluable tool for metaverse users, with federated meta-learning(FML), offering even more tailored solutions owing to its adaptivecapabilities. However, the metaverse is characterized by users heterogeneitywith diverse data structures, varied tasks, and uneven sample sizes,potentially undermining global training outcomes due to statistical difference.Given this, an urgent need arises for smart coalition formation that accountsfor these disparities. This paper introduces a dual game-theoretic frameworkfor metaverse services involving meta-learners as workers to manage FML. Ablockchain-based cooperative coalition formation game is crafted, grounded on areputation metric, user similarity, and incentives. We also introduce a novelreputation system based on users' historical contributions and potentialcontributions to present tasks, leveraging correlations between past and newtasks. Finally, a Stackelberg game-based incentive mechanism is presented toattract reliable workers to participate in meta-learning, minimizing users'energy costs, increasing payoffs, boosting FML efficacy, and improvingmetaverse utility. Results show that our dual game framework outperformsbest-effort, random, and non-uniform clustering schemes - improving trainingperformance by up to 10%, cutting completion times by as much as 30%, enhancingmetaverse utility by more than 25%, and offering up to 5% boost in trainingefficiency over non-blockchain systems, effectively countering misbehavingusers.

元宇宙被视为基于化身的虚拟交互的下一个数字前沿，涉及高性能模型。在这个动态环境中，用户的任务经常发生变化，这就要求在数据有限的情况下快速个性化模型。这种进化需要消耗大量资源和海量数据。为了解决这个问题，元学习（FML）成为了元宇宙用户的宝贵工具，它的自适应能力甚至可以提供更加量身定制的解决方案。然而，元宇宙的特点是用户的异质性，其数据结构多样、任务各异、样本量不均，可能会因统计差异而影响全局训练结果。本文介绍了元学习者作为工人参与管理 FML 的元宇宙服务的双重博弈论框架。我们以计算指标、用户相似性和激励机制为基础，精心设计了基于区块链的合作联盟形成博弈。我们还基于用户的历史贡献和对当前任务的潜在贡献，利用过去任务和新任务之间的相关性，引入了一个新颖的声誉系统。最后，我们提出了一种基于斯塔克尔伯格博弈的激励机制，以吸引可靠的工作人员参与元学习，从而最大限度地降低用户的能源成本、增加回报、提高 FML 的效率，并改善全球效用。结果表明，我们的双博弈框架优于最佳努力、随机和非均匀聚类方案--可将训练性能提高 10%，将完成时间缩短 30%，将全球效用提高 25%以上，与非区块链系统相比，可将训练效率提高 5%，从而有效对抗行为不端的用户。

{"title":"A Blockchain-based Reliable Federated Meta-learning for Metaverse: A Dual Game Framework","authors":"Emna Baccour, Aiman Erbad, Amr Mohamed, Mounir Hamdi, Mohsen Guizani","doi":"arxiv-2408.03694","DOIUrl":"https://doi.org/arxiv-2408.03694","url":null,"abstract":"The metaverse, envisioned as the next digital frontier for avatar-based\u0000virtual interaction, involves high-performance models. In this dynamic\u0000environment, users' tasks frequently shift, requiring fast model\u0000personalization despite limited data. This evolution consumes extensive\u0000resources and requires vast data volumes. To address this, meta-learning\u0000emerges as an invaluable tool for metaverse users, with federated meta-learning\u0000(FML), offering even more tailored solutions owing to its adaptive\u0000capabilities. However, the metaverse is characterized by users heterogeneity\u0000with diverse data structures, varied tasks, and uneven sample sizes,\u0000potentially undermining global training outcomes due to statistical difference.\u0000Given this, an urgent need arises for smart coalition formation that accounts\u0000for these disparities. This paper introduces a dual game-theoretic framework\u0000for metaverse services involving meta-learners as workers to manage FML. A\u0000blockchain-based cooperative coalition formation game is crafted, grounded on a\u0000reputation metric, user similarity, and incentives. We also introduce a novel\u0000reputation system based on users' historical contributions and potential\u0000contributions to present tasks, leveraging correlations between past and new\u0000tasks. Finally, a Stackelberg game-based incentive mechanism is presented to\u0000attract reliable workers to participate in meta-learning, minimizing users'\u0000energy costs, increasing payoffs, boosting FML efficacy, and improving\u0000metaverse utility. Results show that our dual game framework outperforms\u0000best-effort, random, and non-uniform clustering schemes - improving training\u0000performance by up to 10%, cutting completion times by as much as 30%, enhancing\u0000metaverse utility by more than 25%, and offering up to 5% boost in training\u0000efficiency over non-blockchain systems, effectively countering misbehaving\u0000users.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The State of FaaS: An analysis of public Functions-as-a-Service providers FaaS 的现状：对公共功能即服务提供商的分析

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-06 DOI: arxiv-2408.03021

Nnamdi Ekwe-Ekwe, Lucas Amos

Serverless computing is a growing and maturing field that is the focus ofmuch research, industry interest and adoption. Previous works exploringFunctions-as-a-Service providers have focused primarily on the most well knownproviders AWS Lambda, Google Cloud Functions and Microsoft Azure Functionswithout exploring other providers in similar detail. In this work, we conductthe first detailed review of ten currently publicly available FaaS platformsexploring everything from their history, to their features and pricing to wherethey sit within the overall public FaaS landscape, before making a number ofobservations as to the state of the FaaS.

无服务器计算是一个不断发展和成熟的领域，是许多研究、行业关注和采用的焦点。以前探讨功能即服务提供商的作品主要集中在最知名的提供商 AWS Lambda、Google Cloud Functions 和 Microsoft Azure Functions 上，而没有对其他提供商进行类似的详细探讨。在这项工作中，我们首次对目前公开的十个 FaaS 平台进行了详细审查，从它们的历史、功能和定价，到它们在整个公共 FaaS 领域中的位置等各个方面进行了探索，然后就 FaaS 的现状提出了一些看法。

引用次数: 0

Reinforcement Learning based Workflow Scheduling in Cloud and Edge Computing Environments: A Taxonomy, Review and Future Directions 云计算和边缘计算环境中基于强化学习的工作流调度：分类、回顾与未来方向

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-06 DOI: arxiv-2408.02938

Amanda Jayanetti, Saman Halgamuge, Rajkumar Buyya

Deep Reinforcement Learning (DRL) techniques have been successfully appliedfor solving complex decision-making and control tasks in multiple fieldsincluding robotics, autonomous driving, healthcare and natural languageprocessing. The ability of DRL agents to learn from experience and utilizereal-time data for making decisions makes it an ideal candidate for dealingwith the complexities associated with the problem of workflow scheduling inhighly dynamic cloud and edge computing environments. Despite the benefits ofDRL, there are multiple challenges associated with the application of DRLtechniques including multi-objectivity, curse of dimensionality, partialobservability and multi-agent coordination. In this paper, we comprehensivelyanalyze the challenges and opportunities associated with the design andimplementation of DRL oriented solutions for workflow scheduling in cloud andedge computing environments. Based on the identified characteristics, wepropose a taxonomy of workflow scheduling with DRL. We map reviewed works withrespect to the taxonomy to identify their strengths and weaknesses. Based ontaxonomy driven analysis, we propose novel future research directions for thefield.

深度强化学习（DRL）技术已成功应用于解决多个领域的复杂决策和控制任务，包括机器人、自动驾驶、医疗保健和自然语言处理。DRL 代理能够从经验中学习并利用实时数据做出决策，这使其成为处理与高度动态的云计算和边缘计算环境中工作流调度问题相关的复杂性的理想候选方案。尽管 DRL 具有诸多优势，但在应用 DRL 技术时仍面临多重挑战，包括多目标性、维度诅咒、部分可观测性和多代理协调。本文全面分析了在云计算和边缘计算环境中设计和实施面向 DRL 的工作流调度解决方案所面临的挑战和机遇。基于已识别的特征，我们提出了使用 DRL 进行工作流调度的分类标准。我们根据该分类法对已审查的作品进行了映射，以确定其优缺点。基于分类法驱动的分析，我们为该领域提出了新的未来研究方向。

{"title":"Reinforcement Learning based Workflow Scheduling in Cloud and Edge Computing Environments: A Taxonomy, Review and Future Directions","authors":"Amanda Jayanetti, Saman Halgamuge, Rajkumar Buyya","doi":"arxiv-2408.02938","DOIUrl":"https://doi.org/arxiv-2408.02938","url":null,"abstract":"Deep Reinforcement Learning (DRL) techniques have been successfully applied\u0000for solving complex decision-making and control tasks in multiple fields\u0000including robotics, autonomous driving, healthcare and natural language\u0000processing. The ability of DRL agents to learn from experience and utilize\u0000real-time data for making decisions makes it an ideal candidate for dealing\u0000with the complexities associated with the problem of workflow scheduling in\u0000highly dynamic cloud and edge computing environments. Despite the benefits of\u0000DRL, there are multiple challenges associated with the application of DRL\u0000techniques including multi-objectivity, curse of dimensionality, partial\u0000observability and multi-agent coordination. In this paper, we comprehensively\u0000analyze the challenges and opportunities associated with the design and\u0000implementation of DRL oriented solutions for workflow scheduling in cloud and\u0000edge computing environments. Based on the identified characteristics, we\u0000propose a taxonomy of workflow scheduling with DRL. We map reviewed works with\u0000respect to the taxonomy to identify their strengths and weaknesses. Based on\u0000taxonomy driven analysis, we propose novel future research directions for the\u0000field.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"374 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Deep Reinforcement Learning Approach for Cost Optimized Workflow Scheduling in Cloud Computing Environments 云计算环境中成本优化工作流调度的深度强化学习方法

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-06 DOI: arxiv-2408.02926

Amanda Jayanetti, Saman Halgamuge, Rajkumar Buyya

Cost optimization is a common goal of workflow schedulers operating in cloudcomputing environments. The use of spot instances is a potential means ofachieving this goal, as they are offered by cloud providers at discountedprices compared to their on-demand counterparts in exchange for reducedreliability. This is due to the fact that spot instances are subjected tointerruptions when spare computing capacity used for provisioning them isneeded back owing to demand variations. Also, the prices of spot instances arenot fixed as pricing is dependent on long term supply and demand. Thepossibility of interruptions and pricing variations associated with spotinstances adds a layer of uncertainty to the general problem of workflowscheduling across cloud computing environments. These challenges need to beefficiently addressed for enjoying the cost savings achievable with the use ofspot instances without compromising the underlying business requirements. Tothis end, in this paper we use Deep Reinforcement Learning for developing anautonomous agent capable of scheduling workflows in a cost efficient manner byusing an intelligent mix of spot and on-demand instances. The proposed solutionis implemented in the open source container native Argo workflow engine that iswidely used for executing industrial workflows. The results of the experimentsdemonstrate that the proposed scheduling method is capable of outperforming thecurrent benchmarks.

成本优化是在云计算环境中运行的工作流调度程序的共同目标。使用现成实例是实现这一目标的一个潜在手段，因为云提供商以低于按需实例的价格提供现成实例，以换取较低的可靠性。这是因为，当由于需求变化而需要恢复用于供应现货实例的备用计算能力时，现货实例会出现中断。此外，现货实例的价格并不固定，因为定价取决于长期供求关系。与现货实例相关的中断和价格变化的可能性给跨云计算环境的工作流调度这一普遍问题增加了一层不确定性。要在不影响基本业务需求的情况下享受使用现货实例所带来的成本节约，就必须充分应对这些挑战。为此，我们在本文中使用深度强化学习技术开发了一个自主代理，该代理能够通过使用现货和按需实例的智能组合，以具有成本效益的方式调度工作流。提出的解决方案在开源容器原生 Argo 工作流引擎中实现，该引擎广泛用于执行工业工作流。实验结果表明，所提出的调度方法能够超越当前的基准。

{"title":"A Deep Reinforcement Learning Approach for Cost Optimized Workflow Scheduling in Cloud Computing Environments","authors":"Amanda Jayanetti, Saman Halgamuge, Rajkumar Buyya","doi":"arxiv-2408.02926","DOIUrl":"https://doi.org/arxiv-2408.02926","url":null,"abstract":"Cost optimization is a common goal of workflow schedulers operating in cloud\u0000computing environments. The use of spot instances is a potential means of\u0000achieving this goal, as they are offered by cloud providers at discounted\u0000prices compared to their on-demand counterparts in exchange for reduced\u0000reliability. This is due to the fact that spot instances are subjected to\u0000interruptions when spare computing capacity used for provisioning them is\u0000needed back owing to demand variations. Also, the prices of spot instances are\u0000not fixed as pricing is dependent on long term supply and demand. The\u0000possibility of interruptions and pricing variations associated with spot\u0000instances adds a layer of uncertainty to the general problem of workflow\u0000scheduling across cloud computing environments. These challenges need to be\u0000efficiently addressed for enjoying the cost savings achievable with the use of\u0000spot instances without compromising the underlying business requirements. To\u0000this end, in this paper we use Deep Reinforcement Learning for developing an\u0000autonomous agent capable of scheduling workflows in a cost efficient manner by\u0000using an intelligent mix of spot and on-demand instances. The proposed solution\u0000is implemented in the open source container native Argo workflow engine that is\u0000widely used for executing industrial workflows. The results of the experiments\u0000demonstrate that the proposed scheduling method is capable of outperforming the\u0000current benchmarks.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enabling Practical Transparent Checkpointing for MPI: A Topological Sort Approach 为 MPI 启用实用的透明检查点：拓扑排序方法

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-05 DOI: arxiv-2408.02218

Yao Xu, Gene Cooperman

MPI is the de facto standard for parallel computing on a cluster ofcomputers. Checkpointing is an important component in any strategy for softwareresilience and for long-running jobs that must be executed by chaining togethertime-bounded resource allocations. This work solves an old problem: a practicaland general algorithm for transparent checkpointing of MPI that is bothefficient and compatible with most of the latest network software. Transparentcheckpointing is attractive due to its generality and ease of use for most MPIapplication developers. Earlier efforts at transparent checkpointing for MPI,one decade ago, had two difficult problems: (i) by relying on a specific MPIimplementation tied to a specific network technology; and (ii) by failing todemonstrate sufficiently low runtime overhead. Problem (i) (network dependence) was already solved in 2019 by MANA'sintroduction of split processes. Problem (ii) (efficient runtime overhead) issolved in this work. This paper introduces an approach that avoids theselimitations, employing a novel topological sort to algorithmically determine asafe future synchronization point. The algorithm is valid for both blocking andnon-blocking collective communication in MPI. We demonstrate the efficacy andscalability of our approach through both micro-benchmarks and a set of fivereal-world MPI applications, notably including the widely used VASP (Vienna AbInitio Simulation Package), which is responsible for 11% of the workload on thePerlmutter supercomputer at Lawrence Berkley National Laboratory. VASP waspreviously cited as a special challenge for checkpointing, in part due to itsmulti-algorithm codes.

MPI 是计算机集群上并行计算的事实标准。检查点是任何软件弹性策略的重要组成部分，也是必须通过有时间限制的资源分配连锁执行的长期运行作业的重要组成部分。这项工作解决了一个老问题：为 MPI 的透明检查点提供了一种实用的通用算法，它既高效又与大多数最新的网络软件兼容。透明检查点因其通用性和对大多数 MPI 应用开发人员的易用性而极具吸引力。十年前，早期的 MPI 透明检查点技术遇到了两个棘手的问题：(i) 依赖于特定网络技术的特定 MPI 实现；(ii) 无法证明足够低的运行时开销。问题(i)（网络依赖性）已经在2019年通过MANA引入分裂进程得到解决。问题（ii）（高效运行时开销）在本文中得到了解决。本文介绍了一种避免上述限制的方法，它采用一种新颖的拓扑排序算法来确定安全的未来同步点。该算法适用于 MPI 中的阻塞和非阻塞集体通信。我们通过微基准测试和一组真实世界的 MPI 应用证明了我们方法的有效性和可扩展性，其中主要包括广泛使用的 VASP（Vienna AbInitio Simulation Package），它占劳伦斯伯克利国家实验室 Perlmutter 超级计算机 11% 的工作量。VASP 以前被认为是检查点的一个特殊挑战，部分原因是它的多算法代码。

{"title":"Enabling Practical Transparent Checkpointing for MPI: A Topological Sort Approach","authors":"Yao Xu, Gene Cooperman","doi":"arxiv-2408.02218","DOIUrl":"https://doi.org/arxiv-2408.02218","url":null,"abstract":"MPI is the de facto standard for parallel computing on a cluster of\u0000computers. Checkpointing is an important component in any strategy for software\u0000resilience and for long-running jobs that must be executed by chaining together\u0000time-bounded resource allocations. This work solves an old problem: a practical\u0000and general algorithm for transparent checkpointing of MPI that is both\u0000efficient and compatible with most of the latest network software. Transparent\u0000checkpointing is attractive due to its generality and ease of use for most MPI\u0000application developers. Earlier efforts at transparent checkpointing for MPI,\u0000one decade ago, had two difficult problems: (i) by relying on a specific MPI\u0000implementation tied to a specific network technology; and (ii) by failing to\u0000demonstrate sufficiently low runtime overhead. Problem (i) (network dependence) was already solved in 2019 by MANA's\u0000introduction of split processes. Problem (ii) (efficient runtime overhead) is\u0000solved in this work. This paper introduces an approach that avoids these\u0000limitations, employing a novel topological sort to algorithmically determine a\u0000safe future synchronization point. The algorithm is valid for both blocking and\u0000non-blocking collective communication in MPI. We demonstrate the efficacy and\u0000scalability of our approach through both micro-benchmarks and a set of five\u0000real-world MPI applications, notably including the widely used VASP (Vienna Ab\u0000Initio Simulation Package), which is responsible for 11% of the workload on the\u0000Perlmutter supercomputer at Lawrence Berkley National Laboratory. VASP was\u0000previously cited as a special challenge for checkpointing, in part due to its\u0000multi-algorithm codes.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Asynchronous Latency and Fast Atomic Snapshot 异步延迟和快速原子快照

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-05 DOI: arxiv-2408.02562

João Paulo Bezerra, Luciano Freitas, Petr Kuznetsov

The original goal of this paper was a novel, fast atomic-snapshot protocolfor asynchronous message-passing systems. In the process of defining what fastmeans exactly, we faced a number of interesting issues that arise whenconventional time metrics are applied to asynchronous implementations. Wediscovered some gaps in latency claims made in earlier work on snapshotalgorithms, which hampers their comparative time-complexity analysis. We thencame up with a new unifying time-complexity analysis that captures the latencyof an operation in an asynchronous, long-lived implementation, which allowed usto formally grasp latency improvements of our solution with respect to thestate-of-the-art protocols: optimal latency in fault-free runs withoutcontention, short constant latency in fault-free runs with contention, theworst-case latency proportional to the number of failures, and constant, closeto optimal amortized latency.

本文的最初目标是为异步消息传递系统提供一种新颖、快速的原子快照协议。在定义 "快速 "的确切含义的过程中，我们遇到了许多有趣的问题，这些问题是在将常规时间指标应用于异步实现时出现的。我们发现，在早期关于快算法的研究中，关于延迟的说法存在一些漏洞，这妨碍了它们的时间复杂性比较分析。于是，我们提出了一种新的统一时间复杂性分析方法，它可以捕捉异步、长寿命实现中的操作延迟，从而正式掌握我们的解决方案相对于最先进协议在延迟方面的改进：在无争用的无故障运行中的最优延迟、在有争用的无故障运行中的短恒定延迟、与故障次数成正比的最坏情况延迟，以及恒定的、接近最优的摊销延迟。

引用次数: 0