arXiv - CS - Distributed, Parallel, and Cluster Computing最新文献_第4页

Advancing Hybrid Defense for Byzantine Attacks in Federated Learning 在联盟学习中推进拜占庭攻击的混合防御

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-10 DOI: arxiv-2409.06474

Kai Yue, Richeng Jin, Chau-Wai Wong, Huaiyu Dai

Federated learning (FL) enables multiple clients to collaboratively train aglobal model without sharing their local data. Recent studies have highlightedthe vulnerability of FL to Byzantine attacks, where malicious clients sendpoisoned updates to degrade model performance. Notably, many attacks have beendeveloped targeting specific aggregation rules, whereas various defensemechanisms have been designed for dedicated threat models. This paper studiesthe resilience of an attack-agnostic FL scenario, where the server lacks priorknowledge of both the attackers' strategies and the number of malicious clientsinvolved. We first introduce a hybrid defense against state-of-the-art attacks.Our goal is to identify a general-purpose aggregation rule that performs wellon average while also avoiding worst-case vulnerabilities. By adaptivelyselecting from available defenses, we demonstrate that the server remainsrobust even when confronted with a substantial proportion of poisoned updates.To better understand this resilience, we then assess the attackers' capabilityusing a proxy called client heterogeneity. We also emphasize that the existingFL defenses should not be regarded as secure, as demonstrated through the newlyproposed Trapsetter attack. The proposed attack outperforms otherstate-of-the-art attacks by further reducing the model test accuracy by 8-10%.Our findings highlight the ongoing need for the development ofByzantine-resilient aggregation algorithms in FL.

联合学习（FL）使多个客户端能够在不共享本地数据的情况下协作训练全局模型。最近的研究突出表明，FL 容易受到拜占庭攻击，即恶意客户端发送中毒更新以降低模型性能。值得注意的是，许多攻击都是针对特定聚合规则而开发的，而各种防御机制则是针对专用威胁模型而设计的。本文研究了与攻击无关的 FL 场景的恢复能力，在这种场景中，服务器事先不知道攻击者的策略和恶意客户端的数量。我们的目标是找出一种通用的聚合规则，它既能在平均水平上表现良好，又能避免最坏情况下的漏洞。通过自适应地从可用防御中进行选择，我们证明了即使面对大量中毒更新，服务器仍能保持稳健。为了更好地理解这种弹性，我们随后使用一种称为客户端异质性的代理来评估攻击者的能力。我们还强调，不应将现有的FL防御视为安全的，新提出的Trapsetter攻击就证明了这一点。我们的发现凸显了在 FL 中开发拜占庭弹性聚合算法的持续需求。

{"title":"Advancing Hybrid Defense for Byzantine Attacks in Federated Learning","authors":"Kai Yue, Richeng Jin, Chau-Wai Wong, Huaiyu Dai","doi":"arxiv-2409.06474","DOIUrl":"https://doi.org/arxiv-2409.06474","url":null,"abstract":"Federated learning (FL) enables multiple clients to collaboratively train a\u0000global model without sharing their local data. Recent studies have highlighted\u0000the vulnerability of FL to Byzantine attacks, where malicious clients send\u0000poisoned updates to degrade model performance. Notably, many attacks have been\u0000developed targeting specific aggregation rules, whereas various defense\u0000mechanisms have been designed for dedicated threat models. This paper studies\u0000the resilience of an attack-agnostic FL scenario, where the server lacks prior\u0000knowledge of both the attackers' strategies and the number of malicious clients\u0000involved. We first introduce a hybrid defense against state-of-the-art attacks.\u0000Our goal is to identify a general-purpose aggregation rule that performs well\u0000on average while also avoiding worst-case vulnerabilities. By adaptively\u0000selecting from available defenses, we demonstrate that the server remains\u0000robust even when confronted with a substantial proportion of poisoned updates.\u0000To better understand this resilience, we then assess the attackers' capability\u0000using a proxy called client heterogeneity. We also emphasize that the existing\u0000FL defenses should not be regarded as secure, as demonstrated through the newly\u0000proposed Trapsetter attack. The proposed attack outperforms other\u0000state-of-the-art attacks by further reducing the model test accuracy by 8-10%.\u0000Our findings highlight the ongoing need for the development of\u0000Byzantine-resilient aggregation algorithms in FL.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimal Workload Placement on Multi-Instance GPUs 多实例 GPU 上的最佳工作负载配置

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-10 DOI: arxiv-2409.06646

Bekir Turkkan, Pavankumar Murali, Pavithra Harsha, Rohan Arora, Gerard Vanloo, Chandra Narayanaswami

There is an urgent and pressing need to optimize usage of GraphicalProcessing Units (GPUs), which have arguably become one of the most expensiveand sought after IT resources. To help with this goal, several of the currentgeneration of GPUs support a partitioning feature, called Multi-Instance GPU(MIG) to allow multiple workloads to share a GPU, albeit with some constraints.In this paper we investigate how to optimize the placement of Large LanguageModel (LLM)-based AI Inferencing workloads on GPUs. We first identify andpresent several use cases that are encountered in practice that requireworkloads to be efficiently placed or migrated to other GPUs to make room forincoming workloads. The overarching goal is to use as few GPUs as possible andto further minimize memory and compute wastage on GPUs that are utilized. Wehave developed two approaches to address this problem: an optimization methodand a heuristic method. We benchmark these with two workload schedulingheuristics for multiple use cases. Our results show up to 2.85x improvement inthe number of GPUs used and up to 70% reduction in GPU wastage over baselineheuristics. We plan to enable the SRE community to leverage our proposed methodin production environments.

图形处理器（GPU）已成为最昂贵、最受追捧的 IT 资源之一，优化图形处理器（GPU）的使用已迫在眉睫。为了帮助实现这一目标，当前几代 GPU 都支持一种名为 "多实例 GPU（MIG）"的分区功能，允许多个工作负载共享一个 GPU，尽管会受到一些限制。在本文中，我们研究了如何优化基于大型语言模型（LLM）的人工智能推理工作负载在 GPU 上的布局。我们首先确定并介绍了在实践中遇到的几种用例，这些用例要求将工作负载有效地放置或迁移到其他 GPU 上，以便为即将到来的工作负载腾出空间。我们的总体目标是尽可能少地使用 GPU，并进一步减少已使用 GPU 上的内存和计算资源的浪费。我们开发了两种方法来解决这个问题：一种是优化方法，另一种是启发式方法。我们使用两种工作负载调度启发式方法对多个使用案例进行了基准测试。我们的结果表明，与基线启发式相比，所使用的 GPU 数量最多可提高 2.85 倍，GPU 浪费最多可减少 70%。我们计划让 SRE 社区在生产环境中利用我们提出的方法。

{"title":"Optimal Workload Placement on Multi-Instance GPUs","authors":"Bekir Turkkan, Pavankumar Murali, Pavithra Harsha, Rohan Arora, Gerard Vanloo, Chandra Narayanaswami","doi":"arxiv-2409.06646","DOIUrl":"https://doi.org/arxiv-2409.06646","url":null,"abstract":"There is an urgent and pressing need to optimize usage of Graphical\u0000Processing Units (GPUs), which have arguably become one of the most expensive\u0000and sought after IT resources. To help with this goal, several of the current\u0000generation of GPUs support a partitioning feature, called Multi-Instance GPU\u0000(MIG) to allow multiple workloads to share a GPU, albeit with some constraints.\u0000In this paper we investigate how to optimize the placement of Large Language\u0000Model (LLM)-based AI Inferencing workloads on GPUs. We first identify and\u0000present several use cases that are encountered in practice that require\u0000workloads to be efficiently placed or migrated to other GPUs to make room for\u0000incoming workloads. The overarching goal is to use as few GPUs as possible and\u0000to further minimize memory and compute wastage on GPUs that are utilized. We\u0000have developed two approaches to address this problem: an optimization method\u0000and a heuristic method. We benchmark these with two workload scheduling\u0000heuristics for multiple use cases. Our results show up to 2.85x improvement in\u0000the number of GPUs used and up to 70% reduction in GPU wastage over baseline\u0000heuristics. We plan to enable the SRE community to leverage our proposed method\u0000in production environments.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"410 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Parallel Reduced Order Modeling for Digital Twins using High-Performance Computing Workflows 利用高性能计算工作流程进行数字孪生兄弟的并行降阶建模

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-10 DOI: arxiv-2409.09080

S. Ares de Parga, J. R. Bravo, N. Sibuet, J. A. Hernandez, R. Rossi, Stefan Boschert, Enrique S. Quintana-Ortí, Andrés E. Tomás, Cristian Cătălin Tatu, Fernando Vázquez-Novoa, Jorge Ejarque, Rosa M. Badia

The integration of Reduced Order Models (ROMs) with High-PerformanceComputing (HPC) is critical for developing digital twins, particularly forreal-time monitoring and predictive maintenance of industrial systems. Thispaper describes a comprehensive, HPC-enabled workflow for developing anddeploying projection-based ROMs (PROMs). We use PyCOMPSs' parallel framework toefficiently execute ROM training simulations, employing parallel Singular ValueDecomposition (SVD) algorithms such as randomized SVD, Lanczos SVD, and fullSVD based on Tall-Skinny QR. In addition, we introduce a partitioned version ofthe hyper-reduction scheme known as the Empirical Cubature Method. Despite thewidespread use of HPC for PROMs, there is a significant lack of publicationsdetailing comprehensive workflows for building and deploying end-to-end PROMsin HPC environments. Our workflow is validated through a case study focusing onthe thermal dynamics of a motor. The PROM is designed to deliver a real-timeprognosis tool that could enable rapid and safe motor restarts post-emergencyshutdowns under different operating conditions for further integration intodigital twins or control systems. To facilitate deployment, we use the HPCWorkflow as a Service strategy and Functional Mock-Up Units to ensurecompatibility and ease of integration across HPC, edge, and cloud environments.The outcomes illustrate the efficacy of combining PROMs and HPC, establishing aprecedent for scalable, real-time digital twin applications across multipleindustries.

还原阶次模型（ROM）与高性能计算（HPC）的集成对于开发数字孪生系统，特别是用于工业系统的实时监控和预测性维护至关重要。本文介绍了一个全面的、支持 HPC 的工作流程，用于开发和部署基于投影的 ROM（PROM）。我们使用 PyCOMPSs 的并行框架来高效执行 ROM 训练模拟，采用并行奇异值分解（SVD）算法，如随机 SVD、Lanczos SVD 和基于高瘦 QR 的 fullSVD。此外，我们还引入了被称为经验立方法（Empirical Cubature Method）的超还原方案的分区版本。尽管HPC在PROM中的应用非常广泛，但在HPC环境中构建和部署端到端PROM的全面工作流程的出版物却非常缺乏。我们的工作流程通过一个以电机热动力学为重点的案例研究得到了验证。该PROM旨在提供一种实时预测工具，在不同的运行条件下，使电机在发电机停机后能够快速、安全地重新启动，以便进一步集成到数字双胞胎或控制系统中。为了便于部署，我们采用了高性能计算工作流即服务（HPCWorkflow as a Service）策略和功能模拟单元（Functional Mock-Up Units），以确保跨高性能计算、边缘和云环境的兼容性和易集成性。

{"title":"Parallel Reduced Order Modeling for Digital Twins using High-Performance Computing Workflows","authors":"S. Ares de Parga, J. R. Bravo, N. Sibuet, J. A. Hernandez, R. Rossi, Stefan Boschert, Enrique S. Quintana-Ortí, Andrés E. Tomás, Cristian Cătălin Tatu, Fernando Vázquez-Novoa, Jorge Ejarque, Rosa M. Badia","doi":"arxiv-2409.09080","DOIUrl":"https://doi.org/arxiv-2409.09080","url":null,"abstract":"The integration of Reduced Order Models (ROMs) with High-Performance\u0000Computing (HPC) is critical for developing digital twins, particularly for\u0000real-time monitoring and predictive maintenance of industrial systems. This\u0000paper describes a comprehensive, HPC-enabled workflow for developing and\u0000deploying projection-based ROMs (PROMs). We use PyCOMPSs' parallel framework to\u0000efficiently execute ROM training simulations, employing parallel Singular Value\u0000Decomposition (SVD) algorithms such as randomized SVD, Lanczos SVD, and full\u0000SVD based on Tall-Skinny QR. In addition, we introduce a partitioned version of\u0000the hyper-reduction scheme known as the Empirical Cubature Method. Despite the\u0000widespread use of HPC for PROMs, there is a significant lack of publications\u0000detailing comprehensive workflows for building and deploying end-to-end PROMs\u0000in HPC environments. Our workflow is validated through a case study focusing on\u0000the thermal dynamics of a motor. The PROM is designed to deliver a real-time\u0000prognosis tool that could enable rapid and safe motor restarts post-emergency\u0000shutdowns under different operating conditions for further integration into\u0000digital twins or control systems. To facilitate deployment, we use the HPC\u0000Workflow as a Service strategy and Functional Mock-Up Units to ensure\u0000compatibility and ease of integration across HPC, edge, and cloud environments.\u0000The outcomes illustrate the efficacy of combining PROMs and HPC, establishing a\u0000precedent for scalable, real-time digital twin applications across multiple\u0000industries.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

NeurLZ: On Systematically Enhancing Lossy Compression Performance for Scientific Data based on Neural Learning with Error Control NeurLZ：基于误差控制的神经学习，系统地提高科学数据的有损压缩性能

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-09 DOI: arxiv-2409.05785

Wenqi Jia, Youyuan Liu, Zhewen Hu, Jinzhen Wang, Boyuan Zhang, Wei Niu, Junzhou Huang, Stavros Kalafatis, Sian Jin, Miao Yin

Large-scale scientific simulations generate massive datasets that posesignificant challenges for storage and I/O. While traditional lossy compressiontechniques can improve performance, balancing compression ratio, data quality,and throughput remains difficult. To address this, we propose NeurLZ, a novelcross-field learning-based and error-controlled compression framework forscientific data. By integrating skipping DNN models, cross-field learning, anderror control, our framework aims to substantially enhance lossy compressionperformance. Our contributions are three-fold: (1) We design a lightweightskipping model to provide high-fidelity detail retention, further improvingprediction accuracy. (2) We adopt a cross-field learning approach tosignificantly improve data prediction accuracy, resulting in a substantiallyimproved compression ratio. (3) We develop an error control approach to providestrict error bounds according to user requirements. We evaluated NeurLZ onseveral real-world HPC application datasets, including Nyx (cosmologicalsimulation), Miranda (large turbulence simulation), and Hurricane (weathersimulation). Experiments demonstrate that our framework achieves up to a 90%relative reduction in bit rate under the same data distortion, compared to thebest existing approach.

大规模科学模拟会产生海量数据集，给存储和 I/O 带来巨大挑战。虽然传统的有损压缩技术可以提高性能，但要在压缩率、数据质量和吞吐量之间取得平衡仍然很困难。为了解决这个问题，我们提出了 NeurLZ，这是一种基于跨领域学习和误差控制的新型科学数据压缩框架。通过整合跳转 DNN 模型、跨场学习和错误控制，我们的框架旨在大幅提高有损压缩性能。我们的贡献有三个方面：（1）我们设计了一个轻量级跳转模型，以提供高保真细节保留，进一步提高预测精度。(2) 我们采用跨场学习方法来显著提高数据预测的准确性，从而大幅提高压缩率。(3) 我们开发了一种误差控制方法，可根据用户要求提供严格的误差界限。我们在多个真实世界的 HPC 应用数据集上评估了 NeurLZ，包括 Nyx（宇宙学模拟）、Miranda（大型湍流模拟）和 Hurricane（天气模拟）。实验证明，与现有的最佳方法相比，我们的框架在相同的数据失真条件下实现了高达 90% 的比特率相对降低。

{"title":"NeurLZ: On Systematically Enhancing Lossy Compression Performance for Scientific Data based on Neural Learning with Error Control","authors":"Wenqi Jia, Youyuan Liu, Zhewen Hu, Jinzhen Wang, Boyuan Zhang, Wei Niu, Junzhou Huang, Stavros Kalafatis, Sian Jin, Miao Yin","doi":"arxiv-2409.05785","DOIUrl":"https://doi.org/arxiv-2409.05785","url":null,"abstract":"Large-scale scientific simulations generate massive datasets that pose\u0000significant challenges for storage and I/O. While traditional lossy compression\u0000techniques can improve performance, balancing compression ratio, data quality,\u0000and throughput remains difficult. To address this, we propose NeurLZ, a novel\u0000cross-field learning-based and error-controlled compression framework for\u0000scientific data. By integrating skipping DNN models, cross-field learning, and\u0000error control, our framework aims to substantially enhance lossy compression\u0000performance. Our contributions are three-fold: (1) We design a lightweight\u0000skipping model to provide high-fidelity detail retention, further improving\u0000prediction accuracy. (2) We adopt a cross-field learning approach to\u0000significantly improve data prediction accuracy, resulting in a substantially\u0000improved compression ratio. (3) We develop an error control approach to provide\u0000strict error bounds according to user requirements. We evaluated NeurLZ on\u0000several real-world HPC application datasets, including Nyx (cosmological\u0000simulation), Miranda (large turbulence simulation), and Hurricane (weather\u0000simulation). Experiments demonstrate that our framework achieves up to a 90%\u0000relative reduction in bit rate under the same data distortion, compared to the\u0000best existing approach.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication 对用于重复数据删除的内容定义分块算法的深入研究

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-09 DOI: arxiv-2409.06066

Marcel Gregoriadis, Leonhard Balduf, Björn Scheuermann, Johan Pouwelse

Data deduplication emerged as a powerful solution for reducing storage andbandwidth costs by eliminating redundancies at the level of chunks. This hasspurred the development of numerous Content-Defined Chunking (CDC) algorithmsover the past two decades. Despite advancements, the current state-of-the-artremains obscure, as a thorough and impartial analysis and comparison islacking. We conduct a rigorous theoretical analysis and impartial experimentalcomparison of several leading CDC algorithms. Using four realistic datasets, weevaluate these algorithms against four key metrics: throughput, deduplicationratio, average chunk size, and chunk-size variance. Our analyses, in manyinstances, extend the findings of their original publications by reporting newresults and putting existing ones into context. Moreover, we highlightlimitations that have previously gone unnoticed. Our findings provide valuableinsights that inform the selection and optimization of CDC algorithms forpractical applications in data deduplication.

重复数据删除是通过消除块级冗余来降低存储和带宽成本的强大解决方案。这促使在过去二十年中开发了大量内容定义分块（CDC）算法。尽管取得了进步，但由于缺乏全面、公正的分析和比较，目前的先进水平仍不明显。我们对几种领先的 CDC 算法进行了严格的理论分析和公正的实验比较。我们使用四个现实数据集，根据四个关键指标对这些算法进行了评估：吞吐量、重复数据删除比率、平均块大小和块大小差异。在许多情况下，我们的分析通过报告新结果并结合现有结果，扩展了原始出版物的研究结果。此外，我们还强调了以前未被注意到的局限性。我们的研究结果提供了宝贵的见解，为重复数据删除实际应用中 CDC 算法的选择和优化提供了参考。

引用次数: 0

Model Input Verification of Large Scale Simulations 大规模模拟的模型输入验证

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-09 DOI: arxiv-2409.05768

Rumyana Neykova, Derek Groen

Reliable simulations are critical for analyzing and understanding complexsystems, but their accuracy depends on correct input data. Incorrect inputssuch as invalid or out-of-range values, missing data, and formatinconsistencies can cause simulation crashes or unnoticed result distortions,ultimately undermining the validity of the conclusions. This paper presents amethodology for verifying the validity of input data in simulations, a processwe term model input verification (MIV). We implement this approach in FabGuard,a toolset that uses established data schema and validation tools for thespecific needs of simulation modeling. We introduce a formalism forcategorizing MIV patterns and offer a streamlined verification pipeline thatintegrates into existing simulation workflows. FabGuard's applicability isdemonstrated across three diverse domains: conflict-driven migration, disasterevacuation, and disease spread models. We also explore the use of LargeLanguage Models (LLMs) for automating constraint generation and inference. In acase study with a migration simulation, LLMs not only correctly inferred 22 outof 23 developer-defined constraints, but also identified errors in existingconstraints and proposed new, valid constraints. Our evaluation demonstratesthat MIV is feasible on large datasets, with FabGuard efficiently processing12,000 input files in 140 seconds and maintaining consistent performance acrossvarying file sizes.

可靠的模拟对于分析和理解复杂系统至关重要，但其准确性取决于正确的输入数据。不正确的输入，如无效或超出范围的值、数据缺失和格式不一致，会导致仿真崩溃或结果失真而不被察觉，最终破坏结论的有效性。本文介绍了一种验证仿真输入数据有效性的方法，我们称之为模型输入验证 (MIV)。我们在 FabGuard 中实现了这一方法，FabGuard 是一个工具集，它使用成熟的数据模式和验证工具来满足仿真建模的特定需求。我们引入了一种用于对 MIV 模式进行分类的形式主义，并提供了一种可集成到现有仿真工作流中的简化验证管道。FabGuard 的适用性在三个不同领域得到了展示：冲突驱动的迁移、灾难性撤离和疾病传播模型。我们还探索了如何使用大型语言模型（LLM）来自动生成约束和推理。在一项迁移模拟案例研究中，LLM 不仅正确推断出了 23 个开发人员定义的约束条件中的 22 个，而且还识别出了现有约束条件中的错误，并提出了新的有效约束条件。我们的评估证明了 MIV 在大型数据集上的可行性，FabGuard 在 140 秒内高效处理了 12,000 个输入文件，并在文件大小不同的情况下保持了一致的性能。

{"title":"Model Input Verification of Large Scale Simulations","authors":"Rumyana Neykova, Derek Groen","doi":"arxiv-2409.05768","DOIUrl":"https://doi.org/arxiv-2409.05768","url":null,"abstract":"Reliable simulations are critical for analyzing and understanding complex\u0000systems, but their accuracy depends on correct input data. Incorrect inputs\u0000such as invalid or out-of-range values, missing data, and format\u0000inconsistencies can cause simulation crashes or unnoticed result distortions,\u0000ultimately undermining the validity of the conclusions. This paper presents a\u0000methodology for verifying the validity of input data in simulations, a process\u0000we term model input verification (MIV). We implement this approach in FabGuard,\u0000a toolset that uses established data schema and validation tools for the\u0000specific needs of simulation modeling. We introduce a formalism for\u0000categorizing MIV patterns and offer a streamlined verification pipeline that\u0000integrates into existing simulation workflows. FabGuard's applicability is\u0000demonstrated across three diverse domains: conflict-driven migration, disaster\u0000evacuation, and disease spread models. We also explore the use of Large\u0000Language Models (LLMs) for automating constraint generation and inference. In a\u0000case study with a migration simulation, LLMs not only correctly inferred 22 out\u0000of 23 developer-defined constraints, but also identified errors in existing\u0000constraints and proposed new, valid constraints. Our evaluation demonstrates\u0000that MIV is feasible on large datasets, with FabGuard efficiently processing\u000012,000 input files in 140 seconds and maintaining consistent performance across\u0000varying file sizes.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Joint Model Assignment and Resource Allocation for Cost-Effective Mobile Generative Services 高成本效益移动生成服务的联合模型分配和资源分配

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-09 DOI: arxiv-2409.09072

Shuangwei Gao, Peng Yang, Yuxin Kong, Feng Lyu, Ning Zhang

Artificial Intelligence Generated Content (AIGC) services can efficientlysatisfy user-specified content creation demands, but the high computationalrequirements pose various challenges to supporting mobile users at scale. Inthis paper, we present our design of an edge-enabled AIGC service provisioningsystem to properly assign computing tasks of generative models to edge servers,thereby improving overall user experience and reducing content generationlatency. Specifically, once the edge server receives user requested taskprompts, it dynamically assigns appropriate models and allocates computingresources based on features of each category of prompts. The generated contentsare then delivered to users. The key to this system is a proposed probabilisticmodel assignment approach, which estimates the quality score of generatedcontents for each prompt based on category labels. Next, we introduce aheuristic algorithm that enables adaptive configuration of both generationsteps and resource allocation, according to the various task requests receivedby each generative model on the edge.Simulation results demonstrate that thedesigned system can effectively enhance the quality of generated content by upto 4.7% while reducing response delay by up to 39.1% compared to benchmarks.

人工智能生成内容（AIGC）服务可以有效地满足用户指定的内容创建需求，但其高计算要求给大规模支持移动用户带来了各种挑战。在本文中，我们介绍了边缘支持的 AIGC 服务供应系统的设计，该系统可将生成模型的计算任务适当分配给边缘服务器，从而改善整体用户体验并降低内容生成延迟。具体来说，一旦边缘服务器接收到用户请求的任务提示，它就会根据每类提示的特征动态分配适当的模型和计算资源。然后将生成的内容发送给用户。该系统的关键在于所提出的概率模型分配方法，它可以根据类别标签为每个提示估算生成内容的质量分数。接下来，我们引入了一种启发式算法，根据边缘上每个生成模型收到的各种任务请求，对生成步骤和资源分配进行自适应配置。仿真结果表明，与基准相比，所设计的系统可有效提高生成内容的质量达 4.7%，同时减少响应延迟达 39.1%。

{"title":"Joint Model Assignment and Resource Allocation for Cost-Effective Mobile Generative Services","authors":"Shuangwei Gao, Peng Yang, Yuxin Kong, Feng Lyu, Ning Zhang","doi":"arxiv-2409.09072","DOIUrl":"https://doi.org/arxiv-2409.09072","url":null,"abstract":"Artificial Intelligence Generated Content (AIGC) services can efficiently\u0000satisfy user-specified content creation demands, but the high computational\u0000requirements pose various challenges to supporting mobile users at scale. In\u0000this paper, we present our design of an edge-enabled AIGC service provisioning\u0000system to properly assign computing tasks of generative models to edge servers,\u0000thereby improving overall user experience and reducing content generation\u0000latency. Specifically, once the edge server receives user requested task\u0000prompts, it dynamically assigns appropriate models and allocates computing\u0000resources based on features of each category of prompts. The generated contents\u0000are then delivered to users. The key to this system is a proposed probabilistic\u0000model assignment approach, which estimates the quality score of generated\u0000contents for each prompt based on category labels. Next, we introduce a\u0000heuristic algorithm that enables adaptive configuration of both generation\u0000steps and resource allocation, according to the various task requests received\u0000by each generative model on the edge.Simulation results demonstrate that the\u0000designed system can effectively enhance the quality of generated content by up\u0000to 4.7% while reducing response delay by up to 39.1% compared to benchmarks.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DNA sequence alignment: An assignment for OpenMP, MPI, and CUDA/OpenCL DNA 序列比对：为 OpenMP、MPI 和 CUDA/OpenCL 分配任务

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-09 DOI: arxiv-2409.06075

Arturo Gonzalez-EscribanoUniversidad de Valladolid, Spain, Diego García-ÁlvarezUniversidad de Valladolid, Spain, Jesús CámaraUniversidad de Valladolid, Spain

We present an assignment for a full Parallel Computing course. Since2017/2018, we have proposed a different problem each academic year toillustrate various methodologies for approaching the same computational problemusing different parallel programming models. They are designed to beparallelized using shared-memory programming with OpenMP, distributed-memoryprogramming with MPI, and GPU programming with CUDA or OpenCL. The problemchosen for this year implements a brute-force solution for exact DNA sequencealignment of multiple patterns. The program searches for exact coincidences ofmultiple nucleotide strings in a long DNA sequence. The sequentialimplementation is designed to be clear and understandable to students whileoffering many opportunities for parallelization and optimization. Thisassignment addresses key concepts many students find difficult to apply inpractical scenarios: race conditions, reductions, collective operations, andpoint-to-point communications. It also covers the problem of parallelgeneration of pseudo-random sequences and strategies to notify and stopspeculative computations when matches are found. This assignment serves as anexercise that reinforces basic knowledge and prepares students for more complexparallel computing concepts and structures. It has been successfullyimplemented as a practical assignment in a Parallel Computing course in thethird year of a Computer Engineering degree program. Supporting materials forthis and previous assignments in this series are publicly available.

我们介绍了一门完整的并行计算课程的作业。自 2017/2018 学年以来，我们每学年都会提出一个不同的问题，以展示使用不同并行编程模型处理同一计算问题的各种方法。这些问题可以使用 OpenMP 进行共享内存编程，使用 MPI 进行分布式内存编程，使用 CUDA 或 OpenCL 进行 GPU 编程。今年选择的问题是实现多模式 DNA 序列精确配对的暴力解法。该程序搜索长 DNA 序列中多个核苷酸字符串的精确重合点。程序的顺序实现设计得清晰易懂，同时提供了许多并行化和优化的机会。本作业涉及许多学生认为难以在实际场景中应用的关键概念：竞赛条件、还原、集体操作和点对点通信。它还涉及伪随机序列的并行生成问题，以及在发现匹配时通知和停止累加计算的策略。本作业可作为强化基础知识的练习，为学生学习更全面的并行计算概念和结构做好准备。该作业作为计算机工程学位课程三年级并行计算课程的实践作业已成功实施。本系列作业及以前作业的辅助材料均可公开获取。

{"title":"DNA sequence alignment: An assignment for OpenMP, MPI, and CUDA/OpenCL","authors":"Arturo Gonzalez-EscribanoUniversidad de Valladolid, Spain, Diego García-ÁlvarezUniversidad de Valladolid, Spain, Jesús CámaraUniversidad de Valladolid, Spain","doi":"arxiv-2409.06075","DOIUrl":"https://doi.org/arxiv-2409.06075","url":null,"abstract":"We present an assignment for a full Parallel Computing course. Since\u00002017/2018, we have proposed a different problem each academic year to\u0000illustrate various methodologies for approaching the same computational problem\u0000using different parallel programming models. They are designed to be\u0000parallelized using shared-memory programming with OpenMP, distributed-memory\u0000programming with MPI, and GPU programming with CUDA or OpenCL. The problem\u0000chosen for this year implements a brute-force solution for exact DNA sequence\u0000alignment of multiple patterns. The program searches for exact coincidences of\u0000multiple nucleotide strings in a long DNA sequence. The sequential\u0000implementation is designed to be clear and understandable to students while\u0000offering many opportunities for parallelization and optimization. This\u0000assignment addresses key concepts many students find difficult to apply in\u0000practical scenarios: race conditions, reductions, collective operations, and\u0000point-to-point communications. It also covers the problem of parallel\u0000generation of pseudo-random sequences and strategies to notify and stop\u0000speculative computations when matches are found. This assignment serves as an\u0000exercise that reinforces basic knowledge and prepares students for more complex\u0000parallel computing concepts and structures. It has been successfully\u0000implemented as a practical assignment in a Parallel Computing course in the\u0000third year of a Computer Engineering degree program. Supporting materials for\u0000this and previous assignments in this series are publicly available.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"80 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ELMS: Elasticized Large Language Models On Mobile Devices ELMS：移动设备上的弹性大型语言模型

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-08 DOI: arxiv-2409.09071

Wangsong Yin, Rongjie Yi, Daliang Xu, Gang Huang, Mengwei Xu, Xuanzhe Liu

On-device Large Language Models (LLMs) are revolutionizing mobile AI,enabling applications such as UI automation while addressing privacy concerns.Currently, the standard approach involves deploying a single, robust LLM as auniversal solution for various applications, often referred to asLLM-as-a-Service (LLMaaS). However, this approach faces a significant systemchallenge: existing LLMs lack the flexibility to accommodate the diverseService-Level Objectives (SLOs) regarding inference latency across differentapplications. To address this issue, we introduce ELMS, an on-device LLMservice designed to provide elasticity in both the model and prompt dimensionsof an LLMaaS. This system includes: A one-time neuron reordering technique,which utilizes the inherent permutation consistency within transformer modelsto create high-quality, elastic sub-models with minimal runtime switchingcosts. A dual-head compact language model, which efficiently refines promptsand coordinates the elastic adaptation between the model and the prompt. Wehave implemented this elastic on-device LLM service on several off-the-shelf(COTS) smartphones and evaluate ELMS using both standalone NLP/mobile-agentdatasets and synthesized end-to-end traces. Across a range of SLOs, ELMSsurpasses four strong baselines by up to 16.83% and 11.04% in absolute accuracyon average, with less than 1% Time-To-First-Token (TTFT) switching overhead,comparable memory usage, and fewer than 100 offline GPU hours.

设备上的大型语言模型（LLM）正在彻底改变移动人工智能，使用户界面自动化等应用成为可能，同时解决了隐私问题。目前，标准方法包括部署一个单一、强大的 LLM，作为各种应用的通用解决方案，通常称为 LLM 即服务（LLMaaS）。然而，这种方法面临着一个重大的系统挑战：现有的 LLM 缺乏灵活性，无法满足不同应用在推理延迟方面的不同服务级别目标（SLO）。为了解决这个问题，我们推出了 ELMS，这是一种设备上的 LLM 服务，旨在为 LLMaaS 的模型和提示维度提供弹性。该系统包括一次性神经元重排序技术，它利用变压器模型中固有的排列一致性来创建高质量的弹性子模型，并将运行时的切换成本降至最低。双头紧凑语言模型，可高效地完善提示语，并协调模型与提示语之间的弹性适应。我们在几款现成的（COTS）智能手机上实现了这种弹性的设备上 LLM 服务，并使用独立的 NLP/移动标记数据集和合成的端到端跟踪对 ELMS 进行了评估。在一系列SLO中，ELMS的绝对准确率超过了四种强大的基线，平均高达16.83%和11.04%，而首次令牌时间（TTFT）切换开销不到1%，内存使用量相当，离线GPU时长不到100小时。

{"title":"ELMS: Elasticized Large Language Models On Mobile Devices","authors":"Wangsong Yin, Rongjie Yi, Daliang Xu, Gang Huang, Mengwei Xu, Xuanzhe Liu","doi":"arxiv-2409.09071","DOIUrl":"https://doi.org/arxiv-2409.09071","url":null,"abstract":"On-device Large Language Models (LLMs) are revolutionizing mobile AI,\u0000enabling applications such as UI automation while addressing privacy concerns.\u0000Currently, the standard approach involves deploying a single, robust LLM as a\u0000universal solution for various applications, often referred to as\u0000LLM-as-a-Service (LLMaaS). However, this approach faces a significant system\u0000challenge: existing LLMs lack the flexibility to accommodate the diverse\u0000Service-Level Objectives (SLOs) regarding inference latency across different\u0000applications. To address this issue, we introduce ELMS, an on-device LLM\u0000service designed to provide elasticity in both the model and prompt dimensions\u0000of an LLMaaS. This system includes: A one-time neuron reordering technique,\u0000which utilizes the inherent permutation consistency within transformer models\u0000to create high-quality, elastic sub-models with minimal runtime switching\u0000costs. A dual-head compact language model, which efficiently refines prompts\u0000and coordinates the elastic adaptation between the model and the prompt. We\u0000have implemented this elastic on-device LLM service on several off-the-shelf\u0000(COTS) smartphones and evaluate ELMS using both standalone NLP/mobile-agent\u0000datasets and synthesized end-to-end traces. Across a range of SLOs, ELMS\u0000surpasses four strong baselines by up to 16.83% and 11.04% in absolute accuracy\u0000on average, with less than 1% Time-To-First-Token (TTFT) switching overhead,\u0000comparable memory usage, and fewer than 100 offline GPU hours.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CloudNativeSim: a toolkit for modeling and simulation of cloud-native applications CloudNativeSim：云原生应用程序建模和仿真工具包

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-08 DOI: arxiv-2409.05093

Jingfeng Wu, Minxian Xu, Yiyuan He, Kejiang Ye, Chengzhong Xu

Cloud-native applications are increasingly becoming popular in modernsoftware design. Employing a microservice-based architecture into theseapplications is a prevalent strategy that enhances system availability andflexibility. However, cloud-native applications also introduce new challenges,such as frequent inter-service communication and the complexity of managingheterogeneous codebases and hardware, resulting in unpredictable complexity anddynamism. Furthermore, as applications scale, only limited research teams orenterprises possess the resources for large-scale deployment and testing, whichimpedes progress in the cloud-native domain. To address these challenges, wepropose CloudNativeSim, a simulator for cloud-native applications with amicroservice-based architecture. CloudNativeSim offers several key benefits:(i) comprehensive and dynamic modeling for cloud-native applications, (ii) anextended simulation framework with new policy interfaces for schedulingcloud-native applications, and (iii) support for customized applicationscenarios and user feedback based on Quality of Service (QoS) metrics.CloudNativeSim can be easily deployed on standard computers to manage a highvolume of requests and services. Its performance was validated through a casestudy, demonstrating higher than 94.5% accuracy in terms of response time. Thestudy further highlights the feasibility of CloudNativeSim by illustrating theeffects of various scaling policies.

云原生应用在现代软件设计中越来越受欢迎。在这些应用中采用基于微服务的架构是一种流行的策略，可以提高系统的可用性和灵活性。然而，云原生应用也带来了新的挑战，如频繁的服务间通信以及管理异构代码库和硬件的复杂性，从而导致不可预测的复杂性和动态性。此外，随着应用规模的扩大，只有有限的研究团队或企业拥有大规模部署和测试的资源，这阻碍了云原生领域的发展。为了应对这些挑战，我们提出了云原生模拟器（CloudNativeSim），这是一种基于微服务架构的云原生应用模拟器。CloudNativeSim具有以下几个主要优点：(i) 云原生应用的全面动态建模；(ii) 带有用于调度云原生应用的新策略接口的扩展模拟框架；(iii) 支持基于服务质量（QoS）指标的定制应用场景和用户反馈。云原生模拟可轻松部署在标准计算机上，管理大量请求和服务。其性能已通过案例研究得到验证，在响应时间方面的准确率高于 94.5%。该研究通过说明各种扩展策略的效果，进一步突出了云原生模拟的可行性。

{"title":"CloudNativeSim: a toolkit for modeling and simulation of cloud-native applications","authors":"Jingfeng Wu, Minxian Xu, Yiyuan He, Kejiang Ye, Chengzhong Xu","doi":"arxiv-2409.05093","DOIUrl":"https://doi.org/arxiv-2409.05093","url":null,"abstract":"Cloud-native applications are increasingly becoming popular in modern\u0000software design. Employing a microservice-based architecture into these\u0000applications is a prevalent strategy that enhances system availability and\u0000flexibility. However, cloud-native applications also introduce new challenges,\u0000such as frequent inter-service communication and the complexity of managing\u0000heterogeneous codebases and hardware, resulting in unpredictable complexity and\u0000dynamism. Furthermore, as applications scale, only limited research teams or\u0000enterprises possess the resources for large-scale deployment and testing, which\u0000impedes progress in the cloud-native domain. To address these challenges, we\u0000propose CloudNativeSim, a simulator for cloud-native applications with a\u0000microservice-based architecture. CloudNativeSim offers several key benefits:\u0000(i) comprehensive and dynamic modeling for cloud-native applications, (ii) an\u0000extended simulation framework with new policy interfaces for scheduling\u0000cloud-native applications, and (iii) support for customized application\u0000scenarios and user feedback based on Quality of Service (QoS) metrics.\u0000CloudNativeSim can be easily deployed on standard computers to manage a high\u0000volume of requests and services. Its performance was validated through a case\u0000study, demonstrating higher than 94.5% accuracy in terms of response time. The\u0000study further highlights the feasibility of CloudNativeSim by illustrating the\u0000effects of various scaling policies.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0