Federated learning (FL) enables multiple clients to collaboratively train a global model without sharing their local data. Recent studies have highlighted the vulnerability of FL to Byzantine attacks, where malicious clients send poisoned updates to degrade model performance. Notably, many attacks have been developed targeting specific aggregation rules, whereas various defense mechanisms have been designed for dedicated threat models. This paper studies the resilience of an attack-agnostic FL scenario, where the server lacks prior knowledge of both the attackers' strategies and the number of malicious clients involved. We first introduce a hybrid defense against state-of-the-art attacks. Our goal is to identify a general-purpose aggregation rule that performs well on average while also avoiding worst-case vulnerabilities. By adaptively selecting from available defenses, we demonstrate that the server remains robust even when confronted with a substantial proportion of poisoned updates. To better understand this resilience, we then assess the attackers' capability using a proxy called client heterogeneity. We also emphasize that the existing FL defenses should not be regarded as secure, as demonstrated through the newly proposed Trapsetter attack. The proposed attack outperforms other state-of-the-art attacks by further reducing the model test accuracy by 8-10%. Our findings highlight the ongoing need for the development of Byzantine-resilient aggregation algorithms in FL.
{"title":"Advancing Hybrid Defense for Byzantine Attacks in Federated Learning","authors":"Kai Yue, Richeng Jin, Chau-Wai Wong, Huaiyu Dai","doi":"arxiv-2409.06474","DOIUrl":"https://doi.org/arxiv-2409.06474","url":null,"abstract":"Federated learning (FL) enables multiple clients to collaboratively train a\u0000global model without sharing their local data. Recent studies have highlighted\u0000the vulnerability of FL to Byzantine attacks, where malicious clients send\u0000poisoned updates to degrade model performance. Notably, many attacks have been\u0000developed targeting specific aggregation rules, whereas various defense\u0000mechanisms have been designed for dedicated threat models. This paper studies\u0000the resilience of an attack-agnostic FL scenario, where the server lacks prior\u0000knowledge of both the attackers' strategies and the number of malicious clients\u0000involved. We first introduce a hybrid defense against state-of-the-art attacks.\u0000Our goal is to identify a general-purpose aggregation rule that performs well\u0000on average while also avoiding worst-case vulnerabilities. By adaptively\u0000selecting from available defenses, we demonstrate that the server remains\u0000robust even when confronted with a substantial proportion of poisoned updates.\u0000To better understand this resilience, we then assess the attackers' capability\u0000using a proxy called client heterogeneity. We also emphasize that the existing\u0000FL defenses should not be regarded as secure, as demonstrated through the newly\u0000proposed Trapsetter attack. The proposed attack outperforms other\u0000state-of-the-art attacks by further reducing the model test accuracy by 8-10%.\u0000Our findings highlight the ongoing need for the development of\u0000Byzantine-resilient aggregation algorithms in FL.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
There is an urgent and pressing need to optimize usage of Graphical Processing Units (GPUs), which have arguably become one of the most expensive and sought after IT resources. To help with this goal, several of the current generation of GPUs support a partitioning feature, called Multi-Instance GPU (MIG) to allow multiple workloads to share a GPU, albeit with some constraints. In this paper we investigate how to optimize the placement of Large Language Model (LLM)-based AI Inferencing workloads on GPUs. We first identify and present several use cases that are encountered in practice that require workloads to be efficiently placed or migrated to other GPUs to make room for incoming workloads. The overarching goal is to use as few GPUs as possible and to further minimize memory and compute wastage on GPUs that are utilized. We have developed two approaches to address this problem: an optimization method and a heuristic method. We benchmark these with two workload scheduling heuristics for multiple use cases. Our results show up to 2.85x improvement in the number of GPUs used and up to 70% reduction in GPU wastage over baseline heuristics. We plan to enable the SRE community to leverage our proposed method in production environments.
{"title":"Optimal Workload Placement on Multi-Instance GPUs","authors":"Bekir Turkkan, Pavankumar Murali, Pavithra Harsha, Rohan Arora, Gerard Vanloo, Chandra Narayanaswami","doi":"arxiv-2409.06646","DOIUrl":"https://doi.org/arxiv-2409.06646","url":null,"abstract":"There is an urgent and pressing need to optimize usage of Graphical\u0000Processing Units (GPUs), which have arguably become one of the most expensive\u0000and sought after IT resources. To help with this goal, several of the current\u0000generation of GPUs support a partitioning feature, called Multi-Instance GPU\u0000(MIG) to allow multiple workloads to share a GPU, albeit with some constraints.\u0000In this paper we investigate how to optimize the placement of Large Language\u0000Model (LLM)-based AI Inferencing workloads on GPUs. We first identify and\u0000present several use cases that are encountered in practice that require\u0000workloads to be efficiently placed or migrated to other GPUs to make room for\u0000incoming workloads. The overarching goal is to use as few GPUs as possible and\u0000to further minimize memory and compute wastage on GPUs that are utilized. We\u0000have developed two approaches to address this problem: an optimization method\u0000and a heuristic method. We benchmark these with two workload scheduling\u0000heuristics for multiple use cases. Our results show up to 2.85x improvement in\u0000the number of GPUs used and up to 70% reduction in GPU wastage over baseline\u0000heuristics. We plan to enable the SRE community to leverage our proposed method\u0000in production environments.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"410 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Ares de Parga, J. R. Bravo, N. Sibuet, J. A. Hernandez, R. Rossi, Stefan Boschert, Enrique S. Quintana-Ortí, Andrés E. Tomás, Cristian Cătălin Tatu, Fernando Vázquez-Novoa, Jorge Ejarque, Rosa M. Badia
The integration of Reduced Order Models (ROMs) with High-Performance Computing (HPC) is critical for developing digital twins, particularly for real-time monitoring and predictive maintenance of industrial systems. This paper describes a comprehensive, HPC-enabled workflow for developing and deploying projection-based ROMs (PROMs). We use PyCOMPSs' parallel framework to efficiently execute ROM training simulations, employing parallel Singular Value Decomposition (SVD) algorithms such as randomized SVD, Lanczos SVD, and full SVD based on Tall-Skinny QR. In addition, we introduce a partitioned version of the hyper-reduction scheme known as the Empirical Cubature Method. Despite the widespread use of HPC for PROMs, there is a significant lack of publications detailing comprehensive workflows for building and deploying end-to-end PROMs in HPC environments. Our workflow is validated through a case study focusing on the thermal dynamics of a motor. The PROM is designed to deliver a real-time prognosis tool that could enable rapid and safe motor restarts post-emergency shutdowns under different operating conditions for further integration into digital twins or control systems. To facilitate deployment, we use the HPC Workflow as a Service strategy and Functional Mock-Up Units to ensure compatibility and ease of integration across HPC, edge, and cloud environments. The outcomes illustrate the efficacy of combining PROMs and HPC, establishing a precedent for scalable, real-time digital twin applications across multiple industries.
还原阶次模型(ROM)与高性能计算(HPC)的集成对于开发数字孪生系统,特别是用于工业系统的实时监控和预测性维护至关重要。本文介绍了一个全面的、支持 HPC 的工作流程,用于开发和部署基于投影的 ROM(PROM)。我们使用 PyCOMPSs 的并行框架来高效执行 ROM 训练模拟,采用并行奇异值分解(SVD)算法,如随机 SVD、Lanczos SVD 和基于高瘦 QR 的 fullSVD。此外,我们还引入了被称为经验立方法(Empirical Cubature Method)的超还原方案的分区版本。尽管HPC在PROM中的应用非常广泛,但在HPC环境中构建和部署端到端PROM的全面工作流程的出版物却非常缺乏。我们的工作流程通过一个以电机热动力学为重点的案例研究得到了验证。该PROM旨在提供一种实时预测工具,在不同的运行条件下,使电机在发电机停机后能够快速、安全地重新启动,以便进一步集成到数字双胞胎或控制系统中。为了便于部署,我们采用了高性能计算工作流即服务(HPCWorkflow as a Service)策略和功能模拟单元(Functional Mock-Up Units),以确保跨高性能计算、边缘和云环境的兼容性和易集成性。
{"title":"Parallel Reduced Order Modeling for Digital Twins using High-Performance Computing Workflows","authors":"S. Ares de Parga, J. R. Bravo, N. Sibuet, J. A. Hernandez, R. Rossi, Stefan Boschert, Enrique S. Quintana-Ortí, Andrés E. Tomás, Cristian Cătălin Tatu, Fernando Vázquez-Novoa, Jorge Ejarque, Rosa M. Badia","doi":"arxiv-2409.09080","DOIUrl":"https://doi.org/arxiv-2409.09080","url":null,"abstract":"The integration of Reduced Order Models (ROMs) with High-Performance\u0000Computing (HPC) is critical for developing digital twins, particularly for\u0000real-time monitoring and predictive maintenance of industrial systems. This\u0000paper describes a comprehensive, HPC-enabled workflow for developing and\u0000deploying projection-based ROMs (PROMs). We use PyCOMPSs' parallel framework to\u0000efficiently execute ROM training simulations, employing parallel Singular Value\u0000Decomposition (SVD) algorithms such as randomized SVD, Lanczos SVD, and full\u0000SVD based on Tall-Skinny QR. In addition, we introduce a partitioned version of\u0000the hyper-reduction scheme known as the Empirical Cubature Method. Despite the\u0000widespread use of HPC for PROMs, there is a significant lack of publications\u0000detailing comprehensive workflows for building and deploying end-to-end PROMs\u0000in HPC environments. Our workflow is validated through a case study focusing on\u0000the thermal dynamics of a motor. The PROM is designed to deliver a real-time\u0000prognosis tool that could enable rapid and safe motor restarts post-emergency\u0000shutdowns under different operating conditions for further integration into\u0000digital twins or control systems. To facilitate deployment, we use the HPC\u0000Workflow as a Service strategy and Functional Mock-Up Units to ensure\u0000compatibility and ease of integration across HPC, edge, and cloud environments.\u0000The outcomes illustrate the efficacy of combining PROMs and HPC, establishing a\u0000precedent for scalable, real-time digital twin applications across multiple\u0000industries.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large-scale scientific simulations generate massive datasets that pose significant challenges for storage and I/O. While traditional lossy compression techniques can improve performance, balancing compression ratio, data quality, and throughput remains difficult. To address this, we propose NeurLZ, a novel cross-field learning-based and error-controlled compression framework for scientific data. By integrating skipping DNN models, cross-field learning, and error control, our framework aims to substantially enhance lossy compression performance. Our contributions are three-fold: (1) We design a lightweight skipping model to provide high-fidelity detail retention, further improving prediction accuracy. (2) We adopt a cross-field learning approach to significantly improve data prediction accuracy, resulting in a substantially improved compression ratio. (3) We develop an error control approach to provide strict error bounds according to user requirements. We evaluated NeurLZ on several real-world HPC application datasets, including Nyx (cosmological simulation), Miranda (large turbulence simulation), and Hurricane (weather simulation). Experiments demonstrate that our framework achieves up to a 90% relative reduction in bit rate under the same data distortion, compared to the best existing approach.
{"title":"NeurLZ: On Systematically Enhancing Lossy Compression Performance for Scientific Data based on Neural Learning with Error Control","authors":"Wenqi Jia, Youyuan Liu, Zhewen Hu, Jinzhen Wang, Boyuan Zhang, Wei Niu, Junzhou Huang, Stavros Kalafatis, Sian Jin, Miao Yin","doi":"arxiv-2409.05785","DOIUrl":"https://doi.org/arxiv-2409.05785","url":null,"abstract":"Large-scale scientific simulations generate massive datasets that pose\u0000significant challenges for storage and I/O. While traditional lossy compression\u0000techniques can improve performance, balancing compression ratio, data quality,\u0000and throughput remains difficult. To address this, we propose NeurLZ, a novel\u0000cross-field learning-based and error-controlled compression framework for\u0000scientific data. By integrating skipping DNN models, cross-field learning, and\u0000error control, our framework aims to substantially enhance lossy compression\u0000performance. Our contributions are three-fold: (1) We design a lightweight\u0000skipping model to provide high-fidelity detail retention, further improving\u0000prediction accuracy. (2) We adopt a cross-field learning approach to\u0000significantly improve data prediction accuracy, resulting in a substantially\u0000improved compression ratio. (3) We develop an error control approach to provide\u0000strict error bounds according to user requirements. We evaluated NeurLZ on\u0000several real-world HPC application datasets, including Nyx (cosmological\u0000simulation), Miranda (large turbulence simulation), and Hurricane (weather\u0000simulation). Experiments demonstrate that our framework achieves up to a 90%\u0000relative reduction in bit rate under the same data distortion, compared to the\u0000best existing approach.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marcel Gregoriadis, Leonhard Balduf, Björn Scheuermann, Johan Pouwelse
Data deduplication emerged as a powerful solution for reducing storage and bandwidth costs by eliminating redundancies at the level of chunks. This has spurred the development of numerous Content-Defined Chunking (CDC) algorithms over the past two decades. Despite advancements, the current state-of-the-art remains obscure, as a thorough and impartial analysis and comparison is lacking. We conduct a rigorous theoretical analysis and impartial experimental comparison of several leading CDC algorithms. Using four realistic datasets, we evaluate these algorithms against four key metrics: throughput, deduplication ratio, average chunk size, and chunk-size variance. Our analyses, in many instances, extend the findings of their original publications by reporting new results and putting existing ones into context. Moreover, we highlight limitations that have previously gone unnoticed. Our findings provide valuable insights that inform the selection and optimization of CDC algorithms for practical applications in data deduplication.
{"title":"A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication","authors":"Marcel Gregoriadis, Leonhard Balduf, Björn Scheuermann, Johan Pouwelse","doi":"arxiv-2409.06066","DOIUrl":"https://doi.org/arxiv-2409.06066","url":null,"abstract":"Data deduplication emerged as a powerful solution for reducing storage and\u0000bandwidth costs by eliminating redundancies at the level of chunks. This has\u0000spurred the development of numerous Content-Defined Chunking (CDC) algorithms\u0000over the past two decades. Despite advancements, the current state-of-the-art\u0000remains obscure, as a thorough and impartial analysis and comparison is\u0000lacking. We conduct a rigorous theoretical analysis and impartial experimental\u0000comparison of several leading CDC algorithms. Using four realistic datasets, we\u0000evaluate these algorithms against four key metrics: throughput, deduplication\u0000ratio, average chunk size, and chunk-size variance. Our analyses, in many\u0000instances, extend the findings of their original publications by reporting new\u0000results and putting existing ones into context. Moreover, we highlight\u0000limitations that have previously gone unnoticed. Our findings provide valuable\u0000insights that inform the selection and optimization of CDC algorithms for\u0000practical applications in data deduplication.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reliable simulations are critical for analyzing and understanding complex systems, but their accuracy depends on correct input data. Incorrect inputs such as invalid or out-of-range values, missing data, and format inconsistencies can cause simulation crashes or unnoticed result distortions, ultimately undermining the validity of the conclusions. This paper presents a methodology for verifying the validity of input data in simulations, a process we term model input verification (MIV). We implement this approach in FabGuard, a toolset that uses established data schema and validation tools for the specific needs of simulation modeling. We introduce a formalism for categorizing MIV patterns and offer a streamlined verification pipeline that integrates into existing simulation workflows. FabGuard's applicability is demonstrated across three diverse domains: conflict-driven migration, disaster evacuation, and disease spread models. We also explore the use of Large Language Models (LLMs) for automating constraint generation and inference. In a case study with a migration simulation, LLMs not only correctly inferred 22 out of 23 developer-defined constraints, but also identified errors in existing constraints and proposed new, valid constraints. Our evaluation demonstrates that MIV is feasible on large datasets, with FabGuard efficiently processing 12,000 input files in 140 seconds and maintaining consistent performance across varying file sizes.
{"title":"Model Input Verification of Large Scale Simulations","authors":"Rumyana Neykova, Derek Groen","doi":"arxiv-2409.05768","DOIUrl":"https://doi.org/arxiv-2409.05768","url":null,"abstract":"Reliable simulations are critical for analyzing and understanding complex\u0000systems, but their accuracy depends on correct input data. Incorrect inputs\u0000such as invalid or out-of-range values, missing data, and format\u0000inconsistencies can cause simulation crashes or unnoticed result distortions,\u0000ultimately undermining the validity of the conclusions. This paper presents a\u0000methodology for verifying the validity of input data in simulations, a process\u0000we term model input verification (MIV). We implement this approach in FabGuard,\u0000a toolset that uses established data schema and validation tools for the\u0000specific needs of simulation modeling. We introduce a formalism for\u0000categorizing MIV patterns and offer a streamlined verification pipeline that\u0000integrates into existing simulation workflows. FabGuard's applicability is\u0000demonstrated across three diverse domains: conflict-driven migration, disaster\u0000evacuation, and disease spread models. We also explore the use of Large\u0000Language Models (LLMs) for automating constraint generation and inference. In a\u0000case study with a migration simulation, LLMs not only correctly inferred 22 out\u0000of 23 developer-defined constraints, but also identified errors in existing\u0000constraints and proposed new, valid constraints. Our evaluation demonstrates\u0000that MIV is feasible on large datasets, with FabGuard efficiently processing\u000012,000 input files in 140 seconds and maintaining consistent performance across\u0000varying file sizes.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuangwei Gao, Peng Yang, Yuxin Kong, Feng Lyu, Ning Zhang
Artificial Intelligence Generated Content (AIGC) services can efficiently satisfy user-specified content creation demands, but the high computational requirements pose various challenges to supporting mobile users at scale. In this paper, we present our design of an edge-enabled AIGC service provisioning system to properly assign computing tasks of generative models to edge servers, thereby improving overall user experience and reducing content generation latency. Specifically, once the edge server receives user requested task prompts, it dynamically assigns appropriate models and allocates computing resources based on features of each category of prompts. The generated contents are then delivered to users. The key to this system is a proposed probabilistic model assignment approach, which estimates the quality score of generated contents for each prompt based on category labels. Next, we introduce a heuristic algorithm that enables adaptive configuration of both generation steps and resource allocation, according to the various task requests received by each generative model on the edge.Simulation results demonstrate that the designed system can effectively enhance the quality of generated content by up to 4.7% while reducing response delay by up to 39.1% compared to benchmarks.
{"title":"Joint Model Assignment and Resource Allocation for Cost-Effective Mobile Generative Services","authors":"Shuangwei Gao, Peng Yang, Yuxin Kong, Feng Lyu, Ning Zhang","doi":"arxiv-2409.09072","DOIUrl":"https://doi.org/arxiv-2409.09072","url":null,"abstract":"Artificial Intelligence Generated Content (AIGC) services can efficiently\u0000satisfy user-specified content creation demands, but the high computational\u0000requirements pose various challenges to supporting mobile users at scale. In\u0000this paper, we present our design of an edge-enabled AIGC service provisioning\u0000system to properly assign computing tasks of generative models to edge servers,\u0000thereby improving overall user experience and reducing content generation\u0000latency. Specifically, once the edge server receives user requested task\u0000prompts, it dynamically assigns appropriate models and allocates computing\u0000resources based on features of each category of prompts. The generated contents\u0000are then delivered to users. The key to this system is a proposed probabilistic\u0000model assignment approach, which estimates the quality score of generated\u0000contents for each prompt based on category labels. Next, we introduce a\u0000heuristic algorithm that enables adaptive configuration of both generation\u0000steps and resource allocation, according to the various task requests received\u0000by each generative model on the edge.Simulation results demonstrate that the\u0000designed system can effectively enhance the quality of generated content by up\u0000to 4.7% while reducing response delay by up to 39.1% compared to benchmarks.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arturo Gonzalez-EscribanoUniversidad de Valladolid, Spain, Diego García-ÁlvarezUniversidad de Valladolid, Spain, Jesús CámaraUniversidad de Valladolid, Spain
We present an assignment for a full Parallel Computing course. Since 2017/2018, we have proposed a different problem each academic year to illustrate various methodologies for approaching the same computational problem using different parallel programming models. They are designed to be parallelized using shared-memory programming with OpenMP, distributed-memory programming with MPI, and GPU programming with CUDA or OpenCL. The problem chosen for this year implements a brute-force solution for exact DNA sequence alignment of multiple patterns. The program searches for exact coincidences of multiple nucleotide strings in a long DNA sequence. The sequential implementation is designed to be clear and understandable to students while offering many opportunities for parallelization and optimization. This assignment addresses key concepts many students find difficult to apply in practical scenarios: race conditions, reductions, collective operations, and point-to-point communications. It also covers the problem of parallel generation of pseudo-random sequences and strategies to notify and stop speculative computations when matches are found. This assignment serves as an exercise that reinforces basic knowledge and prepares students for more complex parallel computing concepts and structures. It has been successfully implemented as a practical assignment in a Parallel Computing course in the third year of a Computer Engineering degree program. Supporting materials for this and previous assignments in this series are publicly available.
我们介绍了一门完整的并行计算课程的作业。自 2017/2018 学年以来,我们每学年都会提出一个不同的问题,以展示使用不同并行编程模型处理同一计算问题的各种方法。这些问题可以使用 OpenMP 进行共享内存编程,使用 MPI 进行分布式内存编程,使用 CUDA 或 OpenCL 进行 GPU 编程。今年选择的问题是实现多模式 DNA 序列精确配对的暴力解法。该程序搜索长 DNA 序列中多个核苷酸字符串的精确重合点。程序的顺序实现设计得清晰易懂,同时提供了许多并行化和优化的机会。本作业涉及许多学生认为难以在实际场景中应用的关键概念:竞赛条件、还原、集体操作和点对点通信。它还涉及伪随机序列的并行生成问题,以及在发现匹配时通知和停止累加计算的策略。本作业可作为强化基础知识的练习,为学生学习更全面的并行计算概念和结构做好准备。该作业作为计算机工程学位课程三年级并行计算课程的实践作业已成功实施。本系列作业及以前作业的辅助材料均可公开获取。
{"title":"DNA sequence alignment: An assignment for OpenMP, MPI, and CUDA/OpenCL","authors":"Arturo Gonzalez-EscribanoUniversidad de Valladolid, Spain, Diego García-ÁlvarezUniversidad de Valladolid, Spain, Jesús CámaraUniversidad de Valladolid, Spain","doi":"arxiv-2409.06075","DOIUrl":"https://doi.org/arxiv-2409.06075","url":null,"abstract":"We present an assignment for a full Parallel Computing course. Since\u00002017/2018, we have proposed a different problem each academic year to\u0000illustrate various methodologies for approaching the same computational problem\u0000using different parallel programming models. They are designed to be\u0000parallelized using shared-memory programming with OpenMP, distributed-memory\u0000programming with MPI, and GPU programming with CUDA or OpenCL. The problem\u0000chosen for this year implements a brute-force solution for exact DNA sequence\u0000alignment of multiple patterns. The program searches for exact coincidences of\u0000multiple nucleotide strings in a long DNA sequence. The sequential\u0000implementation is designed to be clear and understandable to students while\u0000offering many opportunities for parallelization and optimization. This\u0000assignment addresses key concepts many students find difficult to apply in\u0000practical scenarios: race conditions, reductions, collective operations, and\u0000point-to-point communications. It also covers the problem of parallel\u0000generation of pseudo-random sequences and strategies to notify and stop\u0000speculative computations when matches are found. This assignment serves as an\u0000exercise that reinforces basic knowledge and prepares students for more complex\u0000parallel computing concepts and structures. It has been successfully\u0000implemented as a practical assignment in a Parallel Computing course in the\u0000third year of a Computer Engineering degree program. Supporting materials for\u0000this and previous assignments in this series are publicly available.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"80 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wangsong Yin, Rongjie Yi, Daliang Xu, Gang Huang, Mengwei Xu, Xuanzhe Liu
On-device Large Language Models (LLMs) are revolutionizing mobile AI, enabling applications such as UI automation while addressing privacy concerns. Currently, the standard approach involves deploying a single, robust LLM as a universal solution for various applications, often referred to as LLM-as-a-Service (LLMaaS). However, this approach faces a significant system challenge: existing LLMs lack the flexibility to accommodate the diverse Service-Level Objectives (SLOs) regarding inference latency across different applications. To address this issue, we introduce ELMS, an on-device LLM service designed to provide elasticity in both the model and prompt dimensions of an LLMaaS. This system includes: A one-time neuron reordering technique, which utilizes the inherent permutation consistency within transformer models to create high-quality, elastic sub-models with minimal runtime switching costs. A dual-head compact language model, which efficiently refines prompts and coordinates the elastic adaptation between the model and the prompt. We have implemented this elastic on-device LLM service on several off-the-shelf (COTS) smartphones and evaluate ELMS using both standalone NLP/mobile-agent datasets and synthesized end-to-end traces. Across a range of SLOs, ELMS surpasses four strong baselines by up to 16.83% and 11.04% in absolute accuracy on average, with less than 1% Time-To-First-Token (TTFT) switching overhead, comparable memory usage, and fewer than 100 offline GPU hours.
{"title":"ELMS: Elasticized Large Language Models On Mobile Devices","authors":"Wangsong Yin, Rongjie Yi, Daliang Xu, Gang Huang, Mengwei Xu, Xuanzhe Liu","doi":"arxiv-2409.09071","DOIUrl":"https://doi.org/arxiv-2409.09071","url":null,"abstract":"On-device Large Language Models (LLMs) are revolutionizing mobile AI,\u0000enabling applications such as UI automation while addressing privacy concerns.\u0000Currently, the standard approach involves deploying a single, robust LLM as a\u0000universal solution for various applications, often referred to as\u0000LLM-as-a-Service (LLMaaS). However, this approach faces a significant system\u0000challenge: existing LLMs lack the flexibility to accommodate the diverse\u0000Service-Level Objectives (SLOs) regarding inference latency across different\u0000applications. To address this issue, we introduce ELMS, an on-device LLM\u0000service designed to provide elasticity in both the model and prompt dimensions\u0000of an LLMaaS. This system includes: A one-time neuron reordering technique,\u0000which utilizes the inherent permutation consistency within transformer models\u0000to create high-quality, elastic sub-models with minimal runtime switching\u0000costs. A dual-head compact language model, which efficiently refines prompts\u0000and coordinates the elastic adaptation between the model and the prompt. We\u0000have implemented this elastic on-device LLM service on several off-the-shelf\u0000(COTS) smartphones and evaluate ELMS using both standalone NLP/mobile-agent\u0000datasets and synthesized end-to-end traces. Across a range of SLOs, ELMS\u0000surpasses four strong baselines by up to 16.83% and 11.04% in absolute accuracy\u0000on average, with less than 1% Time-To-First-Token (TTFT) switching overhead,\u0000comparable memory usage, and fewer than 100 offline GPU hours.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jingfeng Wu, Minxian Xu, Yiyuan He, Kejiang Ye, Chengzhong Xu
Cloud-native applications are increasingly becoming popular in modern software design. Employing a microservice-based architecture into these applications is a prevalent strategy that enhances system availability and flexibility. However, cloud-native applications also introduce new challenges, such as frequent inter-service communication and the complexity of managing heterogeneous codebases and hardware, resulting in unpredictable complexity and dynamism. Furthermore, as applications scale, only limited research teams or enterprises possess the resources for large-scale deployment and testing, which impedes progress in the cloud-native domain. To address these challenges, we propose CloudNativeSim, a simulator for cloud-native applications with a microservice-based architecture. CloudNativeSim offers several key benefits: (i) comprehensive and dynamic modeling for cloud-native applications, (ii) an extended simulation framework with new policy interfaces for scheduling cloud-native applications, and (iii) support for customized application scenarios and user feedback based on Quality of Service (QoS) metrics. CloudNativeSim can be easily deployed on standard computers to manage a high volume of requests and services. Its performance was validated through a case study, demonstrating higher than 94.5% accuracy in terms of response time. The study further highlights the feasibility of CloudNativeSim by illustrating the effects of various scaling policies.
{"title":"CloudNativeSim: a toolkit for modeling and simulation of cloud-native applications","authors":"Jingfeng Wu, Minxian Xu, Yiyuan He, Kejiang Ye, Chengzhong Xu","doi":"arxiv-2409.05093","DOIUrl":"https://doi.org/arxiv-2409.05093","url":null,"abstract":"Cloud-native applications are increasingly becoming popular in modern\u0000software design. Employing a microservice-based architecture into these\u0000applications is a prevalent strategy that enhances system availability and\u0000flexibility. However, cloud-native applications also introduce new challenges,\u0000such as frequent inter-service communication and the complexity of managing\u0000heterogeneous codebases and hardware, resulting in unpredictable complexity and\u0000dynamism. Furthermore, as applications scale, only limited research teams or\u0000enterprises possess the resources for large-scale deployment and testing, which\u0000impedes progress in the cloud-native domain. To address these challenges, we\u0000propose CloudNativeSim, a simulator for cloud-native applications with a\u0000microservice-based architecture. CloudNativeSim offers several key benefits:\u0000(i) comprehensive and dynamic modeling for cloud-native applications, (ii) an\u0000extended simulation framework with new policy interfaces for scheduling\u0000cloud-native applications, and (iii) support for customized application\u0000scenarios and user feedback based on Quality of Service (QoS) metrics.\u0000CloudNativeSim can be easily deployed on standard computers to manage a high\u0000volume of requests and services. Its performance was validated through a case\u0000study, demonstrating higher than 94.5% accuracy in terms of response time. The\u0000study further highlights the feasibility of CloudNativeSim by illustrating the\u0000effects of various scaling policies.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}