Lang Xu, Quentin Anthony, Qinghua Zhou, Nawras Alnaasan, Radha R. Gulhane, Aamir Shafi, Hari Subramoni, Dhabaleswar K. Panda
Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP) are the three strategies widely adopted to enable fast and efficient Large Language Model (LLM) training. However, these approaches rely on data-intensive communication routines to collect, aggregate, and re-distribute gradients, activations, and other important model information, which pose significant overhead. Co-designed with GPU-based compression libraries, MPI libraries have been proven to reduce message size significantly, and leverage interconnect bandwidth, thus increasing training efficiency while maintaining acceptable accuracy. In this work, we investigate the efficacy of compression-assisted MPI collectives under the context of distributed LLM training using 3D parallelism and ZeRO optimizations. We scaled up to 192 V100 GPUs on the Lassen supercomputer. First, we enabled a na"ive compression scheme across all collectives and observed a 22.5% increase in TFLOPS per GPU and a 23.6% increase in samples per second for GPT-NeoX-20B training. Nonetheless, such a strategy ignores the sparsity discrepancy among messages communicated in each parallelism degree, thus introducing more errors and causing degradation in training loss. Therefore, we incorporated hybrid compression settings toward each parallel dimension and adjusted the compression intensity accordingly. Given their low-rank structure (arXiv:2301.02654), we apply aggressive compression on gradients when performing DP All-reduce. We adopt milder compression to preserve precision while communicating activations, optimizer states, and model parameters in TP and PP. Using the adjusted hybrid compression scheme, we demonstrate a 17.3% increase in TFLOPS per GPU and a 12.7% increase in samples per second while reaching baseline loss convergence.
{"title":"Accelerating Large Language Model Training with Hybrid GPU-based Compression","authors":"Lang Xu, Quentin Anthony, Qinghua Zhou, Nawras Alnaasan, Radha R. Gulhane, Aamir Shafi, Hari Subramoni, Dhabaleswar K. Panda","doi":"arxiv-2409.02423","DOIUrl":"https://doi.org/arxiv-2409.02423","url":null,"abstract":"Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP)\u0000are the three strategies widely adopted to enable fast and efficient Large\u0000Language Model (LLM) training. However, these approaches rely on data-intensive\u0000communication routines to collect, aggregate, and re-distribute gradients,\u0000activations, and other important model information, which pose significant\u0000overhead. Co-designed with GPU-based compression libraries, MPI libraries have\u0000been proven to reduce message size significantly, and leverage interconnect\u0000bandwidth, thus increasing training efficiency while maintaining acceptable\u0000accuracy. In this work, we investigate the efficacy of compression-assisted MPI\u0000collectives under the context of distributed LLM training using 3D parallelism\u0000and ZeRO optimizations. We scaled up to 192 V100 GPUs on the Lassen\u0000supercomputer. First, we enabled a na\"ive compression scheme across all\u0000collectives and observed a 22.5% increase in TFLOPS per GPU and a 23.6%\u0000increase in samples per second for GPT-NeoX-20B training. Nonetheless, such a\u0000strategy ignores the sparsity discrepancy among messages communicated in each\u0000parallelism degree, thus introducing more errors and causing degradation in\u0000training loss. Therefore, we incorporated hybrid compression settings toward\u0000each parallel dimension and adjusted the compression intensity accordingly.\u0000Given their low-rank structure (arXiv:2301.02654), we apply aggressive\u0000compression on gradients when performing DP All-reduce. We adopt milder\u0000compression to preserve precision while communicating activations, optimizer\u0000states, and model parameters in TP and PP. Using the adjusted hybrid\u0000compression scheme, we demonstrate a 17.3% increase in TFLOPS per GPU and a\u000012.7% increase in samples per second while reaching baseline loss convergence.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"268 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Computation offloading at lower time and lower energy consumption is crucial for resource limited mobile devices. This paper proposes an offloading decision-making model using federated learning. Based on the task type and the user input, the proposed decision-making model predicts whether the task is computationally intensive or not. If the predicted result is computationally intensive, then based on the network parameters the proposed decision-making model predicts whether to offload or locally execute the task. According to the predicted result the task is either locally executed or offloaded to the edge server. The proposed method is implemented in a real-time environment, and the experimental results show that the proposed method has achieved above 90% prediction accuracy in offloading decision-making. The experimental results also present that the proposed offloading method reduces the response time and energy consumption of the user device by ~11-31% for computationally intensive tasks. A partial computation offloading method for federated learning is also proposed and implemented in this paper, where the devices which are unable to analyse the huge number of data samples, offload a part of their local datasets to the edge server. For secure data transmission, cryptography is used. The experimental results present that using encryption and decryption the total time is increased by only 0.05-0.16%. The results also present that the proposed partial computation offloading method for federated learning has achieved a prediction accuracy of above 98% for the global model.
{"title":"A Joint Time and Energy-Efficient Federated Learning-based Computation Offloading Method for Mobile Edge Computing","authors":"Anwesha Mukherjee, Rajkumar Buyya","doi":"arxiv-2409.02548","DOIUrl":"https://doi.org/arxiv-2409.02548","url":null,"abstract":"Computation offloading at lower time and lower energy consumption is crucial\u0000for resource limited mobile devices. This paper proposes an offloading\u0000decision-making model using federated learning. Based on the task type and the\u0000user input, the proposed decision-making model predicts whether the task is\u0000computationally intensive or not. If the predicted result is computationally\u0000intensive, then based on the network parameters the proposed decision-making\u0000model predicts whether to offload or locally execute the task. According to the\u0000predicted result the task is either locally executed or offloaded to the edge\u0000server. The proposed method is implemented in a real-time environment, and the\u0000experimental results show that the proposed method has achieved above 90%\u0000prediction accuracy in offloading decision-making. The experimental results\u0000also present that the proposed offloading method reduces the response time and\u0000energy consumption of the user device by ~11-31% for computationally intensive\u0000tasks. A partial computation offloading method for federated learning is also\u0000proposed and implemented in this paper, where the devices which are unable to\u0000analyse the huge number of data samples, offload a part of their local datasets\u0000to the edge server. For secure data transmission, cryptography is used. The\u0000experimental results present that using encryption and decryption the total\u0000time is increased by only 0.05-0.16%. The results also present that the\u0000proposed partial computation offloading method for federated learning has\u0000achieved a prediction accuracy of above 98% for the global model.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Victor Jarlow, Charalampos Stylianopoulos, Marina Papatriantafilou
The frequent elements problem, a key component in demanding stream-data analytics, involves selecting elements whose occurrence exceeds a user-specified threshold. Fast, memory-efficient $epsilon$-approximate synopsis algorithms select all frequent elements but may overestimate them depending on $epsilon$ (user-defined parameter). Evolving applications demand performance only achievable by parallelization. However, algorithmic guarantees concerning concurrent updates and queries have been overlooked. We propose Query and Parallelism Optimized Space-Saving (QPOPSS), providing concurrency guarantees. The design includes an implementation of the emph{Space-Saving} algorithm supporting fast queries, implying minimal overlap with concurrent updates. QPOPSS integrates this with the distribution of work and fine-grained synchronization among threads, swiftly balancing high throughput, high accuracy, and low memory consumption. Our analysis, under various concurrency and data distribution conditions, shows space and approximation bounds. Our empirical evaluation relative to representative state-of-the-art methods reveals that QPOPSS's multi-threaded throughput scales linearly while maintaining the highest accuracy, with orders of magnitude smaller memory footprint.
{"title":"QPOPSS: Query and Parallelism Optimized Space-Saving for Finding Frequent Stream Elements","authors":"Victor Jarlow, Charalampos Stylianopoulos, Marina Papatriantafilou","doi":"arxiv-2409.01749","DOIUrl":"https://doi.org/arxiv-2409.01749","url":null,"abstract":"The frequent elements problem, a key component in demanding stream-data\u0000analytics, involves selecting elements whose occurrence exceeds a\u0000user-specified threshold. Fast, memory-efficient $epsilon$-approximate\u0000synopsis algorithms select all frequent elements but may overestimate them\u0000depending on $epsilon$ (user-defined parameter). Evolving applications demand\u0000performance only achievable by parallelization. However, algorithmic guarantees\u0000concerning concurrent updates and queries have been overlooked. We propose\u0000Query and Parallelism Optimized Space-Saving (QPOPSS), providing concurrency\u0000guarantees. The design includes an implementation of the emph{Space-Saving}\u0000algorithm supporting fast queries, implying minimal overlap with concurrent\u0000updates. QPOPSS integrates this with the distribution of work and fine-grained\u0000synchronization among threads, swiftly balancing high throughput, high\u0000accuracy, and low memory consumption. Our analysis, under various concurrency\u0000and data distribution conditions, shows space and approximation bounds. Our\u0000empirical evaluation relative to representative state-of-the-art methods\u0000reveals that QPOPSS's multi-threaded throughput scales linearly while\u0000maintaining the highest accuracy, with orders of magnitude smaller memory\u0000footprint.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marina Moran, Javier Balladini, Dolores Rexachs, Emilio Luque
The fault tolerance method currently used in High Performance Computing (HPC) is the rollback-recovery method by using checkpoints. This, like any other fault tolerance method, adds an additional energy consumption to that of the execution of the application. The objective of this work is to determine the factors that affect the energy consumption of the computing nodes on homogeneous cluster, when performing checkpoint and restart operations, on SPMD (Single Program Multiple Data) applications. We have focused on the energetic study of compute nodes, contemplating different configurations of hardware and software parameters. We studied the effect of performance states (states P) and power states (states C) of processors, application problem size, checkpoint software (DMTCP) and distributed file system (NFS) configuration. The results analysis allowed to identify opportunities to reduce the energy consumption of checkpoint and restart operations.
{"title":"Checkpoint and Restart: An Energy Consumption Characterization in Clusters","authors":"Marina Moran, Javier Balladini, Dolores Rexachs, Emilio Luque","doi":"arxiv-2409.02214","DOIUrl":"https://doi.org/arxiv-2409.02214","url":null,"abstract":"The fault tolerance method currently used in High Performance Computing (HPC)\u0000is the rollback-recovery method by using checkpoints. This, like any other\u0000fault tolerance method, adds an additional energy consumption to that of the\u0000execution of the application. The objective of this work is to determine the\u0000factors that affect the energy consumption of the computing nodes on\u0000homogeneous cluster, when performing checkpoint and restart operations, on SPMD\u0000(Single Program Multiple Data) applications. We have focused on the energetic\u0000study of compute nodes, contemplating different configurations of hardware and\u0000software parameters. We studied the effect of performance states (states P) and\u0000power states (states C) of processors, application problem size, checkpoint\u0000software (DMTCP) and distributed file system (NFS) configuration. The results\u0000analysis allowed to identify opportunities to reduce the energy consumption of\u0000checkpoint and restart operations.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large Language Models (LLMs) have revolutionized natural language processing by achieving state-of-the-art results across a variety of tasks. However, the computational demands of LLM inference, including high memory consumption and slow processing speeds, pose significant challenges for real-world applications, particularly on resource-constrained devices. Efficient inference is crucial for scaling the deployment of LLMs to a broader range of platforms, including mobile and edge devices. This survey explores contemporary techniques in model compression that address these challenges by reducing the size and computational requirements of LLMs while maintaining their performance. We focus on model-level compression methods, including quantization, knowledge distillation, and pruning, as well as system-level optimizations like KV cache efficient design. Each of these methodologies offers a unique approach to optimizing LLMs, from reducing numerical precision to transferring knowledge between models and structurally simplifying neural networks. Additionally, we discuss emerging trends in system-level design that further enhance the efficiency of LLM inference. This survey aims to provide a comprehensive overview of current advancements in model compression and their potential to make LLMs more accessible and practical for diverse applications.
{"title":"Contemporary Model Compression on Large Language Models Inference","authors":"Dong Liu","doi":"arxiv-2409.01990","DOIUrl":"https://doi.org/arxiv-2409.01990","url":null,"abstract":"Large Language Models (LLMs) have revolutionized natural language processing\u0000by achieving state-of-the-art results across a variety of tasks. However, the\u0000computational demands of LLM inference, including high memory consumption and\u0000slow processing speeds, pose significant challenges for real-world\u0000applications, particularly on resource-constrained devices. Efficient inference\u0000is crucial for scaling the deployment of LLMs to a broader range of platforms,\u0000including mobile and edge devices. This survey explores contemporary techniques in model compression that\u0000address these challenges by reducing the size and computational requirements of\u0000LLMs while maintaining their performance. We focus on model-level compression\u0000methods, including quantization, knowledge distillation, and pruning, as well\u0000as system-level optimizations like KV cache efficient design. Each of these\u0000methodologies offers a unique approach to optimizing LLMs, from reducing\u0000numerical precision to transferring knowledge between models and structurally\u0000simplifying neural networks. Additionally, we discuss emerging trends in\u0000system-level design that further enhance the efficiency of LLM inference. This\u0000survey aims to provide a comprehensive overview of current advancements in\u0000model compression and their potential to make LLMs more accessible and\u0000practical for diverse applications.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Niousha Nazemi, Omid Tavallaie, Shuaijun Chen, Anna Maria Mandalario, Kanchana Thilakarathna, Ralph Holz, Hamed Haddadi, Albert Y. Zomaya
Federated Learning (FL) is a promising distributed learning framework designed for privacy-aware applications. FL trains models on client devices without sharing the client's data and generates a global model on a server by aggregating model updates. Traditional FL approaches risk exposing sensitive client data when plain model updates are transmitted to the server, making them vulnerable to security threats such as model inversion attacks where the server can infer the client's original training data from monitoring the changes of the trained model in different rounds. Google's Secure Aggregation (SecAgg) protocol addresses this threat by employing a double-masking technique, secret sharing, and cryptography computations in honest-but-curious and adversarial scenarios with client dropouts. However, in scenarios without the presence of an active adversary, the computational and communication cost of SecAgg significantly increases by growing the number of clients. To address this issue, in this paper, we propose ACCESS-FL, a communication-and-computation-efficient secure aggregation method designed for honest-but-curious scenarios in stable FL networks with a limited rate of client dropout. ACCESS-FL reduces the computation/communication cost to a constant level (independent of the network size) by generating shared secrets between only two clients and eliminating the need for double masking, secret sharing, and cryptography computations. To evaluate the performance of ACCESS-FL, we conduct experiments using the MNIST, FMNIST, and CIFAR datasets to verify the performance of our proposed method. The evaluation results demonstrate that our proposed method significantly reduces computation and communication overhead compared to state-of-the-art methods, SecAgg and SecAgg+.
{"title":"ACCESS-FL: Agile Communication and Computation for Efficient Secure Aggregation in Stable Federated Learning Networks","authors":"Niousha Nazemi, Omid Tavallaie, Shuaijun Chen, Anna Maria Mandalario, Kanchana Thilakarathna, Ralph Holz, Hamed Haddadi, Albert Y. Zomaya","doi":"arxiv-2409.01722","DOIUrl":"https://doi.org/arxiv-2409.01722","url":null,"abstract":"Federated Learning (FL) is a promising distributed learning framework\u0000designed for privacy-aware applications. FL trains models on client devices\u0000without sharing the client's data and generates a global model on a server by\u0000aggregating model updates. Traditional FL approaches risk exposing sensitive\u0000client data when plain model updates are transmitted to the server, making them\u0000vulnerable to security threats such as model inversion attacks where the server\u0000can infer the client's original training data from monitoring the changes of\u0000the trained model in different rounds. Google's Secure Aggregation (SecAgg)\u0000protocol addresses this threat by employing a double-masking technique, secret\u0000sharing, and cryptography computations in honest-but-curious and adversarial\u0000scenarios with client dropouts. However, in scenarios without the presence of\u0000an active adversary, the computational and communication cost of SecAgg\u0000significantly increases by growing the number of clients. To address this\u0000issue, in this paper, we propose ACCESS-FL, a\u0000communication-and-computation-efficient secure aggregation method designed for\u0000honest-but-curious scenarios in stable FL networks with a limited rate of\u0000client dropout. ACCESS-FL reduces the computation/communication cost to a\u0000constant level (independent of the network size) by generating shared secrets\u0000between only two clients and eliminating the need for double masking, secret\u0000sharing, and cryptography computations. To evaluate the performance of\u0000ACCESS-FL, we conduct experiments using the MNIST, FMNIST, and CIFAR datasets\u0000to verify the performance of our proposed method. The evaluation results\u0000demonstrate that our proposed method significantly reduces computation and\u0000communication overhead compared to state-of-the-art methods, SecAgg and\u0000SecAgg+.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Quentin Kniep, Maxime Laval, Jakub Sliwinski, Roger Wattenhofer
This work examines the resilience properties of the Snowball and Avalanche protocols that underlie the popular Avalanche blockchain. We experimentally quantify the resilience of Snowball using a simulation implemented in Rust, where the adversary strategically rebalances the network to delay termination. We show that in a network of $n$ nodes of equal stake, the adversary is able to break liveness when controlling $Omega(sqrt{n})$ nodes. Specifically, for $n = 2000$, a simple adversary controlling $5.2%$ of stake can successfully attack liveness. When the adversary is given additional information about the state of the network (without any communication or other advantages), the stake needed for a successful attack is as little as $2.8%$. We show that the adversary can break safety in time exponentially dependent on their stake, and inversely linearly related to the size of the network, e.g. in 265 rounds in expectation when the adversary controls $25%$ of a network of 3000. We conclude that Snowball and Avalanche are akin to Byzantine reliable broadcast protocols as opposed to consensus.
{"title":"Quantifying Liveness and Safety of Avalanche's Snowball","authors":"Quentin Kniep, Maxime Laval, Jakub Sliwinski, Roger Wattenhofer","doi":"arxiv-2409.02217","DOIUrl":"https://doi.org/arxiv-2409.02217","url":null,"abstract":"This work examines the resilience properties of the Snowball and Avalanche\u0000protocols that underlie the popular Avalanche blockchain. We experimentally\u0000quantify the resilience of Snowball using a simulation implemented in Rust,\u0000where the adversary strategically rebalances the network to delay termination. We show that in a network of $n$ nodes of equal stake, the adversary is able\u0000to break liveness when controlling $Omega(sqrt{n})$ nodes. Specifically, for\u0000$n = 2000$, a simple adversary controlling $5.2%$ of stake can successfully\u0000attack liveness. When the adversary is given additional information about the\u0000state of the network (without any communication or other advantages), the stake\u0000needed for a successful attack is as little as $2.8%$. We show that the\u0000adversary can break safety in time exponentially dependent on their stake, and\u0000inversely linearly related to the size of the network, e.g. in 265 rounds in\u0000expectation when the adversary controls $25%$ of a network of 3000. We conclude that Snowball and Avalanche are akin to Byzantine reliable\u0000broadcast protocols as opposed to consensus.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guanzhou Hu, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau
We present a summary of non-transactional consistency levels in the context of distributed data replication protocols. The levels are built upon a practical object pool model and are defined in a unified framework centered around the concept of ordering. We show that each consistency level can be intuitively defined by specifying two types of constraints that determine the validity of orderings allowed by the level: convergence, which bounds the lineage shape of the ordering, and relationship, which bounds the relative positions of operations in the ordering. We give examples of representative protocols and systems that implement each consistency level. Furthermore, we discuss the availability upper bound of presented consistency levels.
{"title":"A Unified, Practical, and Understandable Summary of Non-transactional Consistency Levels in Distributed Replication","authors":"Guanzhou Hu, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau","doi":"arxiv-2409.01576","DOIUrl":"https://doi.org/arxiv-2409.01576","url":null,"abstract":"We present a summary of non-transactional consistency levels in the context\u0000of distributed data replication protocols. The levels are built upon a\u0000practical object pool model and are defined in a unified framework centered\u0000around the concept of ordering. We show that each consistency level can be\u0000intuitively defined by specifying two types of constraints that determine the\u0000validity of orderings allowed by the level: convergence, which bounds the\u0000lineage shape of the ordering, and relationship, which bounds the relative\u0000positions of operations in the ordering. We give examples of representative\u0000protocols and systems that implement each consistency level. Furthermore, we\u0000discuss the availability upper bound of presented consistency levels.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"106 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This work introduces ECOLIFE, the first carbon-aware serverless function scheduler to co-optimize carbon footprint and performance. ECOLIFE builds on the key insight of intelligently exploiting multi-generation hardware to achieve high performance and lower carbon footprint. ECOLIFE designs multiple novel extensions to Particle Swarm Optimization (PSO) in the context of serverless execution environment to achieve high performance while effectively reducing the carbon footprint.
{"title":"EcoLife: Carbon-Aware Serverless Function Scheduling for Sustainable Computing","authors":"Yankai Jiang, Rohan Basu Roy, Baolin Li, Devesh Tiwari","doi":"arxiv-2409.02085","DOIUrl":"https://doi.org/arxiv-2409.02085","url":null,"abstract":"This work introduces ECOLIFE, the first carbon-aware serverless function\u0000scheduler to co-optimize carbon footprint and performance. ECOLIFE builds on\u0000the key insight of intelligently exploiting multi-generation hardware to\u0000achieve high performance and lower carbon footprint. ECOLIFE designs multiple\u0000novel extensions to Particle Swarm Optimization (PSO) in the context of\u0000serverless execution environment to achieve high performance while effectively\u0000reducing the carbon footprint.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiajie Li, Jan-Niklas Schmelzle, Yixiao Du, Simon Heumos, Andrea Guarracino, Giulia Guidi, Pjotr Prins, Erik Garrison, Zhiru Zhang
Computational Pangenomics is an emerging field that studies genetic variation using a graph structure encompassing multiple genomes. Visualizing pangenome graphs is vital for understanding genome diversity. Yet, handling large graphs can be challenging due to the high computational demands of the graph layout process. In this work, we conduct a thorough performance characterization of a state-of-the-art pangenome graph layout algorithm, revealing significant data-level parallelism, which makes GPUs a promising option for compute acceleration. However, irregular data access and the algorithm's memory-bound nature present significant hurdles. To overcome these challenges, we develop a solution implementing three key optimizations: a cache-friendly data layout, coalesced random states, and warp merging. Additionally, we propose a quantitative metric for scalable evaluation of pangenome layout quality. Evaluated on 24 human whole-chromosome pangenomes, our GPU-based solution achieves a 57.3x speedup over the state-of-the-art multithreaded CPU baseline without layout quality loss, reducing execution time from hours to minutes.
{"title":"Rapid GPU-Based Pangenome Graph Layout","authors":"Jiajie Li, Jan-Niklas Schmelzle, Yixiao Du, Simon Heumos, Andrea Guarracino, Giulia Guidi, Pjotr Prins, Erik Garrison, Zhiru Zhang","doi":"arxiv-2409.00876","DOIUrl":"https://doi.org/arxiv-2409.00876","url":null,"abstract":"Computational Pangenomics is an emerging field that studies genetic variation\u0000using a graph structure encompassing multiple genomes. Visualizing pangenome\u0000graphs is vital for understanding genome diversity. Yet, handling large graphs\u0000can be challenging due to the high computational demands of the graph layout\u0000process. In this work, we conduct a thorough performance characterization of a\u0000state-of-the-art pangenome graph layout algorithm, revealing significant\u0000data-level parallelism, which makes GPUs a promising option for compute\u0000acceleration. However, irregular data access and the algorithm's memory-bound\u0000nature present significant hurdles. To overcome these challenges, we develop a\u0000solution implementing three key optimizations: a cache-friendly data layout,\u0000coalesced random states, and warp merging. Additionally, we propose a\u0000quantitative metric for scalable evaluation of pangenome layout quality. Evaluated on 24 human whole-chromosome pangenomes, our GPU-based solution\u0000achieves a 57.3x speedup over the state-of-the-art multithreaded CPU baseline\u0000without layout quality loss, reducing execution time from hours to minutes.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}