Jinghan Yao, Sam Ade Jacobs, Masahiro Tanaka, Olatunji Ruwase, Aamir Shafi, Hari Subramoni, Dhabaleswar K. Panda
Large Language Models (LLMs) with long context capabilities are integral to complex tasks in natural language processing and computational biology, such as text generation and protein sequence analysis. However, training LLMs directly on extremely long contexts demands considerable GPU resources and increased memory, leading to higher costs and greater complexity. Alternative approaches that introduce long context capabilities via downstream finetuning or adaptations impose significant design limitations. In this paper, we propose Fully Pipelined Distributed Transformer (FPDT) for efficiently training long-context LLMs with extreme hardware efficiency. For GPT and Llama models, we achieve a 16x increase in sequence length that can be trained on the same hardware compared to current state-of-the-art solutions. With our dedicated sequence chunk pipeline design, we can now train 8B LLM with 2 million sequence length on only 4 GPUs, while also maintaining over 55% of MFU. Our proposed FPDT is agnostic to existing training techniques and is proven to work efficiently across different LLM models.
{"title":"Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer","authors":"Jinghan Yao, Sam Ade Jacobs, Masahiro Tanaka, Olatunji Ruwase, Aamir Shafi, Hari Subramoni, Dhabaleswar K. Panda","doi":"arxiv-2408.16978","DOIUrl":"https://doi.org/arxiv-2408.16978","url":null,"abstract":"Large Language Models (LLMs) with long context capabilities are integral to\u0000complex tasks in natural language processing and computational biology, such as\u0000text generation and protein sequence analysis. However, training LLMs directly\u0000on extremely long contexts demands considerable GPU resources and increased\u0000memory, leading to higher costs and greater complexity. Alternative approaches\u0000that introduce long context capabilities via downstream finetuning or\u0000adaptations impose significant design limitations. In this paper, we propose\u0000Fully Pipelined Distributed Transformer (FPDT) for efficiently training\u0000long-context LLMs with extreme hardware efficiency. For GPT and Llama models,\u0000we achieve a 16x increase in sequence length that can be trained on the same\u0000hardware compared to current state-of-the-art solutions. With our dedicated\u0000sequence chunk pipeline design, we can now train 8B LLM with 2 million sequence\u0000length on only 4 GPUs, while also maintaining over 55% of MFU. Our proposed\u0000FPDT is agnostic to existing training techniques and is proven to work\u0000efficiently across different LLM models.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zuoning Zhang, Dhruv Parikh, Youning Zhang, Viktor Prasanna
Transformer based Large Language Models (LLMs) have recently reached state of the art performance in Natural Language Processing (NLP) and Computer Vision (CV) domains. LLMs use the Multi-Headed Self-Attention (MHSA) mechanism to capture long-range global attention relationships among input words or image patches, drastically improving its performance over prior deep learning approaches. In this paper, we evaluate the performance of LLMs on the Cerebras Wafer Scale Engine (WSE). Cerebras WSE is a high performance computing system with 2.6 trillion transistors, 850,000 cores and 40 GB on-chip memory. Cerebras WSE's Sparse Linear Algebra Compute (SLAC) cores eliminates multiply-by-zeros operations and its 40 GB of on-chip memory is uniformly distributed among SLAC cores, enabling fast local access to model parameters. Moreover, Cerebras software configures routing between cores at runtime, optimizing communication overhead among cores. As LLMs are becoming more commonly used, new hardware architectures are needed to accelerate LLMs training and inference. We benchmark the effectiveness of this hardware architecture at accelerating LLMs training and inference. Additionally, we analyze if Cerebras WSE can scale the memory-wall associated with traditionally memory-bound compute tasks using its 20 PB/s high bandwidth memory. Furthermore, we examine the performance scalability of Cerebras WSE through a roofline model. By plotting performance metrics against computational intensity, we aim to assess their effectiveness at handling high compute-intensive LLMs training and inference tasks.
{"title":"Benchmarking the Performance of Large Language Models on the Cerebras Wafer Scale Engine","authors":"Zuoning Zhang, Dhruv Parikh, Youning Zhang, Viktor Prasanna","doi":"arxiv-2409.00287","DOIUrl":"https://doi.org/arxiv-2409.00287","url":null,"abstract":"Transformer based Large Language Models (LLMs) have recently reached state of\u0000the art performance in Natural Language Processing (NLP) and Computer Vision\u0000(CV) domains. LLMs use the Multi-Headed Self-Attention (MHSA) mechanism to\u0000capture long-range global attention relationships among input words or image\u0000patches, drastically improving its performance over prior deep learning\u0000approaches. In this paper, we evaluate the performance of LLMs on the Cerebras\u0000Wafer Scale Engine (WSE). Cerebras WSE is a high performance computing system\u0000with 2.6 trillion transistors, 850,000 cores and 40 GB on-chip memory. Cerebras\u0000WSE's Sparse Linear Algebra Compute (SLAC) cores eliminates multiply-by-zeros\u0000operations and its 40 GB of on-chip memory is uniformly distributed among SLAC\u0000cores, enabling fast local access to model parameters. Moreover, Cerebras\u0000software configures routing between cores at runtime, optimizing communication\u0000overhead among cores. As LLMs are becoming more commonly used, new hardware\u0000architectures are needed to accelerate LLMs training and inference. We\u0000benchmark the effectiveness of this hardware architecture at accelerating LLMs\u0000training and inference. Additionally, we analyze if Cerebras WSE can scale the\u0000memory-wall associated with traditionally memory-bound compute tasks using its\u000020 PB/s high bandwidth memory. Furthermore, we examine the performance\u0000scalability of Cerebras WSE through a roofline model. By plotting performance\u0000metrics against computational intensity, we aim to assess their effectiveness\u0000at handling high compute-intensive LLMs training and inference tasks.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Existing blockchain networks are often large-scale, requiring transactions to be synchronized across the entire network to reach consensus. On-chain computations can be prohibitively expensive, making many CPU-intensive computations infeasible. Inspired by the structure of IBM's token ring networks, we propose a lightweight consensus protocol called Monadring to address these issues. Monadring allows nodes within a large blockchain network to form smaller subnetworks, enabling faster and more cost-effective computations while maintaining the security guarantees of the main blockchain network. To further enhance Monadring's security, we introduce a node rotation mechanism based on Verifiable Random Function (VRF) and blind voting using Fully Homomorphic Encryption (FHE) within the smaller subnetwork. Unlike the common voting-based election of validator nodes, Monadring leverages FHE to conceal voting information, eliminating the advantage of the last mover in the voting process. This paper details the design and implementation of the Monadring protocol and evaluates its performance and feasibility through simulation experiments. Our research contributes to enhancing the practical utility of blockchain technology in large-scale application scenarios.
现有的区块链网络通常规模庞大,需要在整个网络内同步交易才能达成共识。链上计算的成本可能高得令人望而却步,使得许多 CPU 密集型计算变得不可行。受 IBM 令牌环网络结构的启发,我们提出了一种名为 Monadring 的轻量级共识协议来解决这些问题。Monadring允许大型区块链网络中的节点组成较小的子网络,从而实现更快、更具成本效益的计算,同时保持主区块链网络的安全保证。为了进一步增强Monadring的安全性,我们引入了一种基于可验证随机函数(VRF)的节点轮换机制,并在较小的子网络中使用完全同态加密(FHE)进行盲投票。与常见的基于投票的验证器节点选举不同,Monadring 利用 FHE 隐藏投票信息,消除了投票过程中后发者的优势。本文详细介绍了 Monadring 协议的设计与实现,并通过仿真实验评估了其性能和可行性。
{"title":"Monadring: A lightweight consensus protocol to offer Validation-as-a-Service to AVS nodes","authors":"Yu Zhang, Xiao Yan, Gang Tang, Helena Wang","doi":"arxiv-2408.16094","DOIUrl":"https://doi.org/arxiv-2408.16094","url":null,"abstract":"Existing blockchain networks are often large-scale, requiring transactions to\u0000be synchronized across the entire network to reach consensus. On-chain\u0000computations can be prohibitively expensive, making many CPU-intensive\u0000computations infeasible. Inspired by the structure of IBM's token ring\u0000networks, we propose a lightweight consensus protocol called Monadring to\u0000address these issues. Monadring allows nodes within a large blockchain network\u0000to form smaller subnetworks, enabling faster and more cost-effective\u0000computations while maintaining the security guarantees of the main blockchain\u0000network. To further enhance Monadring's security, we introduce a node rotation\u0000mechanism based on Verifiable Random Function (VRF) and blind voting using\u0000Fully Homomorphic Encryption (FHE) within the smaller subnetwork. Unlike the\u0000common voting-based election of validator nodes, Monadring leverages FHE to\u0000conceal voting information, eliminating the advantage of the last mover in the\u0000voting process. This paper details the design and implementation of the Monadring protocol\u0000and evaluates its performance and feasibility through simulation experiments.\u0000Our research contributes to enhancing the practical utility of blockchain\u0000technology in large-scale application scenarios.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anton Rydén, Erik Näslund, Elad Michael Schiller, Magnus Almgren
The rapid deployment of Large Language Models (LLMs) requires careful consideration of their effect on cybersecurity. Our work aims to improve the selection process of LLMs that are suitable for facilitating Secure Coding (SC). This raises challenging research questions, such as (RQ1) Which functionality can streamline the LLM evaluation? (RQ2) What should the evaluation measure? (RQ3) How to attest that the evaluation process is impartial? To address these questions, we introduce LLMSecCode, an open-source evaluation framework designed to assess LLM SC capabilities objectively. We validate the LLMSecCode implementation through experiments. When varying parameters and prompts, we find a 10% and 9% difference in performance, respectively. We also compare some results to reliable external actors, where our results show a 5% difference. We strive to ensure the ease of use of our open-source framework and encourage further development by external actors. With LLMSecCode, we hope to encourage the standardization and benchmarking of LLMs' capabilities in security-oriented code and tasks.
{"title":"LLMSecCode: Evaluating Large Language Models for Secure Coding","authors":"Anton Rydén, Erik Näslund, Elad Michael Schiller, Magnus Almgren","doi":"arxiv-2408.16100","DOIUrl":"https://doi.org/arxiv-2408.16100","url":null,"abstract":"The rapid deployment of Large Language Models (LLMs) requires careful\u0000consideration of their effect on cybersecurity. Our work aims to improve the\u0000selection process of LLMs that are suitable for facilitating Secure Coding\u0000(SC). This raises challenging research questions, such as (RQ1) Which\u0000functionality can streamline the LLM evaluation? (RQ2) What should the\u0000evaluation measure? (RQ3) How to attest that the evaluation process is\u0000impartial? To address these questions, we introduce LLMSecCode, an open-source\u0000evaluation framework designed to assess LLM SC capabilities objectively. We validate the LLMSecCode implementation through experiments. When varying\u0000parameters and prompts, we find a 10% and 9% difference in performance,\u0000respectively. We also compare some results to reliable external actors, where\u0000our results show a 5% difference. We strive to ensure the ease of use of our open-source framework and\u0000encourage further development by external actors. With LLMSecCode, we hope to\u0000encourage the standardization and benchmarking of LLMs' capabilities in\u0000security-oriented code and tasks.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large language models have significantly transformed multiple fields with their exceptional performance in natural language tasks, but their deployment in resource-constrained environments like edge networks presents an ongoing challenge. Decentralized techniques for inference have emerged, distributing the model blocks among multiple devices to improve flexibility and cost effectiveness. However, energy limitations remain a significant concern for edge devices. We propose a sustainable model for collaborative inference on interconnected, battery-powered edge devices with energy harvesting. A semi-Markov model is developed to describe the states of the devices, considering processing parameters and average green energy arrivals. This informs the design of scheduling algorithms that aim to minimize device downtimes and maximize network throughput. Through empirical evaluations and simulated runs, we validate the effectiveness of our approach, paving the way for energy-efficient decentralized inference over edge networks.
{"title":"Decentralized LLM Inference over Edge Networks with Energy Harvesting","authors":"Aria Khoshsirat, Giovanni Perin, Michele Rossi","doi":"arxiv-2408.15907","DOIUrl":"https://doi.org/arxiv-2408.15907","url":null,"abstract":"Large language models have significantly transformed multiple fields with\u0000their exceptional performance in natural language tasks, but their deployment\u0000in resource-constrained environments like edge networks presents an ongoing\u0000challenge. Decentralized techniques for inference have emerged, distributing\u0000the model blocks among multiple devices to improve flexibility and cost\u0000effectiveness. However, energy limitations remain a significant concern for\u0000edge devices. We propose a sustainable model for collaborative inference on\u0000interconnected, battery-powered edge devices with energy harvesting. A\u0000semi-Markov model is developed to describe the states of the devices,\u0000considering processing parameters and average green energy arrivals. This\u0000informs the design of scheduling algorithms that aim to minimize device\u0000downtimes and maximize network throughput. Through empirical evaluations and\u0000simulated runs, we validate the effectiveness of our approach, paving the way\u0000for energy-efficient decentralized inference over edge networks.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michal Orzechowski, Bartosz Balis, Krzysztof Janecki
Cloud-native is an approach to building and running scalable applications in modern cloud infrastructures, with the Kubernetes container orchestration platform being often considered as a fundamental cloud-native building block. In this paper, we evaluate alternative execution models for scientific workflows in Kubernetes. We compare the simplest job-based model, its variant with task clustering, and finally we propose a cloud-native model based on microservices comprising auto-scalable worker-pools. We implement the proposed models in the HyperFlow workflow management system, and evaluate them using a large Montage workflow on a Kubernetes cluster. The results indicate that the proposed cloud-native worker-pools execution model achieves best performance in terms of average cluster utilization, resulting in a nearly 20% improvement of the workflow makespan compared to the best-performing job-based model. However, better performance comes at the cost of significantly higher complexity of the implementation and maintenance. We believe that our experiments provide a valuable insight into the performance, advantages and disadvantages of alternative cloud-native execution models for scientific workflows.
{"title":"Towards cloud-native scientific workflow management","authors":"Michal Orzechowski, Bartosz Balis, Krzysztof Janecki","doi":"arxiv-2408.15445","DOIUrl":"https://doi.org/arxiv-2408.15445","url":null,"abstract":"Cloud-native is an approach to building and running scalable applications in\u0000modern cloud infrastructures, with the Kubernetes container orchestration\u0000platform being often considered as a fundamental cloud-native building block.\u0000In this paper, we evaluate alternative execution models for scientific\u0000workflows in Kubernetes. We compare the simplest job-based model, its variant\u0000with task clustering, and finally we propose a cloud-native model based on\u0000microservices comprising auto-scalable worker-pools. We implement the proposed\u0000models in the HyperFlow workflow management system, and evaluate them using a\u0000large Montage workflow on a Kubernetes cluster. The results indicate that the\u0000proposed cloud-native worker-pools execution model achieves best performance in\u0000terms of average cluster utilization, resulting in a nearly 20% improvement of\u0000the workflow makespan compared to the best-performing job-based model. However,\u0000better performance comes at the cost of significantly higher complexity of the\u0000implementation and maintenance. We believe that our experiments provide a\u0000valuable insight into the performance, advantages and disadvantages of\u0000alternative cloud-native execution models for scientific workflows.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Current data compression methods, such as sparsification in Federated Averaging (FedAvg), effectively enhance the communication efficiency of Federated Learning (FL). However, these methods encounter challenges such as the straggler problem and diminished model performance due to heterogeneous bandwidth and non-IID (Independently and Identically Distributed) data. To address these issues, we introduce a bandwidth-aware compression framework for FL, aimed at improving communication efficiency while mitigating the problems associated with non-IID data. First, our strategy dynamically adjusts compression ratios according to bandwidth, enabling clients to upload their models at a close pace, thus exploiting the otherwise wasted time to transmit more data. Second, we identify the non-overlapped pattern of retained parameters after compression, which results in diminished client update signals due to uniformly averaged weights. Based on this finding, we propose a parameter mask to adjust the client-averaging coefficients at the parameter level, thereby more closely approximating the original updates, and improving the training convergence under heterogeneous environments. Our evaluations reveal that our method significantly boosts model accuracy, with a maximum improvement of 13% over the uncompressed FedAvg. Moreover, it achieves a $3.37times$ speedup in reaching the target accuracy compared to FedAvg with a Top-K compressor, demonstrating its effectiveness in accelerating convergence with compression. The integration of common compression techniques into our framework further establishes its potential as a versatile foundation for future cross-device, communication-efficient FL research, addressing critical challenges in FL and advancing the field of distributed machine learning.
{"title":"Bandwidth-Aware and Overlap-Weighted Compression for Communication-Efficient Federated Learning","authors":"Zichen Tang, Junlin Huang, Rudan Yan, Yuxin Wang, Zhenheng Tang, Shaohuai Shi, Amelie Chi Zhou, Xiaowen Chu","doi":"arxiv-2408.14736","DOIUrl":"https://doi.org/arxiv-2408.14736","url":null,"abstract":"Current data compression methods, such as sparsification in Federated\u0000Averaging (FedAvg), effectively enhance the communication efficiency of\u0000Federated Learning (FL). However, these methods encounter challenges such as\u0000the straggler problem and diminished model performance due to heterogeneous\u0000bandwidth and non-IID (Independently and Identically Distributed) data. To\u0000address these issues, we introduce a bandwidth-aware compression framework for\u0000FL, aimed at improving communication efficiency while mitigating the problems\u0000associated with non-IID data. First, our strategy dynamically adjusts\u0000compression ratios according to bandwidth, enabling clients to upload their\u0000models at a close pace, thus exploiting the otherwise wasted time to transmit\u0000more data. Second, we identify the non-overlapped pattern of retained\u0000parameters after compression, which results in diminished client update signals\u0000due to uniformly averaged weights. Based on this finding, we propose a\u0000parameter mask to adjust the client-averaging coefficients at the parameter\u0000level, thereby more closely approximating the original updates, and improving\u0000the training convergence under heterogeneous environments. Our evaluations\u0000reveal that our method significantly boosts model accuracy, with a maximum\u0000improvement of 13% over the uncompressed FedAvg. Moreover, it achieves a\u0000$3.37times$ speedup in reaching the target accuracy compared to FedAvg with a\u0000Top-K compressor, demonstrating its effectiveness in accelerating convergence\u0000with compression. The integration of common compression techniques into our\u0000framework further establishes its potential as a versatile foundation for\u0000future cross-device, communication-efficient FL research, addressing critical\u0000challenges in FL and advancing the field of distributed machine learning.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Keren Censor-Hillel, Tomer Even, Virginia Vassilevska Williams
We provide a fast distributed algorithm for detecting $h$-cycles in the textsf{Congested Clique} model, whose running time decreases as the number of $h$-cycles in the graph increases. In undirected graphs, constant-round algorithms are known for cycles of even length. Our algorithm greatly improves upon the state of the art for odd values of $h$. Moreover, our running time applies also to directed graphs, in which case the improvement is for all values of $h$. Further, our techniques allow us to obtain a triangle detection algorithm in the quantum variant of this model, which is faster than prior work. A key technical contribution we develop to obtain our fast cycle detection algorithm is a new algorithm for computing the product of many pairs of small matrices in parallel, which may be of independent interest.
{"title":"Faster Cycle Detection in the Congested Clique","authors":"Keren Censor-Hillel, Tomer Even, Virginia Vassilevska Williams","doi":"arxiv-2408.15132","DOIUrl":"https://doi.org/arxiv-2408.15132","url":null,"abstract":"We provide a fast distributed algorithm for detecting $h$-cycles in the\u0000textsf{Congested Clique} model, whose running time decreases as the number of\u0000$h$-cycles in the graph increases. In undirected graphs, constant-round\u0000algorithms are known for cycles of even length. Our algorithm greatly improves\u0000upon the state of the art for odd values of $h$. Moreover, our running time\u0000applies also to directed graphs, in which case the improvement is for all\u0000values of $h$. Further, our techniques allow us to obtain a triangle detection\u0000algorithm in the quantum variant of this model, which is faster than prior\u0000work. A key technical contribution we develop to obtain our fast cycle detection\u0000algorithm is a new algorithm for computing the product of many pairs of small\u0000matrices in parallel, which may be of independent interest.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bartosz Balis, Konrad Czerepak, Albert Kuzma, Jan Meizner, Lukasz Wronski
As software systems increase in complexity, conventional monitoring methods struggle to provide a comprehensive overview or identify performance issues, often missing unexpected problems. Observability, however, offers a holistic approach, providing methods and tools that gather and analyze detailed telemetry data to uncover hidden issues. Originally developed for cloud-native systems, modern observability is less prevalent in scientific computing, particularly in HPC clusters, due to differences in application architecture, execution environments, and technology stacks. This paper proposes and evaluates an end-to-end observability solution tailored for scientific computing in HPC environments. We address several challenges, including collection of application-level metrics, instrumentation, context propagation, and tracing. We argue that typical dashboards with charts are not sufficient for advanced observability-driven analysis of scientific applications. Consequently, we propose a different approach based on data analysis using DataFrames and a Jupyter environment. The proposed solution is implemented and evaluated on two medical scientific pipelines running on an HPC cluster.
{"title":"Towards observability of scientific applications","authors":"Bartosz Balis, Konrad Czerepak, Albert Kuzma, Jan Meizner, Lukasz Wronski","doi":"arxiv-2408.15439","DOIUrl":"https://doi.org/arxiv-2408.15439","url":null,"abstract":"As software systems increase in complexity, conventional monitoring methods\u0000struggle to provide a comprehensive overview or identify performance issues,\u0000often missing unexpected problems. Observability, however, offers a holistic\u0000approach, providing methods and tools that gather and analyze detailed\u0000telemetry data to uncover hidden issues. Originally developed for cloud-native\u0000systems, modern observability is less prevalent in scientific computing,\u0000particularly in HPC clusters, due to differences in application architecture,\u0000execution environments, and technology stacks. This paper proposes and\u0000evaluates an end-to-end observability solution tailored for scientific\u0000computing in HPC environments. We address several challenges, including\u0000collection of application-level metrics, instrumentation, context propagation,\u0000and tracing. We argue that typical dashboards with charts are not sufficient\u0000for advanced observability-driven analysis of scientific applications.\u0000Consequently, we propose a different approach based on data analysis using\u0000DataFrames and a Jupyter environment. The proposed solution is implemented and\u0000evaluated on two medical scientific pipelines running on an HPC cluster.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"177 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yérom-David BrombergIRISA, UR, Jérémie DecouchantTU Delft, Manon SourisseauIRISA, UR, François TaïaniIRISA, UR
Detecting and handling network partitions is a fundamental requirement of distributed systems. Although existing partition detection methods in arbitrary graphs tolerate unreliable networks, they either assume that all nodes are correct or that a limited number of nodes might crash. In particular, Byzantine behaviors are out of the scope of these algorithms despite Byzantine fault tolerance being an active research topic for important problems such as consensus. Moreover, Byzantinetolerant protocols, such as broadcast or consensus, always rely on the assumption of connected networks. This paper addresses the problem of detecting partition in Byzantine networks (without connectivity assumption). We present a novel algorithm, which we call NECTAR, that safely detects partitioned and possibly partitionable networks and prove its correctness. NECTAR allows all correct nodes to detect whether a network could suffer from Byzantine nodes. We evaluate NECTAR's performance and compare it to two existing baselines using up to 100 nodes running real code, on various realistic topologies. Our results confirm that NECTAR maintains a 100% accuracy while the accuracy of the various existing baselines decreases by at least 40% as soon as one participant is Byzantine. Although NECTAR's network cost increases with the number of nodes and decreases with the network's diameter, it does not go above around 500KB in the worst cases.
{"title":"Partition Detection in Byzantine Networks","authors":"Yérom-David BrombergIRISA, UR, Jérémie DecouchantTU Delft, Manon SourisseauIRISA, UR, François TaïaniIRISA, UR","doi":"arxiv-2408.14814","DOIUrl":"https://doi.org/arxiv-2408.14814","url":null,"abstract":"Detecting and handling network partitions is a fundamental requirement of\u0000distributed systems. Although existing partition detection methods in arbitrary\u0000graphs tolerate unreliable networks, they either assume that all nodes are\u0000correct or that a limited number of nodes might crash. In particular, Byzantine\u0000behaviors are out of the scope of these algorithms despite Byzantine fault\u0000tolerance being an active research topic for important problems such as\u0000consensus. Moreover, Byzantinetolerant protocols, such as broadcast or\u0000consensus, always rely on the assumption of connected networks. This paper\u0000addresses the problem of detecting partition in Byzantine networks (without\u0000connectivity assumption). We present a novel algorithm, which we call NECTAR,\u0000that safely detects partitioned and possibly partitionable networks and prove\u0000its correctness. NECTAR allows all correct nodes to detect whether a network\u0000could suffer from Byzantine nodes. We evaluate NECTAR's performance and compare\u0000it to two existing baselines using up to 100 nodes running real code, on\u0000various realistic topologies. Our results confirm that NECTAR maintains a 100%\u0000accuracy while the accuracy of the various existing baselines decreases by at\u0000least 40% as soon as one participant is Byzantine. Although NECTAR's network\u0000cost increases with the number of nodes and decreases with the network's\u0000diameter, it does not go above around 500KB in the worst cases.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"71 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}