arXiv - CS - Distributed, Parallel, and Cluster Computing最新文献_第9页

Analysis of the Performance of the Matrix Multiplication Algorithm on the Cirrus Supercomputer Cirrus 超级计算机上的矩阵乘法算法性能分析

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-27 DOI: arxiv-2408.15384

Temitayo Adefemi

Matrix multiplication is integral to various scientific and engineeringdisciplines, including machine learning, image processing, and gaming. With theincreasing data volumes in areas like machine learning, the demand forefficient parallel processing of large matrices has grown significantly.Thisstudy explores the performance of both serial and parallel matrixmultiplication on the Cirrus supercomputer at the University of Edinburgh. Theresults demonstrate the scalability and efficiency of these methods, providinginsights for optimizing matrixmultiplication in real-world applications.

矩阵乘法是机器学习、图像处理和游戏等各种科学和工程学科不可或缺的一部分。本研究探讨了爱丁堡大学 Cirrus 超级计算机上串行和并行矩阵乘法的性能。研究结果证明了这些方法的可扩展性和效率，为优化实际应用中的矩阵乘法提供了启示。

引用次数: 0

A parallel particle cluster algorithm using nearest neighbour graphs and passive target communication 使用近邻图和被动目标通信的并行粒子群算法

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-27 DOI: arxiv-2408.15348

Matthias Frey, Steven Böing, Rui F. G. Apóstolo

We present a parallel cluster algorithm for $N$-body simulations which uses anearest neighbour search algorithm and one-sided messaging passing interface(MPI) communication. The nearest neighbour is defined by the Euclidean distancein three-dimensional space. The resulting directed nearest neighbour graphsthat are used to define the clusters are split up in an iterative procedurewith MPI remote memory access (RMA) communication. The method has beenimplemented as part of the elliptical parcel-in-cell (EPIC) method targetinggeophysical fluid flows. The parallel scalability of the algorithm is discussedby means of an artificial and a standard fluid dynamics test case. The clusteralgorithm shows good weak and strong scalability up to 16,384 cores with aparallel weak scaling efficiency of about 80% for balanced workloads. In poorlybalanced problems, MPI synchronisation dominates execution of the clusteralgorithm and thus drastically worsens its parallel scalability.

我们提出了一种用于 $N$ 体模拟的并行集群算法，该算法使用最近邻搜索算法和单边消息传递接口（MPI）通信。最近邻定义为三维空间中的欧氏距离。通过 MPI 远程内存访问（RMA）通信，在迭代过程中分割出用于定义聚类的有向近邻图。该方法已作为针对地球物理流体流的椭圆包裹单元（EPIC）方法的一部分加以实施。通过人工和标准流体动力学测试案例讨论了该算法的并行可扩展性。聚类算法显示出良好的弱可扩展性和强可扩展性，最高可扩展至 16,384 个内核，对于平衡的工作负载，并行弱扩展效率约为 80%。在平衡性较差的问题中，MPI 同步主导了聚类算法的执行，从而大大降低了其并行可扩展性。

引用次数: 0

A sparsity-aware distributed-memory algorithm for sparse-sparse matrix multiplication 稀疏-稀疏矩阵乘法的稀疏感知分布式内存算法

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-26 DOI: arxiv-2408.14558

Yuxi Hong, Aydin Buluc

Multiplying two sparse matrices (SpGEMM) is a common computational primitiveused in many areas including graph algorithms, bioinformatics, algebraicmultigrid solvers, and randomized sketching. Distributed-memory parallelalgorithms for SpGEMM have mainly focused on sparsity-oblivious approaches thatuse 2D and 3D partitioning. Sparsity-aware 1D algorithms can theoreticallyreduce communication by not fetching nonzeros of the sparse matrices that donot participate in the multiplication. Here, we present a distributed-memory 1D SpGEMM algorithm and implementation.It uses MPI RDMA operations to mitigate the cost of packing/unpackingsubmatrices for communication, and it uses a block fetching strategy to avoidexcessive fine-grained messaging. Our results show that our 1D implementationoutperforms state-of-the-art 2D and 3D implementations within CombBLAS for manyconfigurations, inputs, and use cases, while remaining conceptually simpler.

两个稀疏矩阵相乘（SpGEMM）是一种常见的计算基元，在图算法、生物信息学、代数多网格求解器和随机草图等许多领域都有应用。针对 SpGEMM 的分布式内存并行算法主要集中在使用二维和三维分割的稀疏性盲方法上。理论上，稀疏感知的一维算法可以通过不获取不参与乘法的稀疏矩阵的非零点来减少通信量。在这里，我们介绍了分布式内存 1D SpGEMM 算法及其实现。它使用 MPI RDMA 操作来降低打包/解包子矩阵的通信成本，并使用分块获取策略来避免过多的细粒度消息传递。我们的研究结果表明，在 CombBLAS 中，我们的一维实现在许多配置、输入和用例方面都优于最先进的二维和三维实现，同时在概念上也更加简单。

引用次数: 0

Scalable, reproducible, and cost-effective processing of large-scale medical imaging datasets 可扩展、可重现、经济高效地处理大规模医学成像数据集

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-26 DOI: arxiv-2408.14611

Michael E. Kim, Karthik Ramadass, Chenyu Gao, Praitayini Kanakaraj, Nancy R. Newlin, Gaurav Rudravaram, Kurt G. Schilling, Blake E. Dewey, Derek Archer, Timothy J. Hohman, Zhiyuan Li, Shunxing Bao, Bennett A. Landman, Nazirah Mohd Khairi

Curating, processing, and combining large-scale medical imaging datasets fromnational studies is a non-trivial task due to the intense computation and datathroughput required, variability of acquired data, and associated financialoverhead. Existing platforms or tools for large-scale data curation,processing, and storage have difficulty achieving a viable cost-to-scale ratioof computation speed for research purposes, either being too slow or tooexpensive. Additionally, management and consistency of processing large data ina team-driven manner is a non-trivial task. We design a BIDS-compliant methodfor an efficient and robust data processing pipeline of large-scalediffusion-weighted and T1-weighted MRI data compatible with low-cost,high-efficiency computing systems. Our method accomplishes automated queryingof data available for processing and process running in a consistent andreproducible manner that has long-term stability, while using heterogenouslow-cost computational resources and storage systems for efficient processingand data transfer. We demonstrate how our organizational structure permitsefficiency in a semi-automated data processing pipeline and show how our methodis comparable in processing time to cloud-based computation while being almost20 times more cost-effective. Our design allows for fast data throughput speedsand low latency to reduce the time for data transfer between storage serversand computation servers, achieving an average of 0.60 Gb/s compared to 0.33Gb/s for using cloud-based processing methods. The design of our workflowengine permits quick process running while maintaining flexibility to adapt tonewly acquired data.

由于需要高强度的计算和数据吞吐量、所获数据的多变性以及相关的财务费用，整理、处理和合并来自国家研究的大规模医学影像数据集是一项非同小可的任务。用于大规模数据整理、处理和存储的现有平台或工具难以达到用于研究目的的可行的计算速度成本比，要么太慢，要么太贵。此外，以团队驱动的方式处理大型数据的管理和一致性也是一项非同小可的任务。我们设计了一种符合 BIDS 标准的方法，用于高效、稳健地处理大规模扩散加权和 T1 加权磁共振成像数据，并与低成本、高效率的计算系统兼容。我们的方法可以自动查询可供处理的数据，并以一致、可重复的方式运行处理过程，具有长期稳定性，同时利用异质低成本计算资源和存储系统进行高效处理和数据传输。我们展示了我们的组织结构如何在半自动化数据处理管道中实现高效，并展示了我们的方法如何在处理时间上与基于云的计算相媲美，而成本效益却高出近 20 倍。我们的设计实现了快速的数据吞吐速度和低延迟，从而缩短了存储服务器和计算服务器之间的数据传输时间，平均传输速度达到 0.60Gb/s，而使用基于云的处理方法平均传输速度为 0.33Gb/s。我们的工作流引擎设计允许快速运行流程，同时保持灵活性，以适应不断获取的数据。

{"title":"Scalable, reproducible, and cost-effective processing of large-scale medical imaging datasets","authors":"Michael E. Kim, Karthik Ramadass, Chenyu Gao, Praitayini Kanakaraj, Nancy R. Newlin, Gaurav Rudravaram, Kurt G. Schilling, Blake E. Dewey, Derek Archer, Timothy J. Hohman, Zhiyuan Li, Shunxing Bao, Bennett A. Landman, Nazirah Mohd Khairi","doi":"arxiv-2408.14611","DOIUrl":"https://doi.org/arxiv-2408.14611","url":null,"abstract":"Curating, processing, and combining large-scale medical imaging datasets from\u0000national studies is a non-trivial task due to the intense computation and data\u0000throughput required, variability of acquired data, and associated financial\u0000overhead. Existing platforms or tools for large-scale data curation,\u0000processing, and storage have difficulty achieving a viable cost-to-scale ratio\u0000of computation speed for research purposes, either being too slow or too\u0000expensive. Additionally, management and consistency of processing large data in\u0000a team-driven manner is a non-trivial task. We design a BIDS-compliant method\u0000for an efficient and robust data processing pipeline of large-scale\u0000diffusion-weighted and T1-weighted MRI data compatible with low-cost,\u0000high-efficiency computing systems. Our method accomplishes automated querying\u0000of data available for processing and process running in a consistent and\u0000reproducible manner that has long-term stability, while using heterogenous\u0000low-cost computational resources and storage systems for efficient processing\u0000and data transfer. We demonstrate how our organizational structure permits\u0000efficiency in a semi-automated data processing pipeline and show how our method\u0000is comparable in processing time to cloud-based computation while being almost\u000020 times more cost-effective. Our design allows for fast data throughput speeds\u0000and low latency to reduce the time for data transfer between storage servers\u0000and computation servers, achieving an average of 0.60 Gb/s compared to 0.33\u0000Gb/s for using cloud-based processing methods. The design of our workflow\u0000engine permits quick process running while maintaining flexibility to adapt to\u0000newly acquired data.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"268 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Resource Efficient Asynchronous Federated Learning for Digital Twin Empowered IoT Network 为数字双胞胎赋能的物联网网络提供资源高效的异步联盟学习

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-26 DOI: arxiv-2408.14298

Shunfeng Chu, Jun Li, Jianxin Wang, Yiyang Ni, Kang Wei, Wen Chen, Shi Jin

As an emerging technology, digital twin (DT) can provide real-time status anddynamic topology mapping for Internet of Things (IoT) devices. However, DT andits implementation within industrial IoT networks necessitates substantial,distributed data support, which often leads to ``data silos'' and raisesprivacy concerns. To address these issues, we develop a dynamic resourcescheduling algorithm tailored for the asynchronous federated learning(FL)-based lightweight DT empowered IoT network. Specifically, our approachaims to minimize a multi-objective function that encompasses both energyconsumption and latency by optimizing IoT device selection and transmit powercontrol, subject to FL model performance constraints. We utilize the Lyapunovmethod to decouple the formulated problem into a series of one-slotoptimization problems and develop a two-stage optimization algorithm to achievethe optimal transmission power control and IoT device scheduling strategies. Inthe first stage, we derive closed-form solutions for optimal transmit power onthe IoT device side. In the second stage, since partial state information isunknown, e.g., the transmitting power and computational frequency of IoTdevice, the edge server employs a multi-armed bandit (MAB) framework to modelthe IoT device selection problem and utilizes an efficient online algorithm,namely the client utility-based upper confidence bound (CU-UCB), to address it.Numerical results validate our algorithm's superiority over benchmark schemes,and simulations demonstrate that our algorithm achieves faster training speedson the Fashion-MNIST and CIFAR-10 datasets within the same training duration.

作为一项新兴技术，数字孪生（DT）可为物联网（IoT）设备提供实时状态和动态拓扑映射。然而，数字孪生及其在工业物联网网络中的实施需要大量的分布式数据支持，这往往会导致 "数据孤岛 "并引发隐私问题。为了解决这些问题，我们为基于异步联合学习（FL）的轻量级 DT 物联网网络开发了一种动态资源调度算法。具体来说，我们的方法旨在通过优化物联网设备选择和发射功率控制，在 FL 模型性能约束条件下，最小化包含能耗和延迟的多目标函数。我们利用 Lyapunov 方法将所提出的问题解耦为一系列单槽优化问题，并开发了一种两阶段优化算法，以实现最佳发射功率控制和物联网设备调度策略。在第一阶段，我们得出了物联网设备侧最优发射功率的闭式解。在第二阶段，由于部分状态信息是未知的，例如数值结果验证了我们的算法优于基准方案，仿真证明我们的算法在相同的训练时间内，在 Fashion-MNIST 和 CIFAR-10 数据集上实现了更快的训练速度。

{"title":"Resource Efficient Asynchronous Federated Learning for Digital Twin Empowered IoT Network","authors":"Shunfeng Chu, Jun Li, Jianxin Wang, Yiyang Ni, Kang Wei, Wen Chen, Shi Jin","doi":"arxiv-2408.14298","DOIUrl":"https://doi.org/arxiv-2408.14298","url":null,"abstract":"As an emerging technology, digital twin (DT) can provide real-time status and\u0000dynamic topology mapping for Internet of Things (IoT) devices. However, DT and\u0000its implementation within industrial IoT networks necessitates substantial,\u0000distributed data support, which often leads to ``data silos'' and raises\u0000privacy concerns. To address these issues, we develop a dynamic resource\u0000scheduling algorithm tailored for the asynchronous federated learning\u0000(FL)-based lightweight DT empowered IoT network. Specifically, our approach\u0000aims to minimize a multi-objective function that encompasses both energy\u0000consumption and latency by optimizing IoT device selection and transmit power\u0000control, subject to FL model performance constraints. We utilize the Lyapunov\u0000method to decouple the formulated problem into a series of one-slot\u0000optimization problems and develop a two-stage optimization algorithm to achieve\u0000the optimal transmission power control and IoT device scheduling strategies. In\u0000the first stage, we derive closed-form solutions for optimal transmit power on\u0000the IoT device side. In the second stage, since partial state information is\u0000unknown, e.g., the transmitting power and computational frequency of IoT\u0000device, the edge server employs a multi-armed bandit (MAB) framework to model\u0000the IoT device selection problem and utilizes an efficient online algorithm,\u0000namely the client utility-based upper confidence bound (CU-UCB), to address it.\u0000Numerical results validate our algorithm's superiority over benchmark schemes,\u0000and simulations demonstrate that our algorithm achieves faster training speeds\u0000on the Fashion-MNIST and CIFAR-10 datasets within the same training duration.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Employing Artificial Intelligence to Steer Exascale Workflows with Colmena 利用人工智能引导 Colmena 的超大规模工作流

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-26 DOI: arxiv-2408.14434

Logan Ward, J. Gregory Pauloski, Valerie Hayot-Sasson, Yadu Babuji, Alexander Brace, Ryan Chard, Kyle Chard, Rajeev Thakur, Ian Foster

Computational workflows are a common class of application on supercomputers,yet the loosely coupled and heterogeneous nature of workflows often fails totake full advantage of their capabilities. We created Colmena to leverage themassive parallelism of a supercomputer by using Artificial Intelligence (AI) tolearn from and adapt a workflow as it executes. Colmena allows scientists todefine how their application should respond to events (e.g., task completion)as a series of cooperative agents. In this paper, we describe the design ofColmena, the challenges we overcame while deploying applications on exascalesystems, and the science workflows we have enhanced through interweaving AI.The scaling challenges we discuss include developing steering strategies thatmaximize node utilization, introducing data fabrics that reduce communicationoverhead of data-intensive tasks, and implementing workflow tasks that cachecostly operations between invocations. These innovations coupled with a varietyof application patterns accessible through our agent-based steering model haveenabled science advances in chemistry, biophysics, and materials science usingdifferent types of AI. Our vision is that Colmena will spur creative solutionsthat harness AI across many domains of scientific computing.

计算工作流是超级计算机上常见的一类应用，但工作流的松散耦合和异构特性往往无法充分利用其能力。我们创建了 Colmena，利用人工智能（AI）来学习和调整工作流，从而充分利用超级计算机的大规模并行性。Colmena 允许科学家定义他们的应用程序应如何作为一系列合作代理对事件（如任务完成）做出响应。在本文中，我们将介绍Colmena的设计、我们在超大规模系统上部署应用时克服的挑战，以及我们通过人工智能交织增强的科学工作流。我们讨论的扩展挑战包括开发可最大限度提高节点利用率的转向策略、引入可减少数据密集型任务通信开销的数据结构，以及实施可在调用之间缓存高成本操作的工作流任务。这些创新加上我们基于代理的转向模型所提供的各种应用模式，使得化学、生物物理和材料科学领域利用不同类型的人工智能取得了科学进步。我们的愿景是，Colmena 将推动在科学计算的多个领域利用人工智能的创造性解决方案。

{"title":"Employing Artificial Intelligence to Steer Exascale Workflows with Colmena","authors":"Logan Ward, J. Gregory Pauloski, Valerie Hayot-Sasson, Yadu Babuji, Alexander Brace, Ryan Chard, Kyle Chard, Rajeev Thakur, Ian Foster","doi":"arxiv-2408.14434","DOIUrl":"https://doi.org/arxiv-2408.14434","url":null,"abstract":"Computational workflows are a common class of application on supercomputers,\u0000yet the loosely coupled and heterogeneous nature of workflows often fails to\u0000take full advantage of their capabilities. We created Colmena to leverage the\u0000massive parallelism of a supercomputer by using Artificial Intelligence (AI) to\u0000learn from and adapt a workflow as it executes. Colmena allows scientists to\u0000define how their application should respond to events (e.g., task completion)\u0000as a series of cooperative agents. In this paper, we describe the design of\u0000Colmena, the challenges we overcame while deploying applications on exascale\u0000systems, and the science workflows we have enhanced through interweaving AI.\u0000The scaling challenges we discuss include developing steering strategies that\u0000maximize node utilization, introducing data fabrics that reduce communication\u0000overhead of data-intensive tasks, and implementing workflow tasks that cache\u0000costly operations between invocations. These innovations coupled with a variety\u0000of application patterns accessible through our agent-based steering model have\u0000enabled science advances in chemistry, biophysics, and materials science using\u0000different types of AI. Our vision is that Colmena will spur creative solutions\u0000that harness AI across many domains of scientific computing.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sparsity-Preserving Encodings for Straggler-Optimal Distributed Matrix Computations at the Edge 边缘最优分布式矩阵计算的稀疏性保护编码

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-09 DOI: arxiv-2408.05152

Anindya Bijoy Das, Aditya Ramamoorthy, David J. Love, Christopher G. Brinton

Matrix computations are a fundamental building-block of edge computingsystems, with a major recent uptick in demand due to their use in AI/MLtraining and inference procedures. Existing approaches for distributing matrixcomputations involve allocating coded combinations of submatrices to workernodes, to build resilience to slower nodes, called stragglers. In the edgelearning context, however, these approaches will compromise sparsity propertiesthat are often present in the original matrices found at the edge server. Inthis study, we consider the challenge of augmenting such approaches to preserveinput sparsity when distributing the task across edge devices, therebyretaining the associated computational efficiency enhancements. First, we finda lower bound on the weight of coding, i.e., the number of submatrices to becombined to obtain coded submatrices, to provide the resilience to the maximumpossible number of straggler devices (for given number of devices and theirstorage constraints). Next we propose distributed matrix computation schemeswhich meet the exact lower bound on the weight of the coding. Numericalexperiments conducted in Amazon Web Services (AWS) validate our assertionsregarding straggler mitigation and computation speed for sparse matrices.

矩阵计算是边缘计算系统的基本组成部分，最近由于其在人工智能/ML 训练和推理程序中的应用而需求大增。分配矩阵计算的现有方法包括将子矩阵的编码组合分配给工作节点，以建立对较慢节点（称为落后节点）的弹性。但是，在边缘学习环境中，这些方法会损害边缘服务器上原始矩阵中经常出现的稀疏性。在这项研究中，我们考虑的挑战是如何增强这些方法，以便在将任务分配给边缘设备时保持输入稀疏性，从而保持相关的计算效率提升。首先，我们找到了编码权重的下限，即为获得编码子矩阵而需要组合的子矩阵数量，以提供对最大可能数量的游离设备的弹性（在给定设备数量和存储约束条件下）。接下来，我们提出了符合编码权重精确下限的分布式矩阵计算方案。在亚马逊网络服务（AWS）中进行的数值实验验证了我们关于稀疏矩阵的流浪者缓解和计算速度的论断。

{"title":"Sparsity-Preserving Encodings for Straggler-Optimal Distributed Matrix Computations at the Edge","authors":"Anindya Bijoy Das, Aditya Ramamoorthy, David J. Love, Christopher G. Brinton","doi":"arxiv-2408.05152","DOIUrl":"https://doi.org/arxiv-2408.05152","url":null,"abstract":"Matrix computations are a fundamental building-block of edge computing\u0000systems, with a major recent uptick in demand due to their use in AI/ML\u0000training and inference procedures. Existing approaches for distributing matrix\u0000computations involve allocating coded combinations of submatrices to worker\u0000nodes, to build resilience to slower nodes, called stragglers. In the edge\u0000learning context, however, these approaches will compromise sparsity properties\u0000that are often present in the original matrices found at the edge server. In\u0000this study, we consider the challenge of augmenting such approaches to preserve\u0000input sparsity when distributing the task across edge devices, thereby\u0000retaining the associated computational efficiency enhancements. First, we find\u0000a lower bound on the weight of coding, i.e., the number of submatrices to be\u0000combined to obtain coded submatrices, to provide the resilience to the maximum\u0000possible number of straggler devices (for given number of devices and their\u0000storage constraints). Next we propose distributed matrix computation schemes\u0000which meet the exact lower bound on the weight of the coding. Numerical\u0000experiments conducted in Amazon Web Services (AWS) validate our assertions\u0000regarding straggler mitigation and computation speed for sparse matrices.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Distributed Augmentation, Hypersweeps, and Branch Decomposition of Contour Trees for Scientific Exploration 用于科学探索的等值线树的分布式扩增、超扫和分支分解

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-09 DOI: arxiv-2408.04836

Mingzhe Li, Hamish Carr, Oliver Rübel, Bei Wang, Gunther H. Weber

Contour trees describe the topology of level sets in scalar fields and arewidely used in topological data analysis and visualization. A main challenge ofutilizing contour trees for large-scale scientific data is their computation atscale using high-performance computing. To address this challenge, recent workhas introduced distributed hierarchical contour trees for distributedcomputation and storage of contour trees. However, effective use of thesedistributed structures in analysis and visualization requires subsequentcomputation of geometric properties and branch decomposition to support contourextraction and exploration. In this work, we introduce distributed algorithmsfor augmentation, hypersweeps, and branch decomposition that enable parallelcomputation of geometric properties, and support the use of distributed contourtrees as query structures for scientific exploration. We evaluate the parallelperformance of these algorithms and apply them to identify and extractimportant contours for scientific visualization.

等值线树描述了标量场中水平集的拓扑结构，广泛应用于拓扑数据分析和可视化。将等值线树用于大规模科学数据的一个主要挑战是使用高性能计算进行等值线树的大规模计算。为了应对这一挑战，最近的研究引入了分布式分层等值线树，用于分布式计算和存储等值线树。然而，要在分析和可视化中有效利用这些分布式结构，需要对几何属性和分支分解进行后续计算，以支持轮廓提取和探索。在这项工作中，我们介绍了增强、超扫和分支分解的分布式算法，这些算法可实现几何属性的并行计算，并支持将分布式轮廓树用作科学探索的查询结构。我们评估了这些算法的并行性能，并将其应用于科学可视化中重要轮廓的识别和提取。

{"title":"Distributed Augmentation, Hypersweeps, and Branch Decomposition of Contour Trees for Scientific Exploration","authors":"Mingzhe Li, Hamish Carr, Oliver Rübel, Bei Wang, Gunther H. Weber","doi":"arxiv-2408.04836","DOIUrl":"https://doi.org/arxiv-2408.04836","url":null,"abstract":"Contour trees describe the topology of level sets in scalar fields and are\u0000widely used in topological data analysis and visualization. A main challenge of\u0000utilizing contour trees for large-scale scientific data is their computation at\u0000scale using high-performance computing. To address this challenge, recent work\u0000has introduced distributed hierarchical contour trees for distributed\u0000computation and storage of contour trees. However, effective use of these\u0000distributed structures in analysis and visualization requires subsequent\u0000computation of geometric properties and branch decomposition to support contour\u0000extraction and exploration. In this work, we introduce distributed algorithms\u0000for augmentation, hypersweeps, and branch decomposition that enable parallel\u0000computation of geometric properties, and support the use of distributed contour\u0000trees as query structures for scientific exploration. We evaluate the parallel\u0000performance of these algorithms and apply them to identify and extract\u0000important contours for scientific visualization.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor 在内核间互联智能处理器上扩展深度学习计算

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-09 DOI: arxiv-2408.04808

Yiqi Liu, Yuqi Xue, Yu Cheng, Lingxiao Ma, Ziming Miao, Jilong Xue, Jian Huang

As AI chips incorporate numerous parallelized cores to scale deep learning(DL) computing, inter-core communication is enabled recently by employinghigh-bandwidth and low-latency interconnect links on the chip (e.g., GraphcoreIPU). It allows each core to directly access the fast scratchpad memory inother cores, which enables new parallel computing paradigms. However, withoutproper support for the scalable inter-core connections in current DL compilers,it is hard for developers to exploit the benefits of this new architecture. We present T10, the first DL compiler to exploit the inter-core communicationbandwidth and distributed on-chip memory on AI chips. To formulate thecomputation and communication patterns of tensor operators in this newarchitecture, T10 introduces a distributed tensor abstraction rTensor. T10 mapsa DNN model to execution plans with a generalized compute-shift pattern, bypartitioning DNN computation into sub-operators and mapping them to cores, sothat the cores can exchange data following predictable patterns. T10 makesglobally optimized trade-offs between on-chip memory consumption and inter-corecommunication overhead, selects the best execution plan from a vastoptimization space, and alleviates unnecessary inter-core communications. Ourevaluation with a real inter-core connected AI chip, the Graphcore IPU, showsup to 3.3$times$ performance improvement, and scalability support for largermodels, compared to state-of-the-art DL compilers and vendor libraries.

随着人工智能芯片采用大量并行化内核来扩展深度学习（DL）计算，最近通过在芯片上采用高带宽、低延迟的互连链路（如 GraphcoreIPU）实现了内核间通信。它允许每个内核直接访问其他内核的快速划板存储器，从而实现了新的并行计算模式。然而，由于目前的 DL 编译器不支持可扩展的内核间连接，开发人员很难利用这种新架构的优势。我们提出了 T10，这是第一个利用人工智能芯片上的核间通讯带宽和分布式片上内存的 DL 编译器。T10 引入了分布式张量抽象 rTensor，以制定这种新架构中张量运算符的计算和通信模式。T10 通过将 DNN 计算划分为子运算符并将其映射到内核，将 DNN 模型映射到具有广义计算转移模式的执行计划中，从而使内核可以按照可预测的模式交换数据。T10 在片上内存消耗和内核间通信开销之间进行了全局优化权衡，从广阔的优化空间中选择最佳执行计划，并减少不必要的内核间通信。与最先进的 DL 编译器和供应商库相比，使用真正的内核间连接 AI 芯片 Graphcore IPU 进行的评估显示，T10 的性能提高了 3.3 美元/次，并支持大型模型的可扩展性。

{"title":"Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor","authors":"Yiqi Liu, Yuqi Xue, Yu Cheng, Lingxiao Ma, Ziming Miao, Jilong Xue, Jian Huang","doi":"arxiv-2408.04808","DOIUrl":"https://doi.org/arxiv-2408.04808","url":null,"abstract":"As AI chips incorporate numerous parallelized cores to scale deep learning\u0000(DL) computing, inter-core communication is enabled recently by employing\u0000high-bandwidth and low-latency interconnect links on the chip (e.g., Graphcore\u0000IPU). It allows each core to directly access the fast scratchpad memory in\u0000other cores, which enables new parallel computing paradigms. However, without\u0000proper support for the scalable inter-core connections in current DL compilers,\u0000it is hard for developers to exploit the benefits of this new architecture. We present T10, the first DL compiler to exploit the inter-core communication\u0000bandwidth and distributed on-chip memory on AI chips. To formulate the\u0000computation and communication patterns of tensor operators in this new\u0000architecture, T10 introduces a distributed tensor abstraction rTensor. T10 maps\u0000a DNN model to execution plans with a generalized compute-shift pattern, by\u0000partitioning DNN computation into sub-operators and mapping them to cores, so\u0000that the cores can exchange data following predictable patterns. T10 makes\u0000globally optimized trade-offs between on-chip memory consumption and inter-core\u0000communication overhead, selects the best execution plan from a vast\u0000optimization space, and alleviates unnecessary inter-core communications. Our\u0000evaluation with a real inter-core connected AI chip, the Graphcore IPU, shows\u0000up to 3.3$times$ performance improvement, and scalability support for larger\u0000models, compared to state-of-the-art DL compilers and vendor libraries.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Partial Experts Checkpoint: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training 部分专家检查点：稀疏专家混合模型训练的高效容错能力

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-08 DOI: arxiv-2408.04307

Weilin Cai, Le Qin, Jiayi Huang

As large language models continue to scale up, the imperative for faulttolerance in distributed deep learning systems intensifies, becoming a focalarea of AI infrastructure research. Checkpoint has emerged as the predominantfault tolerance strategy, with extensive studies dedicated to optimizing itsefficiency. However, the advent of the sparse Mixture-of-Experts (MoE) modelpresents new challenges for traditional checkpoint techniques due to thesubstantial increase in model size, despite comparable computational demands todense models. Breaking new ground in the realm of efficient fault tolerance forMoE model training, we introduce a novel Partial Experts Checkpoint (PEC)mechanism alongside a corresponding PEC fault-tolerant system. Our approachstrategically checkpoints a selected subset of experts, thereby significantlyreducing the checkpoint size for MoE models to a level comparable with that ofdense models. The empirical analysis on our 8-expert GPT-MoE model demonstratesthat the proposed PEC approach facilitates a substantial 54.2% decrease in thesize of non-redundant checkpoint (no data-parallel duplication), withoutcompromising the final model quality. Moreover, our PEC fault-tolerant systemachieves a 76.9% reduction in checkpoint workload per data-parallel distributedrank, thereby correspondingly diminishing the checkpointing time andfacilitating complete overlap with the training process.

随着大型语言模型的不断扩大，分布式深度学习系统的容错需求也在不断加强，这已成为人工智能基础架构研究的一个焦点领域。检查点（Checkpoint）已成为最主要的容错策略，大量研究致力于优化它的效率。然而，稀疏专家混合物（MoE）模型的出现给传统的检查点技术带来了新的挑战，因为尽管计算需求与密集模型相当，但模型规模却大幅增加。我们引入了一种新颖的部分专家检查点（PEC）机制和相应的 PEC 容错系统，在 MoE 模型训练的高效容错领域开辟了新天地。我们的方法策略性地对选定的专家子集进行检查点，从而将 MoE 模型的检查点规模显著降低到与密集模型相当的水平。对我们的 8 专家 GPT-MoE 模型进行的实证分析表明，所提出的 PEC 方法有助于将非冗余检查点（无数据并行重复）的大小大幅减少 54.2%，而不会影响最终模型的质量。此外，我们的 PEC 容错系统还将每个数据并行分布式等级的检查点工作量减少了 76.9%，从而相应地缩短了检查点时间，并促进了与训练过程的完全重叠。

{"title":"Partial Experts Checkpoint: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training","authors":"Weilin Cai, Le Qin, Jiayi Huang","doi":"arxiv-2408.04307","DOIUrl":"https://doi.org/arxiv-2408.04307","url":null,"abstract":"As large language models continue to scale up, the imperative for fault\u0000tolerance in distributed deep learning systems intensifies, becoming a focal\u0000area of AI infrastructure research. Checkpoint has emerged as the predominant\u0000fault tolerance strategy, with extensive studies dedicated to optimizing its\u0000efficiency. However, the advent of the sparse Mixture-of-Experts (MoE) model\u0000presents new challenges for traditional checkpoint techniques due to the\u0000substantial increase in model size, despite comparable computational demands to\u0000dense models. Breaking new ground in the realm of efficient fault tolerance for\u0000MoE model training, we introduce a novel Partial Experts Checkpoint (PEC)\u0000mechanism alongside a corresponding PEC fault-tolerant system. Our approach\u0000strategically checkpoints a selected subset of experts, thereby significantly\u0000reducing the checkpoint size for MoE models to a level comparable with that of\u0000dense models. The empirical analysis on our 8-expert GPT-MoE model demonstrates\u0000that the proposed PEC approach facilitates a substantial 54.2% decrease in the\u0000size of non-redundant checkpoint (no data-parallel duplication), without\u0000compromising the final model quality. Moreover, our PEC fault-tolerant system\u0000achieves a 76.9% reduction in checkpoint workload per data-parallel distributed\u0000rank, thereby correspondingly diminishing the checkpointing time and\u0000facilitating complete overlap with the training process.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0