Matrix multiplication is integral to various scientific and engineering disciplines, including machine learning, image processing, and gaming. With the increasing data volumes in areas like machine learning, the demand for efficient parallel processing of large matrices has grown significantly.This study explores the performance of both serial and parallel matrix multiplication on the Cirrus supercomputer at the University of Edinburgh. The results demonstrate the scalability and efficiency of these methods, providing insights for optimizing matrixmultiplication in real-world applications.
{"title":"Analysis of the Performance of the Matrix Multiplication Algorithm on the Cirrus Supercomputer","authors":"Temitayo Adefemi","doi":"arxiv-2408.15384","DOIUrl":"https://doi.org/arxiv-2408.15384","url":null,"abstract":"Matrix multiplication is integral to various scientific and engineering\u0000disciplines, including machine learning, image processing, and gaming. With the\u0000increasing data volumes in areas like machine learning, the demand for\u0000efficient parallel processing of large matrices has grown significantly.This\u0000study explores the performance of both serial and parallel matrix\u0000multiplication on the Cirrus supercomputer at the University of Edinburgh. The\u0000results demonstrate the scalability and efficiency of these methods, providing\u0000insights for optimizing matrixmultiplication in real-world applications.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a parallel cluster algorithm for $N$-body simulations which uses a nearest neighbour search algorithm and one-sided messaging passing interface (MPI) communication. The nearest neighbour is defined by the Euclidean distance in three-dimensional space. The resulting directed nearest neighbour graphs that are used to define the clusters are split up in an iterative procedure with MPI remote memory access (RMA) communication. The method has been implemented as part of the elliptical parcel-in-cell (EPIC) method targeting geophysical fluid flows. The parallel scalability of the algorithm is discussed by means of an artificial and a standard fluid dynamics test case. The cluster algorithm shows good weak and strong scalability up to 16,384 cores with a parallel weak scaling efficiency of about 80% for balanced workloads. In poorly balanced problems, MPI synchronisation dominates execution of the cluster algorithm and thus drastically worsens its parallel scalability.
{"title":"A parallel particle cluster algorithm using nearest neighbour graphs and passive target communication","authors":"Matthias Frey, Steven Böing, Rui F. G. Apóstolo","doi":"arxiv-2408.15348","DOIUrl":"https://doi.org/arxiv-2408.15348","url":null,"abstract":"We present a parallel cluster algorithm for $N$-body simulations which uses a\u0000nearest neighbour search algorithm and one-sided messaging passing interface\u0000(MPI) communication. The nearest neighbour is defined by the Euclidean distance\u0000in three-dimensional space. The resulting directed nearest neighbour graphs\u0000that are used to define the clusters are split up in an iterative procedure\u0000with MPI remote memory access (RMA) communication. The method has been\u0000implemented as part of the elliptical parcel-in-cell (EPIC) method targeting\u0000geophysical fluid flows. The parallel scalability of the algorithm is discussed\u0000by means of an artificial and a standard fluid dynamics test case. The cluster\u0000algorithm shows good weak and strong scalability up to 16,384 cores with a\u0000parallel weak scaling efficiency of about 80% for balanced workloads. In poorly\u0000balanced problems, MPI synchronisation dominates execution of the cluster\u0000algorithm and thus drastically worsens its parallel scalability.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"45 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multiplying two sparse matrices (SpGEMM) is a common computational primitive used in many areas including graph algorithms, bioinformatics, algebraic multigrid solvers, and randomized sketching. Distributed-memory parallel algorithms for SpGEMM have mainly focused on sparsity-oblivious approaches that use 2D and 3D partitioning. Sparsity-aware 1D algorithms can theoretically reduce communication by not fetching nonzeros of the sparse matrices that do not participate in the multiplication. Here, we present a distributed-memory 1D SpGEMM algorithm and implementation. It uses MPI RDMA operations to mitigate the cost of packing/unpacking submatrices for communication, and it uses a block fetching strategy to avoid excessive fine-grained messaging. Our results show that our 1D implementation outperforms state-of-the-art 2D and 3D implementations within CombBLAS for many configurations, inputs, and use cases, while remaining conceptually simpler.
{"title":"A sparsity-aware distributed-memory algorithm for sparse-sparse matrix multiplication","authors":"Yuxi Hong, Aydin Buluc","doi":"arxiv-2408.14558","DOIUrl":"https://doi.org/arxiv-2408.14558","url":null,"abstract":"Multiplying two sparse matrices (SpGEMM) is a common computational primitive\u0000used in many areas including graph algorithms, bioinformatics, algebraic\u0000multigrid solvers, and randomized sketching. Distributed-memory parallel\u0000algorithms for SpGEMM have mainly focused on sparsity-oblivious approaches that\u0000use 2D and 3D partitioning. Sparsity-aware 1D algorithms can theoretically\u0000reduce communication by not fetching nonzeros of the sparse matrices that do\u0000not participate in the multiplication. Here, we present a distributed-memory 1D SpGEMM algorithm and implementation.\u0000It uses MPI RDMA operations to mitigate the cost of packing/unpacking\u0000submatrices for communication, and it uses a block fetching strategy to avoid\u0000excessive fine-grained messaging. Our results show that our 1D implementation\u0000outperforms state-of-the-art 2D and 3D implementations within CombBLAS for many\u0000configurations, inputs, and use cases, while remaining conceptually simpler.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"64 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael E. Kim, Karthik Ramadass, Chenyu Gao, Praitayini Kanakaraj, Nancy R. Newlin, Gaurav Rudravaram, Kurt G. Schilling, Blake E. Dewey, Derek Archer, Timothy J. Hohman, Zhiyuan Li, Shunxing Bao, Bennett A. Landman, Nazirah Mohd Khairi
Curating, processing, and combining large-scale medical imaging datasets from national studies is a non-trivial task due to the intense computation and data throughput required, variability of acquired data, and associated financial overhead. Existing platforms or tools for large-scale data curation, processing, and storage have difficulty achieving a viable cost-to-scale ratio of computation speed for research purposes, either being too slow or too expensive. Additionally, management and consistency of processing large data in a team-driven manner is a non-trivial task. We design a BIDS-compliant method for an efficient and robust data processing pipeline of large-scale diffusion-weighted and T1-weighted MRI data compatible with low-cost, high-efficiency computing systems. Our method accomplishes automated querying of data available for processing and process running in a consistent and reproducible manner that has long-term stability, while using heterogenous low-cost computational resources and storage systems for efficient processing and data transfer. We demonstrate how our organizational structure permits efficiency in a semi-automated data processing pipeline and show how our method is comparable in processing time to cloud-based computation while being almost 20 times more cost-effective. Our design allows for fast data throughput speeds and low latency to reduce the time for data transfer between storage servers and computation servers, achieving an average of 0.60 Gb/s compared to 0.33 Gb/s for using cloud-based processing methods. The design of our workflow engine permits quick process running while maintaining flexibility to adapt to newly acquired data.
{"title":"Scalable, reproducible, and cost-effective processing of large-scale medical imaging datasets","authors":"Michael E. Kim, Karthik Ramadass, Chenyu Gao, Praitayini Kanakaraj, Nancy R. Newlin, Gaurav Rudravaram, Kurt G. Schilling, Blake E. Dewey, Derek Archer, Timothy J. Hohman, Zhiyuan Li, Shunxing Bao, Bennett A. Landman, Nazirah Mohd Khairi","doi":"arxiv-2408.14611","DOIUrl":"https://doi.org/arxiv-2408.14611","url":null,"abstract":"Curating, processing, and combining large-scale medical imaging datasets from\u0000national studies is a non-trivial task due to the intense computation and data\u0000throughput required, variability of acquired data, and associated financial\u0000overhead. Existing platforms or tools for large-scale data curation,\u0000processing, and storage have difficulty achieving a viable cost-to-scale ratio\u0000of computation speed for research purposes, either being too slow or too\u0000expensive. Additionally, management and consistency of processing large data in\u0000a team-driven manner is a non-trivial task. We design a BIDS-compliant method\u0000for an efficient and robust data processing pipeline of large-scale\u0000diffusion-weighted and T1-weighted MRI data compatible with low-cost,\u0000high-efficiency computing systems. Our method accomplishes automated querying\u0000of data available for processing and process running in a consistent and\u0000reproducible manner that has long-term stability, while using heterogenous\u0000low-cost computational resources and storage systems for efficient processing\u0000and data transfer. We demonstrate how our organizational structure permits\u0000efficiency in a semi-automated data processing pipeline and show how our method\u0000is comparable in processing time to cloud-based computation while being almost\u000020 times more cost-effective. Our design allows for fast data throughput speeds\u0000and low latency to reduce the time for data transfer between storage servers\u0000and computation servers, achieving an average of 0.60 Gb/s compared to 0.33\u0000Gb/s for using cloud-based processing methods. The design of our workflow\u0000engine permits quick process running while maintaining flexibility to adapt to\u0000newly acquired data.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"268 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shunfeng Chu, Jun Li, Jianxin Wang, Yiyang Ni, Kang Wei, Wen Chen, Shi Jin
As an emerging technology, digital twin (DT) can provide real-time status and dynamic topology mapping for Internet of Things (IoT) devices. However, DT and its implementation within industrial IoT networks necessitates substantial, distributed data support, which often leads to ``data silos'' and raises privacy concerns. To address these issues, we develop a dynamic resource scheduling algorithm tailored for the asynchronous federated learning (FL)-based lightweight DT empowered IoT network. Specifically, our approach aims to minimize a multi-objective function that encompasses both energy consumption and latency by optimizing IoT device selection and transmit power control, subject to FL model performance constraints. We utilize the Lyapunov method to decouple the formulated problem into a series of one-slot optimization problems and develop a two-stage optimization algorithm to achieve the optimal transmission power control and IoT device scheduling strategies. In the first stage, we derive closed-form solutions for optimal transmit power on the IoT device side. In the second stage, since partial state information is unknown, e.g., the transmitting power and computational frequency of IoT device, the edge server employs a multi-armed bandit (MAB) framework to model the IoT device selection problem and utilizes an efficient online algorithm, namely the client utility-based upper confidence bound (CU-UCB), to address it. Numerical results validate our algorithm's superiority over benchmark schemes, and simulations demonstrate that our algorithm achieves faster training speeds on the Fashion-MNIST and CIFAR-10 datasets within the same training duration.
{"title":"Resource Efficient Asynchronous Federated Learning for Digital Twin Empowered IoT Network","authors":"Shunfeng Chu, Jun Li, Jianxin Wang, Yiyang Ni, Kang Wei, Wen Chen, Shi Jin","doi":"arxiv-2408.14298","DOIUrl":"https://doi.org/arxiv-2408.14298","url":null,"abstract":"As an emerging technology, digital twin (DT) can provide real-time status and\u0000dynamic topology mapping for Internet of Things (IoT) devices. However, DT and\u0000its implementation within industrial IoT networks necessitates substantial,\u0000distributed data support, which often leads to ``data silos'' and raises\u0000privacy concerns. To address these issues, we develop a dynamic resource\u0000scheduling algorithm tailored for the asynchronous federated learning\u0000(FL)-based lightweight DT empowered IoT network. Specifically, our approach\u0000aims to minimize a multi-objective function that encompasses both energy\u0000consumption and latency by optimizing IoT device selection and transmit power\u0000control, subject to FL model performance constraints. We utilize the Lyapunov\u0000method to decouple the formulated problem into a series of one-slot\u0000optimization problems and develop a two-stage optimization algorithm to achieve\u0000the optimal transmission power control and IoT device scheduling strategies. In\u0000the first stage, we derive closed-form solutions for optimal transmit power on\u0000the IoT device side. In the second stage, since partial state information is\u0000unknown, e.g., the transmitting power and computational frequency of IoT\u0000device, the edge server employs a multi-armed bandit (MAB) framework to model\u0000the IoT device selection problem and utilizes an efficient online algorithm,\u0000namely the client utility-based upper confidence bound (CU-UCB), to address it.\u0000Numerical results validate our algorithm's superiority over benchmark schemes,\u0000and simulations demonstrate that our algorithm achieves faster training speeds\u0000on the Fashion-MNIST and CIFAR-10 datasets within the same training duration.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Logan Ward, J. Gregory Pauloski, Valerie Hayot-Sasson, Yadu Babuji, Alexander Brace, Ryan Chard, Kyle Chard, Rajeev Thakur, Ian Foster
Computational workflows are a common class of application on supercomputers, yet the loosely coupled and heterogeneous nature of workflows often fails to take full advantage of their capabilities. We created Colmena to leverage the massive parallelism of a supercomputer by using Artificial Intelligence (AI) to learn from and adapt a workflow as it executes. Colmena allows scientists to define how their application should respond to events (e.g., task completion) as a series of cooperative agents. In this paper, we describe the design of Colmena, the challenges we overcame while deploying applications on exascale systems, and the science workflows we have enhanced through interweaving AI. The scaling challenges we discuss include developing steering strategies that maximize node utilization, introducing data fabrics that reduce communication overhead of data-intensive tasks, and implementing workflow tasks that cache costly operations between invocations. These innovations coupled with a variety of application patterns accessible through our agent-based steering model have enabled science advances in chemistry, biophysics, and materials science using different types of AI. Our vision is that Colmena will spur creative solutions that harness AI across many domains of scientific computing.
{"title":"Employing Artificial Intelligence to Steer Exascale Workflows with Colmena","authors":"Logan Ward, J. Gregory Pauloski, Valerie Hayot-Sasson, Yadu Babuji, Alexander Brace, Ryan Chard, Kyle Chard, Rajeev Thakur, Ian Foster","doi":"arxiv-2408.14434","DOIUrl":"https://doi.org/arxiv-2408.14434","url":null,"abstract":"Computational workflows are a common class of application on supercomputers,\u0000yet the loosely coupled and heterogeneous nature of workflows often fails to\u0000take full advantage of their capabilities. We created Colmena to leverage the\u0000massive parallelism of a supercomputer by using Artificial Intelligence (AI) to\u0000learn from and adapt a workflow as it executes. Colmena allows scientists to\u0000define how their application should respond to events (e.g., task completion)\u0000as a series of cooperative agents. In this paper, we describe the design of\u0000Colmena, the challenges we overcame while deploying applications on exascale\u0000systems, and the science workflows we have enhanced through interweaving AI.\u0000The scaling challenges we discuss include developing steering strategies that\u0000maximize node utilization, introducing data fabrics that reduce communication\u0000overhead of data-intensive tasks, and implementing workflow tasks that cache\u0000costly operations between invocations. These innovations coupled with a variety\u0000of application patterns accessible through our agent-based steering model have\u0000enabled science advances in chemistry, biophysics, and materials science using\u0000different types of AI. Our vision is that Colmena will spur creative solutions\u0000that harness AI across many domains of scientific computing.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anindya Bijoy Das, Aditya Ramamoorthy, David J. Love, Christopher G. Brinton
Matrix computations are a fundamental building-block of edge computing systems, with a major recent uptick in demand due to their use in AI/ML training and inference procedures. Existing approaches for distributing matrix computations involve allocating coded combinations of submatrices to worker nodes, to build resilience to slower nodes, called stragglers. In the edge learning context, however, these approaches will compromise sparsity properties that are often present in the original matrices found at the edge server. In this study, we consider the challenge of augmenting such approaches to preserve input sparsity when distributing the task across edge devices, thereby retaining the associated computational efficiency enhancements. First, we find a lower bound on the weight of coding, i.e., the number of submatrices to be combined to obtain coded submatrices, to provide the resilience to the maximum possible number of straggler devices (for given number of devices and their storage constraints). Next we propose distributed matrix computation schemes which meet the exact lower bound on the weight of the coding. Numerical experiments conducted in Amazon Web Services (AWS) validate our assertions regarding straggler mitigation and computation speed for sparse matrices.
{"title":"Sparsity-Preserving Encodings for Straggler-Optimal Distributed Matrix Computations at the Edge","authors":"Anindya Bijoy Das, Aditya Ramamoorthy, David J. Love, Christopher G. Brinton","doi":"arxiv-2408.05152","DOIUrl":"https://doi.org/arxiv-2408.05152","url":null,"abstract":"Matrix computations are a fundamental building-block of edge computing\u0000systems, with a major recent uptick in demand due to their use in AI/ML\u0000training and inference procedures. Existing approaches for distributing matrix\u0000computations involve allocating coded combinations of submatrices to worker\u0000nodes, to build resilience to slower nodes, called stragglers. In the edge\u0000learning context, however, these approaches will compromise sparsity properties\u0000that are often present in the original matrices found at the edge server. In\u0000this study, we consider the challenge of augmenting such approaches to preserve\u0000input sparsity when distributing the task across edge devices, thereby\u0000retaining the associated computational efficiency enhancements. First, we find\u0000a lower bound on the weight of coding, i.e., the number of submatrices to be\u0000combined to obtain coded submatrices, to provide the resilience to the maximum\u0000possible number of straggler devices (for given number of devices and their\u0000storage constraints). Next we propose distributed matrix computation schemes\u0000which meet the exact lower bound on the weight of the coding. Numerical\u0000experiments conducted in Amazon Web Services (AWS) validate our assertions\u0000regarding straggler mitigation and computation speed for sparse matrices.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingzhe Li, Hamish Carr, Oliver Rübel, Bei Wang, Gunther H. Weber
Contour trees describe the topology of level sets in scalar fields and are widely used in topological data analysis and visualization. A main challenge of utilizing contour trees for large-scale scientific data is their computation at scale using high-performance computing. To address this challenge, recent work has introduced distributed hierarchical contour trees for distributed computation and storage of contour trees. However, effective use of these distributed structures in analysis and visualization requires subsequent computation of geometric properties and branch decomposition to support contour extraction and exploration. In this work, we introduce distributed algorithms for augmentation, hypersweeps, and branch decomposition that enable parallel computation of geometric properties, and support the use of distributed contour trees as query structures for scientific exploration. We evaluate the parallel performance of these algorithms and apply them to identify and extract important contours for scientific visualization.
{"title":"Distributed Augmentation, Hypersweeps, and Branch Decomposition of Contour Trees for Scientific Exploration","authors":"Mingzhe Li, Hamish Carr, Oliver Rübel, Bei Wang, Gunther H. Weber","doi":"arxiv-2408.04836","DOIUrl":"https://doi.org/arxiv-2408.04836","url":null,"abstract":"Contour trees describe the topology of level sets in scalar fields and are\u0000widely used in topological data analysis and visualization. A main challenge of\u0000utilizing contour trees for large-scale scientific data is their computation at\u0000scale using high-performance computing. To address this challenge, recent work\u0000has introduced distributed hierarchical contour trees for distributed\u0000computation and storage of contour trees. However, effective use of these\u0000distributed structures in analysis and visualization requires subsequent\u0000computation of geometric properties and branch decomposition to support contour\u0000extraction and exploration. In this work, we introduce distributed algorithms\u0000for augmentation, hypersweeps, and branch decomposition that enable parallel\u0000computation of geometric properties, and support the use of distributed contour\u0000trees as query structures for scientific exploration. We evaluate the parallel\u0000performance of these algorithms and apply them to identify and extract\u0000important contours for scientific visualization.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As AI chips incorporate numerous parallelized cores to scale deep learning (DL) computing, inter-core communication is enabled recently by employing high-bandwidth and low-latency interconnect links on the chip (e.g., Graphcore IPU). It allows each core to directly access the fast scratchpad memory in other cores, which enables new parallel computing paradigms. However, without proper support for the scalable inter-core connections in current DL compilers, it is hard for developers to exploit the benefits of this new architecture. We present T10, the first DL compiler to exploit the inter-core communication bandwidth and distributed on-chip memory on AI chips. To formulate the computation and communication patterns of tensor operators in this new architecture, T10 introduces a distributed tensor abstraction rTensor. T10 maps a DNN model to execution plans with a generalized compute-shift pattern, by partitioning DNN computation into sub-operators and mapping them to cores, so that the cores can exchange data following predictable patterns. T10 makes globally optimized trade-offs between on-chip memory consumption and inter-core communication overhead, selects the best execution plan from a vast optimization space, and alleviates unnecessary inter-core communications. Our evaluation with a real inter-core connected AI chip, the Graphcore IPU, shows up to 3.3$times$ performance improvement, and scalability support for larger models, compared to state-of-the-art DL compilers and vendor libraries.
{"title":"Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor","authors":"Yiqi Liu, Yuqi Xue, Yu Cheng, Lingxiao Ma, Ziming Miao, Jilong Xue, Jian Huang","doi":"arxiv-2408.04808","DOIUrl":"https://doi.org/arxiv-2408.04808","url":null,"abstract":"As AI chips incorporate numerous parallelized cores to scale deep learning\u0000(DL) computing, inter-core communication is enabled recently by employing\u0000high-bandwidth and low-latency interconnect links on the chip (e.g., Graphcore\u0000IPU). It allows each core to directly access the fast scratchpad memory in\u0000other cores, which enables new parallel computing paradigms. However, without\u0000proper support for the scalable inter-core connections in current DL compilers,\u0000it is hard for developers to exploit the benefits of this new architecture. We present T10, the first DL compiler to exploit the inter-core communication\u0000bandwidth and distributed on-chip memory on AI chips. To formulate the\u0000computation and communication patterns of tensor operators in this new\u0000architecture, T10 introduces a distributed tensor abstraction rTensor. T10 maps\u0000a DNN model to execution plans with a generalized compute-shift pattern, by\u0000partitioning DNN computation into sub-operators and mapping them to cores, so\u0000that the cores can exchange data following predictable patterns. T10 makes\u0000globally optimized trade-offs between on-chip memory consumption and inter-core\u0000communication overhead, selects the best execution plan from a vast\u0000optimization space, and alleviates unnecessary inter-core communications. Our\u0000evaluation with a real inter-core connected AI chip, the Graphcore IPU, shows\u0000up to 3.3$times$ performance improvement, and scalability support for larger\u0000models, compared to state-of-the-art DL compilers and vendor libraries.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As large language models continue to scale up, the imperative for fault tolerance in distributed deep learning systems intensifies, becoming a focal area of AI infrastructure research. Checkpoint has emerged as the predominant fault tolerance strategy, with extensive studies dedicated to optimizing its efficiency. However, the advent of the sparse Mixture-of-Experts (MoE) model presents new challenges for traditional checkpoint techniques due to the substantial increase in model size, despite comparable computational demands to dense models. Breaking new ground in the realm of efficient fault tolerance for MoE model training, we introduce a novel Partial Experts Checkpoint (PEC) mechanism alongside a corresponding PEC fault-tolerant system. Our approach strategically checkpoints a selected subset of experts, thereby significantly reducing the checkpoint size for MoE models to a level comparable with that of dense models. The empirical analysis on our 8-expert GPT-MoE model demonstrates that the proposed PEC approach facilitates a substantial 54.2% decrease in the size of non-redundant checkpoint (no data-parallel duplication), without compromising the final model quality. Moreover, our PEC fault-tolerant system achieves a 76.9% reduction in checkpoint workload per data-parallel distributed rank, thereby correspondingly diminishing the checkpointing time and facilitating complete overlap with the training process.
{"title":"Partial Experts Checkpoint: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training","authors":"Weilin Cai, Le Qin, Jiayi Huang","doi":"arxiv-2408.04307","DOIUrl":"https://doi.org/arxiv-2408.04307","url":null,"abstract":"As large language models continue to scale up, the imperative for fault\u0000tolerance in distributed deep learning systems intensifies, becoming a focal\u0000area of AI infrastructure research. Checkpoint has emerged as the predominant\u0000fault tolerance strategy, with extensive studies dedicated to optimizing its\u0000efficiency. However, the advent of the sparse Mixture-of-Experts (MoE) model\u0000presents new challenges for traditional checkpoint techniques due to the\u0000substantial increase in model size, despite comparable computational demands to\u0000dense models. Breaking new ground in the realm of efficient fault tolerance for\u0000MoE model training, we introduce a novel Partial Experts Checkpoint (PEC)\u0000mechanism alongside a corresponding PEC fault-tolerant system. Our approach\u0000strategically checkpoints a selected subset of experts, thereby significantly\u0000reducing the checkpoint size for MoE models to a level comparable with that of\u0000dense models. The empirical analysis on our 8-expert GPT-MoE model demonstrates\u0000that the proposed PEC approach facilitates a substantial 54.2% decrease in the\u0000size of non-redundant checkpoint (no data-parallel duplication), without\u0000compromising the final model quality. Moreover, our PEC fault-tolerant system\u0000achieves a 76.9% reduction in checkpoint workload per data-parallel distributed\u0000rank, thereby correspondingly diminishing the checkpointing time and\u0000facilitating complete overlap with the training process.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}