首页 > 最新文献

IEEE Transactions on Parallel and Distributed Systems最新文献

英文 中文
HiHGNN: Accelerating HGNNs Through Parallelism and Data Reusability Exploitation HiHGNN:通过并行性和数据可重用性开发加速 HGNNs
IF 5.3 2区 计算机科学 Q1 Computer Science Pub Date : 2024-04-30 DOI: 10.1109/TPDS.2024.3394841
Runzhen Xue;Dengke Han;Mingyu Yan;Mo Zou;Xiaocheng Yang;Duo Wang;Wenming Li;Zhimin Tang;John Kim;Xiaochun Ye;Dongrui Fan
Heterogeneous graph neural networks (HGNNs) have emerged as powerful algorithms for processing heterogeneous graphs (HetGs), widely used in many critical fields. To capture both structural and semantic information in HetGs, HGNNs first aggregate the neighboring feature vectors for each vertex in each semantic graph and then fuse the aggregated results across all semantic graphs for each vertex. Unfortunately, existing graph neural network accelerators are ill-suited to accelerate HGNNs. This is because they fail to efficiently tackle the specific execution patterns and exploit the high-degree parallelism as well as data reusability inside and across the processing of semantic graphs in HGNNs. In this work, we first quantitatively characterize a set of representative HGNN models on GPU to disclose the execution bound of each stage, inter-semantic-graph parallelism, and inter-semantic-graph data reusability in HGNNs. Guided by our findings, we propose a high-performance HGNN accelerator, HiHGNN, to alleviate the execution bound and exploit the newfound parallelism and data reusability in HGNNs. Specifically, we first propose a bound-aware stage-fusion methodology that tailors to HGNN acceleration, to fuse and pipeline the execution stages being aware of their execution bounds. Second, we design an independency-aware parallel execution design to exploit the inter-semantic-graph parallelism. Finally, we present a similarity-aware execution scheduling to exploit the inter-semantic-graph data reusability. Compared to the state-of-the-art software framework running on NVIDIA GPU T4 and GPU A100, HiHGNN respectively achieves an average 40.0× and 8.3× speedup as well as 99.59% and 99.74% energy reduction with quintile the memory bandwidth of GPU A100.
异构图神经网络(HGNN)是处理异构图(HetGs)的强大算法,广泛应用于许多关键领域。为了捕捉 HetGs 中的结构和语义信息,HGNNs 首先聚合每个语义图中每个顶点的相邻特征向量,然后融合每个顶点在所有语义图中的聚合结果。遗憾的是,现有的图神经网络加速器并不适合加速 HGNN。这是因为它们无法有效地处理特定的执行模式,也无法利用 HGNN 中语义图处理过程中的高度并行性和数据可重用性。在这项工作中,我们首先在 GPU 上定量描述了一组具有代表性的 HGNN 模型,以揭示 HGNN 中每个阶段的执行约束、语义图之间的并行性以及语义图之间的数据可重用性。在研究结果的指导下,我们提出了一种高性能 HGNN 加速器 HiHGNN,以减轻执行约束并利用 HGNN 中新发现的并行性和数据可重用性。具体来说,我们首先提出了一种边界感知的阶段融合方法,该方法适合 HGNN 加速,可在感知各执行阶段的执行边界的情况下对其进行融合和管道化。其次,我们设计了一种独立感知并行执行设计,以利用语义图之间的并行性。最后,我们提出了一种相似性感知执行调度,以利用语义图间数据的可重用性。与运行在英伟达™(NVIDIA®)GPU T4和GPU A100上的最先进软件框架相比,HiHGNN分别实现了平均40.0倍和8.3倍的速度提升,以及99.59%和99.74%的能耗降低,而内存带宽仅为GPU A100的五分之一。
{"title":"HiHGNN: Accelerating HGNNs Through Parallelism and Data Reusability Exploitation","authors":"Runzhen Xue;Dengke Han;Mingyu Yan;Mo Zou;Xiaocheng Yang;Duo Wang;Wenming Li;Zhimin Tang;John Kim;Xiaochun Ye;Dongrui Fan","doi":"10.1109/TPDS.2024.3394841","DOIUrl":"10.1109/TPDS.2024.3394841","url":null,"abstract":"Heterogeneous graph neural networks (HGNNs) have emerged as powerful algorithms for processing heterogeneous graphs (HetGs), widely used in many critical fields. To capture both structural and semantic information in HetGs, HGNNs first aggregate the neighboring feature vectors for each vertex in each semantic graph and then fuse the aggregated results across all semantic graphs for each vertex. Unfortunately, existing graph neural network accelerators are ill-suited to accelerate HGNNs. This is because they fail to efficiently tackle the specific execution patterns and exploit the high-degree parallelism as well as data reusability inside and across the processing of semantic graphs in HGNNs. In this work, we first quantitatively characterize a set of representative HGNN models on GPU to disclose the execution bound of each stage, inter-semantic-graph parallelism, and inter-semantic-graph data reusability in HGNNs. Guided by our findings, we propose a high-performance HGNN accelerator, HiHGNN, to alleviate the execution bound and exploit the newfound parallelism and data reusability in HGNNs. Specifically, we first propose a bound-aware stage-fusion methodology that tailors to HGNN acceleration, to fuse and pipeline the execution stages being aware of their execution bounds. Second, we design an independency-aware parallel execution design to exploit the inter-semantic-graph parallelism. Finally, we present a similarity-aware execution scheduling to exploit the inter-semantic-graph data reusability. Compared to the state-of-the-art software framework running on NVIDIA GPU T4 and GPU A100, HiHGNN respectively achieves an average 40.0× and 8.3× speedup as well as 99.59% and 99.74% energy reduction with quintile the memory bandwidth of GPU A100.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140842239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TeGraph+: Scalable Temporal Graph Processing Enabling Flexible Edge Modifications TeGraph+:实现灵活边缘修改的可扩展时态图处理技术
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-04-26 DOI: 10.1109/TPDS.2024.3393914
Chengying Huan;Yongchao Liu;Heng Zhang;Hang Liu;Shiyang Chen;Shuaiwen Leon Song;Yanjun Wu
Temporal graphs are widely used for time-critical applications, which enable the extraction of graph structural information with temporal features but cannot be efficiently supported by static graph computing systems. However, the current state-of-the-art solutions for temporal graph problems are not only ad-hoc and suboptimal, but they also exhibit poor scalability, particularly in terms of their inability to scale to evolving graphs with flexible edge modifications (including insertions and deletions) and diverse execution environments. In this article, we present two key observations. First, temporal path problems can be characterized as topological-optimum problems, which can be efficiently resolved using a universal single-scan execution model. Second, data redundancy in transformed temporal graphs can be mitigated by merging superfluous vertices. Building upon these fundamental insights, we propose TeGraph+, a versatile temporal graph computing engine that makes the following contributions: (1) a unified optimization strategy and execution model for temporal graph problems; (2) a novel graph transformation model with graph redundancy reduction strategy; (3) a spanning tree decomposition (STD) based distributed execution model which uses an efficient transformed graph decomposition strategy to partition the transformed graph into different spanning trees for distributed execution; (4) an efficient mixed imperative and lazy graph update strategy that offers support for evolving graphs with flexible edge modifications; (5) a general system framework with user-friendly APIs and the support of various execution environments, including in-memory, out-of-core, and distributed execution environments. Our extensive evaluation reveals that TeGraph+ can achieve up to $241times$ speedups over the state-of-the-art counterparts.
时态图被广泛应用于时间关键型应用中,这些应用可以提取具有时间特征的图结构信息,但静态图计算系统无法有效支持这些信息。然而,目前针对时态图问题的最先进解决方案不仅是临时和次优的,而且还表现出很差的可扩展性,特别是无法扩展到具有灵活边修改(包括插入和删除)和不同执行环境的不断演化的图。在本文中,我们将提出两个重要观点。首先,时空路径问题可被描述为拓扑最优问题,可使用通用的单扫描执行模型高效解决。其次,可以通过合并多余的顶点来减少转换时序图中的数据冗余。在这些基本见解的基础上,我们提出了 TeGraph+,这是一个多功能时态图计算引擎,具有以下贡献:(1) 时序图问题的统一优化策略和执行模型;(2) 带有图冗余减少策略的新型图转换模型;(3) 基于生成树分解(STD)的分布式执行模型,该模型使用高效的转换图分解策略将转换图划分为不同的生成树以进行分布式执行;(4) 高效的混合命令式和懒惰式图更新策略,支持灵活修改边的演化图;(5) 具有用户友好 API 的通用系统框架,支持各种执行环境,包括内存、外核和分布式执行环境。我们的广泛评估显示,TeGraph+ 与最先进的同行相比,可实现高达 241 美元/次的提速。
{"title":"TeGraph+: Scalable Temporal Graph Processing Enabling Flexible Edge Modifications","authors":"Chengying Huan;Yongchao Liu;Heng Zhang;Hang Liu;Shiyang Chen;Shuaiwen Leon Song;Yanjun Wu","doi":"10.1109/TPDS.2024.3393914","DOIUrl":"10.1109/TPDS.2024.3393914","url":null,"abstract":"Temporal graphs are widely used for time-critical applications, which enable the extraction of graph structural information with temporal features but cannot be efficiently supported by static graph computing systems. However, the current state-of-the-art solutions for temporal graph problems are not only ad-hoc and suboptimal, but they also exhibit poor scalability, particularly in terms of their inability to scale to evolving graphs with flexible edge modifications (including insertions and deletions) and diverse execution environments. In this article, we present two key observations. First, temporal path problems can be characterized as \u0000<i>topological-optimum</i>\u0000 problems, which can be efficiently resolved using a universal single-scan execution model. Second, data redundancy in transformed temporal graphs can be mitigated by merging superfluous vertices. Building upon these fundamental insights, we propose TeGraph+, a versatile temporal graph computing engine that makes the following contributions: (1) a unified optimization strategy and execution model for temporal graph problems; (2) a novel graph transformation model with graph redundancy reduction strategy; (3) a spanning tree decomposition (STD) based distributed execution model which uses an efficient transformed graph decomposition strategy to partition the transformed graph into different spanning trees for distributed execution; (4) an efficient mixed imperative and lazy graph update strategy that offers support for evolving graphs with flexible edge modifications; (5) a general system framework with user-friendly APIs and the support of various execution environments, including in-memory, out-of-core, and distributed execution environments. Our extensive evaluation reveals that TeGraph+ can achieve up to \u0000<inline-formula><tex-math>$241times$</tex-math></inline-formula>\u0000 speedups over the state-of-the-art counterparts.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140800383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SLO-Aware Function Placement for Serverless Workflows With Layer-Wise Memory Sharing 利用分层内存共享为无服务器工作流提供 SLO 感知功能布局
IF 5.3 2区 计算机科学 Q1 Computer Science Pub Date : 2024-04-22 DOI: 10.1109/TPDS.2024.3391858
Dazhao Cheng;Kai Yan;Xinquan Cai;Yili Gong;Chuang Hu
Function-as-a-Service (FaaS) is a promising cloud computing model known for its scalability and elasticity. In various application domains, FaaS workflows have been widely adopted to manage user requests and complete computational tasks efficiently. Motivated by the fact that function containers collaboratively use the image layer's memory, co-placing functions would leverage memory sharing to reduce cluster memory footprint, this article studies layer-wise memory sharing for serverless functions. We find that overwhelming memory sharing by placing containers in the same cluster machine may lead to performance deterioration and Service Level Objective (SLO) violations due to the increased CPU pressure. We investigate how to maximally reduce cluster memory footprint via layer-wise memory sharing for serverless workflows while guaranteeing their SLO. First, we study the container memory sharing problem under serverless workflows with a static Directed Acyclic Graph (DAG) structure. We prove it is NP-Hard and propose a 2-approximation algorithm, namely MDP. Then we consider workflows with dynamic DAG structure scenarios, where the memory sharing problem is also NP-Hard. We design a Greedy-based algorithm called GSP to address this issue. We implement a carefully designed prototype on the OpenWhisk platform, and our evaluation results demonstrate that both MDP and GSP achieve a balanced and satisfying state, effectively reducing up to 63$%$ of cache memory usage while guaranteeing serverless workflow SLO.
功能即服务(FaaS)是一种前景广阔的云计算模式,以其可扩展性和弹性而著称。在各种应用领域,FaaS 工作流已被广泛采用,以管理用户请求并高效完成计算任务。受函数容器协同使用映像层内存这一事实的启发,本文研究了无服务器函数的分层内存共享。我们发现,将容器放置在同一集群机器中进行压倒性的内存共享可能会导致性能下降和违反服务级别目标(SLO),原因是 CPU 压力增大。我们研究了如何通过无服务器工作流的分层内存共享最大限度地减少集群内存占用,同时保证其 SLO。首先,我们研究了具有静态有向无环图(DAG)结构的无服务器工作流下的容器内存共享问题。我们证明了这是一个 NP-Hard(NP-Hard)问题,并提出了一种 2-approximation 算法,即 MDP。然后,我们考虑了具有动态 DAG 结构场景的工作流,其中的内存共享问题也是 NP-Hard。我们设计了一种名为 GSP 的基于 Greedy 的算法来解决这个问题。我们在 OpenWhisk 平台上实现了一个精心设计的原型,评估结果表明,MDP 和 GSP 都达到了平衡和令人满意的状态,在保证无服务器工作流 SLO 的同时,有效减少了高达 63%$ 的缓存内存使用量。
{"title":"SLO-Aware Function Placement for Serverless Workflows With Layer-Wise Memory Sharing","authors":"Dazhao Cheng;Kai Yan;Xinquan Cai;Yili Gong;Chuang Hu","doi":"10.1109/TPDS.2024.3391858","DOIUrl":"10.1109/TPDS.2024.3391858","url":null,"abstract":"Function-as-a-Service (FaaS) is a promising cloud computing model known for its scalability and elasticity. In various application domains, FaaS workflows have been widely adopted to manage user requests and complete computational tasks efficiently. Motivated by the fact that function containers collaboratively use the image layer's memory, co-placing functions would leverage memory sharing to reduce cluster memory footprint, this article studies layer-wise memory sharing for serverless functions. We find that overwhelming memory sharing by placing containers in the same cluster machine may lead to performance deterioration and Service Level Objective (SLO) violations due to the increased CPU pressure. We investigate how to maximally reduce cluster memory footprint via layer-wise memory sharing for serverless workflows while guaranteeing their SLO. First, we study the container memory sharing problem under serverless workflows with a static Directed Acyclic Graph (DAG) structure. We prove it is NP-Hard and propose a 2-approximation algorithm, namely MDP. Then we consider workflows with dynamic DAG structure scenarios, where the memory sharing problem is also NP-Hard. We design a Greedy-based algorithm called GSP to address this issue. We implement a carefully designed prototype on the OpenWhisk platform, and our evaluation results demonstrate that both MDP and GSP achieve a balanced and satisfying state, effectively reducing up to 63\u0000<inline-formula><tex-math>$%$</tex-math></inline-formula>\u0000 of cache memory usage while guaranteeing serverless workflow SLO.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140637295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Utilization of Multi-Threading Parallelism on Heterogeneous Systems for Sparse Tensor Contraction 在异构系统上高效利用多线程并行性实现稀疏张量收缩
IF 5.3 2区 计算机科学 Q1 Computer Science Pub Date : 2024-04-19 DOI: 10.1109/TPDS.2024.3391254
Guoqing Xiao;Chuanghui Yin;Yuedan Chen;Mingxing Duan;Kenli Li
Many fields of scientific simulation, such as chemistry and condensed matter physics, are increasingly eschewing dense tensor contraction in favor of sparse tensor contraction. In this work, we center around binary sparse tensor contraction (SpTC) which has the challenges of index matching and accumulation. To address these difficulties, we present GSpTC, an efficient element-wise SpTC framework on CPU-GPU heterogeneous systems. GSpTC first introduces a fine-grained partitioning strategy based on element-wise tensor contraction. By analyzing and selecting appropriate dimension partitioning strategies, we can efficiently utilize the multi-threading parallelism on GPUs and optimize the overall performance of GSpTC. In particular, GSpTC leverages multi-threading parallelism on GPUs for the contraction phase and merging phase, which greatly accelerates the computation phase in sparse tensor contraction computations. Furthermore, GSpTC employs parallel pipeline technology to hide the data transmission time between the host and the device, further enhancing its performance. As a result, GSpTC achieves an average performance improvement of 267% compared to the previous state-of-the-art framework Sparta.
许多科学模拟领域,如化学和凝聚态物理,越来越多地放弃密集张量收缩,转而使用稀疏张量收缩。在这项工作中,我们围绕二进制稀疏张量收缩(SpTC)展开研究,它面临着索引匹配和累积的挑战。为了解决这些难题,我们提出了在 CPU-GPU 异构系统上高效的按元素划分的 SpTC 框架--GSpTC。GSpTC 首先引入了基于元素张量收缩的细粒度分区策略。通过分析和选择适当的维度划分策略,我们可以有效利用 GPU 上的多线程并行性,优化 GSpTC 的整体性能。特别是,GSpTC 在收缩阶段和合并阶段利用了 GPU 上的多线程并行性,大大加快了稀疏张量收缩计算中的计算阶段。此外,GSpTC 还采用了并行流水线技术,隐藏了主机和设备之间的数据传输时间,进一步提高了性能。因此,与之前最先进的框架 Sparta 相比,GSpTC 的平均性能提高了 267%。
{"title":"Efficient Utilization of Multi-Threading Parallelism on Heterogeneous Systems for Sparse Tensor Contraction","authors":"Guoqing Xiao;Chuanghui Yin;Yuedan Chen;Mingxing Duan;Kenli Li","doi":"10.1109/TPDS.2024.3391254","DOIUrl":"10.1109/TPDS.2024.3391254","url":null,"abstract":"Many fields of scientific simulation, such as chemistry and condensed matter physics, are increasingly eschewing dense tensor contraction in favor of sparse tensor contraction. In this work, we center around binary sparse tensor contraction (SpTC) which has the challenges of index matching and accumulation. To address these difficulties, we present GSpTC, an efficient element-wise SpTC framework on CPU-GPU heterogeneous systems. GSpTC first introduces a fine-grained partitioning strategy based on element-wise tensor contraction. By analyzing and selecting appropriate dimension partitioning strategies, we can efficiently utilize the multi-threading parallelism on GPUs and optimize the overall performance of GSpTC. In particular, GSpTC leverages multi-threading parallelism on GPUs for the contraction phase and merging phase, which greatly accelerates the computation phase in sparse tensor contraction computations. Furthermore, GSpTC employs parallel pipeline technology to hide the data transmission time between the host and the device, further enhancing its performance. As a result, GSpTC achieves an average performance improvement of 267% compared to the previous state-of-the-art framework Sparta.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140626457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Formal Definitions and Performance Comparison of Consistency Models for Parallel File Systems 并行文件系统一致性模型的正式定义和性能比较
IF 5.3 2区 计算机科学 Q1 Computer Science Pub Date : 2024-04-18 DOI: 10.1109/TPDS.2024.3391058
Chen Wang;Kathryn Mohror;Marc Snir
The semantics of HPC storage systems are defined by the consistency models to which they abide. Storage consistency models have been less studied than their counterparts in memory systems, with the exception of the POSIX standard and its strict consistency model. The use of POSIX consistency imposes a performance penalty that becomes more significant as the scale of parallel file systems increases and the access time to storage devices, such as node-local solid storage devices, decreases. While some efforts have been made to adopt relaxed storage consistency models, these models are often defined informally and ambiguously as by-products of a particular implementation. In this work, we establish a connection between memory consistency models and storage consistency models and revisit the key design choices of storage consistency models from a high-level perspective. Further, we propose a formal and unified framework for defining storage consistency models and a layered implementation that can be used to easily evaluate their relative performance for different I/O workloads. Finally, we conduct a comprehensive performance comparison of two relaxed consistency models on a range of commonly seen parallel I/O workloads, such as checkpoint/restart of scientific applications and random reads of deep learning applications. We demonstrate that for certain I/O scenarios, a weaker consistency model can significantly improve the I/O performance. For instance, in small random reads that are typically found in deep learning applications, session consistency achieved a 5x improvement in I/O bandwidth compared to commit consistency, even at small scales.
高性能计算存储系统的语义是由其遵守的一致性模型定义的。除了 POSIX 标准及其严格的一致性模型之外,对存储一致性模型的研究要少于对内存系统的研究。随着并行文件系统规模的扩大和存储设备(如节点本地固态存储设备)访问时间的缩短,使用 POSIX 一致性会带来更显著的性能损失。虽然有些人已经努力采用宽松的存储一致性模型,但这些模型往往是作为特定实现的副产品而被非正式地、模棱两可地定义的。在这项工作中,我们建立了内存一致性模型和存储一致性模型之间的联系,并从高层次的角度重新审视了存储一致性模型的关键设计选择。此外,我们还提出了定义存储一致性模型和分层实现的正式统一框架,可用于轻松评估它们在不同 I/O 工作负载下的相对性能。最后,我们在一系列常见的并行 I/O 工作负载(如科学应用的检查点/重启和深度学习应用的随机读取)上对两种宽松的一致性模型进行了全面的性能比较。我们证明,对于某些 I/O 场景,较弱的一致性模型可以显著提高 I/O 性能。例如,在深度学习应用中常见的小规模随机读取中,会话一致性比提交一致性的 I/O 带宽提高了 5 倍,即使在小规模情况下也是如此。
{"title":"Formal Definitions and Performance Comparison of Consistency Models for Parallel File Systems","authors":"Chen Wang;Kathryn Mohror;Marc Snir","doi":"10.1109/TPDS.2024.3391058","DOIUrl":"10.1109/TPDS.2024.3391058","url":null,"abstract":"The semantics of HPC storage systems are defined by the consistency models to which they abide. Storage consistency models have been less studied than their counterparts in memory systems, with the exception of the POSIX standard and its strict consistency model. The use of POSIX consistency imposes a performance penalty that becomes more significant as the scale of parallel file systems increases and the access time to storage devices, such as node-local solid storage devices, decreases. While some efforts have been made to adopt relaxed storage consistency models, these models are often defined informally and ambiguously as by-products of a particular implementation. In this work, we establish a connection between memory consistency models and storage consistency models and revisit the key design choices of storage consistency models from a high-level perspective. Further, we propose a formal and unified framework for defining storage consistency models and a layered implementation that can be used to easily evaluate their relative performance for different I/O workloads. Finally, we conduct a comprehensive performance comparison of two relaxed consistency models on a range of commonly seen parallel I/O workloads, such as checkpoint/restart of scientific applications and random reads of deep learning applications. We demonstrate that for certain I/O scenarios, a weaker consistency model can significantly improve the I/O performance. For instance, in small random reads that are typically found in deep learning applications, session consistency achieved a 5x improvement in I/O bandwidth compared to commit consistency, even at small scales.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140626368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sampling-Based Multi-Job Placement for Heterogeneous Deep Learning Clusters 基于采样的异构深度学习集群多任务分配
IF 5.3 2区 计算机科学 Q1 Computer Science Pub Date : 2024-04-17 DOI: 10.1109/TPDS.2024.3390109
Kaiyang Liu;Jingrong Wang;Zhiming Huang;Jianping Pan
Heterogeneous deep learning clusters commonly host a variety of distributed learning jobs. In such scenarios, the training efficiency of learning models is negatively affected by the slowest worker. To accelerate the training process, multiple learning jobs may compete for limited computational resources, posing significant challenges to multi-job placement among heterogeneous workers. This article presents a heterogeneity-aware scheduler to solve the multi-job placement problem while taking into account job sizing and load balancing, minimizing the average Job Completion Time (JCT) of deep learning jobs. A novel scheme based on proportional training workload assignment, feasible solution categorization, and matching markets is proposed with theoretical guarantees. To further reduce the computational complexity for low latency decision-making and improve scheduling fairness, we propose to construct the sparsification of feasible solution categories through sampling, which has negligible performance loss in JCT. We evaluate the performance of our design with real-world deep neural network benchmarks on heterogeneous computing clusters. Experimental results show that, compared to existing solutions, the proposed sampling-based scheme can achieve 1) results within 2.04% of the optimal JCT with orders-of-magnitude improvements in algorithm running time, and 2) high scheduling fairness among learning jobs.
异构深度学习集群通常承载着各种分布式学习任务。在这种情况下,学习模型的训练效率会受到最慢工作者的负面影响。为了加速训练过程,多个学习作业可能会争夺有限的计算资源,这给异构工作者之间的多作业安排带来了巨大挑战。本文提出了一种异构感知调度器,用于解决多作业安排问题,同时考虑作业大小和负载平衡,最大限度地减少深度学习作业的平均作业完成时间(JCT)。提出了一种基于按比例分配训练工作量、可行解决方案分类和匹配市场的新方案,并提供了理论保证。为了进一步降低低延迟决策的计算复杂度并提高调度公平性,我们提出通过抽样来构建可行解决方案类别的稀疏化,这对 JCT 的性能损失可以忽略不计。我们利用异构计算集群上的实际深度神经网络基准来评估我们设计的性能。实验结果表明,与现有解决方案相比,所提出的基于抽样的方案可以实现:1)结果在最优 JCT 的 2.04% 以内,算法运行时间有数量级的改善;2)学习作业之间的调度公平性高。
{"title":"Sampling-Based Multi-Job Placement for Heterogeneous Deep Learning Clusters","authors":"Kaiyang Liu;Jingrong Wang;Zhiming Huang;Jianping Pan","doi":"10.1109/TPDS.2024.3390109","DOIUrl":"10.1109/TPDS.2024.3390109","url":null,"abstract":"Heterogeneous deep learning clusters commonly host a variety of distributed learning jobs. In such scenarios, the training efficiency of learning models is negatively affected by the slowest worker. To accelerate the training process, multiple learning jobs may compete for limited computational resources, posing significant challenges to multi-job placement among heterogeneous workers. This article presents a heterogeneity-aware scheduler to solve the multi-job placement problem while taking into account job sizing and load balancing, minimizing the average Job Completion Time (JCT) of deep learning jobs. A novel scheme based on proportional training workload assignment, feasible solution categorization, and matching markets is proposed with theoretical guarantees. To further reduce the computational complexity for low latency decision-making and improve scheduling fairness, we propose to construct the sparsification of feasible solution categories through sampling, which has negligible performance loss in JCT. We evaluate the performance of our design with real-world deep neural network benchmarks on heterogeneous computing clusters. Experimental results show that, compared to existing solutions, the proposed sampling-based scheme can achieve 1) results within 2.04% of the optimal JCT with orders-of-magnitude improvements in algorithm running time, and 2) high scheduling fairness among learning jobs.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140612729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On Off-Chaining Smart Contract Runtime Protection: A Queuing Model Approach 关于离链智能合约运行时保护:排队模型方法
IF 5.3 2区 计算机科学 Q1 Computer Science Pub Date : 2024-04-16 DOI: 10.1109/TPDS.2024.3389153
Isra M. Ali;Mohamed M. Abdallah
The vulnerability of smart contracts has been demonstrated by an increasing number of multi-million exploitation incidents in public blockchains. Several works propose applying runtime verification to protect smart contracts post-deployment. However, none discuss the induced onchain overhead that may preclude its deployment, leaving smart contracts unprotected. A prominent solution to the onchain overhead is outsourcing the analysis off-chain. In this work, we analytically study the potential efficiency of off-chain smart contract runtime verification. We present a generic queueing network model of the off-chain runtime verification and the block generation process. The queuing model approach allows us to efficiently and flexibly capture the non-deterministic behavior of blockchain, estimating the number of transactions in the pool and their corresponding waiting times. We analyze the onchain overhead and evaluate off-chain RV, providing numerical indicators of transaction processing latency and throughput.
智能合约的脆弱性已通过公共区块链中越来越多的数百万美元的漏洞利用事件得到证实。有几项研究提出应用运行时验证来保护部署后的智能合约。然而,它们都没有讨论可能导致无法部署的链上开销,从而使智能合约得不到保护。解决链上开销的一个重要方法是将分析外包到链外。在这项工作中,我们分析研究了链外智能合约运行时验证的潜在效率。我们提出了链外运行时验证和区块生成过程的通用排队网络模型。排队模型方法允许我们高效灵活地捕捉区块链的非确定性行为,估计池中的交易数量及其相应的等待时间。我们分析了链上开销并评估了链下 RV,提供了交易处理延迟和吞吐量的数值指标。
{"title":"On Off-Chaining Smart Contract Runtime Protection: A Queuing Model Approach","authors":"Isra M. Ali;Mohamed M. Abdallah","doi":"10.1109/TPDS.2024.3389153","DOIUrl":"10.1109/TPDS.2024.3389153","url":null,"abstract":"The vulnerability of smart contracts has been demonstrated by an increasing number of multi-million exploitation incidents in public blockchains. Several works propose applying runtime verification to protect smart contracts post-deployment. However, none discuss the induced onchain overhead that may preclude its deployment, leaving smart contracts unprotected. A prominent solution to the onchain overhead is outsourcing the analysis off-chain. In this work, we analytically study the potential efficiency of off-chain smart contract runtime verification. We present a generic queueing network model of the off-chain runtime verification and the block generation process. The queuing model approach allows us to efficiently and flexibly capture the non-deterministic behavior of blockchain, estimating the number of transactions in the pool and their corresponding waiting times. We analyze the onchain overhead and evaluate off-chain RV, providing numerical indicators of transaction processing latency and throughput.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140612643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HRCM: A Hierarchical Regularizing Mechanism for Sparse and Imbalanced Communication in Whole Human Brain Simulations HRCM:全人脑模拟中稀疏和不平衡通信的分层正则化机制
IF 5.3 2区 计算机科学 Q1 Computer Science Pub Date : 2024-04-12 DOI: 10.1109/TPDS.2024.3387720
Xin Du;Minglong Wang;Zhihui Lu;Qiang Duan;Yuhao Liu;Jianfeng Feng;Huarui Wang
Brain simulation is one of the most important measures to understand how information is represented and processed in the brain, which usually needs to be realized in supercomputers with a large number of interconnected graphical processing units (GPUs). For the whole human brain simulation, tens of thousands of GPUs are utilized to simulate tens of billions of neurons and tens of trillions of synapses for the living brain to reveal functional connectivity patterns. However, as an application of the irregular spares communication problem on a large-scale system, the sparse and imbalanced communication patterns of the human brain make it particularly challenging to design a communication system for supporting large-scale brain simulations. To face this challenge, this paper proposes a hierarchical regularized communication mechanism, HRCM. The HRCM maintains a hierarchical virtual communication topology (HVCT) with a merge-forward algorithm that exploits the sparsity of neuron interactions to regularize inter-process communications in brain simulations. HRCM also provides a neuron-level partition scheme for assigning neurons to simulation processes to balance the communication load while improving resource utilization. In HRCM, neuron partition is formulated as a k-way graph partition problem and solved efficiently by the proposed hybrid multi-constraint greedy (HMCG) algorithm. HRCM has been implemented in human brain simulations at the scale of up to 86 billion neurons running on 10000 GPUs. Results obtained from extensive simulation experiments verify the effectiveness of HRCM in significantly reducing communication delay, increasing resource usage, and shortening simulation time for large-scale human brain models.
大脑模拟是了解大脑如何表示和处理信息的最重要措施之一,通常需要在具有大量互连图形处理单元(GPU)的超级计算机中实现。对于整个人类大脑的模拟,需要利用数以万计的 GPU 来模拟活体大脑的数百亿个神经元和数万亿个突触,以揭示功能连接模式。然而,作为不规则备件通信问题在大规模系统上的应用,人脑稀疏且不平衡的通信模式使得设计支持大规模大脑模拟的通信系统尤为困难。面对这一挑战,本文提出了一种分层正则化通信机制(HRCM)。HRCM 采用合并前向算法维护分层虚拟通信拓扑(HVCT),利用神经元交互的稀疏性来规范大脑模拟中的进程间通信。HRCM 还提供了神经元级分区方案,用于将神经元分配给仿真进程,以平衡通信负载,同时提高资源利用率。在 HRCM 中,神经元分区被表述为 k 路图分区问题,并通过所提出的混合多约束贪婪(HMCG)算法高效解决。HRCM 已在运行于 10000 个 GPU 上的高达 860 亿神经元规模的人脑仿真中实现。大量仿真实验的结果验证了 HRCM 在大幅减少通信延迟、提高资源利用率和缩短大规模人脑模型仿真时间方面的有效性。
{"title":"HRCM: A Hierarchical Regularizing Mechanism for Sparse and Imbalanced Communication in Whole Human Brain Simulations","authors":"Xin Du;Minglong Wang;Zhihui Lu;Qiang Duan;Yuhao Liu;Jianfeng Feng;Huarui Wang","doi":"10.1109/TPDS.2024.3387720","DOIUrl":"10.1109/TPDS.2024.3387720","url":null,"abstract":"Brain simulation is one of the most important measures to understand how information is represented and processed in the brain, which usually needs to be realized in supercomputers with a large number of interconnected graphical processing units (GPUs). For the whole human brain simulation, tens of thousands of GPUs are utilized to simulate tens of billions of neurons and tens of trillions of synapses for the living brain to reveal functional connectivity patterns. However, as an application of the irregular spares communication problem on a large-scale system, the sparse and imbalanced communication patterns of the human brain make it particularly challenging to design a communication system for supporting large-scale brain simulations. To face this challenge, this paper proposes a hierarchical regularized communication mechanism, HRCM. The HRCM maintains a hierarchical virtual communication topology (HVCT) with a merge-forward algorithm that exploits the sparsity of neuron interactions to regularize inter-process communications in brain simulations. HRCM also provides a neuron-level partition scheme for assigning neurons to simulation processes to balance the communication load while improving resource utilization. In HRCM, neuron partition is formulated as a k-way graph partition problem and solved efficiently by the proposed hybrid multi-constraint greedy (HMCG) algorithm. HRCM has been implemented in human brain simulations at the scale of up to 86 billion neurons running on 10000 GPUs. Results obtained from extensive simulation experiments verify the effectiveness of HRCM in significantly reducing communication delay, increasing resource usage, and shortening simulation time for large-scale human brain models.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140561605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FastTuning: Enabling Fast and Efficient Hyper-Parameter Tuning With Partitioning and Parallelism of Search Space FastTuning:利用搜索空间的分割和并行性实现快速高效的超参数调整
IF 5.3 2区 计算机科学 Q1 Computer Science Pub Date : 2024-04-10 DOI: 10.1109/TPDS.2024.3386939
Xiaqing Li;Qi Guo;Guangyan Zhang;Siwei Ye;Guanhua He;Yiheng Yao;Rui Zhang;Yifan Hao;Zidong Du;Weimin Zheng
Hyper-parameter tuning (HPT) for deep learning (DL) models is prohibitively expensive. Sequential model-based optimization (SMBO) emerges as the state-of-the-art (SOTA) approach to automatically optimize HPT performance due to its heuristic advantages. Unfortunately, focusing on algorithm optimization rather than a large-scale parallel HPT system, existing SMBO-based approaches still cannot effectively remove their strong sequential nature, posing two performance problems: (1) extremely low tuning speed and (2) sub-optimal model quality. In this paper, we propose FastTuning, a fast, scalable, and generic system aiming at parallelly accelerating SMBO-based HPT for large DL/ML models. The key is to partition the highly complex search space into multiple smaller sub-spaces, each of which is assigned to and optimized by a different tuning worker in parallel. However, determining the right level of resource allocation to strike a balance between quality and cost remains a challenge. To address this, we further propose NIMBLE, a dynamic scheduling strategy that is specially designed for FastTuning, including (1) Dynamic Elimination Algorithm, (2) Sub-space Re-division, and (3) Posterior Information Sharing. Finally, we incorporate 6 SOTAs (i.e., 3 tuning algorithms and 3 parallel tuning tools) into FastTuning. Experimental results, on ResNet18, VGG19, ResNet50, and ResNet152, show that FastTuning can consistently offer much faster tuning speed (up to $80times$) with better accuracy (up to 4.7% improvement), thereby enabling the application of automatic HPT to real-life DL models.
深度学习(DL)模型的超参数调整(HPT)成本过高。基于序列模型的优化(SMBO)因其启发式优势而成为自动优化 HPT 性能的最先进(SOTA)方法。遗憾的是,现有的基于 SMBO 的方法侧重于算法优化而非大规模并行 HPT 系统,仍无法有效消除其强烈的顺序性,从而带来两个性能问题:(1) 极低的调整速度和 (2) 次优模型质量。本文提出的 FastTuning 是一种快速、可扩展的通用系统,旨在并行加速基于 SMBO 的大型 DL/ML 模型的 HPT。其关键在于将高度复杂的搜索空间划分为多个较小的子空间,每个子空间分配给不同的调优人员并由其并行优化。然而,如何确定正确的资源分配水平,在质量和成本之间取得平衡仍然是一个挑战。为了解决这个问题,我们进一步提出了 NIMBLE,一种专为 FastTuning 设计的动态调度策略,包括:(1)动态消除算法;(2)子空间再划分;(3)后验信息共享。最后,我们在 FastTuning 中加入了 6 个 SOTAs(即 3 个调优算法和 3 个并行调优工具)。在ResNet18、VGG19、ResNet50和ResNet152上的实验结果表明,FastTuning可以持续提供更快的调谐速度(高达80美元/次)和更高的精度(高达4.7%的改进),从而使自动HPT应用于现实生活中的DL模型。
{"title":"FastTuning: Enabling Fast and Efficient Hyper-Parameter Tuning With Partitioning and Parallelism of Search Space","authors":"Xiaqing Li;Qi Guo;Guangyan Zhang;Siwei Ye;Guanhua He;Yiheng Yao;Rui Zhang;Yifan Hao;Zidong Du;Weimin Zheng","doi":"10.1109/TPDS.2024.3386939","DOIUrl":"10.1109/TPDS.2024.3386939","url":null,"abstract":"Hyper-parameter tuning (HPT) for deep learning (DL) models is prohibitively expensive. Sequential model-based optimization (SMBO) emerges as the state-of-the-art (SOTA) approach to automatically optimize HPT performance due to its heuristic advantages. Unfortunately, focusing on algorithm optimization rather than a large-scale parallel HPT system, existing SMBO-based approaches still cannot effectively remove their strong sequential nature, posing two performance problems: (1) \u0000<i>extremely low tuning speed</i>\u0000 and (2) \u0000<i>sub-optimal model quality</i>\u0000. In this paper, we propose FastTuning, a fast, scalable, and generic system aiming at parallelly accelerating SMBO-based HPT for large DL/ML models. The key is to partition the highly complex search space into multiple smaller sub-spaces, each of which is assigned to and optimized by a different tuning worker in parallel. However, determining the right level of resource allocation to strike a balance between quality and cost remains a challenge. To address this, we further propose NIMBLE, a dynamic scheduling strategy that is specially designed for FastTuning, including (1) Dynamic Elimination Algorithm, (2) Sub-space Re-division, and (3) Posterior Information Sharing. Finally, we incorporate 6 SOTAs (i.e., 3 tuning algorithms and 3 parallel tuning tools) into FastTuning. Experimental results, on ResNet18, VGG19, ResNet50, and ResNet152, show that FastTuning can consistently offer much faster tuning speed (up to \u0000<inline-formula><tex-math>$80times$</tex-math></inline-formula>\u0000) with better accuracy (up to 4.7% improvement), thereby enabling the application of automatic HPT to real-life DL models.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140561683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MPMoE: Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline Parallelism MPMoE:利用自适应管道并行性预训练模型的内存效率 MoE
IF 5.3 2区 计算机科学 Q1 Computer Science Pub Date : 2024-04-08 DOI: 10.1109/TPDS.2024.3385639
Zheng Zhang;Yaqi Xia;Hulin Wang;Donglin Yang;Chuang Hu;Xiaobo Zhou;Dazhao Cheng
In recent years, the Mixture-of-Experts (MoE) technique has gained widespread popularity as a means to scale pre-trained models to exceptionally large sizes. Dynamic activation of experts allows for conditional computation, increasing the number of parameters of neural networks, which is critical for absorbing the vast amounts of knowledge available in many deep learning areas. However, despite the existing system and algorithm optimizations, there are significant challenges to be tackled when it comes to the inefficiencies of communication and memory consumption. In this paper, we present the design and implementation of MPMoE, a high-performance library that accelerates MoE training with adaptive and memory-efficient pipeline parallelism. Inspired by that the MoE training procedure can be divided into multiple independent sub-stages. We design a pipeline parallelism method for reducing communication latency by overlapping with computation operations. Further, we analyze the memory footprint breakdown of MoE training and identify that activations and temporary buffers are the primary contributors to the overall memory footprint. Toward memory efficiency, we propose memory reuse strategies to reduce memory requirements by eliminating memory redundancies. Finally, to optimize pipeline granularity and memory reuse strategies jointly, we propose a profile-based algorithm and a performance model to determine the configurations of MPMoE at runtime. We implement MPMoE upon PyTorch and evaluate it with common MoE models in two physical clusters, including 64 NVIDIA A100 GPU cards and 16 NVIDIA V100 GPU cards. Compared with the state-of-art approach, MPMoE achieves up to 2.3× speedup while reducing more than 30% memory footprint for training large models.
近年来,专家混合(MoE)技术作为一种将预训练模型扩展到超大规模的手段,受到了广泛欢迎。专家的动态激活允许进行条件计算,增加了神经网络的参数数量,这对于吸收许多深度学习领域的大量知识至关重要。然而,尽管现有的系统和算法已经进行了优化,但在通信和内存消耗效率低下方面仍有重大挑战需要解决。在本文中,我们介绍了 MPMoE 的设计与实现,这是一个高性能库,可通过自适应和内存效率高的流水线并行来加速 MoE 训练。受此启发,MoE 训练过程可分为多个独立的子阶段。我们设计了一种流水线并行方法,通过与计算操作重叠来减少通信延迟。此外,我们分析了 MoE 训练的内存占用细分,发现激活和临时缓冲区是造成整体内存占用的主要因素。为了提高内存效率,我们提出了内存重用策略,通过消除内存冗余来降低内存需求。最后,为了联合优化流水线粒度和内存重用策略,我们提出了基于配置文件的算法和性能模型,以确定 MPMoE 在运行时的配置。我们在 PyTorch 上实现了 MPMoE,并在两个物理集群(包括 64 个英伟达 A100 GPU 卡和 16 个英伟达 V100 GPU 卡)中用常见的 MoE 模型对其进行了评估。与最先进的方法相比,MPMoE 的速度提高了 2.3 倍,同时在训练大型模型时减少了 30% 以上的内存占用。
{"title":"MPMoE: Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline Parallelism","authors":"Zheng Zhang;Yaqi Xia;Hulin Wang;Donglin Yang;Chuang Hu;Xiaobo Zhou;Dazhao Cheng","doi":"10.1109/TPDS.2024.3385639","DOIUrl":"10.1109/TPDS.2024.3385639","url":null,"abstract":"In recent years, the Mixture-of-Experts (MoE) technique has gained widespread popularity as a means to scale pre-trained models to exceptionally large sizes. Dynamic activation of experts allows for conditional computation, increasing the number of parameters of neural networks, which is critical for absorbing the vast amounts of knowledge available in many deep learning areas. However, despite the existing system and algorithm optimizations, there are significant challenges to be tackled when it comes to the inefficiencies of communication and memory consumption. In this paper, we present the design and implementation of MPMoE, a high-performance library that accelerates MoE training with adaptive and memory-efficient pipeline parallelism. Inspired by that the MoE training procedure can be divided into multiple independent sub-stages. We design a pipeline parallelism method for reducing communication latency by overlapping with computation operations. Further, we analyze the memory footprint breakdown of MoE training and identify that activations and temporary buffers are the primary contributors to the overall memory footprint. Toward memory efficiency, we propose memory reuse strategies to reduce memory requirements by eliminating memory redundancies. Finally, to optimize pipeline granularity and memory reuse strategies jointly, we propose a profile-based algorithm and a performance model to determine the configurations of MPMoE at runtime. We implement MPMoE upon PyTorch and evaluate it with common MoE models in two physical clusters, including 64 NVIDIA A100 GPU cards and 16 NVIDIA V100 GPU cards. Compared with the state-of-art approach, MPMoE achieves up to 2.3× speedup while reducing more than 30% memory footprint for training large models.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140561554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Parallel and Distributed Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1