首页 > 最新文献

IEEE Transactions on Parallel and Distributed Systems最新文献

英文 中文
2024 Reviewers List* 2024审稿人名单*
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-01-08 DOI: 10.1109/TPDS.2024.3512712
{"title":"2024 Reviewers List*","authors":"","doi":"10.1109/TPDS.2024.3512712","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3512712","url":null,"abstract":"","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"356-360"},"PeriodicalIF":5.6,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10834303","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142938326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HpT: Hybrid Acceleration of Spatio-Temporal Attention Model Training on Heterogeneous Manycore Architectures 基于异构多核架构的时空注意力模型混合加速训练
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-01-01 DOI: 10.1109/TPDS.2024.3522781
Saiman Dahal;Pratyush Dhingra;Krishu Kumar Thapa;Partha Pratim Pande;Ananth Kalyanaraman
Transformer models have become widely popular in numerous applications, and especially for building foundation large language models (LLMs). Recently, there has been a surge in the exploration of transformer-based architectures in non-LLM applications. In particular, the self-attention mechanism within the transformer architecture offers a way to exploit any hidden relations within data, making it widely applicable for a variety of spatio-temporal tasks in scientific computing domains (e.g., weather, traffic, agriculture). Most of these efforts have primarily focused on accelerating the inference phase. However, the computational resources required to train these attention-based models for scientific applications remain a significant challenge to address. Emerging non-volatile memory (NVM)-based processing-in-memory (PIM) architectures can achieve higher performance and better energy efficiency than their GPU-based counterparts. However, the frequent weight updates during training would necessitate write operations to NVM cells, posing a significant barrier for considering stand-alone NVM-based PIM architectures. In this paper, we present HpT, a new hybrid approach to accelerate the training of attention-based models for scientific applications. Our approach is hybrid at two different layers: at the software layer, our approach dynamically switches from a full-parameter training mode to a lower-parameter training mode by incorporating intrinsic dimensionality; and at the hardware layer, our approach harnesses the combined power of GPUs, resistive random-access memory (ReRAM)-based PIM devices, and systolic arrays. This software-hardware co-design approach is aimed at adaptively reducing both runtime and energy costs during the training phase, without compromising on quality. Experiments on four concrete real-world scientific applications demonstrate that our hybrid approach is able to significantly reduce training time (up to $11.9times$) and energy consumption (up to $12.05times$), compared to the corresponding full-parameter training executing on only GPUs. Our approach serves as an example for accelerating the training of attention-based models on heterogeneous platforms including ReRAMs.
Transformer模型已经在许多应用程序中广泛流行,特别是用于构建基础大型语言模型(llm)。最近,在非llm应用程序中对基于变压器的体系结构的探索激增。特别是,变压器体系结构中的自关注机制提供了一种利用数据中任何隐藏关系的方法,使其广泛适用于科学计算领域(例如,天气、交通、农业)中的各种时空任务。大多数这些努力主要集中在加速推理阶段。然而,训练这些基于注意力的模型用于科学应用所需的计算资源仍然是一个需要解决的重大挑战。新兴的基于非易失性存储器(NVM)的内存处理(PIM)体系结构可以实现比基于gpu的体系结构更高的性能和更好的能源效率。然而,训练期间频繁的权重更新将需要对NVM单元进行写操作,这对考虑基于NVM的独立PIM架构构成了重大障碍。在本文中,我们提出了一种新的混合方法HpT,用于加速科学应用中基于注意的模型的训练。我们的方法在两个不同的层是混合的:在软件层,我们的方法通过结合内在维度动态地从全参数训练模式切换到低参数训练模式;在硬件层,我们的方法利用了gpu、基于电阻随机存取存储器(ReRAM)的PIM设备和收缩阵列的综合能力。这种软硬件协同设计方法旨在自适应地减少训练阶段的运行时间和能源成本,同时不影响质量。在四个具体的现实世界科学应用中进行的实验表明,与仅在gpu上执行相应的全参数训练相比,我们的混合方法能够显着减少训练时间(高达11.9美元)和能耗(高达12.05美元)。我们的方法可以作为在包括reram在内的异构平台上加速训练基于注意力的模型的示例。
{"title":"HpT: Hybrid Acceleration of Spatio-Temporal Attention Model Training on Heterogeneous Manycore Architectures","authors":"Saiman Dahal;Pratyush Dhingra;Krishu Kumar Thapa;Partha Pratim Pande;Ananth Kalyanaraman","doi":"10.1109/TPDS.2024.3522781","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3522781","url":null,"abstract":"Transformer models have become widely popular in numerous applications, and especially for building foundation large language models (LLMs). Recently, there has been a surge in the exploration of transformer-based architectures in non-LLM applications. In particular, the self-attention mechanism within the transformer architecture offers a way to exploit any hidden relations within data, making it widely applicable for a variety of spatio-temporal tasks in scientific computing domains (e.g., weather, traffic, agriculture). Most of these efforts have primarily focused on accelerating the inference phase. However, the computational resources required to train these attention-based models for scientific applications remain a significant challenge to address. Emerging non-volatile memory (NVM)-based processing-in-memory (PIM) architectures can achieve higher performance and better energy efficiency than their GPU-based counterparts. However, the frequent weight updates during training would necessitate write operations to NVM cells, posing a significant barrier for considering stand-alone NVM-based PIM architectures. In this paper, we present <monospace>HpT</monospace>, a new hybrid approach to accelerate the training of attention-based models for scientific applications. Our approach is hybrid at two different layers: at the software layer, our approach dynamically switches from a full-parameter training mode to a lower-parameter training mode by incorporating intrinsic dimensionality; and at the hardware layer, our approach harnesses the combined power of GPUs, resistive random-access memory (ReRAM)-based PIM devices, and systolic arrays. This software-hardware co-design approach is aimed at adaptively reducing both runtime and energy costs during the training phase, without compromising on quality. Experiments on four concrete real-world scientific applications demonstrate that our hybrid approach is able to significantly reduce training time (up to <inline-formula><tex-math>$11.9times$</tex-math></inline-formula>) and energy consumption (up to <inline-formula><tex-math>$12.05times$</tex-math></inline-formula>), compared to the corresponding full-parameter training executing on only GPUs. Our approach serves as an example for accelerating the training of attention-based models on heterogeneous platforms including ReRAMs.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 3","pages":"407-421"},"PeriodicalIF":5.6,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sparrow: Expediting Smart Contract Execution for Blockchain Sharding via Inter-Shard Caching Sparrow:通过分片间缓存加速区块链分片的智能合约执行
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-12-26 DOI: 10.1109/TPDS.2024.3522016
Junyuan Liang;Peiyuan Yao;Wuhui Chen;Zicong Hong;Jianting Zhang;Ting Cai;Min Sun;Zibin Zheng
Sharding is a promising solution to scale blockchain by separating the system into multiple shards to process transactions in parallel. However, due to state separation and shard isolation, it is still challenging to efficiently support smart contracts on a blockchain sharding system where smart contracts can interact with each other, involving states maintained by multiple shards. Specifically, existing sharding systems adopt a costly multi-step collaboration mechanism to execute smart contracts, resulting in long latency and low throughput. This article proposes Sparrow, a blockchain sharding protocol achieving one-step execution for smart contracts. To break shard isolation, inspired by non-local hotspot data caching in traditional databases, we propose a new idea of inter-shard caching, allowing a shard to prefetch and cache frequently accessed contract states of other shards. The miner can thus use the inter-shard cache to pre-execute a pending transaction, retrieve all its contract invocations, and commit it to multiple shards in one step. Particularly, we first propose a speculative dispersal cache synchronisation mechanism for efficient and secure cache synchronization across shards in Byzantine environments. Then, we propose a multi-branch exploration mechanism to solve the rollback problem during the optimistic one-step execution of contract invocations with dependencies. We also present a series of conflict resolution mechanisms to decrease the rollback caused by inherent transaction conflicts. We implement prototypes for Sparrow and existing sharding systems, and the evaluation shows that Sparrow improves the throughput by $2.44times$ and reduces the transaction latency by 30% compared with the existing sharding systems.
分片是一种很有前途的扩展区块链的解决方案,它将系统分成多个分片来并行处理事务。然而,由于状态分离和分片隔离,在区块链分片系统上有效支持智能合约仍然具有挑战性,在区块链分片系统中,智能合约可以相互交互,涉及多个分片维护的状态。具体来说,现有的分片系统采用昂贵的多步协作机制来执行智能合约,导致延迟长,吞吐量低。本文提出了一种区块链分片协议Sparrow,它实现了智能合约的一步执行。为了打破分片隔离,受传统数据库非本地热点数据缓存的启发,我们提出了一种分片间缓存的新思路,允许一个分片预取和缓存其他分片频繁访问的合约状态。因此,矿工可以使用分片间缓存来预执行待处理事务,检索其所有合约调用,并在一步中将其提交到多个分片。特别是,我们首先提出了一种推测式分散缓存同步机制,用于在拜占庭环境中跨分片进行高效和安全的缓存同步。然后,我们提出了一种多分支探索机制来解决依赖关系契约调用的乐观一步执行过程中的回滚问题。我们还提出了一系列冲突解决机制,以减少由固有事务冲突引起的回滚。我们实现了Sparrow和现有分片系统的原型,评估表明,与现有分片系统相比,Sparrow的吞吐量提高了2.44倍,交易延迟减少了30%。
{"title":"Sparrow: Expediting Smart Contract Execution for Blockchain Sharding via Inter-Shard Caching","authors":"Junyuan Liang;Peiyuan Yao;Wuhui Chen;Zicong Hong;Jianting Zhang;Ting Cai;Min Sun;Zibin Zheng","doi":"10.1109/TPDS.2024.3522016","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3522016","url":null,"abstract":"Sharding is a promising solution to scale blockchain by separating the system into multiple shards to process transactions in parallel. However, due to state separation and shard isolation, it is still challenging to efficiently support smart contracts on a blockchain sharding system where smart contracts can interact with each other, involving states maintained by multiple shards. Specifically, existing sharding systems adopt a costly multi-step collaboration mechanism to execute smart contracts, resulting in long latency and low throughput. This article proposes <small>Sparrow</small>, a blockchain sharding protocol achieving one-step execution for smart contracts. To break shard isolation, inspired by non-local hotspot data caching in traditional databases, we propose a new idea of <i>inter-shard caching</i>, allowing a shard to prefetch and cache frequently accessed contract states of other shards. The miner can thus use the inter-shard cache to pre-execute a pending transaction, retrieve all its contract invocations, and commit it to multiple shards in one step. Particularly, we first propose a speculative dispersal cache synchronisation mechanism for efficient and secure cache synchronization across shards in Byzantine environments. Then, we propose a multi-branch exploration mechanism to solve the rollback problem during the optimistic one-step execution of contract invocations with dependencies. We also present a series of conflict resolution mechanisms to decrease the rollback caused by inherent transaction conflicts. We implement prototypes for <small>Sparrow</small> and existing sharding systems, and the evaluation shows that <small>Sparrow</small> improves the throughput by <inline-formula><tex-math>$2.44times$</tex-math></inline-formula> and reduces the transaction latency by 30% compared with the existing sharding systems.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 3","pages":"377-390"},"PeriodicalIF":5.6,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CAT: Cellular Automata on Tensor Cores 张量核上的元胞自动机
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-12-20 DOI: 10.1109/TPDS.2024.3520395
Cristóbal A. Navarro;Felipe A. Quezada;Enzo Meneses;Héctor Ferrada;Nancy Hitschfeld
Cellular automata (CA) are simulation models that can produce complex emergent behaviors from simple local rules. Although state-of-the-art GPU solutions are already fast due to their data-parallel nature, their performance can rapidly degrade in CA with a large neighborhood radius. With the inclusion of tensor cores across the entire GPU ecosystem, interest has grown in finding ways to leverage these fast units outside the field of artificial intelligence, which was their original purpose. In this work, we present CAT, a GPU tensor core approach that can accelerate CA in which the cell transition function acts on a weighted summation of its neighborhood. CAT is evaluated theoretically, using an extended PRAM cost model, as well as empirically using the Larger Than Life (LTL) family of CA as case studies. The results confirm that the cost model is accurate, showing that CAT exhibits constant time throughout the entire radius range $1 leq r leq 16$, and its theoretical speedups agree with the empirical results. At low radius $r=1,2$, CAT is competitive and is only surpassed by the fastest state-of-the-art GPU solution. Starting from $r=3$, CAT progressively outperforms all other approaches, reaching speedups of up to $101times$ over a GPU baseline and up to $sim !14times$ over the fastest state-of-the-art GPU approach. In terms of energy efficiency, CAT is competitive in the range $1 leq r leq 4$ and from $r geq 5$ it is the most energy efficient approach. As for performance scaling across GPU architectures, CAT shows a promising trend that, if continues for future generations, it would increase its performance at a higher rate than classical GPU solutions. A CPU version of CAT was also explored, using the recently introduced AMX instructions. Although its performance is still below GPU tensor cores, it is a promising approach as it can still outperform some GPU approaches at large radius. The results obtained in this work put CAT as an approach with great potential for scientists who need to study emerging phenomena in CA with a large neighborhood radius, both in the GPU and in the CPU.
元胞自动机(CA)是一种能够从简单的局部规则中产生复杂突发行为的仿真模型。尽管最先进的GPU解决方案由于其数据并行特性已经很快,但它们的性能在具有大邻域半径的CA中可能会迅速下降。随着整个GPU生态系统中包含了张量核,人们越来越有兴趣在人工智能领域之外寻找方法来利用这些快速单元,这是它们的最初目的。在这项工作中,我们提出了CAT,一种GPU张量核心方法,可以加速CA,其中单元转换函数作用于其邻域的加权和。从理论上评估CAT,使用扩展的PRAM成本模型,以及经验上使用大于寿命(LTL) CA家族作为案例研究。结果证实了成本模型的准确性,表明CAT在整个半径范围内呈现恒定时间$1 leq r leq 16$,其理论加速与实证结果一致。在低半径$r=1,2$, CAT具有竞争力,只有最快的最先进的GPU解决方案才能超越它。从$r=3$开始,CAT逐渐优于所有其他方法,在GPU基线上达到高达$101times$的速度,在最快的最先进的GPU方法上达到$sim !14times$的速度。在能源效率方面,CAT在$1 leq r leq 4$范围内具有竞争力,从$r geq 5$来看,它是最节能的方法。至于跨GPU架构的性能扩展,CAT显示出一个有希望的趋势,如果在未来几代中继续下去,它将以比经典GPU解决方案更高的速度提高其性能。我们还研究了CAT的CPU版本,使用了最近引入的AMX指令。虽然它的性能仍然低于GPU张量核,但它仍然可以在大半径范围内优于一些GPU方法,是一种很有前途的方法。在这项工作中获得的结果表明,对于需要在GPU和CPU中研究具有大邻域半径的CA中新出现的现象的科学家来说,CAT是一种具有巨大潜力的方法。
{"title":"CAT: Cellular Automata on Tensor Cores","authors":"Cristóbal A. Navarro;Felipe A. Quezada;Enzo Meneses;Héctor Ferrada;Nancy Hitschfeld","doi":"10.1109/TPDS.2024.3520395","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3520395","url":null,"abstract":"Cellular automata (CA) are simulation models that can produce complex emergent behaviors from simple local rules. Although state-of-the-art GPU solutions are already fast due to their data-parallel nature, their performance can rapidly degrade in CA with a large neighborhood radius. With the inclusion of tensor cores across the entire GPU ecosystem, interest has grown in finding ways to leverage these fast units outside the field of artificial intelligence, which was their original purpose. In this work, we present CAT, a GPU tensor core approach that can accelerate CA in which the cell transition function acts on a weighted summation of its neighborhood. CAT is evaluated theoretically, using an extended PRAM cost model, as well as empirically using the Larger Than Life (LTL) family of CA as case studies. The results confirm that the cost model is accurate, showing that CAT exhibits constant time throughout the entire radius range \u0000<inline-formula><tex-math>$1 leq r leq 16$</tex-math></inline-formula>\u0000, and its theoretical speedups agree with the empirical results. At low radius \u0000<inline-formula><tex-math>$r=1,2$</tex-math></inline-formula>\u0000, CAT is competitive and is only surpassed by the fastest state-of-the-art GPU solution. Starting from \u0000<inline-formula><tex-math>$r=3$</tex-math></inline-formula>\u0000, CAT progressively outperforms all other approaches, reaching speedups of up to \u0000<inline-formula><tex-math>$101times$</tex-math></inline-formula>\u0000 over a GPU baseline and up to \u0000<inline-formula><tex-math>$sim !14times$</tex-math></inline-formula>\u0000 over the fastest state-of-the-art GPU approach. In terms of energy efficiency, CAT is competitive in the range \u0000<inline-formula><tex-math>$1 leq r leq 4$</tex-math></inline-formula>\u0000 and from \u0000<inline-formula><tex-math>$r geq 5$</tex-math></inline-formula>\u0000 it is the most energy efficient approach. As for performance scaling across GPU architectures, CAT shows a promising trend that, if continues for future generations, it would increase its performance at a higher rate than classical GPU solutions. A CPU version of CAT was also explored, using the recently introduced AMX instructions. Although its performance is still below GPU tensor cores, it is a promising approach as it can still outperform some GPU approaches at large radius. The results obtained in this work put CAT as an approach with great potential for scientists who need to study emerging phenomena in CA with a large neighborhood radius, both in the GPU and in the CPU.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"341-355"},"PeriodicalIF":5.6,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142938327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UMPIPE: Unequal Microbatches-Based Pipeline Parallelism for Deep Neural Network Training UMPIPE:基于不等微批的管道并行深度神经网络训练
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-12-11 DOI: 10.1109/TPDS.2024.3515804
Guangyao Zhou;Wenhong Tian;Rajkumar Buyya;Kui Wu
The increasing need for large-scale deep neural networks (DNN) has made parallel training an area of intensive focus. One effective method, microbatch-based pipeline parallelism (notably GPipe), accelerates parallel training in various architectures. However, existing parallel training architectures normally use equal data partitioning (EDP), where each layer's process maintains identical microbatch-sizes. EDP may hinder training speed because different processes often require varying optimal microbatch-sizes. To address this, we introduce UMPIPE, a novel framework for unequal microbatches-based pipeline parallelism. UMPIPE enables unequal data partitions (UEDP) across processes to optimize resource utilization. We develop a recurrence formula to calculate the time cost in UMPIPE by considering both computation and communication processes. To further enhance UMPIPE's efficiency, we propose the Dual-Chromosome Genetic Algorithm for UMPIPE (DGAP) that accounts for the independent time costs of forward and backward propagation. Furthermore, we present TiDGAP, a two-level improvement on DGAP. TiDGAP accelerates the process by simultaneously calculating the end time for multiple individuals and microbatches using matrix operations. Our extensive experiments validate the dual-chromosome strategy's optimization benefits and TiDGAP's acceleration capabilities. TiDGAP can achieve better training schemes than baselines, such as the local greedy algorithm and the global greedy-based dynamic programming. Compared to (GPipe, PipeDream), UMPIPE achieves increases in training speed: $(13.89,11.09)%$ for GPT1-14, $(17.11, 7.96)%$ for VGG16 and $geq (170,100)%$ for simulation networks.
对大规模深度神经网络(DNN)日益增长的需求使得并行训练成为一个备受关注的领域。一种有效的方法,基于微批的管道并行(特别是GPipe),可以加速各种架构中的并行训练。然而,现有的并行训练架构通常使用相等数据分区(EDP),其中每层的进程保持相同的微批大小。EDP可能会阻碍训练速度,因为不同的过程通常需要不同的最佳微批大小。为了解决这个问题,我们引入了UMPIPE,这是一个基于不平等微批处理的管道并行性的新框架。UMPIPE支持跨进程的不平等数据分区(UEDP),以优化资源利用率。同时考虑了计算过程和通信过程,建立了计算UMPIPE中时间开销的递推公式。为了进一步提高UMPIPE的效率,我们提出了考虑正向和反向传播独立时间成本的UMPIPE双染色体遗传算法(DGAP)。此外,我们提出了TiDGAP,这是对DGAP的两级改进。TiDGAP通过使用矩阵操作同时计算多个个体和微批的结束时间来加速过程。我们的大量实验验证了双染色体策略的优化效益和TiDGAP的加速能力。TiDGAP可以获得比基线更好的训练方案,如局部贪婪算法和基于全局贪婪的动态规划。与(GPipe, PipeDream)相比,UMPIPE实现了训练速度的提高:GPT1-14为$(13.89,11.09)%$, VGG16为$(17.11, 7.96)%$,仿真网络为$geq (170,100)%$。
{"title":"UMPIPE: Unequal Microbatches-Based Pipeline Parallelism for Deep Neural Network Training","authors":"Guangyao Zhou;Wenhong Tian;Rajkumar Buyya;Kui Wu","doi":"10.1109/TPDS.2024.3515804","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3515804","url":null,"abstract":"The increasing need for large-scale deep neural networks (DNN) has made parallel training an area of intensive focus. One effective method, microbatch-based pipeline parallelism (notably GPipe), accelerates parallel training in various architectures. However, existing parallel training architectures normally use equal data partitioning (EDP), where each layer's process maintains identical microbatch-sizes. EDP may hinder training speed because different processes often require varying optimal microbatch-sizes. To address this, we introduce UMPIPE, a novel framework for unequal microbatches-based pipeline parallelism. UMPIPE enables unequal data partitions (UEDP) across processes to optimize resource utilization. We develop a recurrence formula to calculate the time cost in UMPIPE by considering both computation and communication processes. To further enhance UMPIPE's efficiency, we propose the Dual-Chromosome Genetic Algorithm for UMPIPE (DGAP) that accounts for the independent time costs of forward and backward propagation. Furthermore, we present TiDGAP, a two-level improvement on DGAP. TiDGAP accelerates the process by simultaneously calculating the end time for multiple individuals and microbatches using matrix operations. Our extensive experiments validate the dual-chromosome strategy's optimization benefits and TiDGAP's acceleration capabilities. TiDGAP can achieve better training schemes than baselines, such as the local greedy algorithm and the global greedy-based dynamic programming. Compared to (GPipe, PipeDream), UMPIPE achieves increases in training speed: \u0000<inline-formula><tex-math>$(13.89,11.09)%$</tex-math></inline-formula>\u0000 for GPT1-14, \u0000<inline-formula><tex-math>$(17.11, 7.96)%$</tex-math></inline-formula>\u0000 for VGG16 and \u0000<inline-formula><tex-math>$geq (170,100)%$</tex-math></inline-formula>\u0000 for simulation networks.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"293-307"},"PeriodicalIF":5.6,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142890166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fine-Grained QoS Control via Tightly-Coupled Bandwidth Monitoring and Regulation for FPGA-Based Heterogeneous SoCs 基于fpga的异构soc紧耦合带宽监控与调节的细粒度QoS控制
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-12-09 DOI: 10.1109/TPDS.2024.3513416
Giacomo Valente;Gianluca Brilli;Tania Di Mascio;Alessandro Capotondi;Paolo Burgio;Paolo Valente;Andrea Marongiu
Commercial embedded systems increasingly rely on heterogeneous architectures that integrate general-purpose, multi-core processors, and various hardware accelerators on the same chip. This provides the high performance required by modern applications at a low cost and low power consumption, but at the same time poses new challenges. Hardware resource sharing at various levels, and in particular at the main memory controller level, results in slower execution time for the application tasks, ultimately making the system unpredictable from the point of view of timing. To enable the adoption of heterogeneous systems-on-chip (System on Chips (SoCs)) in the domain of timing-critical applications several hardware and software approaches have been proposed, bandwidth regulation based on monitoring and throttling being one of the most widely adopted. Existing solutions, however, are either too coarse-grained, limiting the control over computing engines activities, or strongly platform-dependent, addressing the problem only for specific SoCs. This article proposes an innovative approach that can accurately control main memory bandwidth usage in FPGA-based heterogeneous SoCs. In particular, it controls system bandwidth by connecting a runtime bandwidth regulation component to FPGA-based accelerators. Our solution offers dynamically configurable, fine-grained bandwidth regulation – to adapt to the varying requirements of the application over time – at a very low overhead. Furthermore, it is entirely platform-independent, capable of integration with any FPGA-based accelerator. Developed at the register-transfer level using a reference SoC platform, it is designed for easy compatibility with any FPGA-based SoC. Experimental results conducted on the Xilinx Zynq UltraScale+ platform demonstrate that our approach (i) is more than $100times$ faster than loosely-coupled, software controlled regulators; (ii) is capable of exploiting the system bandwidth 28.7% more efficiently than tightly-coupled hardware regulators (e.g., ARM CoreLink QoS-400, where available); (iii) enables task co-scheduling solutions not feasible with state-of-the-art bandwidth regulation methods.
商业嵌入式系统越来越依赖于异构体系结构,这些体系结构在同一芯片上集成了通用的多核处理器和各种硬件加速器。这以低成本和低功耗提供了现代应用所需的高性能,但同时也提出了新的挑战。硬件资源在不同级别上的共享,特别是在主内存控制器级别上的共享,会导致应用程序任务的执行时间变慢,最终使系统从计时的角度来看不可预测。为了在时间关键应用领域采用异构片上系统(soc),已经提出了几种硬件和软件方法,基于监控和节流的带宽调节是最广泛采用的方法之一。然而,现有的解决方案要么过于粗粒度,限制了对计算引擎活动的控制,要么高度依赖于平台,仅针对特定的soc解决问题。本文提出了一种创新的方法,可以精确地控制基于fpga的异构soc中的主存储器带宽使用。特别是,它通过将运行时带宽调节组件连接到基于fpga的加速器来控制系统带宽。我们的解决方案以非常低的开销提供动态可配置的、细粒度的带宽调节——以适应应用程序随时间变化的需求。此外,它完全独立于平台,能够与任何基于fpga的加速器集成。它使用参考SoC平台在寄存器传输级别开发,旨在与任何基于fpga的SoC轻松兼容。在赛灵思Zynq UltraScale+平台上进行的实验结果表明,我们的方法(i)比松耦合、软件控制的调节器快100倍以上;(ii)能够比紧耦合硬件调节器(如ARM CoreLink QoS-400)更有效地利用系统带宽28.7%;(iii)使任务协同调度解决方案在最先进的带宽调节方法中是不可行的。
{"title":"Fine-Grained QoS Control via Tightly-Coupled Bandwidth Monitoring and Regulation for FPGA-Based Heterogeneous SoCs","authors":"Giacomo Valente;Gianluca Brilli;Tania Di Mascio;Alessandro Capotondi;Paolo Burgio;Paolo Valente;Andrea Marongiu","doi":"10.1109/TPDS.2024.3513416","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3513416","url":null,"abstract":"Commercial embedded systems increasingly rely on heterogeneous architectures that integrate general-purpose, multi-core processors, and various hardware accelerators on the same chip. This provides the high performance required by modern applications at a low cost and low power consumption, but at the same time poses new challenges. Hardware resource sharing at various levels, and in particular at the main memory controller level, results in slower execution time for the application tasks, ultimately making the system unpredictable from the point of view of timing. To enable the adoption of heterogeneous systems-on-chip (System on Chips (SoCs)) in the domain of timing-critical applications several hardware and software approaches have been proposed, bandwidth regulation based on monitoring and throttling being one of the most widely adopted. Existing solutions, however, are either too coarse-grained, limiting the control over computing engines activities, or strongly platform-dependent, addressing the problem only for specific SoCs. This article proposes an innovative approach that can accurately control main memory bandwidth usage in FPGA-based heterogeneous SoCs. In particular, it controls system bandwidth by connecting a runtime bandwidth regulation component to FPGA-based accelerators. Our solution offers dynamically configurable, fine-grained bandwidth regulation – to adapt to the varying requirements of the application over time – at a very low overhead. Furthermore, it is entirely platform-independent, capable of integration with any FPGA-based accelerator. Developed at the register-transfer level using a reference SoC platform, it is designed for easy compatibility with any FPGA-based SoC. Experimental results conducted on the Xilinx Zynq UltraScale+ platform demonstrate that our approach (i) is more than \u0000<inline-formula><tex-math>$100times$</tex-math></inline-formula>\u0000 faster than loosely-coupled, software controlled regulators; (ii) is capable of exploiting the system bandwidth 28.7% more efficiently than tightly-coupled hardware regulators (e.g., ARM CoreLink QoS-400, where available); (iii) enables task co-scheduling solutions not feasible with state-of-the-art bandwidth regulation methods.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"326-340"},"PeriodicalIF":5.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142938328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TOP: Task-Based Operator Parallelism for Asynchronous Deep Learning Inference on GPU TOP: GPU上异步深度学习推理的基于任务的算子并行性
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-12-05 DOI: 10.1109/TPDS.2024.3511543
Changyao Lin;Zhenming Chen;Ziyang Zhang;Jie Liu
Current deep learning compilers have made significant strides in optimizing computation graphs for single- and multi-model scenarios. However, they lack specific optimizations for asynchronous multi-task inference systems. In such systems, tasks arrive dynamically, leading to diverse inference progress for each model. This renders traditional optimization strategies based solely on the original computation graph suboptimal or even invalid. Furthermore, existing operator scheduling methods do not account for parallel task pipelines involving the same model. Task pipelines present additional opportunities for optimization. Therefore, we propose Task-based Operator Parallelism (TOP). TOP incorporates an understanding of the impact of task arrival patterns on the inference progress of each model. It leverages the multi-agent reinforcement learning algorithm MADDPG to cooperatively optimize the task launcher and model scheduler, generating an optimal pair of dequeue frequency and computation graph. The objective of TOP is to enhance resource utilization, increase throughput, and allocate resources judiciously to prevent task backlog. To expedite the optimization process in TOP, we introduce a novel stage partition method using the GNN-based Policy Gradient (GPG) algorithm. Through extensive experiments on various devices, we demonstrate the efficacy of TOP. It outperforms the state-of-the-art in operator scheduling for both single- and multi-model task processing scenarios. Benefiting from TOP, we can significantly enhance the throughput of a single model by increasing its concurrency or batch size, thereby achieving self-acceleration.
当前的深度学习编译器在优化单模型和多模型场景的计算图方面取得了重大进展。然而,它们缺乏针对异步多任务推理系统的特定优化。在这样的系统中,任务是动态到达的,导致每个模型的推理进度不同。这使得仅基于原始计算图的传统优化策略不是最优的,甚至无效。此外,现有的操作员调度方法没有考虑到涉及同一模型的并行任务管道。任务管道为优化提供了额外的机会。因此,我们提出了基于任务的算子并行(TOP)。TOP包含了对任务到达模式对每个模型的推理过程的影响的理解。利用多智能体强化学习算法madpg对任务启动器和模型调度器进行协同优化,生成最优的一对脱队频率和计算图。TOP的目标是提高资源利用率、提高吞吐量并明智地分配资源以防止任务积压。为了加快TOP的优化过程,我们引入了一种基于gnn的策略梯度(GPG)算法的阶段划分方法。通过在各种设备上的大量实验,我们证明了TOP的有效性。它在单模型和多模型任务处理场景中都优于最先进的操作员调度。得益于TOP,我们可以通过增加单个模型的并发性或批处理大小来显著提高其吞吐量,从而实现自加速。
{"title":"TOP: Task-Based Operator Parallelism for Asynchronous Deep Learning Inference on GPU","authors":"Changyao Lin;Zhenming Chen;Ziyang Zhang;Jie Liu","doi":"10.1109/TPDS.2024.3511543","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3511543","url":null,"abstract":"Current deep learning compilers have made significant strides in optimizing computation graphs for single- and multi-model scenarios. However, they lack specific optimizations for asynchronous multi-task inference systems. In such systems, tasks arrive dynamically, leading to diverse inference progress for each model. This renders traditional optimization strategies based solely on the original computation graph suboptimal or even invalid. Furthermore, existing operator scheduling methods do not account for parallel task pipelines involving the same model. Task pipelines present additional opportunities for optimization. Therefore, we propose Task-based Operator Parallelism (TOP). TOP incorporates an understanding of the impact of task arrival patterns on the inference progress of each model. It leverages the multi-agent reinforcement learning algorithm MADDPG to cooperatively optimize the task launcher and model scheduler, generating an optimal pair of dequeue frequency and computation graph. The objective of TOP is to enhance resource utilization, increase throughput, and allocate resources judiciously to prevent task backlog. To expedite the optimization process in TOP, we introduce a novel stage partition method using the GNN-based Policy Gradient (GPG) algorithm. Through extensive experiments on various devices, we demonstrate the efficacy of TOP. It outperforms the state-of-the-art in operator scheduling for both single- and multi-model task processing scenarios. Benefiting from TOP, we can significantly enhance the throughput of a single model by increasing its concurrency or batch size, thereby achieving self-acceleration.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"266-281"},"PeriodicalIF":5.6,"publicationDate":"2024-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142890355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Efficient GPU Algorithm for Lattice Boltzmann Method on Sparse Complex Geometries 稀疏复几何格子玻尔兹曼方法的高效GPU算法
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-12-04 DOI: 10.1109/TPDS.2024.3510810
Zhangrong Qin;Xusheng Lu;Long Lv;Zhongxiang Tang;Binghai Wen
Many fluid flow problems, such as the porous media, arterial blood flow and tissue fluid, contain sparse complex geometries. Although the lattice Boltzmann method is good at dealing with the complex boundaries, these sparse complex geometries cause the low computational performance and high memory consumption when the graphics processing unit (GPU) is used to accelerate the numerical computation. These problems would be addressed by compact memory layout, sophisticated memory access and enhanced thread utilization. This paper proposes a GPU-based algorithm to improve the lattice Boltzmann simulations with sparse complex geometries. An access pattern for a single set of distribution functions together with a semi-direct addressing is adopted to reduce memory consumption, while a collected structure of arrays is employed to enhance memory access efficiency. Furthermore, an address index array and a node classification coding scheme are employed to improve the GPU thread utilization ratio and reduce the GPU global memory access, respectively. The accuracy and mesh-independence has been verified by the numerical simulations of Poiseuille flow and porous media flow with face-centered filled spheres. The present algorithm has a significantly lower memory consumption than those based on direct or indirect addressing schemes. It improves the computational performance by several times compared to the other algorithms on the common GPU hardware.
许多流体流动问题,如多孔介质、动脉血流和组织液,都包含稀疏的复杂几何形状。虽然晶格玻尔兹曼方法擅长处理复杂边界,但这些稀疏的复杂几何图形在使用图形处理单元(GPU)加速数值计算时导致计算性能低下和内存消耗高。这些问题可以通过紧凑的内存布局、复杂的内存访问和增强的线程利用率来解决。本文提出了一种基于gpu的改进稀疏复杂几何晶格玻尔兹曼模拟的算法。采用单组分布函数和半直接寻址的访问模式来减少内存消耗,采用集合的数组结构来提高内存访问效率。此外,采用地址索引数组和节点分类编码方案分别提高GPU线程利用率和减少GPU全局内存访问。通过面心填充球的泊泽维尔流和多孔介质流的数值模拟,验证了该方法的准确性和网格无关性。与基于直接寻址和间接寻址的算法相比,该算法的内存消耗明显降低。与普通GPU硬件上的其他算法相比,它的计算性能提高了几倍。
{"title":"An Efficient GPU Algorithm for Lattice Boltzmann Method on Sparse Complex Geometries","authors":"Zhangrong Qin;Xusheng Lu;Long Lv;Zhongxiang Tang;Binghai Wen","doi":"10.1109/TPDS.2024.3510810","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3510810","url":null,"abstract":"Many fluid flow problems, such as the porous media, arterial blood flow and tissue fluid, contain sparse complex geometries. Although the lattice Boltzmann method is good at dealing with the complex boundaries, these sparse complex geometries cause the low computational performance and high memory consumption when the graphics processing unit (GPU) is used to accelerate the numerical computation. These problems would be addressed by compact memory layout, sophisticated memory access and enhanced thread utilization. This paper proposes a GPU-based algorithm to improve the lattice Boltzmann simulations with sparse complex geometries. An access pattern for a single set of distribution functions together with a semi-direct addressing is adopted to reduce memory consumption, while a collected structure of arrays is employed to enhance memory access efficiency. Furthermore, an address index array and a node classification coding scheme are employed to improve the GPU thread utilization ratio and reduce the GPU global memory access, respectively. The accuracy and mesh-independence has been verified by the numerical simulations of Poiseuille flow and porous media flow with face-centered filled spheres. The present algorithm has a significantly lower memory consumption than those based on direct or indirect addressing schemes. It improves the computational performance by several times compared to the other algorithms on the common GPU hardware.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"239-252"},"PeriodicalIF":5.6,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142890357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Object Proxy Patterns for Accelerating Distributed Applications 加速分布式应用程序的对象代理模式
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-12-04 DOI: 10.1109/TPDS.2024.3511347
J. Gregory Pauloski;Valerie Hayot-Sasson;Logan Ward;Alexander Brace;André Bauer;Kyle Chard;Ian Foster
Workflow and serverless frameworks have empowered new approaches to distributed application design by abstracting compute resources. However, their typically limited or one-size-fits-all support for advanced data flow patterns leaves optimization to the application programmer—optimization that becomes more difficult as data become larger. The transparent object proxy, which provides wide-area references that can resolve to data regardless of location, has been demonstrated as an effective low-level building block in such situations. Here we propose three high-level proxy-based programming patterns—distributed futures, streaming, and ownership—that make the power of the proxy pattern usable for more complex and dynamic distributed program structures. We motivate these patterns via careful review of application requirements and describe implementations of each pattern. We evaluate our implementations through a suite of benchmarks and by applying them in three meaningful scientific applications, in which we demonstrate substantial improvements in runtime, throughput, and memory usage.
工作流和无服务器框架通过抽象计算资源为分布式应用程序设计提供了新的方法。然而,它们对高级数据流模式的支持通常是有限的或一刀切的,这将优化留给了应用程序编程人员——随着数据变大,优化变得更加困难。透明对象代理提供了广域引用,可以解析到任何位置的数据,已被证明是这种情况下有效的低级构建块。在这里,我们提出了三种高级的基于代理的编程模式——分布式未来、流和所有权——它们使代理模式的强大功能可用于更复杂和动态的分布式程序结构。我们通过仔细审查应用程序需求来激发这些模式,并描述每个模式的实现。我们通过一套基准测试来评估我们的实现,并在三个有意义的科学应用程序中应用它们,在这些应用程序中,我们展示了在运行时、吞吐量和内存使用方面的重大改进。
{"title":"Object Proxy Patterns for Accelerating Distributed Applications","authors":"J. Gregory Pauloski;Valerie Hayot-Sasson;Logan Ward;Alexander Brace;André Bauer;Kyle Chard;Ian Foster","doi":"10.1109/TPDS.2024.3511347","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3511347","url":null,"abstract":"Workflow and serverless frameworks have empowered new approaches to distributed application design by abstracting compute resources. However, their typically limited or one-size-fits-all support for advanced data flow patterns leaves optimization to the application programmer—optimization that becomes more difficult as data become larger. The transparent object proxy, which provides wide-area references that can resolve to data regardless of location, has been demonstrated as an effective low-level building block in such situations. Here we propose three high-level proxy-based programming patterns—distributed futures, streaming, and ownership—that make the power of the proxy pattern usable for more complex and dynamic distributed program structures. We motivate these patterns via careful review of application requirements and describe implementations of each pattern. We evaluate our implementations through a suite of benchmarks and by applying them in three meaningful scientific applications, in which we demonstrate substantial improvements in runtime, throughput, and memory usage.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"253-265"},"PeriodicalIF":5.6,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142890356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms 面向多gpu平台的机器学习训练通用性能建模
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-11-28 DOI: 10.1109/TPDS.2024.3507814
Zhongyi Lin;Ning Sun;Pallab Bhattacharya;Xizhou Feng;Louis Feng;John D. Owens
Characterizing and predicting the training performance of modern machine learning (ML) workloads on compute systems with compute and communication spread between CPUs, GPUs, and network devices is not only the key to optimization and planning but also a complex goal to achieve. The primary challenges include the complexity of synchronization and load balancing between CPUs and GPUs, the variance in input data distribution, and the use of different communication devices and topologies (e.g., NVLink, PCIe, network cards) that connect multiple compute devices, coupled with the desire for flexible training configurations. Built on top of our prior work for single-GPU platforms, we address these challenges and enable multi-GPU performance modeling1 by incorporating (1) data-distribution-aware performance models for embedding table lookup, and (2) data movement prediction of communication collectives, into our upgraded performance modeling pipeline equipped with inter-and intra-rank synchronization for ML workloads trained on multi-GPU platforms. Beyond accurately predicting the per-iteration training time of deep learning recommendation models (DLRM) models with random configurations with a geomean error of 5.21% on two multi-GPU platforms, our prediction pipeline generalizes well to other types of ML workloads, such as Transformer-based natural language processing (NLP) models with a geomean error of 3.00%. Moreover, even without actually running ML workloads like DLRMs on the hardware, it is capable of generating insights such as quickly selecting the fastest embedding table sharding configuration (with a success rate of 85%).
在cpu、gpu和网络设备之间分布计算和通信的计算系统上,描述和预测现代机器学习(ML)工作负载的训练性能不仅是优化和规划的关键,也是一个复杂的目标。主要挑战包括cpu和gpu之间同步和负载平衡的复杂性,输入数据分布的差异,以及连接多个计算设备的不同通信设备和拓扑(例如,NVLink, PCIe,网卡)的使用,以及对灵活训练配置的需求。基于我们之前针对单gpu平台的工作,我们解决了这些挑战,并通过将(1)用于嵌入表查找的数据分布感知性能模型,以及(2)通信集体的数据移动预测,整合到我们升级的性能建模管道中,该管道配备了在多gpu平台上训练的ML工作负载的秩间和秩内同步。除了在两个多gpu平台上准确预测随机配置的深度学习推荐模型(DLRM)模型的每次迭代训练时间(几何误差为5.21%)之外,我们的预测管道还可以很好地推广到其他类型的机器学习工作负载,例如基于transformer的自然语言处理(NLP)模型,几何误差为3.00%。此外,即使没有在硬件上实际运行像dlrm这样的机器学习工作负载,它也能够生成洞察力,例如快速选择最快的嵌入表分片配置(成功率为85%)。
{"title":"Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms","authors":"Zhongyi Lin;Ning Sun;Pallab Bhattacharya;Xizhou Feng;Louis Feng;John D. Owens","doi":"10.1109/TPDS.2024.3507814","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3507814","url":null,"abstract":"Characterizing and predicting the training performance of modern machine learning (ML) workloads on compute systems with compute and communication spread between CPUs, GPUs, and network devices is not only the key to optimization and planning but also a complex goal to achieve. The primary challenges include the complexity of synchronization and load balancing between CPUs and GPUs, the variance in input data distribution, and the use of different communication devices and topologies (e.g., NVLink, PCIe, network cards) that connect multiple compute devices, coupled with the desire for flexible training configurations. Built on top of our prior work for single-GPU platforms, we address these challenges and enable multi-GPU performance modeling\u0000<sup>1</sup>\u0000 by incorporating (1) data-distribution-aware performance models for embedding table lookup, and (2) data movement prediction of communication collectives, into our upgraded performance modeling pipeline equipped with inter-and intra-rank synchronization for ML workloads trained on multi-GPU platforms. Beyond accurately predicting the per-iteration training time of deep learning recommendation models (DLRM) models with random configurations with a geomean error of 5.21% on two multi-GPU platforms, our prediction pipeline generalizes well to other types of ML workloads, such as Transformer-based natural language processing (NLP) models with a geomean error of 3.00%. Moreover, even without actually running ML workloads like DLRMs on the hardware, it is capable of generating insights such as quickly selecting the fastest embedding table sharding configuration (with a success rate of 85%).","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"226-238"},"PeriodicalIF":5.6,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142890381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Parallel and Distributed Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1