Ran Yan, Youhe Jiang, Wangcheng Tao, Xiaonan Nie, Bin Cui, Binhang Yuan
{"title":"FlashFlex:适应异构环境下的大型语言模型训练","authors":"Ran Yan, Youhe Jiang, Wangcheng Tao, Xiaonan Nie, Bin Cui, Binhang Yuan","doi":"arxiv-2409.01143","DOIUrl":null,"url":null,"abstract":"Training large language model (LLM) is a computationally intensive task,\nwhich is typically conducted in data centers with homogeneous high-performance\nGPUs. This paper explores an alternative approach by deploying the training\ncomputation across heterogeneous GPUs to enable better flexibility and\nefficiency for heterogeneous resource utilization. To achieve this goal, we\npropose a novel system, FlashFlex, that can flexibly support an asymmetric\npartition of the parallel training computations across the scope of data-,\npipeline-, and tensor model parallelism. We further formalize the allocation of\nasymmetric partitioned training computations over a set of heterogeneous GPUs\nas a constrained optimization problem and propose an efficient solution based\non a hierarchical graph partitioning algorithm. Our approach can adaptively\nallocate asymmetric training computations across GPUs, fully leveraging the\navailable computational power. We conduct extensive empirical studies to\nevaluate the performance of FlashFlex, where we find that when training LLMs at\ndifferent scales (from 7B to 30B), FlashFlex can achieve comparable training\nMFU when running over a set of heterogeneous GPUs compared with the state of\nthe art training systems running over a set of homogeneous high-performance\nGPUs with the same amount of total peak FLOPS. The achieved smallest gaps in\nMFU are 11.61% and 0.30%, depending on whether the homogeneous setting is\nequipped with and without RDMA. Our implementation is available at\nhttps://github.com/Relaxed-System-Lab/FlashFlex.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"17 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"FlashFlex: Accommodating Large Language Model Training over Heterogeneous Environment\",\"authors\":\"Ran Yan, Youhe Jiang, Wangcheng Tao, Xiaonan Nie, Bin Cui, Binhang Yuan\",\"doi\":\"arxiv-2409.01143\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Training large language model (LLM) is a computationally intensive task,\\nwhich is typically conducted in data centers with homogeneous high-performance\\nGPUs. This paper explores an alternative approach by deploying the training\\ncomputation across heterogeneous GPUs to enable better flexibility and\\nefficiency for heterogeneous resource utilization. To achieve this goal, we\\npropose a novel system, FlashFlex, that can flexibly support an asymmetric\\npartition of the parallel training computations across the scope of data-,\\npipeline-, and tensor model parallelism. We further formalize the allocation of\\nasymmetric partitioned training computations over a set of heterogeneous GPUs\\nas a constrained optimization problem and propose an efficient solution based\\non a hierarchical graph partitioning algorithm. Our approach can adaptively\\nallocate asymmetric training computations across GPUs, fully leveraging the\\navailable computational power. We conduct extensive empirical studies to\\nevaluate the performance of FlashFlex, where we find that when training LLMs at\\ndifferent scales (from 7B to 30B), FlashFlex can achieve comparable training\\nMFU when running over a set of heterogeneous GPUs compared with the state of\\nthe art training systems running over a set of homogeneous high-performance\\nGPUs with the same amount of total peak FLOPS. The achieved smallest gaps in\\nMFU are 11.61% and 0.30%, depending on whether the homogeneous setting is\\nequipped with and without RDMA. Our implementation is available at\\nhttps://github.com/Relaxed-System-Lab/FlashFlex.\",\"PeriodicalId\":501422,\"journal\":{\"name\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"volume\":\"17 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.01143\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.01143","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
FlashFlex: Accommodating Large Language Model Training over Heterogeneous Environment
Training large language model (LLM) is a computationally intensive task,
which is typically conducted in data centers with homogeneous high-performance
GPUs. This paper explores an alternative approach by deploying the training
computation across heterogeneous GPUs to enable better flexibility and
efficiency for heterogeneous resource utilization. To achieve this goal, we
propose a novel system, FlashFlex, that can flexibly support an asymmetric
partition of the parallel training computations across the scope of data-,
pipeline-, and tensor model parallelism. We further formalize the allocation of
asymmetric partitioned training computations over a set of heterogeneous GPUs
as a constrained optimization problem and propose an efficient solution based
on a hierarchical graph partitioning algorithm. Our approach can adaptively
allocate asymmetric training computations across GPUs, fully leveraging the
available computational power. We conduct extensive empirical studies to
evaluate the performance of FlashFlex, where we find that when training LLMs at
different scales (from 7B to 30B), FlashFlex can achieve comparable training
MFU when running over a set of heterogeneous GPUs compared with the state of
the art training systems running over a set of homogeneous high-performance
GPUs with the same amount of total peak FLOPS. The achieved smallest gaps in
MFU are 11.61% and 0.30%, depending on whether the homogeneous setting is
equipped with and without RDMA. Our implementation is available at
https://github.com/Relaxed-System-Lab/FlashFlex.