Ziming Liu, Shaoyu Wang, Shenggan Cheng, Zhongkai Zhao, Yang Bai, Xuanlei Zhao, James Demmel, Yang You
{"title":"WallFacer: Guiding Transformer Model Training Out of the Long-Context Dark Forest with N-body Problem","authors":"Ziming Liu, Shaoyu Wang, Shenggan Cheng, Zhongkai Zhao, Yang Bai, Xuanlei Zhao, James Demmel, Yang You","doi":"arxiv-2407.00611","DOIUrl":null,"url":null,"abstract":"In recent years, Transformer-based Large Language Models (LLMs) have garnered\nsignificant attention due to their exceptional performance across a variety of\ntasks. However, training these models on long sequences presents a substantial\nchallenge in terms of efficiency and scalability. Current methods are\nconstrained either by the number of attention heads, limiting scalability, or\nby excessive communication overheads. In this paper, we propose an insight that\nAttention Computation can be considered as a special case of n-body problem\nwith direct interactions. Based on this concept, this paper introduces\nWallFacer, an efficient long-sequence training system with a novel\nmulti-dimensional ring sequence parallelism, fostering an efficient\ncommunication paradigm and extra tuning space for communication arrangement.\nThrough comprehensive experiments under diverse environments and model\nsettings, we demonstrate that WallFacer significantly surpasses\nstate-of-the-art method that supports near-infinite sequence length, achieving\nperformance improvements of up to 77.12%.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"133 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.00611","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In recent years, Transformer-based Large Language Models (LLMs) have garnered
significant attention due to their exceptional performance across a variety of
tasks. However, training these models on long sequences presents a substantial
challenge in terms of efficiency and scalability. Current methods are
constrained either by the number of attention heads, limiting scalability, or
by excessive communication overheads. In this paper, we propose an insight that
Attention Computation can be considered as a special case of n-body problem
with direct interactions. Based on this concept, this paper introduces
WallFacer, an efficient long-sequence training system with a novel
multi-dimensional ring sequence parallelism, fostering an efficient
communication paradigm and extra tuning space for communication arrangement.
Through comprehensive experiments under diverse environments and model
settings, we demonstrate that WallFacer significantly surpasses
state-of-the-art method that supports near-infinite sequence length, achieving
performance improvements of up to 77.12%.