Benchmarking the Performance of Large Language Models on the Cerebras Wafer Scale Engine

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-30 DOI:arxiv-2409.00287

Zuoning Zhang, Dhruv Parikh, Youning Zhang, Viktor Prasanna

{"title":"Benchmarking the Performance of Large Language Models on the Cerebras Wafer Scale Engine","authors":"Zuoning Zhang, Dhruv Parikh, Youning Zhang, Viktor Prasanna","doi":"arxiv-2409.00287","DOIUrl":null,"url":null,"abstract":"Transformer based Large Language Models (LLMs) have recently reached state of\nthe art performance in Natural Language Processing (NLP) and Computer Vision\n(CV) domains. LLMs use the Multi-Headed Self-Attention (MHSA) mechanism to\ncapture long-range global attention relationships among input words or image\npatches, drastically improving its performance over prior deep learning\napproaches. In this paper, we evaluate the performance of LLMs on the Cerebras\nWafer Scale Engine (WSE). Cerebras WSE is a high performance computing system\nwith 2.6 trillion transistors, 850,000 cores and 40 GB on-chip memory. Cerebras\nWSE's Sparse Linear Algebra Compute (SLAC) cores eliminates multiply-by-zeros\noperations and its 40 GB of on-chip memory is uniformly distributed among SLAC\ncores, enabling fast local access to model parameters. Moreover, Cerebras\nsoftware configures routing between cores at runtime, optimizing communication\noverhead among cores. As LLMs are becoming more commonly used, new hardware\narchitectures are needed to accelerate LLMs training and inference. We\nbenchmark the effectiveness of this hardware architecture at accelerating LLMs\ntraining and inference. Additionally, we analyze if Cerebras WSE can scale the\nmemory-wall associated with traditionally memory-bound compute tasks using its\n20 PB/s high bandwidth memory. Furthermore, we examine the performance\nscalability of Cerebras WSE through a roofline model. By plotting performance\nmetrics against computational intensity, we aim to assess their effectiveness\nat handling high compute-intensive LLMs training and inference tasks.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"7 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.00287","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Transformer based Large Language Models (LLMs) have recently reached state of the art performance in Natural Language Processing (NLP) and Computer Vision (CV) domains. LLMs use the Multi-Headed Self-Attention (MHSA) mechanism to capture long-range global attention relationships among input words or image patches, drastically improving its performance over prior deep learning approaches. In this paper, we evaluate the performance of LLMs on the Cerebras Wafer Scale Engine (WSE). Cerebras WSE is a high performance computing system with 2.6 trillion transistors, 850,000 cores and 40 GB on-chip memory. Cerebras WSE's Sparse Linear Algebra Compute (SLAC) cores eliminates multiply-by-zeros operations and its 40 GB of on-chip memory is uniformly distributed among SLAC cores, enabling fast local access to model parameters. Moreover, Cerebras software configures routing between cores at runtime, optimizing communication overhead among cores. As LLMs are becoming more commonly used, new hardware architectures are needed to accelerate LLMs training and inference. We benchmark the effectiveness of this hardware architecture at accelerating LLMs training and inference. Additionally, we analyze if Cerebras WSE can scale the memory-wall associated with traditionally memory-bound compute tasks using its 20 PB/s high bandwidth memory. Furthermore, we examine the performance scalability of Cerebras WSE through a roofline model. By plotting performance metrics against computational intensity, we aim to assess their effectiveness at handling high compute-intensive LLMs training and inference tasks.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在 Cerebras 晶圆级引擎上对大型语言模型的性能进行基准测试

基于变压器的大型语言模型（LLM）最近在自然语言处理（NLP）和计算机视觉（CV）领域达到了最先进的性能。LLMs 使用多头自注意（MHSA）机制来捕捉输入单词或图像斑块之间的长距离全局注意关系，与之前的深度学习方法相比，大大提高了其性能。本文评估了 LLM 在 CerebrasWafer Scale Engine（WSE）上的性能。Cerebras WSE是一个拥有2.6万亿个晶体管、85万个内核和40 GB片上内存的高性能计算系统。CerebrasWSE 的稀疏线性代数计算（SLAC）内核消除了逐乘迭代，40 GB 的片上内存在 SLAC 内核之间均匀分布，实现了对模型参数的快速本地访问。此外，Cerebrass软件可在运行时配置内核之间的路由，优化内核之间的通信开销。随着 LLM 的使用越来越普遍，需要新的硬件架构来加速 LLM 的训练和推理。我们对这种硬件架构在加速 LLM 训练和推理方面的有效性进行了测试。此外，我们还分析了 Cerebras WSE 能否利用其 20 PB/s 的高带宽内存扩展与传统内存约束计算任务相关的内存墙。此外，我们还通过屋顶线模型检验了 Cerebras WSE 的性能可计算性。通过绘制性能指标与计算强度的对比图，我们旨在评估它们处理高计算密集型 LLMs 训练和推理任务的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Distributed, Parallel, and Cluster Computing

自引率

0.00%

发文量

期刊最新文献

Massively parallel CMA-ES with increasing population Communication Lower Bounds and Optimal Algorithms for Symmetric Matrix Computations Energy Efficiency Support for Software Defined Networks: a Serverless Computing Approach CountChain: A Decentralized Oracle Network for Counting Systems Delay Analysis of EIP-4844