Efficient scaling of large language models with mixture of experts and 3D analog in-memory computing

IF 12 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Nature computational science Pub Date : 2025-01-08 DOI:10.1038/s43588-024-00753-x

Julian Büchel, Athanasios Vasilopoulos, William Andrew Simon, Irem Boybat, HsinYu Tsai, Geoffrey W. Burr, Hernan Castro, Bill Filipiak, Manuel Le Gallo, Abbas Rahimi, Vijay Narayanan, Abu Sebastian

{"title":"Efficient scaling of large language models with mixture of experts and 3D analog in-memory computing","authors":"Julian Büchel, Athanasios Vasilopoulos, William Andrew Simon, Irem Boybat, HsinYu Tsai, Geoffrey W. Burr, Hernan Castro, Bill Filipiak, Manuel Le Gallo, Abbas Rahimi, Vijay Narayanan, Abu Sebastian","doi":"10.1038/s43588-024-00753-x","DOIUrl":null,"url":null,"abstract":"Large language models (LLMs), with their remarkable generative capacities, have greatly impacted a range of fields, but they face scalability challenges due to their large parameter counts, which result in high costs for training and inference. The trend of increasing model sizes is exacerbating these challenges, particularly in terms of memory footprint, latency and energy consumption. Here we explore the deployment of ‘mixture of experts’ (MoEs) networks—networks that use conditional computing to keep computational demands low despite having many parameters—on three-dimensional (3D) non-volatile memory (NVM)-based analog in-memory computing (AIMC) hardware. When combined with the MoE architecture, this hardware, utilizing stacked NVM devices arranged in a crossbar array, offers a solution to the parameter-fetching bottleneck typical in traditional models deployed on conventional von-Neumann-based architectures. By simulating the deployment of MoEs on an abstract 3D AIMC system, we demonstrate that, due to their conditional compute mechanism, MoEs are inherently better suited to this hardware than conventional, dense model architectures. Our findings suggest that MoEs, in conjunction with emerging 3D NVM-based AIMC, can substantially reduce the inference costs of state-of-the-art LLMs, making them more accessible and energy-efficient. This study shows a viable pathway to the efficient deployment of state-of-the-art large language models using mixture of experts on 3D analog in-memory computing hardware.","PeriodicalId":74246,"journal":{"name":"Nature computational science","volume":"5 1","pages":"13-26"},"PeriodicalIF":12.0000,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature computational science","FirstCategoryId":"1085","ListUrlMain":"https://www.nature.com/articles/s43588-024-00753-x","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Large language models (LLMs), with their remarkable generative capacities, have greatly impacted a range of fields, but they face scalability challenges due to their large parameter counts, which result in high costs for training and inference. The trend of increasing model sizes is exacerbating these challenges, particularly in terms of memory footprint, latency and energy consumption. Here we explore the deployment of ‘mixture of experts’ (MoEs) networks—networks that use conditional computing to keep computational demands low despite having many parameters—on three-dimensional (3D) non-volatile memory (NVM)-based analog in-memory computing (AIMC) hardware. When combined with the MoE architecture, this hardware, utilizing stacked NVM devices arranged in a crossbar array, offers a solution to the parameter-fetching bottleneck typical in traditional models deployed on conventional von-Neumann-based architectures. By simulating the deployment of MoEs on an abstract 3D AIMC system, we demonstrate that, due to their conditional compute mechanism, MoEs are inherently better suited to this hardware than conventional, dense model architectures. Our findings suggest that MoEs, in conjunction with emerging 3D NVM-based AIMC, can substantially reduce the inference costs of state-of-the-art LLMs, making them more accessible and energy-efficient. This study shows a viable pathway to the efficient deployment of state-of-the-art large language models using mixture of experts on 3D analog in-memory computing hardware.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

高效缩放与专家和三维模拟内存计算混合的大型语言模型。

大型语言模型（llm）具有显著的生成能力，极大地影响了一系列领域，但由于其参数数量大，导致训练和推理成本高，因此面临可扩展性挑战。增加模型尺寸的趋势加剧了这些挑战，特别是在内存占用、延迟和能耗方面。在这里，我们探讨了在基于三维（3D）非易失性存储器（NVM）的模拟内存计算（AIMC）硬件上部署“混合专家”（MoEs）网络——尽管有许多参数，但使用条件计算来保持低计算需求的网络。当与MoE架构相结合时，该硬件利用堆叠在交叉棒阵列中的NVM设备，为部署在传统基于冯-诺伊曼架构的传统模型中典型的参数获取瓶颈提供了解决方案。通过在抽象的3D AIMC系统上模拟moe的部署，我们证明，由于它们的条件计算机制，moe本质上比传统的密集模型架构更适合这种硬件。我们的研究结果表明，moe与新兴的基于3D nvm的AIMC相结合，可以大大降低最先进的llm的推理成本，使它们更容易获得和节能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊