Efficient scaling of large language models with mixture of experts and 3D analog in-memory computing

IF 12 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Nature computational science Pub Date : 2025-01-08 DOI:10.1038/s43588-024-00753-x
Julian Büchel, Athanasios Vasilopoulos, William Andrew Simon, Irem Boybat, HsinYu Tsai, Geoffrey W. Burr, Hernan Castro, Bill Filipiak, Manuel Le Gallo, Abbas Rahimi, Vijay Narayanan, Abu Sebastian
{"title":"Efficient scaling of large language models with mixture of experts and 3D analog in-memory computing","authors":"Julian Büchel, Athanasios Vasilopoulos, William Andrew Simon, Irem Boybat, HsinYu Tsai, Geoffrey W. Burr, Hernan Castro, Bill Filipiak, Manuel Le Gallo, Abbas Rahimi, Vijay Narayanan, Abu Sebastian","doi":"10.1038/s43588-024-00753-x","DOIUrl":null,"url":null,"abstract":"Large language models (LLMs), with their remarkable generative capacities, have greatly impacted a range of fields, but they face scalability challenges due to their large parameter counts, which result in high costs for training and inference. The trend of increasing model sizes is exacerbating these challenges, particularly in terms of memory footprint, latency and energy consumption. Here we explore the deployment of ‘mixture of experts’ (MoEs) networks—networks that use conditional computing to keep computational demands low despite having many parameters—on three-dimensional (3D) non-volatile memory (NVM)-based analog in-memory computing (AIMC) hardware. When combined with the MoE architecture, this hardware, utilizing stacked NVM devices arranged in a crossbar array, offers a solution to the parameter-fetching bottleneck typical in traditional models deployed on conventional von-Neumann-based architectures. By simulating the deployment of MoEs on an abstract 3D AIMC system, we demonstrate that, due to their conditional compute mechanism, MoEs are inherently better suited to this hardware than conventional, dense model architectures. Our findings suggest that MoEs, in conjunction with emerging 3D NVM-based AIMC, can substantially reduce the inference costs of state-of-the-art LLMs, making them more accessible and energy-efficient. This study shows a viable pathway to the efficient deployment of state-of-the-art large language models using mixture of experts on 3D analog in-memory computing hardware.","PeriodicalId":74246,"journal":{"name":"Nature computational science","volume":"5 1","pages":"13-26"},"PeriodicalIF":12.0000,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature computational science","FirstCategoryId":"1085","ListUrlMain":"https://www.nature.com/articles/s43588-024-00753-x","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

Large language models (LLMs), with their remarkable generative capacities, have greatly impacted a range of fields, but they face scalability challenges due to their large parameter counts, which result in high costs for training and inference. The trend of increasing model sizes is exacerbating these challenges, particularly in terms of memory footprint, latency and energy consumption. Here we explore the deployment of ‘mixture of experts’ (MoEs) networks—networks that use conditional computing to keep computational demands low despite having many parameters—on three-dimensional (3D) non-volatile memory (NVM)-based analog in-memory computing (AIMC) hardware. When combined with the MoE architecture, this hardware, utilizing stacked NVM devices arranged in a crossbar array, offers a solution to the parameter-fetching bottleneck typical in traditional models deployed on conventional von-Neumann-based architectures. By simulating the deployment of MoEs on an abstract 3D AIMC system, we demonstrate that, due to their conditional compute mechanism, MoEs are inherently better suited to this hardware than conventional, dense model architectures. Our findings suggest that MoEs, in conjunction with emerging 3D NVM-based AIMC, can substantially reduce the inference costs of state-of-the-art LLMs, making them more accessible and energy-efficient. This study shows a viable pathway to the efficient deployment of state-of-the-art large language models using mixture of experts on 3D analog in-memory computing hardware.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
高效缩放与专家和三维模拟内存计算混合的大型语言模型。
大型语言模型(llm)具有显著的生成能力,极大地影响了一系列领域,但由于其参数数量大,导致训练和推理成本高,因此面临可扩展性挑战。增加模型尺寸的趋势加剧了这些挑战,特别是在内存占用、延迟和能耗方面。在这里,我们探讨了在基于三维(3D)非易失性存储器(NVM)的模拟内存计算(AIMC)硬件上部署“混合专家”(MoEs)网络——尽管有许多参数,但使用条件计算来保持低计算需求的网络。当与MoE架构相结合时,该硬件利用堆叠在交叉棒阵列中的NVM设备,为部署在传统基于冯-诺伊曼架构的传统模型中典型的参数获取瓶颈提供了解决方案。通过在抽象的3D AIMC系统上模拟moe的部署,我们证明,由于它们的条件计算机制,moe本质上比传统的密集模型架构更适合这种硬件。我们的研究结果表明,moe与新兴的基于3D nvm的AIMC相结合,可以大大降低最先进的llm的推理成本,使它们更容易获得和节能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
11.70
自引率
0.00%
发文量
0
期刊最新文献
Based on the science, diversity matters Enhancing synthesis prediction via machine learning Harnessing large language models for data-scarce learning of polymer properties. A statistical framework for multi-trait rare variant analysis in large-scale whole-genome sequencing studies MultiSTAAR delivers multi-trait rare variant analysis of biobank-scale sequencing data
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1