Optically Connected Multi-Stack HBM Modules for Large Language Model Training and Inference

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE IEEE Computer Architecture Letters Pub Date : 2025-02-18 DOI:10.1109/LCA.2025.3540058

Yanghui Ou;Hengrui Zhang;Austin Rovinski;David Wentzlaff;Christopher Batten

引用次数: 0

Abstract

Large language models (LLMs) have grown exponentially in size, presenting significant challenges to traditional memory architectures. Current high bandwidth memory (HBM) systems are constrained by chiplet I/O bandwidth and the limited number of HBM stacks that can be integrated due to packaging constraints. In this letter, we propose a novel memory system architecture that leverages silicon photonic interconnects to increase memory capacity and bandwidth for compute devices. By introducing optically connected multi-stack HBM modules, we extend the HBM memory system off the compute chip, significantly increasing the number of HBM stacks. Our evaluations show that this architecture can improve training efficiency for a trillion-parameter model by 1.4× compared to a modeled A100 baseline, while also enhancing inference performance by 4.2× if the L2 is modified to provide sufficient bandwidth.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于大型语言模型训练和推理的光连接多栈HBM模块

大型语言模型（llm）的规模呈指数级增长，对传统的内存体系结构提出了重大挑战。当前的高带宽存储器（HBM）系统受到芯片I/O带宽的限制，并且由于封装限制，可以集成的HBM堆栈数量有限。在这封信中，我们提出了一种新的存储系统架构，利用硅光子互连来增加计算设备的存储容量和带宽。通过引入光连接的多层HBM模块，我们将HBM存储系统扩展到计算芯片之外，显著增加了HBM堆栈的数量。我们的评估表明，与建模的A100基线相比，该架构可以将万亿参数模型的训练效率提高1.4倍，同时如果修改L2以提供足够的带宽，则还可以将推理性能提高4.2倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Computer Architecture Letters COMPUTER SCIENCE, HARDWARE & ARCHITECTURE-

CiteScore

4.60

自引率

4.30%

发文量

期刊介绍： IEEE Computer Architecture Letters is a rigorously peer-reviewed forum for publishing early, high-impact results in the areas of uni- and multiprocessor computer systems, computer architecture, microarchitecture, workload characterization, performance evaluation and simulation techniques, and power-aware computing. Submissions are welcomed on any topic in computer architecture, especially but not limited to: microprocessor and multiprocessor systems, microarchitecture and ILP processors, workload characterization, performance evaluation and simulation techniques, compiler-hardware and operating system-hardware interactions, interconnect architectures, memory and cache systems, power and thermal issues at the architecture level, I/O architectures and techniques, independent validation of previously published results, analysis of unsuccessful techniques, domain-specific processor architectures (e.g., embedded, graphics, network, etc.), real-time and high-availability architectures, reconfigurable systems.

期刊最新文献

DAWN: Efficient Distribution of Attention Workload in PIM-Enabled Systems for LLM Inference 2025 Reviewers List* Driving the Core Frontend With LiteBTB CTL: A Case for CXL Device-Managed Hugepages H3: Hybrid Architecture Using High Bandwidth Memory and High Bandwidth Flash for Cost-Efficient LLM Inference