首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure LLMServingSim2.0: LLM基础设施中异构硬件和服务技术的统一模拟器
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-11-03 DOI: 10.1109/LCA.2025.3628325
Jaehong Cho;Hyunmin Choi;Jongse Park
This letter introduces LLMServingSim2.0, a system simulator designed for exploring heterogeneous hardware in large-scale LLM serving systems. LLMServingSim2.0 addresses two key limitations of its predecessor: (1) integrating hardware models into system-level simulators is non-trivial due to the lack of a clear abstraction, and (2) existing simulators support only a narrow subset of serving techniques, leaving no infrastructure that captures the breadth of approaches in modern LLM serving. To overcome these issues, LLMServingSim2.0 adopts trace-driven performance modeling, accompanied by an operator-level latency profiler, enabling the integration of new accelerators with a single command. It further embeds up-to-date serving techniques while exposing flexible interfaces for request routing, cache management, and scheduling policies. In a TPU case study, our profiler requires 18.5 × fewer LoC and outperforms the predecessor’s hardware-simulator integration, demonstrating LLMServingSim2.0’s low-effort hardware extensibility. Our experiments further show that LLMServingSim2.0 reproduces GPU-based LLM serving with 1.9% error, while maintaining practical simulation time, making it a comprehensive platform for both hardware developers and LLM service providers.
这封信介绍了LLMServingSim2.0,这是一个系统模拟器,旨在探索大型LLM服务系统中的异构硬件。LLMServingSim2.0解决了其前身的两个关键限制:(1)由于缺乏明确的抽象,将硬件模型集成到系统级模拟器中是非常重要的;(2)现有的模拟器只支持服务技术的一个狭窄子集,没有基础设施可以捕获现代LLM服务方法的广度。为了克服这些问题,LLMServingSim2.0采用了跟踪驱动的性能建模,并附带了一个操作员级延迟分析器,从而能够将新的加速器与单个命令集成在一起。它进一步嵌入了最新的服务技术,同时为请求路由、缓存管理和调度策略提供了灵活的接口。在TPU案例研究中,我们的分析器需要的LoC减少了18.5倍,并且优于前一代的硬件模拟器集成,展示了LLMServingSim2.0的低工作量硬件可扩展性。我们的实验进一步表明,LLMServingSim2.0在保持实际仿真时间的同时,以1.9%的误差再现了基于gpu的LLM服务,使其成为硬件开发商和LLM服务提供商的综合平台。
{"title":"LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure","authors":"Jaehong Cho;Hyunmin Choi;Jongse Park","doi":"10.1109/LCA.2025.3628325","DOIUrl":"https://doi.org/10.1109/LCA.2025.3628325","url":null,"abstract":"This letter introduces LLMServingSim2.0, a system simulator designed for exploring heterogeneous hardware in large-scale LLM serving systems. LLMServingSim2.0 addresses two key limitations of its predecessor: (1) integrating hardware models into system-level simulators is non-trivial due to the lack of a clear abstraction, and (2) existing simulators support only a narrow subset of serving techniques, leaving no infrastructure that captures the breadth of approaches in modern LLM serving. To overcome these issues, LLMServingSim2.0 adopts trace-driven performance modeling, accompanied by an operator-level latency profiler, enabling the integration of new accelerators with a single command. It further embeds up-to-date serving techniques while exposing flexible interfaces for request routing, cache management, and scheduling policies. In a TPU case study, our profiler requires 18.5 × fewer LoC and outperforms the predecessor’s hardware-simulator integration, demonstrating LLMServingSim2.0’s low-effort hardware extensibility. Our experiments further show that LLMServingSim2.0 reproduces GPU-based LLM serving with 1.9% error, while maintaining practical simulation time, making it a comprehensive platform for both hardware developers and LLM service providers.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"361-364"},"PeriodicalIF":1.4,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
StreamDQ: HBM-Integrated On-the-Fly DeQuantization via Memory Load for Large Language Models 基于内存负载的基于hbm的大型语言模型动态去量化
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-31 DOI: 10.1109/LCA.2025.3626929
Minki Jeong;Daegun Yoon;Soohong Ahn;Seungyong Lee;Jooyoung Kim;Jinuk Jeon;Joonseop Sim;Youngpyo Joo;Hoshik Kim
As large language models (LLMs) scale, their memory and computation demands increase, making weight-only quantization a widely adopted technique to reduce memory footprint with minimal accuracy loss. However, CUDA core–based dequantization introduces significant instruction overhead, memory traffic, and pipeline stalls across all batch sizes, and critically remains a persistent bottleneck even in large-batch, cloud-scale LLM serving. To address these challenges, we propose StreamDQ, a lightweight architectural enhancement for cloud-scale LLM inference that enables on-the-fly dequantization within the memory subsystem by integrating compact DeQuantization Blocks (DQBs) into the base-die of high-bandwidth memory (HBM). StreamDQ leverages reserved address bits for control signaling, requiring no modifications to the ISA or compiler. Our design minimizes data movement, offloads computation from CUDA cores, and delivers dequantized weights directly to tensor cores for general matrix multiplication (GEMM) execution. StreamDQ achieves up to 83.57% reduction in inference latency and 5.15 × improvement in tokens-per-second throughput, with only 0.013 ${mathrm{mm}}^{2}$ area and 0.17 W power overhead per DQB. The design is scalable, software-transparent, and well-suited for high-throughput LLM inference on modern HBM-enabled GPU platforms.
随着大型语言模型(llm)的扩展,它们的内存和计算需求增加,使得仅权重量化成为一种广泛采用的技术,以最小的准确性损失来减少内存占用。然而,基于CUDA核心的去量化会在所有批处理规模中引入大量的指令开销、内存流量和管道停顿,即使在大规模的云规模LLM服务中,这仍然是一个持续的瓶颈。为了解决这些挑战,我们提出了StreamDQ,这是一种用于云规模LLM推理的轻量级架构增强,通过将紧凑的去量化块(dqb)集成到高带宽内存(HBM)的基本芯片中,可以在内存子系统中实现动态去量化。StreamDQ利用保留的地址位来控制信号,不需要修改ISA或编译器。我们的设计最大限度地减少了数据移动,卸载了CUDA核心的计算,并将去量化的权重直接传递给张量核心,用于通用矩阵乘法(GEMM)的执行。StreamDQ的推理延迟减少了83.57%,每秒令牌吞吐量提高了5.15倍,每个DQB的面积只有0.013 ${ mathm {mm}}^{2}$,功耗只有0.17 W。该设计是可扩展的,软件透明的,非常适合在现代支持hbm的GPU平台上进行高吞吐量LLM推理。
{"title":"StreamDQ: HBM-Integrated On-the-Fly DeQuantization via Memory Load for Large Language Models","authors":"Minki Jeong;Daegun Yoon;Soohong Ahn;Seungyong Lee;Jooyoung Kim;Jinuk Jeon;Joonseop Sim;Youngpyo Joo;Hoshik Kim","doi":"10.1109/LCA.2025.3626929","DOIUrl":"https://doi.org/10.1109/LCA.2025.3626929","url":null,"abstract":"As large language models (LLMs) scale, their memory and computation demands increase, making weight-only quantization a widely adopted technique to reduce memory footprint with minimal accuracy loss. However, CUDA core–based dequantization introduces significant instruction overhead, memory traffic, and pipeline stalls across all batch sizes, and critically remains a persistent bottleneck even in large-batch, cloud-scale LLM serving. To address these challenges, we propose StreamDQ, a lightweight architectural enhancement for cloud-scale LLM inference that enables on-the-fly dequantization within the memory subsystem by integrating compact DeQuantization Blocks (DQBs) into the base-die of high-bandwidth memory (HBM). StreamDQ leverages reserved address bits for control signaling, requiring no modifications to the ISA or compiler. Our design minimizes data movement, offloads computation from CUDA cores, and delivers dequantized weights directly to tensor cores for general matrix multiplication (GEMM) execution. StreamDQ achieves up to 83.57% reduction in inference latency and 5.15 × improvement in tokens-per-second throughput, with only 0.013 <inline-formula><tex-math>${mathrm{mm}}^{2}$</tex-math></inline-formula> area and 0.17 W power overhead per DQB. The design is scalable, software-transparent, and well-suited for high-throughput LLM inference on modern HBM-enabled GPU platforms.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"373-376"},"PeriodicalIF":1.4,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145612132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference MoSKA:用于高效长序列LLM推理的共享KV注意混合
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-31 DOI: 10.1109/LCA.2025.3627539
Myunghyun Rhee;Sookyung Choi;Euiseok Kim;Joonseop Sim;Youngpyo Joo;Hoshik Kim
The escalating context length in Large Language Models (LLMs) creates a severe performance bottleneck around the Key-Value (KV) cache, whose memory-bound nature leads to significant GPU under-utilization. This paper introduces Mixture of Shared KV Attention (MoSKA), an architecture that addresses this challenge by exploiting the heterogeneity of context data. It differentiates between per-request unique and massively reused shared sequences. The core of MoSKA is a novel Shared KV Attention mechanism that transforms the attention on shared data from a series of memory-bound GEMV operations into a single, compute-bound GEMM by batching concurrent requests. This is supported by an MoE-inspired sparse attention strategy that prunes the search space and a tailored Disaggregated Infrastructure that specializes hardware for unique and shared data. This comprehensive approach demonstrates a throughput increase of up to 538.7 × over baselines in workloads with high context sharing, offering a clear architectural path toward scalable LLM inference
大型语言模型(llm)中不断升级的上下文长度在键值(KV)缓存周围造成了严重的性能瓶颈,其内存绑定性质导致GPU利用率明显不足。本文介绍了共享KV注意力混合(MoSKA),这是一种通过利用上下文数据的异质性来解决这一挑战的架构。它区分了每个请求唯一的序列和大规模重用的共享序列。MoSKA的核心是一种新颖的共享KV注意力机制,通过批处理并发请求,将对共享数据的注意力从一系列内存绑定的GEMV操作转换为单个计算绑定的GEMM。这是由一个受moe启发的稀疏注意力策略支持的,该策略修剪了搜索空间,并定制了一个专门用于独特和共享数据的硬件的分解基础设施。这种全面的方法表明,在具有高度上下文共享的工作负载中,吞吐量比基线提高了538.7倍,为可扩展的LLM推理提供了清晰的体系结构路径
{"title":"MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference","authors":"Myunghyun Rhee;Sookyung Choi;Euiseok Kim;Joonseop Sim;Youngpyo Joo;Hoshik Kim","doi":"10.1109/LCA.2025.3627539","DOIUrl":"https://doi.org/10.1109/LCA.2025.3627539","url":null,"abstract":"The escalating context length in Large Language Models (LLMs) creates a severe performance bottleneck around the Key-Value (KV) cache, whose memory-bound nature leads to significant GPU under-utilization. This paper introduces <italic>Mixture of Shared KV Attention (MoSKA)</i>, an architecture that addresses this challenge by exploiting the heterogeneity of context data. It differentiates between per-request unique and massively reused shared sequences. The core of <italic>MoSKA</i> is a novel <italic>Shared KV Attention</i> mechanism that transforms the attention on shared data from a series of memory-bound GEMV operations into a single, compute-bound GEMM by batching concurrent requests. This is supported by an <italic>MoE-inspired sparse attention</i> strategy that prunes the search space and a tailored <italic>Disaggregated Infrastructure</i> that specializes hardware for unique and shared data. This comprehensive approach demonstrates a throughput increase of up to 538.7 × over baselines in workloads with high context sharing, offering a clear architectural path toward scalable LLM inference","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"365-368"},"PeriodicalIF":1.4,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Characterizing the System Overhead of Discrete Noise Generation for Differential Privacy 差分隐私中离散噪声产生的系统开销表征
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-31 DOI: 10.1109/LCA.2025.3627101
SeokHyeon Kong;Donghwan Kim;Euiseong Seo;Kiwan Maeng
Differential privacy (DP) has become a de facto standard in restricting data leakage when releasing statistical information. It is widely used in data-centric applications, ranging from the US Census Bureau's population data collection to modern deep learning training. Recent studies have shown that using floating-point in implementing DP can significantly degrade its mathematical guarantees and strongly advised to use an integer-based implementation instead. However, nearly all popular DP libraries currently use floating-point. In this paper, we characterize the performance of a recent integer-based DP library from Google. Our study reveals that the noise generation is significantly slower (by 187–296×) when using an integer-based implementation, and that noise sampling can become a non-negligible overhead in real applications. Our findings highlight an overlooked but important overhead in realizing high-privacy DP and call for greater focus from the community.
差分隐私(DP)已经成为统计信息发布中限制数据泄露的事实上的标准。它广泛应用于以数据为中心的应用,从美国人口普查局的人口数据收集到现代深度学习训练。最近的研究表明,在实现DP时使用浮点数会显著降低其数学保证,因此强烈建议使用基于整数的实现。然而,目前几乎所有流行的DP库都使用浮点数。在本文中,我们描述了谷歌最近的一个基于整数的DP库的性能。我们的研究表明,当使用基于整数的实现时,噪声的产生明显变慢(减少187 - 296倍),并且噪声采样在实际应用程序中可能成为不可忽略的开销。我们的研究结果强调了实现高隐私DP的一个被忽视但重要的开销,并呼吁社区更加关注。
{"title":"Characterizing the System Overhead of Discrete Noise Generation for Differential Privacy","authors":"SeokHyeon Kong;Donghwan Kim;Euiseong Seo;Kiwan Maeng","doi":"10.1109/LCA.2025.3627101","DOIUrl":"https://doi.org/10.1109/LCA.2025.3627101","url":null,"abstract":"Differential privacy (DP) has become a de facto standard in restricting data leakage when releasing statistical information. It is widely used in data-centric applications, ranging from the US Census Bureau's population data collection to modern deep learning training. Recent studies have shown that using floating-point in implementing DP can significantly degrade its mathematical guarantees and strongly advised to use an integer-based implementation instead. However, nearly all popular DP libraries currently use floating-point. In this paper, we characterize the performance of a recent integer-based DP library from Google. Our study reveals that the noise generation is significantly slower (by 187–296×) when using an integer-based implementation, and that noise sampling can become a non-negligible overhead in real applications. Our findings highlight an overlooked but important overhead in realizing high-privacy DP and call for greater focus from the community.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"345-348"},"PeriodicalIF":1.4,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
In-Depth Characterization of Machine Learning on an Optimized Multi-Party Computing Library 基于优化的多方计算库的机器学习深度表征
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-23 DOI: 10.1109/LCA.2025.3624787
Jinyu Liu;Kiwan Maeng
Secure multi-party computation (MPC) allows multiple parties to collaboratively run machine learning (ML) training and inference without each party revealing its secret data or model weights. Prior works characterized popular MPC-based ML libraries, such as Meta’s CrypTen, to reveal their system overheads and built optimizations guided by the observations. However, we found potential concerns in this process. Through a careful inspection of the CrypTen library, we discovered several inefficient implementations that could overshadow fundamental MPC-related overheads. Furthermore, we observed that the characteristics can vary significantly depending on several factors, such as the model type, batch size, sequence length, and network conditions, many of which prior works do not vary during their evaluation. Our results indicate that focusing solely on a narrow experimental setup and/or relying on characterization without a deep understanding can misguide researchers and call for a more mature framework and standardized evaluation methodology.
安全多方计算(MPC)允许多方协作运行机器学习(ML)训练和推理,而每一方都不会泄露其秘密数据或模型权重。之前的工作描述了流行的基于mpc的ML库,如Meta的CrypTen,以揭示它们的系统开销和根据观察指导的构建优化。然而,我们在这个过程中发现了潜在的问题。通过对CrypTen库的仔细检查,我们发现了几个低效的实现,这些实现可能会掩盖基本的mpc相关开销。此外,我们观察到特征可以根据几个因素发生显著变化,例如模型类型,批量大小,序列长度和网络条件,其中许多先前的工作在评估过程中没有变化。我们的研究结果表明,仅仅关注一个狭窄的实验设置和/或依赖于特征而没有深入的理解可能会误导研究人员,需要一个更成熟的框架和标准化的评估方法。
{"title":"In-Depth Characterization of Machine Learning on an Optimized Multi-Party Computing Library","authors":"Jinyu Liu;Kiwan Maeng","doi":"10.1109/LCA.2025.3624787","DOIUrl":"https://doi.org/10.1109/LCA.2025.3624787","url":null,"abstract":"Secure multi-party computation (MPC) allows multiple parties to collaboratively run machine learning (ML) training and inference without each party revealing its secret data or model weights. Prior works characterized popular MPC-based ML libraries, such as Meta’s CrypTen, to reveal their system overheads and built optimizations guided by the observations. However, we found potential concerns in this process. Through a careful inspection of the CrypTen library, we discovered several inefficient implementations that could overshadow fundamental MPC-related overheads. Furthermore, we observed that the characteristics can vary significantly depending on several factors, such as the model type, batch size, sequence length, and network conditions, many of which prior works do not vary during their evaluation. Our results indicate that focusing solely on a narrow experimental setup and/or relying on characterization without a deep understanding can misguide researchers and call for a more mature framework and standardized evaluation methodology.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"341-344"},"PeriodicalIF":1.4,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Multiple-Aspect Optimal CNN Accelerator in Top1 Accuracy, Performance, and Power Efficiency 在Top1精度,性能和功率效率的多方面优化CNN加速器
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-22 DOI: 10.1109/LCA.2025.3624004
Xianghong Hu;Yuanmiao Lin;Xueming Li;Ruidian Zhan;Jie Cao;Dayong Zhu;Shuting Cai;Xin Zheng;Xiaoming Xiong
The customization of accelerators for low-bit and mixed-bit convolutional neural works (CNNs) has been a promising approach to enhance computing efficiency of CNNs. However, current low-bit and mix-bit accelerator sacrifice some network accuracy to achieve higher performance and power efficiency, especially for lightweight CNNs like MobileNets containing depth-wise convolution (DW-CONV). These accelerators for low-bit or mixed-bit CNNS tend to achieve good results in only one aspect, such as performance, power efficiency, or Top1 accuracy, and it is difficult to achieve good experimental results in all aspects. In this work, we propose an accelerator that perform well in multifaceted aspects including performance, power efficiency, and Top1 accuracy. First, arbitrary-basis quantization (ABQ) method is used to enhance Top1 accuracy and a dedicated ABQ-based processing element (PE) is proposed to improve performance. Then, an adaptive data flow is presented to support standard convolution (SD-CONV) and depth-wise convolution (DW-CONV) efficiently in the primise of without increasing hardware consumption of the ABQ-based PE. Implemented on Zynp ZC706 platform, compared with other works, the proposed accelerator first achieve good experimental results in all aspects, achieving 1.28 × –5.76 × power efficiency and 1.11 × –5.81 × performance in the premise of the best Top1 accuracy.
为低比特和混合比特卷积神经网络(cnn)定制加速器是提高cnn计算效率的一种很有前途的方法。然而,目前的低位和混合位加速器牺牲了一些网络精度来实现更高的性能和功率效率,特别是对于轻量级cnn,如包含深度卷积(DW-CONV)的MobileNets。这些用于低位或混合位CNNS的加速器往往只在性能、功耗效率或Top1精度等一个方面取得良好的效果,很难在所有方面都取得良好的实验结果。在这项工作中,我们提出了一个在多方面表现良好的加速器,包括性能,功率效率和Top1精度。首先,采用任意基量化(ABQ)方法提高Top1精度,并提出了基于ABQ的专用处理单元(PE)来提高性能;然后,在不增加基于abq的PE的硬件消耗的前提下,提出了一种有效支持标准卷积(SD-CONV)和深度卷积(DW-CONV)的自适应数据流。在Zynp ZC706平台上实现,与其他工作相比,所提出的加速器首先在各方面都取得了较好的实验结果,在获得最佳Top1精度的前提下,实现了1.28 × -5.76 ×的功率效率和1.11 × -5.81 ×的性能。
{"title":"A Multiple-Aspect Optimal CNN Accelerator in Top1 Accuracy, Performance, and Power Efficiency","authors":"Xianghong Hu;Yuanmiao Lin;Xueming Li;Ruidian Zhan;Jie Cao;Dayong Zhu;Shuting Cai;Xin Zheng;Xiaoming Xiong","doi":"10.1109/LCA.2025.3624004","DOIUrl":"https://doi.org/10.1109/LCA.2025.3624004","url":null,"abstract":"The customization of accelerators for low-bit and mixed-bit convolutional neural works (CNNs) has been a promising approach to enhance computing efficiency of CNNs. However, current low-bit and mix-bit accelerator sacrifice some network accuracy to achieve higher performance and power efficiency, especially for lightweight CNNs like MobileNets containing depth-wise convolution (DW-CONV). These accelerators for low-bit or mixed-bit CNNS tend to achieve good results in only one aspect, such as performance, power efficiency, or Top1 accuracy, and it is difficult to achieve good experimental results in all aspects. In this work, we propose an accelerator that perform well in multifaceted aspects including performance, power efficiency, and Top1 accuracy. First, arbitrary-basis quantization (ABQ) method is used to enhance Top1 accuracy and a dedicated ABQ-based processing element (PE) is proposed to improve performance. Then, an adaptive data flow is presented to support standard convolution (SD-CONV) and depth-wise convolution (DW-CONV) efficiently in the primise of without increasing hardware consumption of the ABQ-based PE. Implemented on Zynp ZC706 platform, compared with other works, the proposed accelerator first achieve good experimental results in all aspects, achieving 1.28 × –5.76 × power efficiency and 1.11 × –5.81 × performance in the premise of the best Top1 accuracy.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"349-352"},"PeriodicalIF":1.4,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PNM Meets Sparse Attention: Enabling Multi-Million Tokens Inference at Scale PNM满足稀疏关注:大规模启用数百万令牌推理
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-22 DOI: 10.1109/LCA.2025.3624272
Sookyung Choi;Myunghyun Rhee;Euiseok Kim;Kwangsik Shin;Youngpyo Joo;Hoshik Kim
Processing multi-million tokens for advanced Large Language Models (LLMs) poses a significant memory bottleneck for existing AI systems. This bottleneck stems from a fundamental resource imbalance, where enormous memory capacity and bandwidth are required, yet the computational load is minimal. We propose NELSSA (Processing Near Memory for Extremely Long Sequences with Sparse Attention), an architectural platform that synergistically combines the high-capacity Processing Near Memory (PNM) with the principles of dynamic sparse attention to address this issue. This approach enables capacity scaling without performance degradation, and our evaluation shows that NELSSA can process up to 20M-token sequences on a single node (Llama-2-70B), achieving an 11× to 40× speedup over a representative DIMM-based PNM system. The proposed architecture radically resolves existing inefficiencies, enabling previously impractical multi-million-token processing and thus laying the foundation for next-generation AI applications.
为高级大型语言模型(llm)处理数百万个令牌对现有人工智能系统构成了显著的内存瓶颈。这种瓶颈源于基本的资源不平衡,需要巨大的内存容量和带宽,但计算负载却很小。我们提出了NELSSA (Processing Near Memory for Extremely Long Sequences with Sparse Attention),这是一个将高容量处理近内存(PNM)与动态稀疏注意原理协同结合的架构平台来解决这个问题。这种方法可以在不降低性能的情况下实现容量扩展,我们的评估表明,NELSSA可以在单个节点(Llama-2-70B)上处理多达20m个令牌序列,比典型的基于dimm的PNM系统实现11倍到40倍的加速。所提出的架构从根本上解决了现有的低效率问题,实现了以前不切实际的数百万令牌处理,从而为下一代人工智能应用奠定了基础。
{"title":"PNM Meets Sparse Attention: Enabling Multi-Million Tokens Inference at Scale","authors":"Sookyung Choi;Myunghyun Rhee;Euiseok Kim;Kwangsik Shin;Youngpyo Joo;Hoshik Kim","doi":"10.1109/LCA.2025.3624272","DOIUrl":"https://doi.org/10.1109/LCA.2025.3624272","url":null,"abstract":"Processing multi-million tokens for advanced Large Language Models (LLMs) poses a significant memory bottleneck for existing AI systems. This bottleneck stems from a fundamental resource imbalance, where enormous memory capacity and bandwidth are required, yet the computational load is minimal. We propose <monospace>NELSSA</monospace> (Processing <underline>N</u>ear Memory for <underline>E</u>xtremely <underline>L</u>ong <underline>S</u>equences with <underline>S</u>parse <underline>A</u>ttention), an architectural platform that synergistically combines the high-capacity Processing Near Memory (PNM) with the principles of dynamic sparse attention to address this issue. This approach enables capacity scaling without performance degradation, and our evaluation shows that <monospace>NELSSA</monospace> can process up to 20M-token sequences on a single node (Llama-2-70B), achieving an 11× to 40× speedup over a representative DIMM-based PNM system. The proposed architecture radically resolves existing inefficiencies, enabling previously impractical multi-million-token processing and thus laying the foundation for next-generation AI applications.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"353-356"},"PeriodicalIF":1.4,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reimagining RDMA Through the Lens of ML 通过ML的镜头重新想象RDMA
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-22 DOI: 10.1109/LCA.2025.3624158
Ertza Warraich;Ali Imran;Annus Zulfiqar;Shay Vargaftik;Sonia Fahmy;Muhammad Shahbaz
As distributed machine learning (ML) workloads scale to thousands of GPUs connected by ultra-high-speed interconnects, tail latency in collective communication has emerged as a primary bottleneck. Prior RDMA designs, like RoCE, IRN, and SRNIC, enforce strict reliability and in-order delivery, relying on retransmissions and packet sequencing to ensure correctness. While effective for general-purpose workloads, these mechanisms introduce complexity and latency that scale poorly, where even rare packet losses or delays can consistently degrade system performance. We introduce Celeris, a domain-specific RDMA transport that revisits traditional reliability guarantees based on ML’s tolerance for lost or partial data. Celeris removes retransmissions and in-order delivery from the RDMA NIC, enabling best-effort transport that exploits the robustness of ML workloads. It retains congestion control (e.g., DCQCN) and manages communication with software-level mechanisms such as adaptive timeouts and data prioritization, while shifting loss recovery to the ML pipeline (e.g., using the Hadamard Transform). Early results show that Celeris reduces 99th-percentile latency by up to 2.3×, cuts BRAM usage by 67%, and nearly doubles NIC resilience to faults—delivering a resilient, scalable transport tailored for ML at cluster scale.
随着分布式机器学习(ML)工作负载扩展到通过超高速互连连接的数千个gpu,集体通信中的尾部延迟已成为主要瓶颈。以前的RDMA设计,如RoCE、IRN和SRNIC,强制执行严格的可靠性和按顺序交付,依靠重传和分组排序来确保正确性。虽然这些机制对通用工作负载是有效的,但它们带来的复杂性和延迟的可扩展性很差,即使很少的数据包丢失或延迟也会持续降低系统性能。我们介绍了Celeris,这是一种特定于领域的RDMA传输,它基于ML对丢失或部分数据的容错,重新审视了传统的可靠性保证。Celeris从RDMA网卡中删除了重传和按顺序交付,从而实现了利用ML工作负载健壮性的尽力而为传输。它保留拥塞控制(例如,DCQCN),并管理与软件级机制(如自适应超时和数据优先级)的通信,同时将损失恢复转移到ML管道(例如,使用Hadamard变换)。早期的结果表明,Celeris将第99百分位延迟减少了2.3倍,将BRAM使用量减少了67%,并将网卡的故障恢复能力提高了近一倍,为集群规模的机器学习提供了弹性、可扩展的传输。
{"title":"Reimagining RDMA Through the Lens of ML","authors":"Ertza Warraich;Ali Imran;Annus Zulfiqar;Shay Vargaftik;Sonia Fahmy;Muhammad Shahbaz","doi":"10.1109/LCA.2025.3624158","DOIUrl":"https://doi.org/10.1109/LCA.2025.3624158","url":null,"abstract":"As distributed machine learning (ML) workloads scale to thousands of GPUs connected by ultra-high-speed interconnects, tail latency in collective communication has emerged as a primary bottleneck. Prior RDMA designs, like RoCE, IRN, and SRNIC, enforce strict reliability and in-order delivery, relying on retransmissions and packet sequencing to ensure correctness. While effective for general-purpose workloads, these mechanisms introduce complexity and latency that scale poorly, where even rare packet losses or delays can consistently degrade system performance. We introduce Celeris, a domain-specific RDMA transport that revisits traditional reliability guarantees based on ML’s tolerance for lost or partial data. Celeris removes retransmissions and in-order delivery from the RDMA NIC, enabling best-effort transport that exploits the robustness of ML workloads. It retains congestion control (e.g., DCQCN) and manages communication with software-level mechanisms such as adaptive timeouts and data prioritization, while shifting loss recovery to the ML pipeline (e.g., using the Hadamard Transform). Early results show that Celeris reduces 99th-percentile latency by up to 2.3×, cuts BRAM usage by 67%, and nearly doubles NIC resilience to faults—delivering a resilient, scalable transport tailored for ML at cluster scale.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"393-396"},"PeriodicalIF":1.4,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Partial Tag–Data Decoupled Architecture for Last-Level Cache Optimization 面向最后一级缓存优化的部分标签-数据解耦架构
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-20 DOI: 10.1109/LCA.2025.3623137
Honghui Liu;Xian Lin;Xin Zheng;Qiancheng Liu;Huaien Gao;Shuting Cai;Xiaoming Xiong
Modern processors rely on the last-level cache to bridge the growing latency gap between the CPU core and main memory. However, the memory access patterns of contemporary applications exhibit increasing complexity, characterized by significant temporal locality, irregular reuse, and high conflict rates. We propose a partial tag-data decoupling architecture that leverages temporal locality without modifying the main cache structure or replacement policy. A lightweight auxiliary tag path is introduced, where data is allocated only upon reuse confirmation, thus minimizing resource waste caused by low-reuse blocks. The experimental results show that the proposed design achieves an average IPC improvement of 1.55% and a 5.33% reduction in MPKI without prefetching. With prefetching enabled, IPC improves by 1.96% and MPKI is further reduced by 10.91%, while overall storage overhead is decreased by approximately 2.59%.
现代处理器依靠最后一级缓存来弥合CPU核心和主存之间日益增长的延迟差距。然而,当代应用程序的内存访问模式表现出日益增长的复杂性,其特点是显著的时间局部性、不规则的重用和高冲突率。我们提出了一种局部标签数据解耦架构,该架构在不修改主缓存结构或替换策略的情况下利用了时间局域性。引入了一个轻量级的辅助标签路径,只有在确认重用后才分配数据,从而最大限度地减少了低重用块造成的资源浪费。实验结果表明,该设计在不预取的情况下,IPC平均提高了1.55%,MPKI平均降低了5.33%。启用预取后,IPC提高了1.96%,MPKI进一步降低了10.91%,而总体存储开销降低了大约2.59%。
{"title":"A Partial Tag–Data Decoupled Architecture for Last-Level Cache Optimization","authors":"Honghui Liu;Xian Lin;Xin Zheng;Qiancheng Liu;Huaien Gao;Shuting Cai;Xiaoming Xiong","doi":"10.1109/LCA.2025.3623137","DOIUrl":"https://doi.org/10.1109/LCA.2025.3623137","url":null,"abstract":"Modern processors rely on the last-level cache to bridge the growing latency gap between the CPU core and main memory. However, the memory access patterns of contemporary applications exhibit increasing complexity, characterized by significant temporal locality, irregular reuse, and high conflict rates. We propose a partial tag-data decoupling architecture that leverages temporal locality without modifying the main cache structure or replacement policy. A lightweight auxiliary tag path is introduced, where data is allocated only upon reuse confirmation, thus minimizing resource waste caused by low-reuse blocks. The experimental results show that the proposed design achieves an average IPC improvement of 1.55% and a 5.33% reduction in MPKI without prefetching. With prefetching enabled, IPC improves by 1.96% and MPKI is further reduced by 10.91%, while overall storage overhead is decreased by approximately 2.59%.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"333-336"},"PeriodicalIF":1.4,"publicationDate":"2025-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System 在异构存储系统中通过动态KV缓存加速LLM推理
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-17 DOI: 10.1109/LCA.2025.3622724
Yunhua Fang;Rui Xie;Asad Ul Haq;Linsen Ma;Kaoutar El Maghraoui;Naigang Wang;Meng Wang;Liu Liu;Tong Zhang
Large Language Model (LLM) inference is increasingly constrained by memory bandwidth, with frequent access to the key-value (KV) cache dominating data movement. While attention sparsity reduces some memory traffic, the relevance of past tokens varies over time, requiring the full KV cache to remain accessible and sustaining pressure on both bandwidth and capacity. With advances in interconnects such as NVLink and LPDDR5X, modern AI hardware now integrates high-bandwidth memory (HBM) with high-speed off-package DRAM, making heterogeneous memory systems a practical solution. This work investigates dynamic KV cache placement across such systems to maximize aggregated bandwidth utilization under capacity constraints. Rather than proposing a specific scheduling policy, we formulate the placement problem mathematically and derive a theoretical upper bound, revealing substantial headroom for runtime optimization. To our knowledge, this is the first formal treatment of dynamic KV cache scheduling in heterogeneous memory systems for LLM inference.
大型语言模型(LLM)推理越来越受到内存带宽的限制,对键值(KV)缓存的频繁访问主导了数据移动。虽然注意力稀疏减少了一些内存流量,但过去令牌的相关性随着时间的推移而变化,要求完整KV缓存保持可访问性,并承受带宽和容量的压力。随着NVLink和LPDDR5X等互连技术的进步,现代人工智能硬件现在将高带宽内存(HBM)与高速非封装DRAM集成在一起,使异构存储系统成为一种实用的解决方案。这项工作研究了动态KV缓存放置在这些系统中,以最大限度地提高容量限制下的总带宽利用率。我们没有提出一个具体的调度策略,而是用数学的方法制定了布局问题,并推导了一个理论上的上限,揭示了运行时优化的巨大空间。据我们所知,这是针对LLM推理的异构存储系统中动态KV缓存调度的第一个正式处理。
{"title":"Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System","authors":"Yunhua Fang;Rui Xie;Asad Ul Haq;Linsen Ma;Kaoutar El Maghraoui;Naigang Wang;Meng Wang;Liu Liu;Tong Zhang","doi":"10.1109/LCA.2025.3622724","DOIUrl":"https://doi.org/10.1109/LCA.2025.3622724","url":null,"abstract":"Large Language Model (LLM) inference is increasingly constrained by memory bandwidth, with frequent access to the key-value (KV) cache dominating data movement. While attention sparsity reduces some memory traffic, the relevance of past tokens varies over time, requiring the full KV cache to remain accessible and sustaining pressure on both bandwidth and capacity. With advances in interconnects such as NVLink and LPDDR5X, modern AI hardware now integrates high-bandwidth memory (HBM) with high-speed off-package DRAM, making heterogeneous memory systems a practical solution. This work investigates dynamic KV cache placement across such systems to maximize aggregated bandwidth utilization under capacity constraints. Rather than proposing a specific scheduling policy, we formulate the placement problem mathematically and derive a theoretical upper bound, revealing substantial headroom for runtime optimization. To our knowledge, this is the first formal treatment of dynamic KV cache scheduling in heterogeneous memory systems for LLM inference.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"337-340"},"PeriodicalIF":1.4,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1