首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
Disaggregated Speculative Decoding for Carbon-Efficient LLM Serving 碳效率LLM服务的分解推测解码
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-11-06 DOI: 10.1109/LCA.2025.3630094
Tianyao Shi;Yanran Wu;Sihang Liu;Yi Ding
Large language models (LLMs) are increasingly deployed in practice but incur significant computational costs and environmental impacts. Disaggregated serving techniques, particularly decoupling prefill and decoding (DPD) across GPUs, have been introduced to improve performance and reduce carbon emissions. However, DPD suffers from high bandwidth overhead due to frequent large KV cache transfers. To address this, we present disaggregated speculative decoding (DSD), which leverages speculative decoding by assigning draft models to older GPUs and target models to newer GPUs, requiring only token and probability distribution transfers. Building on this insight, we introduce GreenLLM, an SLO- and bandwidth-aware framework that unifies DPD and DSD, profiles workload characteristics, and dynamically selects the most carbon-efficient configuration. Across diverse benchmarks, GreenLLM reduces carbon emissions by up to 40.6% while meeting latency SLOs.
大型语言模型(llm)在实践中越来越多地部署,但会产生大量的计算成本和环境影响。已经引入了分解服务技术,特别是跨gpu的解耦预填充和解码(DPD),以提高性能并减少碳排放。然而,由于频繁的大KV缓存传输,DPD遭受高带宽开销的困扰。为了解决这个问题,我们提出了分解推测解码(DSD),它通过将草案模型分配给旧gpu和目标模型分配给新gpu来利用推测解码,只需要令牌和概率分布传输。在此基础上,我们介绍了GreenLLM,这是一个SLO和带宽感知框架,它统一了DPD和DSD,配置工作负载特征,并动态选择最节能的配置。在不同的基准测试中,GreenLLM在满足延迟slo的同时减少了40.6%的碳排放。
{"title":"Disaggregated Speculative Decoding for Carbon-Efficient LLM Serving","authors":"Tianyao Shi;Yanran Wu;Sihang Liu;Yi Ding","doi":"10.1109/LCA.2025.3630094","DOIUrl":"https://doi.org/10.1109/LCA.2025.3630094","url":null,"abstract":"Large language models (LLMs) are increasingly deployed in practice but incur significant computational costs and environmental impacts. Disaggregated serving techniques, particularly decoupling prefill and decoding (DPD) across GPUs, have been introduced to improve performance and reduce carbon emissions. However, DPD suffers from high bandwidth overhead due to frequent large KV cache transfers. To address this, we present disaggregated speculative decoding (DSD), which leverages speculative decoding by assigning draft models to older GPUs and target models to newer GPUs, requiring only token and probability distribution transfers. Building on this insight, we introduce GreenLLM, an SLO- and bandwidth-aware framework that unifies DPD and DSD, profiles workload characteristics, and dynamically selects the most carbon-efficient configuration. Across diverse benchmarks, GreenLLM reduces carbon emissions by up to 40.6% while meeting latency SLOs.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"369-372"},"PeriodicalIF":1.4,"publicationDate":"2025-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145612047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing DCIM Efficiency with Multi-Storage-Row Architecture for Edge AI Workloads 边缘AI工作负载的多存储行架构提高DCIM效率
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-11-05 DOI: 10.1109/LCA.2025.3629390
Xiaoyu Sun;Haruki Mori;Wei-Chang Zhao;Je-Min Hung;Hidehiro Fujiwara;Brian Crafton;Bo Zhang;Win-San Khwa;Yu-Der Chih;Tsung-Yung Jonathan Chang;Kerem Akarvardar
Digital Compute-in-Memory (DCIM) has emerged as a promising solution for accelerating Artificial Intelligence (AI) workloads, especially at the edge. However, DCIM macros may lead to inefficiencies due to frequent weight reloads associated with the block-wise (tiled) computation (BWC) scheme, required due to limited accumulator buffer capacity. To address this challenge, we leverage a multi-storage-row (MSR) architecture that integrates multiple storage cells per multiplier within a DCIM macro. This approach transforms costly weight buffer accesses into highly efficient local row-switching (RS) operations, significantly reducing weight reloading overhead and enhancing accelerator performance. Using technology parameters extracted from our 3 nm SRAM-based MSR-DCIM testchip, coupled with architectural statistics from our in-house analytical framework, we evaluated power, performance, and area (PPA) across several Deep Neural Network (DNN) workloads. Our findings on four edge DNN workloads, including convolutional neural networks (CNNs) and large language models (LLMs), demonstrate that MSR architecture facilitates the BWC-based execution with reduced latency and energy consumption compared to a baseline DCIM macro. A case study reveals that an MSR-DCIM macro with 16-storage-row achieves 15%–55% savings in energy-delay product (EDP) across the selected workloads, underscoring its potential for efficient edge AI acceleration.
数字内存计算(DCIM)已经成为加速人工智能(AI)工作负载的一种有前途的解决方案,特别是在边缘。然而,DCIM宏可能会导致效率低下,因为与块(平铺)计算(BWC)方案相关的权重频繁重新加载,这是由于有限的累加器缓冲容量所必需的。为了应对这一挑战,我们利用了多存储行(MSR)架构,该架构在DCIM宏中为每个乘法器集成了多个存储单元。这种方法将代价高昂的权重缓冲区访问转换为高效的本地行交换(RS)操作,显著降低了权重重新加载开销并提高了加速器性能。利用从我们基于3nm sram的MSR-DCIM测试芯片中提取的技术参数,再加上我们内部分析框架的架构统计数据,我们评估了几种深度神经网络(DNN)工作负载的功耗、性能和面积(PPA)。我们在四个边缘DNN工作负载(包括卷积神经网络(cnn)和大型语言模型(llm))上的研究结果表明,与基线DCIM宏相比,MSR架构促进了基于bwc的执行,降低了延迟和能耗。一个案例研究表明,具有16行存储的MSR-DCIM宏在选定的工作负载中可以节省15%-55%的能量延迟产品(EDP),强调了其高效边缘人工智能加速的潜力。
{"title":"Enhancing DCIM Efficiency with Multi-Storage-Row Architecture for Edge AI Workloads","authors":"Xiaoyu Sun;Haruki Mori;Wei-Chang Zhao;Je-Min Hung;Hidehiro Fujiwara;Brian Crafton;Bo Zhang;Win-San Khwa;Yu-Der Chih;Tsung-Yung Jonathan Chang;Kerem Akarvardar","doi":"10.1109/LCA.2025.3629390","DOIUrl":"https://doi.org/10.1109/LCA.2025.3629390","url":null,"abstract":"Digital Compute-in-Memory (DCIM) has emerged as a promising solution for accelerating Artificial Intelligence (AI) workloads, especially at the edge. However, DCIM macros may lead to inefficiencies due to frequent weight reloads associated with the block-wise (tiled) computation (BWC) scheme, required due to limited accumulator buffer capacity. To address this challenge, we leverage a multi-storage-row (MSR) architecture that integrates multiple storage cells per multiplier within a DCIM macro. This approach transforms costly weight buffer accesses into highly efficient local row-switching (RS) operations, significantly reducing weight reloading overhead and enhancing accelerator performance. Using technology parameters extracted from our 3 nm SRAM-based MSR-DCIM testchip, coupled with architectural statistics from our in-house analytical framework, we evaluated power, performance, and area (PPA) across several Deep Neural Network (DNN) workloads. Our findings on four edge DNN workloads, including convolutional neural networks (CNNs) and large language models (LLMs), demonstrate that MSR architecture facilitates the BWC-based execution with reduced latency and energy consumption compared to a baseline DCIM macro. A case study reveals that an MSR-DCIM macro with 16-storage-row achieves 15%–55% savings in energy-delay product (EDP) across the selected workloads, underscoring its potential for efficient edge AI acceleration.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"377-380"},"PeriodicalIF":1.4,"publicationDate":"2025-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145612071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
I/O-ETEM: An I/O-Aware Approach for Estimating Execution Time of Machine Learning Workloads I/O-ETEM:一种I/ o感知的机器学习工作负载执行时间估计方法
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-11-04 DOI: 10.1109/LCA.2025.3620629
Elham Adibi;Mohammadamin Ajdari;Pouria Arefijamal;Amirsaeed Ahmadi-Tonekaboni;Hossein Asadi
Training Machine Learning (ML) models commonly rely on High-Performance Computing (HPC) centers or cloud servers that accommodate compute nodes with powerful resources. Users tend to enhance the accuracy of ML applications by continuous training, after model refinements, and dataset size increase. With such application changes, and heterogeneity of HPC nodes, knowing each job execution time in advance (i.e., prediction) is necessary for efficient job scheduling. We observe that I/O accesses highly influence the executing time of modern ML applications. Unfortunately, existing studies on estimating job execution time either (a) rely on overestimated user declared time, or (b) predict execution time mainly based on compute resources (ignoring I/O or storage effects), and use complex deep learning models for this purpose. In this paper, we propose a simple, yet effective method for predicting the execution time of ML training. Our approach explicitly accounts for I/O accesses as a critical factor. Our method combines (a) partial application execution & monitoring, (b) analytical modeling leveraging ML application characteristics, (c) dynamic re-estimation, and (d) simplified history-based analysis. Our evaluation on a number of Convolutional Neural Networks (CNNs) and Transformer models show that our proposed method predicts the execution time accurately (i.e., with error less than 8% for most cases) compared to actual execution.
训练机器学习(ML)模型通常依赖于高性能计算(HPC)中心或云服务器,这些服务器可以容纳具有强大资源的计算节点。用户倾向于通过持续训练、模型细化和数据集大小的增加来提高机器学习应用程序的准确性。由于应用程序的变化和HPC节点的异构性,提前知道每个作业的执行时间(即预测)对于有效的作业调度是必要的。我们观察到I/O访问高度影响现代ML应用程序的执行时间。不幸的是,现有的关于估计作业执行时间的研究要么(a)依赖于高估的用户声明时间,要么(b)主要基于计算资源(忽略I/O或存储影响)预测执行时间,并为此使用复杂的深度学习模型。在本文中,我们提出了一种简单而有效的方法来预测机器学习训练的执行时间。我们的方法明确地将I/O访问作为一个关键因素。我们的方法结合了(a)部分应用程序执行和监控,(b)利用ML应用程序特征的分析建模,(c)动态重新评估,以及(d)简化的基于历史的分析。我们对许多卷积神经网络(cnn)和Transformer模型的评估表明,与实际执行相比,我们提出的方法准确地预测了执行时间(即,在大多数情况下误差小于8%)。
{"title":"I/O-ETEM: An I/O-Aware Approach for Estimating Execution Time of Machine Learning Workloads","authors":"Elham Adibi;Mohammadamin Ajdari;Pouria Arefijamal;Amirsaeed Ahmadi-Tonekaboni;Hossein Asadi","doi":"10.1109/LCA.2025.3620629","DOIUrl":"https://doi.org/10.1109/LCA.2025.3620629","url":null,"abstract":"Training <italic>Machine Learning</i> (ML) models commonly rely on <italic>High-Performance Computing</i> (HPC) centers or cloud servers that accommodate compute nodes with powerful resources. Users tend to enhance the accuracy of ML applications by continuous training, after model refinements, and dataset size increase. With such application changes, and heterogeneity of HPC nodes, knowing each job execution time in advance (i.e., <italic>prediction</i>) is necessary for efficient job scheduling. We observe that I/O accesses highly influence the executing time of modern ML applications. Unfortunately, existing studies on estimating job execution time either (a) rely on overestimated user declared time, or (b) predict execution time mainly based on compute resources (ignoring I/O or storage effects), and use complex deep learning models for this purpose. <bold>In this paper</b>, we propose a simple, yet effective method for predicting the execution time of ML training. Our approach explicitly accounts for I/O accesses as a critical factor. Our method combines (a) partial application execution & monitoring, (b) analytical modeling leveraging ML application characteristics, (c) dynamic re-estimation, and (d) simplified history-based analysis. Our evaluation on a number of <italic>Convolutional Neural Networks</i> (CNNs) and Transformer models show that our proposed method predicts the execution time accurately (i.e., with error less than 8% for most cases) compared to actual execution.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"325-328"},"PeriodicalIF":1.4,"publicationDate":"2025-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Rethinking In-Memory Hash Table Design for CXL-Based Main Memory Compression 基于cxl的内存压缩中内存哈希表设计的再思考
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-11-04 DOI: 10.1109/LCA.2025.3628805
Teresa Zhang
Main memory compression has re-emerged as a viable solution to DRAM cost challenges, enabled by CXL memory devices that support hardware-managed, coarse-grained compression. This shift decouples logical memory space usage from physical consumption, opening new opportunities to re-think the design and implementation of in-memory data structures. This letter presents case studies on blocked variants of chained and Cuckoo hashing that align naturally with the memory compression granularity. Targeting large-scale in-memory data stores, blocked hash tables trade logical sparsity for implementation efficiency while leveraging compression to avoid physical memory waste. We develop mathematical formulations to enable theoretical analysis of key metrics, including memory usage saving and operational throughput. Our results highlight the potential of compression-aware data structures to better leverage modern memory hierarchies and motivate further exploration in this largely untapped design space.
通过支持硬件管理的粗粒度压缩的CXL内存设备,主内存压缩已经重新成为应对DRAM成本挑战的可行解决方案。这种转变将逻辑内存空间使用与物理消耗解耦,为重新考虑内存中数据结构的设计和实现提供了新的机会。这封信介绍了链和布谷鸟哈希的阻塞变体的案例研究,这些变体与内存压缩粒度自然地对齐。针对大规模内存中的数据存储,阻塞哈希表以逻辑稀疏性换取实现效率,同时利用压缩来避免物理内存浪费。我们开发数学公式来实现关键指标的理论分析,包括内存使用节省和操作吞吐量。我们的研究结果强调了压缩感知数据结构在更好地利用现代内存层次结构方面的潜力,并激发了在这个很大程度上尚未开发的设计空间中的进一步探索。
{"title":"Rethinking In-Memory Hash Table Design for CXL-Based Main Memory Compression","authors":"Teresa Zhang","doi":"10.1109/LCA.2025.3628805","DOIUrl":"https://doi.org/10.1109/LCA.2025.3628805","url":null,"abstract":"Main memory compression has re-emerged as a viable solution to DRAM cost challenges, enabled by CXL memory devices that support hardware-managed, coarse-grained compression. This shift decouples logical memory space usage from physical consumption, opening new opportunities to re-think the design and implementation of in-memory data structures. This letter presents case studies on <italic>blocked</i> variants of chained and Cuckoo hashing that align naturally with the memory compression granularity. Targeting large-scale in-memory data stores, blocked hash tables trade logical sparsity for implementation efficiency while leveraging compression to avoid physical memory waste. We develop mathematical formulations to enable theoretical analysis of key metrics, including memory usage saving and operational throughput. Our results highlight the potential of compression-aware data structures to better leverage modern memory hierarchies and motivate further exploration in this largely untapped design space.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"357-360"},"PeriodicalIF":1.4,"publicationDate":"2025-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure LLMServingSim2.0: LLM基础设施中异构硬件和服务技术的统一模拟器
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-11-03 DOI: 10.1109/LCA.2025.3628325
Jaehong Cho;Hyunmin Choi;Jongse Park
This letter introduces LLMServingSim2.0, a system simulator designed for exploring heterogeneous hardware in large-scale LLM serving systems. LLMServingSim2.0 addresses two key limitations of its predecessor: (1) integrating hardware models into system-level simulators is non-trivial due to the lack of a clear abstraction, and (2) existing simulators support only a narrow subset of serving techniques, leaving no infrastructure that captures the breadth of approaches in modern LLM serving. To overcome these issues, LLMServingSim2.0 adopts trace-driven performance modeling, accompanied by an operator-level latency profiler, enabling the integration of new accelerators with a single command. It further embeds up-to-date serving techniques while exposing flexible interfaces for request routing, cache management, and scheduling policies. In a TPU case study, our profiler requires 18.5 × fewer LoC and outperforms the predecessor’s hardware-simulator integration, demonstrating LLMServingSim2.0’s low-effort hardware extensibility. Our experiments further show that LLMServingSim2.0 reproduces GPU-based LLM serving with 1.9% error, while maintaining practical simulation time, making it a comprehensive platform for both hardware developers and LLM service providers.
这封信介绍了LLMServingSim2.0,这是一个系统模拟器,旨在探索大型LLM服务系统中的异构硬件。LLMServingSim2.0解决了其前身的两个关键限制:(1)由于缺乏明确的抽象,将硬件模型集成到系统级模拟器中是非常重要的;(2)现有的模拟器只支持服务技术的一个狭窄子集,没有基础设施可以捕获现代LLM服务方法的广度。为了克服这些问题,LLMServingSim2.0采用了跟踪驱动的性能建模,并附带了一个操作员级延迟分析器,从而能够将新的加速器与单个命令集成在一起。它进一步嵌入了最新的服务技术,同时为请求路由、缓存管理和调度策略提供了灵活的接口。在TPU案例研究中,我们的分析器需要的LoC减少了18.5倍,并且优于前一代的硬件模拟器集成,展示了LLMServingSim2.0的低工作量硬件可扩展性。我们的实验进一步表明,LLMServingSim2.0在保持实际仿真时间的同时,以1.9%的误差再现了基于gpu的LLM服务,使其成为硬件开发商和LLM服务提供商的综合平台。
{"title":"LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure","authors":"Jaehong Cho;Hyunmin Choi;Jongse Park","doi":"10.1109/LCA.2025.3628325","DOIUrl":"https://doi.org/10.1109/LCA.2025.3628325","url":null,"abstract":"This letter introduces LLMServingSim2.0, a system simulator designed for exploring heterogeneous hardware in large-scale LLM serving systems. LLMServingSim2.0 addresses two key limitations of its predecessor: (1) integrating hardware models into system-level simulators is non-trivial due to the lack of a clear abstraction, and (2) existing simulators support only a narrow subset of serving techniques, leaving no infrastructure that captures the breadth of approaches in modern LLM serving. To overcome these issues, LLMServingSim2.0 adopts trace-driven performance modeling, accompanied by an operator-level latency profiler, enabling the integration of new accelerators with a single command. It further embeds up-to-date serving techniques while exposing flexible interfaces for request routing, cache management, and scheduling policies. In a TPU case study, our profiler requires 18.5 × fewer LoC and outperforms the predecessor’s hardware-simulator integration, demonstrating LLMServingSim2.0’s low-effort hardware extensibility. Our experiments further show that LLMServingSim2.0 reproduces GPU-based LLM serving with 1.9% error, while maintaining practical simulation time, making it a comprehensive platform for both hardware developers and LLM service providers.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"361-364"},"PeriodicalIF":1.4,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
StreamDQ: HBM-Integrated On-the-Fly DeQuantization via Memory Load for Large Language Models 基于内存负载的基于hbm的大型语言模型动态去量化
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-31 DOI: 10.1109/LCA.2025.3626929
Minki Jeong;Daegun Yoon;Soohong Ahn;Seungyong Lee;Jooyoung Kim;Jinuk Jeon;Joonseop Sim;Youngpyo Joo;Hoshik Kim
As large language models (LLMs) scale, their memory and computation demands increase, making weight-only quantization a widely adopted technique to reduce memory footprint with minimal accuracy loss. However, CUDA core–based dequantization introduces significant instruction overhead, memory traffic, and pipeline stalls across all batch sizes, and critically remains a persistent bottleneck even in large-batch, cloud-scale LLM serving. To address these challenges, we propose StreamDQ, a lightweight architectural enhancement for cloud-scale LLM inference that enables on-the-fly dequantization within the memory subsystem by integrating compact DeQuantization Blocks (DQBs) into the base-die of high-bandwidth memory (HBM). StreamDQ leverages reserved address bits for control signaling, requiring no modifications to the ISA or compiler. Our design minimizes data movement, offloads computation from CUDA cores, and delivers dequantized weights directly to tensor cores for general matrix multiplication (GEMM) execution. StreamDQ achieves up to 83.57% reduction in inference latency and 5.15 × improvement in tokens-per-second throughput, with only 0.013 ${mathrm{mm}}^{2}$ area and 0.17 W power overhead per DQB. The design is scalable, software-transparent, and well-suited for high-throughput LLM inference on modern HBM-enabled GPU platforms.
随着大型语言模型(llm)的扩展,它们的内存和计算需求增加,使得仅权重量化成为一种广泛采用的技术,以最小的准确性损失来减少内存占用。然而,基于CUDA核心的去量化会在所有批处理规模中引入大量的指令开销、内存流量和管道停顿,即使在大规模的云规模LLM服务中,这仍然是一个持续的瓶颈。为了解决这些挑战,我们提出了StreamDQ,这是一种用于云规模LLM推理的轻量级架构增强,通过将紧凑的去量化块(dqb)集成到高带宽内存(HBM)的基本芯片中,可以在内存子系统中实现动态去量化。StreamDQ利用保留的地址位来控制信号,不需要修改ISA或编译器。我们的设计最大限度地减少了数据移动,卸载了CUDA核心的计算,并将去量化的权重直接传递给张量核心,用于通用矩阵乘法(GEMM)的执行。StreamDQ的推理延迟减少了83.57%,每秒令牌吞吐量提高了5.15倍,每个DQB的面积只有0.013 ${ mathm {mm}}^{2}$,功耗只有0.17 W。该设计是可扩展的,软件透明的,非常适合在现代支持hbm的GPU平台上进行高吞吐量LLM推理。
{"title":"StreamDQ: HBM-Integrated On-the-Fly DeQuantization via Memory Load for Large Language Models","authors":"Minki Jeong;Daegun Yoon;Soohong Ahn;Seungyong Lee;Jooyoung Kim;Jinuk Jeon;Joonseop Sim;Youngpyo Joo;Hoshik Kim","doi":"10.1109/LCA.2025.3626929","DOIUrl":"https://doi.org/10.1109/LCA.2025.3626929","url":null,"abstract":"As large language models (LLMs) scale, their memory and computation demands increase, making weight-only quantization a widely adopted technique to reduce memory footprint with minimal accuracy loss. However, CUDA core–based dequantization introduces significant instruction overhead, memory traffic, and pipeline stalls across all batch sizes, and critically remains a persistent bottleneck even in large-batch, cloud-scale LLM serving. To address these challenges, we propose StreamDQ, a lightweight architectural enhancement for cloud-scale LLM inference that enables on-the-fly dequantization within the memory subsystem by integrating compact DeQuantization Blocks (DQBs) into the base-die of high-bandwidth memory (HBM). StreamDQ leverages reserved address bits for control signaling, requiring no modifications to the ISA or compiler. Our design minimizes data movement, offloads computation from CUDA cores, and delivers dequantized weights directly to tensor cores for general matrix multiplication (GEMM) execution. StreamDQ achieves up to 83.57% reduction in inference latency and 5.15 × improvement in tokens-per-second throughput, with only 0.013 <inline-formula><tex-math>${mathrm{mm}}^{2}$</tex-math></inline-formula> area and 0.17 W power overhead per DQB. The design is scalable, software-transparent, and well-suited for high-throughput LLM inference on modern HBM-enabled GPU platforms.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"373-376"},"PeriodicalIF":1.4,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145612132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference MoSKA:用于高效长序列LLM推理的共享KV注意混合
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-31 DOI: 10.1109/LCA.2025.3627539
Myunghyun Rhee;Sookyung Choi;Euiseok Kim;Joonseop Sim;Youngpyo Joo;Hoshik Kim
The escalating context length in Large Language Models (LLMs) creates a severe performance bottleneck around the Key-Value (KV) cache, whose memory-bound nature leads to significant GPU under-utilization. This paper introduces Mixture of Shared KV Attention (MoSKA), an architecture that addresses this challenge by exploiting the heterogeneity of context data. It differentiates between per-request unique and massively reused shared sequences. The core of MoSKA is a novel Shared KV Attention mechanism that transforms the attention on shared data from a series of memory-bound GEMV operations into a single, compute-bound GEMM by batching concurrent requests. This is supported by an MoE-inspired sparse attention strategy that prunes the search space and a tailored Disaggregated Infrastructure that specializes hardware for unique and shared data. This comprehensive approach demonstrates a throughput increase of up to 538.7 × over baselines in workloads with high context sharing, offering a clear architectural path toward scalable LLM inference
大型语言模型(llm)中不断升级的上下文长度在键值(KV)缓存周围造成了严重的性能瓶颈,其内存绑定性质导致GPU利用率明显不足。本文介绍了共享KV注意力混合(MoSKA),这是一种通过利用上下文数据的异质性来解决这一挑战的架构。它区分了每个请求唯一的序列和大规模重用的共享序列。MoSKA的核心是一种新颖的共享KV注意力机制,通过批处理并发请求,将对共享数据的注意力从一系列内存绑定的GEMV操作转换为单个计算绑定的GEMM。这是由一个受moe启发的稀疏注意力策略支持的,该策略修剪了搜索空间,并定制了一个专门用于独特和共享数据的硬件的分解基础设施。这种全面的方法表明,在具有高度上下文共享的工作负载中,吞吐量比基线提高了538.7倍,为可扩展的LLM推理提供了清晰的体系结构路径
{"title":"MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference","authors":"Myunghyun Rhee;Sookyung Choi;Euiseok Kim;Joonseop Sim;Youngpyo Joo;Hoshik Kim","doi":"10.1109/LCA.2025.3627539","DOIUrl":"https://doi.org/10.1109/LCA.2025.3627539","url":null,"abstract":"The escalating context length in Large Language Models (LLMs) creates a severe performance bottleneck around the Key-Value (KV) cache, whose memory-bound nature leads to significant GPU under-utilization. This paper introduces <italic>Mixture of Shared KV Attention (MoSKA)</i>, an architecture that addresses this challenge by exploiting the heterogeneity of context data. It differentiates between per-request unique and massively reused shared sequences. The core of <italic>MoSKA</i> is a novel <italic>Shared KV Attention</i> mechanism that transforms the attention on shared data from a series of memory-bound GEMV operations into a single, compute-bound GEMM by batching concurrent requests. This is supported by an <italic>MoE-inspired sparse attention</i> strategy that prunes the search space and a tailored <italic>Disaggregated Infrastructure</i> that specializes hardware for unique and shared data. This comprehensive approach demonstrates a throughput increase of up to 538.7 × over baselines in workloads with high context sharing, offering a clear architectural path toward scalable LLM inference","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"365-368"},"PeriodicalIF":1.4,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Characterizing the System Overhead of Discrete Noise Generation for Differential Privacy 差分隐私中离散噪声产生的系统开销表征
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-31 DOI: 10.1109/LCA.2025.3627101
SeokHyeon Kong;Donghwan Kim;Euiseong Seo;Kiwan Maeng
Differential privacy (DP) has become a de facto standard in restricting data leakage when releasing statistical information. It is widely used in data-centric applications, ranging from the US Census Bureau's population data collection to modern deep learning training. Recent studies have shown that using floating-point in implementing DP can significantly degrade its mathematical guarantees and strongly advised to use an integer-based implementation instead. However, nearly all popular DP libraries currently use floating-point. In this paper, we characterize the performance of a recent integer-based DP library from Google. Our study reveals that the noise generation is significantly slower (by 187–296×) when using an integer-based implementation, and that noise sampling can become a non-negligible overhead in real applications. Our findings highlight an overlooked but important overhead in realizing high-privacy DP and call for greater focus from the community.
差分隐私(DP)已经成为统计信息发布中限制数据泄露的事实上的标准。它广泛应用于以数据为中心的应用,从美国人口普查局的人口数据收集到现代深度学习训练。最近的研究表明,在实现DP时使用浮点数会显著降低其数学保证,因此强烈建议使用基于整数的实现。然而,目前几乎所有流行的DP库都使用浮点数。在本文中,我们描述了谷歌最近的一个基于整数的DP库的性能。我们的研究表明,当使用基于整数的实现时,噪声的产生明显变慢(减少187 - 296倍),并且噪声采样在实际应用程序中可能成为不可忽略的开销。我们的研究结果强调了实现高隐私DP的一个被忽视但重要的开销,并呼吁社区更加关注。
{"title":"Characterizing the System Overhead of Discrete Noise Generation for Differential Privacy","authors":"SeokHyeon Kong;Donghwan Kim;Euiseong Seo;Kiwan Maeng","doi":"10.1109/LCA.2025.3627101","DOIUrl":"https://doi.org/10.1109/LCA.2025.3627101","url":null,"abstract":"Differential privacy (DP) has become a de facto standard in restricting data leakage when releasing statistical information. It is widely used in data-centric applications, ranging from the US Census Bureau's population data collection to modern deep learning training. Recent studies have shown that using floating-point in implementing DP can significantly degrade its mathematical guarantees and strongly advised to use an integer-based implementation instead. However, nearly all popular DP libraries currently use floating-point. In this paper, we characterize the performance of a recent integer-based DP library from Google. Our study reveals that the noise generation is significantly slower (by 187–296×) when using an integer-based implementation, and that noise sampling can become a non-negligible overhead in real applications. Our findings highlight an overlooked but important overhead in realizing high-privacy DP and call for greater focus from the community.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"345-348"},"PeriodicalIF":1.4,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
In-Depth Characterization of Machine Learning on an Optimized Multi-Party Computing Library 基于优化的多方计算库的机器学习深度表征
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-23 DOI: 10.1109/LCA.2025.3624787
Jinyu Liu;Kiwan Maeng
Secure multi-party computation (MPC) allows multiple parties to collaboratively run machine learning (ML) training and inference without each party revealing its secret data or model weights. Prior works characterized popular MPC-based ML libraries, such as Meta’s CrypTen, to reveal their system overheads and built optimizations guided by the observations. However, we found potential concerns in this process. Through a careful inspection of the CrypTen library, we discovered several inefficient implementations that could overshadow fundamental MPC-related overheads. Furthermore, we observed that the characteristics can vary significantly depending on several factors, such as the model type, batch size, sequence length, and network conditions, many of which prior works do not vary during their evaluation. Our results indicate that focusing solely on a narrow experimental setup and/or relying on characterization without a deep understanding can misguide researchers and call for a more mature framework and standardized evaluation methodology.
安全多方计算(MPC)允许多方协作运行机器学习(ML)训练和推理,而每一方都不会泄露其秘密数据或模型权重。之前的工作描述了流行的基于mpc的ML库,如Meta的CrypTen,以揭示它们的系统开销和根据观察指导的构建优化。然而,我们在这个过程中发现了潜在的问题。通过对CrypTen库的仔细检查,我们发现了几个低效的实现,这些实现可能会掩盖基本的mpc相关开销。此外,我们观察到特征可以根据几个因素发生显著变化,例如模型类型,批量大小,序列长度和网络条件,其中许多先前的工作在评估过程中没有变化。我们的研究结果表明,仅仅关注一个狭窄的实验设置和/或依赖于特征而没有深入的理解可能会误导研究人员,需要一个更成熟的框架和标准化的评估方法。
{"title":"In-Depth Characterization of Machine Learning on an Optimized Multi-Party Computing Library","authors":"Jinyu Liu;Kiwan Maeng","doi":"10.1109/LCA.2025.3624787","DOIUrl":"https://doi.org/10.1109/LCA.2025.3624787","url":null,"abstract":"Secure multi-party computation (MPC) allows multiple parties to collaboratively run machine learning (ML) training and inference without each party revealing its secret data or model weights. Prior works characterized popular MPC-based ML libraries, such as Meta’s CrypTen, to reveal their system overheads and built optimizations guided by the observations. However, we found potential concerns in this process. Through a careful inspection of the CrypTen library, we discovered several inefficient implementations that could overshadow fundamental MPC-related overheads. Furthermore, we observed that the characteristics can vary significantly depending on several factors, such as the model type, batch size, sequence length, and network conditions, many of which prior works do not vary during their evaluation. Our results indicate that focusing solely on a narrow experimental setup and/or relying on characterization without a deep understanding can misguide researchers and call for a more mature framework and standardized evaluation methodology.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"341-344"},"PeriodicalIF":1.4,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Multiple-Aspect Optimal CNN Accelerator in Top1 Accuracy, Performance, and Power Efficiency 在Top1精度,性能和功率效率的多方面优化CNN加速器
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-22 DOI: 10.1109/LCA.2025.3624004
Xianghong Hu;Yuanmiao Lin;Xueming Li;Ruidian Zhan;Jie Cao;Dayong Zhu;Shuting Cai;Xin Zheng;Xiaoming Xiong
The customization of accelerators for low-bit and mixed-bit convolutional neural works (CNNs) has been a promising approach to enhance computing efficiency of CNNs. However, current low-bit and mix-bit accelerator sacrifice some network accuracy to achieve higher performance and power efficiency, especially for lightweight CNNs like MobileNets containing depth-wise convolution (DW-CONV). These accelerators for low-bit or mixed-bit CNNS tend to achieve good results in only one aspect, such as performance, power efficiency, or Top1 accuracy, and it is difficult to achieve good experimental results in all aspects. In this work, we propose an accelerator that perform well in multifaceted aspects including performance, power efficiency, and Top1 accuracy. First, arbitrary-basis quantization (ABQ) method is used to enhance Top1 accuracy and a dedicated ABQ-based processing element (PE) is proposed to improve performance. Then, an adaptive data flow is presented to support standard convolution (SD-CONV) and depth-wise convolution (DW-CONV) efficiently in the primise of without increasing hardware consumption of the ABQ-based PE. Implemented on Zynp ZC706 platform, compared with other works, the proposed accelerator first achieve good experimental results in all aspects, achieving 1.28 × –5.76 × power efficiency and 1.11 × –5.81 × performance in the premise of the best Top1 accuracy.
为低比特和混合比特卷积神经网络(cnn)定制加速器是提高cnn计算效率的一种很有前途的方法。然而,目前的低位和混合位加速器牺牲了一些网络精度来实现更高的性能和功率效率,特别是对于轻量级cnn,如包含深度卷积(DW-CONV)的MobileNets。这些用于低位或混合位CNNS的加速器往往只在性能、功耗效率或Top1精度等一个方面取得良好的效果,很难在所有方面都取得良好的实验结果。在这项工作中,我们提出了一个在多方面表现良好的加速器,包括性能,功率效率和Top1精度。首先,采用任意基量化(ABQ)方法提高Top1精度,并提出了基于ABQ的专用处理单元(PE)来提高性能;然后,在不增加基于abq的PE的硬件消耗的前提下,提出了一种有效支持标准卷积(SD-CONV)和深度卷积(DW-CONV)的自适应数据流。在Zynp ZC706平台上实现,与其他工作相比,所提出的加速器首先在各方面都取得了较好的实验结果,在获得最佳Top1精度的前提下,实现了1.28 × -5.76 ×的功率效率和1.11 × -5.81 ×的性能。
{"title":"A Multiple-Aspect Optimal CNN Accelerator in Top1 Accuracy, Performance, and Power Efficiency","authors":"Xianghong Hu;Yuanmiao Lin;Xueming Li;Ruidian Zhan;Jie Cao;Dayong Zhu;Shuting Cai;Xin Zheng;Xiaoming Xiong","doi":"10.1109/LCA.2025.3624004","DOIUrl":"https://doi.org/10.1109/LCA.2025.3624004","url":null,"abstract":"The customization of accelerators for low-bit and mixed-bit convolutional neural works (CNNs) has been a promising approach to enhance computing efficiency of CNNs. However, current low-bit and mix-bit accelerator sacrifice some network accuracy to achieve higher performance and power efficiency, especially for lightweight CNNs like MobileNets containing depth-wise convolution (DW-CONV). These accelerators for low-bit or mixed-bit CNNS tend to achieve good results in only one aspect, such as performance, power efficiency, or Top1 accuracy, and it is difficult to achieve good experimental results in all aspects. In this work, we propose an accelerator that perform well in multifaceted aspects including performance, power efficiency, and Top1 accuracy. First, arbitrary-basis quantization (ABQ) method is used to enhance Top1 accuracy and a dedicated ABQ-based processing element (PE) is proposed to improve performance. Then, an adaptive data flow is presented to support standard convolution (SD-CONV) and depth-wise convolution (DW-CONV) efficiently in the primise of without increasing hardware consumption of the ABQ-based PE. Implemented on Zynp ZC706 platform, compared with other works, the proposed accelerator first achieve good experimental results in all aspects, achieving 1.28 × –5.76 × power efficiency and 1.11 × –5.81 × performance in the premise of the best Top1 accuracy.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"349-352"},"PeriodicalIF":1.4,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1