Pub Date : 2025-11-03DOI: 10.1109/LCA.2025.3628325
Jaehong Cho;Hyunmin Choi;Jongse Park
This letter introduces LLMServingSim2.0, a system simulator designed for exploring heterogeneous hardware in large-scale LLM serving systems. LLMServingSim2.0 addresses two key limitations of its predecessor: (1) integrating hardware models into system-level simulators is non-trivial due to the lack of a clear abstraction, and (2) existing simulators support only a narrow subset of serving techniques, leaving no infrastructure that captures the breadth of approaches in modern LLM serving. To overcome these issues, LLMServingSim2.0 adopts trace-driven performance modeling, accompanied by an operator-level latency profiler, enabling the integration of new accelerators with a single command. It further embeds up-to-date serving techniques while exposing flexible interfaces for request routing, cache management, and scheduling policies. In a TPU case study, our profiler requires 18.5 × fewer LoC and outperforms the predecessor’s hardware-simulator integration, demonstrating LLMServingSim2.0’s low-effort hardware extensibility. Our experiments further show that LLMServingSim2.0 reproduces GPU-based LLM serving with 1.9% error, while maintaining practical simulation time, making it a comprehensive platform for both hardware developers and LLM service providers.
{"title":"LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure","authors":"Jaehong Cho;Hyunmin Choi;Jongse Park","doi":"10.1109/LCA.2025.3628325","DOIUrl":"https://doi.org/10.1109/LCA.2025.3628325","url":null,"abstract":"This letter introduces LLMServingSim2.0, a system simulator designed for exploring heterogeneous hardware in large-scale LLM serving systems. LLMServingSim2.0 addresses two key limitations of its predecessor: (1) integrating hardware models into system-level simulators is non-trivial due to the lack of a clear abstraction, and (2) existing simulators support only a narrow subset of serving techniques, leaving no infrastructure that captures the breadth of approaches in modern LLM serving. To overcome these issues, LLMServingSim2.0 adopts trace-driven performance modeling, accompanied by an operator-level latency profiler, enabling the integration of new accelerators with a single command. It further embeds up-to-date serving techniques while exposing flexible interfaces for request routing, cache management, and scheduling policies. In a TPU case study, our profiler requires 18.5 × fewer LoC and outperforms the predecessor’s hardware-simulator integration, demonstrating LLMServingSim2.0’s low-effort hardware extensibility. Our experiments further show that LLMServingSim2.0 reproduces GPU-based LLM serving with 1.9% error, while maintaining practical simulation time, making it a comprehensive platform for both hardware developers and LLM service providers.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"361-364"},"PeriodicalIF":1.4,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-31DOI: 10.1109/LCA.2025.3626929
Minki Jeong;Daegun Yoon;Soohong Ahn;Seungyong Lee;Jooyoung Kim;Jinuk Jeon;Joonseop Sim;Youngpyo Joo;Hoshik Kim
As large language models (LLMs) scale, their memory and computation demands increase, making weight-only quantization a widely adopted technique to reduce memory footprint with minimal accuracy loss. However, CUDA core–based dequantization introduces significant instruction overhead, memory traffic, and pipeline stalls across all batch sizes, and critically remains a persistent bottleneck even in large-batch, cloud-scale LLM serving. To address these challenges, we propose StreamDQ, a lightweight architectural enhancement for cloud-scale LLM inference that enables on-the-fly dequantization within the memory subsystem by integrating compact DeQuantization Blocks (DQBs) into the base-die of high-bandwidth memory (HBM). StreamDQ leverages reserved address bits for control signaling, requiring no modifications to the ISA or compiler. Our design minimizes data movement, offloads computation from CUDA cores, and delivers dequantized weights directly to tensor cores for general matrix multiplication (GEMM) execution. StreamDQ achieves up to 83.57% reduction in inference latency and 5.15 × improvement in tokens-per-second throughput, with only 0.013 ${mathrm{mm}}^{2}$ area and 0.17 W power overhead per DQB. The design is scalable, software-transparent, and well-suited for high-throughput LLM inference on modern HBM-enabled GPU platforms.
{"title":"StreamDQ: HBM-Integrated On-the-Fly DeQuantization via Memory Load for Large Language Models","authors":"Minki Jeong;Daegun Yoon;Soohong Ahn;Seungyong Lee;Jooyoung Kim;Jinuk Jeon;Joonseop Sim;Youngpyo Joo;Hoshik Kim","doi":"10.1109/LCA.2025.3626929","DOIUrl":"https://doi.org/10.1109/LCA.2025.3626929","url":null,"abstract":"As large language models (LLMs) scale, their memory and computation demands increase, making weight-only quantization a widely adopted technique to reduce memory footprint with minimal accuracy loss. However, CUDA core–based dequantization introduces significant instruction overhead, memory traffic, and pipeline stalls across all batch sizes, and critically remains a persistent bottleneck even in large-batch, cloud-scale LLM serving. To address these challenges, we propose StreamDQ, a lightweight architectural enhancement for cloud-scale LLM inference that enables on-the-fly dequantization within the memory subsystem by integrating compact DeQuantization Blocks (DQBs) into the base-die of high-bandwidth memory (HBM). StreamDQ leverages reserved address bits for control signaling, requiring no modifications to the ISA or compiler. Our design minimizes data movement, offloads computation from CUDA cores, and delivers dequantized weights directly to tensor cores for general matrix multiplication (GEMM) execution. StreamDQ achieves up to 83.57% reduction in inference latency and 5.15 × improvement in tokens-per-second throughput, with only 0.013 <inline-formula><tex-math>${mathrm{mm}}^{2}$</tex-math></inline-formula> area and 0.17 W power overhead per DQB. The design is scalable, software-transparent, and well-suited for high-throughput LLM inference on modern HBM-enabled GPU platforms.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"373-376"},"PeriodicalIF":1.4,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145612132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-31DOI: 10.1109/LCA.2025.3627539
Myunghyun Rhee;Sookyung Choi;Euiseok Kim;Joonseop Sim;Youngpyo Joo;Hoshik Kim
The escalating context length in Large Language Models (LLMs) creates a severe performance bottleneck around the Key-Value (KV) cache, whose memory-bound nature leads to significant GPU under-utilization. This paper introduces Mixture of Shared KV Attention (MoSKA), an architecture that addresses this challenge by exploiting the heterogeneity of context data. It differentiates between per-request unique and massively reused shared sequences. The core of MoSKA is a novel Shared KV Attention mechanism that transforms the attention on shared data from a series of memory-bound GEMV operations into a single, compute-bound GEMM by batching concurrent requests. This is supported by an MoE-inspired sparse attention strategy that prunes the search space and a tailored Disaggregated Infrastructure that specializes hardware for unique and shared data. This comprehensive approach demonstrates a throughput increase of up to 538.7 × over baselines in workloads with high context sharing, offering a clear architectural path toward scalable LLM inference
{"title":"MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference","authors":"Myunghyun Rhee;Sookyung Choi;Euiseok Kim;Joonseop Sim;Youngpyo Joo;Hoshik Kim","doi":"10.1109/LCA.2025.3627539","DOIUrl":"https://doi.org/10.1109/LCA.2025.3627539","url":null,"abstract":"The escalating context length in Large Language Models (LLMs) creates a severe performance bottleneck around the Key-Value (KV) cache, whose memory-bound nature leads to significant GPU under-utilization. This paper introduces <italic>Mixture of Shared KV Attention (MoSKA)</i>, an architecture that addresses this challenge by exploiting the heterogeneity of context data. It differentiates between per-request unique and massively reused shared sequences. The core of <italic>MoSKA</i> is a novel <italic>Shared KV Attention</i> mechanism that transforms the attention on shared data from a series of memory-bound GEMV operations into a single, compute-bound GEMM by batching concurrent requests. This is supported by an <italic>MoE-inspired sparse attention</i> strategy that prunes the search space and a tailored <italic>Disaggregated Infrastructure</i> that specializes hardware for unique and shared data. This comprehensive approach demonstrates a throughput increase of up to 538.7 × over baselines in workloads with high context sharing, offering a clear architectural path toward scalable LLM inference","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"365-368"},"PeriodicalIF":1.4,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Differential privacy (DP) has become a de facto standard in restricting data leakage when releasing statistical information. It is widely used in data-centric applications, ranging from the US Census Bureau's population data collection to modern deep learning training. Recent studies have shown that using floating-point in implementing DP can significantly degrade its mathematical guarantees and strongly advised to use an integer-based implementation instead. However, nearly all popular DP libraries currently use floating-point. In this paper, we characterize the performance of a recent integer-based DP library from Google. Our study reveals that the noise generation is significantly slower (by 187–296×) when using an integer-based implementation, and that noise sampling can become a non-negligible overhead in real applications. Our findings highlight an overlooked but important overhead in realizing high-privacy DP and call for greater focus from the community.
{"title":"Characterizing the System Overhead of Discrete Noise Generation for Differential Privacy","authors":"SeokHyeon Kong;Donghwan Kim;Euiseong Seo;Kiwan Maeng","doi":"10.1109/LCA.2025.3627101","DOIUrl":"https://doi.org/10.1109/LCA.2025.3627101","url":null,"abstract":"Differential privacy (DP) has become a de facto standard in restricting data leakage when releasing statistical information. It is widely used in data-centric applications, ranging from the US Census Bureau's population data collection to modern deep learning training. Recent studies have shown that using floating-point in implementing DP can significantly degrade its mathematical guarantees and strongly advised to use an integer-based implementation instead. However, nearly all popular DP libraries currently use floating-point. In this paper, we characterize the performance of a recent integer-based DP library from Google. Our study reveals that the noise generation is significantly slower (by 187–296×) when using an integer-based implementation, and that noise sampling can become a non-negligible overhead in real applications. Our findings highlight an overlooked but important overhead in realizing high-privacy DP and call for greater focus from the community.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"345-348"},"PeriodicalIF":1.4,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-23DOI: 10.1109/LCA.2025.3624787
Jinyu Liu;Kiwan Maeng
Secure multi-party computation (MPC) allows multiple parties to collaboratively run machine learning (ML) training and inference without each party revealing its secret data or model weights. Prior works characterized popular MPC-based ML libraries, such as Meta’s CrypTen, to reveal their system overheads and built optimizations guided by the observations. However, we found potential concerns in this process. Through a careful inspection of the CrypTen library, we discovered several inefficient implementations that could overshadow fundamental MPC-related overheads. Furthermore, we observed that the characteristics can vary significantly depending on several factors, such as the model type, batch size, sequence length, and network conditions, many of which prior works do not vary during their evaluation. Our results indicate that focusing solely on a narrow experimental setup and/or relying on characterization without a deep understanding can misguide researchers and call for a more mature framework and standardized evaluation methodology.
{"title":"In-Depth Characterization of Machine Learning on an Optimized Multi-Party Computing Library","authors":"Jinyu Liu;Kiwan Maeng","doi":"10.1109/LCA.2025.3624787","DOIUrl":"https://doi.org/10.1109/LCA.2025.3624787","url":null,"abstract":"Secure multi-party computation (MPC) allows multiple parties to collaboratively run machine learning (ML) training and inference without each party revealing its secret data or model weights. Prior works characterized popular MPC-based ML libraries, such as Meta’s CrypTen, to reveal their system overheads and built optimizations guided by the observations. However, we found potential concerns in this process. Through a careful inspection of the CrypTen library, we discovered several inefficient implementations that could overshadow fundamental MPC-related overheads. Furthermore, we observed that the characteristics can vary significantly depending on several factors, such as the model type, batch size, sequence length, and network conditions, many of which prior works do not vary during their evaluation. Our results indicate that focusing solely on a narrow experimental setup and/or relying on characterization without a deep understanding can misguide researchers and call for a more mature framework and standardized evaluation methodology.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"341-344"},"PeriodicalIF":1.4,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The customization of accelerators for low-bit and mixed-bit convolutional neural works (CNNs) has been a promising approach to enhance computing efficiency of CNNs. However, current low-bit and mix-bit accelerator sacrifice some network accuracy to achieve higher performance and power efficiency, especially for lightweight CNNs like MobileNets containing depth-wise convolution (DW-CONV). These accelerators for low-bit or mixed-bit CNNS tend to achieve good results in only one aspect, such as performance, power efficiency, or Top1 accuracy, and it is difficult to achieve good experimental results in all aspects. In this work, we propose an accelerator that perform well in multifaceted aspects including performance, power efficiency, and Top1 accuracy. First, arbitrary-basis quantization (ABQ) method is used to enhance Top1 accuracy and a dedicated ABQ-based processing element (PE) is proposed to improve performance. Then, an adaptive data flow is presented to support standard convolution (SD-CONV) and depth-wise convolution (DW-CONV) efficiently in the primise of without increasing hardware consumption of the ABQ-based PE. Implemented on Zynp ZC706 platform, compared with other works, the proposed accelerator first achieve good experimental results in all aspects, achieving 1.28 × –5.76 × power efficiency and 1.11 × –5.81 × performance in the premise of the best Top1 accuracy.
{"title":"A Multiple-Aspect Optimal CNN Accelerator in Top1 Accuracy, Performance, and Power Efficiency","authors":"Xianghong Hu;Yuanmiao Lin;Xueming Li;Ruidian Zhan;Jie Cao;Dayong Zhu;Shuting Cai;Xin Zheng;Xiaoming Xiong","doi":"10.1109/LCA.2025.3624004","DOIUrl":"https://doi.org/10.1109/LCA.2025.3624004","url":null,"abstract":"The customization of accelerators for low-bit and mixed-bit convolutional neural works (CNNs) has been a promising approach to enhance computing efficiency of CNNs. However, current low-bit and mix-bit accelerator sacrifice some network accuracy to achieve higher performance and power efficiency, especially for lightweight CNNs like MobileNets containing depth-wise convolution (DW-CONV). These accelerators for low-bit or mixed-bit CNNS tend to achieve good results in only one aspect, such as performance, power efficiency, or Top1 accuracy, and it is difficult to achieve good experimental results in all aspects. In this work, we propose an accelerator that perform well in multifaceted aspects including performance, power efficiency, and Top1 accuracy. First, arbitrary-basis quantization (ABQ) method is used to enhance Top1 accuracy and a dedicated ABQ-based processing element (PE) is proposed to improve performance. Then, an adaptive data flow is presented to support standard convolution (SD-CONV) and depth-wise convolution (DW-CONV) efficiently in the primise of without increasing hardware consumption of the ABQ-based PE. Implemented on Zynp ZC706 platform, compared with other works, the proposed accelerator first achieve good experimental results in all aspects, achieving 1.28 × –5.76 × power efficiency and 1.11 × –5.81 × performance in the premise of the best Top1 accuracy.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"349-352"},"PeriodicalIF":1.4,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-22DOI: 10.1109/LCA.2025.3624272
Sookyung Choi;Myunghyun Rhee;Euiseok Kim;Kwangsik Shin;Youngpyo Joo;Hoshik Kim
Processing multi-million tokens for advanced Large Language Models (LLMs) poses a significant memory bottleneck for existing AI systems. This bottleneck stems from a fundamental resource imbalance, where enormous memory capacity and bandwidth are required, yet the computational load is minimal. We propose NELSSA (Processing Near Memory for Extremely Long Sequences with Sparse Attention), an architectural platform that synergistically combines the high-capacity Processing Near Memory (PNM) with the principles of dynamic sparse attention to address this issue. This approach enables capacity scaling without performance degradation, and our evaluation shows that NELSSA can process up to 20M-token sequences on a single node (Llama-2-70B), achieving an 11× to 40× speedup over a representative DIMM-based PNM system. The proposed architecture radically resolves existing inefficiencies, enabling previously impractical multi-million-token processing and thus laying the foundation for next-generation AI applications.
为高级大型语言模型(llm)处理数百万个令牌对现有人工智能系统构成了显著的内存瓶颈。这种瓶颈源于基本的资源不平衡,需要巨大的内存容量和带宽,但计算负载却很小。我们提出了NELSSA (Processing Near Memory for Extremely Long Sequences with Sparse Attention),这是一个将高容量处理近内存(PNM)与动态稀疏注意原理协同结合的架构平台来解决这个问题。这种方法可以在不降低性能的情况下实现容量扩展,我们的评估表明,NELSSA可以在单个节点(Llama-2-70B)上处理多达20m个令牌序列,比典型的基于dimm的PNM系统实现11倍到40倍的加速。所提出的架构从根本上解决了现有的低效率问题,实现了以前不切实际的数百万令牌处理,从而为下一代人工智能应用奠定了基础。
{"title":"PNM Meets Sparse Attention: Enabling Multi-Million Tokens Inference at Scale","authors":"Sookyung Choi;Myunghyun Rhee;Euiseok Kim;Kwangsik Shin;Youngpyo Joo;Hoshik Kim","doi":"10.1109/LCA.2025.3624272","DOIUrl":"https://doi.org/10.1109/LCA.2025.3624272","url":null,"abstract":"Processing multi-million tokens for advanced Large Language Models (LLMs) poses a significant memory bottleneck for existing AI systems. This bottleneck stems from a fundamental resource imbalance, where enormous memory capacity and bandwidth are required, yet the computational load is minimal. We propose <monospace>NELSSA</monospace> (Processing <underline>N</u>ear Memory for <underline>E</u>xtremely <underline>L</u>ong <underline>S</u>equences with <underline>S</u>parse <underline>A</u>ttention), an architectural platform that synergistically combines the high-capacity Processing Near Memory (PNM) with the principles of dynamic sparse attention to address this issue. This approach enables capacity scaling without performance degradation, and our evaluation shows that <monospace>NELSSA</monospace> can process up to 20M-token sequences on a single node (Llama-2-70B), achieving an 11× to 40× speedup over a representative DIMM-based PNM system. The proposed architecture radically resolves existing inefficiencies, enabling previously impractical multi-million-token processing and thus laying the foundation for next-generation AI applications.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"353-356"},"PeriodicalIF":1.4,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As distributed machine learning (ML) workloads scale to thousands of GPUs connected by ultra-high-speed interconnects, tail latency in collective communication has emerged as a primary bottleneck. Prior RDMA designs, like RoCE, IRN, and SRNIC, enforce strict reliability and in-order delivery, relying on retransmissions and packet sequencing to ensure correctness. While effective for general-purpose workloads, these mechanisms introduce complexity and latency that scale poorly, where even rare packet losses or delays can consistently degrade system performance. We introduce Celeris, a domain-specific RDMA transport that revisits traditional reliability guarantees based on ML’s tolerance for lost or partial data. Celeris removes retransmissions and in-order delivery from the RDMA NIC, enabling best-effort transport that exploits the robustness of ML workloads. It retains congestion control (e.g., DCQCN) and manages communication with software-level mechanisms such as adaptive timeouts and data prioritization, while shifting loss recovery to the ML pipeline (e.g., using the Hadamard Transform). Early results show that Celeris reduces 99th-percentile latency by up to 2.3×, cuts BRAM usage by 67%, and nearly doubles NIC resilience to faults—delivering a resilient, scalable transport tailored for ML at cluster scale.
{"title":"Reimagining RDMA Through the Lens of ML","authors":"Ertza Warraich;Ali Imran;Annus Zulfiqar;Shay Vargaftik;Sonia Fahmy;Muhammad Shahbaz","doi":"10.1109/LCA.2025.3624158","DOIUrl":"https://doi.org/10.1109/LCA.2025.3624158","url":null,"abstract":"As distributed machine learning (ML) workloads scale to thousands of GPUs connected by ultra-high-speed interconnects, tail latency in collective communication has emerged as a primary bottleneck. Prior RDMA designs, like RoCE, IRN, and SRNIC, enforce strict reliability and in-order delivery, relying on retransmissions and packet sequencing to ensure correctness. While effective for general-purpose workloads, these mechanisms introduce complexity and latency that scale poorly, where even rare packet losses or delays can consistently degrade system performance. We introduce Celeris, a domain-specific RDMA transport that revisits traditional reliability guarantees based on ML’s tolerance for lost or partial data. Celeris removes retransmissions and in-order delivery from the RDMA NIC, enabling best-effort transport that exploits the robustness of ML workloads. It retains congestion control (e.g., DCQCN) and manages communication with software-level mechanisms such as adaptive timeouts and data prioritization, while shifting loss recovery to the ML pipeline (e.g., using the Hadamard Transform). Early results show that Celeris reduces 99th-percentile latency by up to 2.3×, cuts BRAM usage by 67%, and nearly doubles NIC resilience to faults—delivering a resilient, scalable transport tailored for ML at cluster scale.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"393-396"},"PeriodicalIF":1.4,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modern processors rely on the last-level cache to bridge the growing latency gap between the CPU core and main memory. However, the memory access patterns of contemporary applications exhibit increasing complexity, characterized by significant temporal locality, irregular reuse, and high conflict rates. We propose a partial tag-data decoupling architecture that leverages temporal locality without modifying the main cache structure or replacement policy. A lightweight auxiliary tag path is introduced, where data is allocated only upon reuse confirmation, thus minimizing resource waste caused by low-reuse blocks. The experimental results show that the proposed design achieves an average IPC improvement of 1.55% and a 5.33% reduction in MPKI without prefetching. With prefetching enabled, IPC improves by 1.96% and MPKI is further reduced by 10.91%, while overall storage overhead is decreased by approximately 2.59%.
{"title":"A Partial Tag–Data Decoupled Architecture for Last-Level Cache Optimization","authors":"Honghui Liu;Xian Lin;Xin Zheng;Qiancheng Liu;Huaien Gao;Shuting Cai;Xiaoming Xiong","doi":"10.1109/LCA.2025.3623137","DOIUrl":"https://doi.org/10.1109/LCA.2025.3623137","url":null,"abstract":"Modern processors rely on the last-level cache to bridge the growing latency gap between the CPU core and main memory. However, the memory access patterns of contemporary applications exhibit increasing complexity, characterized by significant temporal locality, irregular reuse, and high conflict rates. We propose a partial tag-data decoupling architecture that leverages temporal locality without modifying the main cache structure or replacement policy. A lightweight auxiliary tag path is introduced, where data is allocated only upon reuse confirmation, thus minimizing resource waste caused by low-reuse blocks. The experimental results show that the proposed design achieves an average IPC improvement of 1.55% and a 5.33% reduction in MPKI without prefetching. With prefetching enabled, IPC improves by 1.96% and MPKI is further reduced by 10.91%, while overall storage overhead is decreased by approximately 2.59%.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"333-336"},"PeriodicalIF":1.4,"publicationDate":"2025-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-17DOI: 10.1109/LCA.2025.3622724
Yunhua Fang;Rui Xie;Asad Ul Haq;Linsen Ma;Kaoutar El Maghraoui;Naigang Wang;Meng Wang;Liu Liu;Tong Zhang
Large Language Model (LLM) inference is increasingly constrained by memory bandwidth, with frequent access to the key-value (KV) cache dominating data movement. While attention sparsity reduces some memory traffic, the relevance of past tokens varies over time, requiring the full KV cache to remain accessible and sustaining pressure on both bandwidth and capacity. With advances in interconnects such as NVLink and LPDDR5X, modern AI hardware now integrates high-bandwidth memory (HBM) with high-speed off-package DRAM, making heterogeneous memory systems a practical solution. This work investigates dynamic KV cache placement across such systems to maximize aggregated bandwidth utilization under capacity constraints. Rather than proposing a specific scheduling policy, we formulate the placement problem mathematically and derive a theoretical upper bound, revealing substantial headroom for runtime optimization. To our knowledge, this is the first formal treatment of dynamic KV cache scheduling in heterogeneous memory systems for LLM inference.
{"title":"Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System","authors":"Yunhua Fang;Rui Xie;Asad Ul Haq;Linsen Ma;Kaoutar El Maghraoui;Naigang Wang;Meng Wang;Liu Liu;Tong Zhang","doi":"10.1109/LCA.2025.3622724","DOIUrl":"https://doi.org/10.1109/LCA.2025.3622724","url":null,"abstract":"Large Language Model (LLM) inference is increasingly constrained by memory bandwidth, with frequent access to the key-value (KV) cache dominating data movement. While attention sparsity reduces some memory traffic, the relevance of past tokens varies over time, requiring the full KV cache to remain accessible and sustaining pressure on both bandwidth and capacity. With advances in interconnects such as NVLink and LPDDR5X, modern AI hardware now integrates high-bandwidth memory (HBM) with high-speed off-package DRAM, making heterogeneous memory systems a practical solution. This work investigates dynamic KV cache placement across such systems to maximize aggregated bandwidth utilization under capacity constraints. Rather than proposing a specific scheduling policy, we formulate the placement problem mathematically and derive a theoretical upper bound, revealing substantial headroom for runtime optimization. To our knowledge, this is the first formal treatment of dynamic KV cache scheduling in heterogeneous memory systems for LLM inference.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"337-340"},"PeriodicalIF":1.4,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}