首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
A Quantitative Analysis of Mamba-2-Based Large Language Model: Study of State Space Duality 基于mamba -2的大型语言模型的定量分析:状态空间对偶性研究
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-09-11 DOI: 10.1109/LCA.2025.3609283
Gyeongrok Yang;Jaeha Min;In-Jun Jung;Joo-Young Kim
Mamba is based on a state space model (SSM) to address limitations of attention-based large language models (LLMs) associated with long-context processing. While Mamba achieves accuracy comparable to attention-based LLMs, it introduces recurrent computation that limits efficiency during the prefill phase of inference. To mitigate this, Mamba-2 introduces the state space duality (SSD), which increases parallelism during multi-token processing. However, its workload characteristics remain unexamined from a systems and architectural perspective. This work presents a system-level analysis of SSD in Mamba-2, characterizing its compute and memory behavior on modern hardware. Our findings reveal the computational characteristics of SSD and provide the first architectural insight into its execution. In addition, we identify performance bottlenecks and propose directions for addressing them in future work.
Mamba基于状态空间模型(SSM)来解决与长上下文处理相关的基于注意力的大型语言模型(llm)的局限性。虽然Mamba达到了与基于注意力的llm相当的准确性,但它引入了循环计算,限制了推理预填充阶段的效率。为了缓解这种情况,Mamba-2引入了状态空间二象性(SSD),它增加了多令牌处理期间的并行性。然而,从系统和体系结构的角度来看,它的工作负载特征仍然没有得到检验。这项工作提出了Mamba-2中SSD的系统级分析,描述了其在现代硬件上的计算和内存行为。我们的发现揭示了SSD的计算特性,并提供了对其执行的第一个架构见解。此外,我们还确定了性能瓶颈,并提出了在未来工作中解决这些瓶颈的方向。
{"title":"A Quantitative Analysis of Mamba-2-Based Large Language Model: Study of State Space Duality","authors":"Gyeongrok Yang;Jaeha Min;In-Jun Jung;Joo-Young Kim","doi":"10.1109/LCA.2025.3609283","DOIUrl":"https://doi.org/10.1109/LCA.2025.3609283","url":null,"abstract":"Mamba is based on a <italic>state space model (SSM)</i> to address limitations of attention-based large language models (LLMs) associated with long-context processing. While Mamba achieves accuracy comparable to attention-based LLMs, it introduces recurrent computation that limits efficiency during the prefill phase of inference. To mitigate this, Mamba-2 introduces the <italic>state space duality (SSD)</i>, which increases parallelism during multi-token processing. However, its workload characteristics remain unexamined from a systems and architectural perspective. This work presents a system-level analysis of SSD in Mamba-2, characterizing its compute and memory behavior on modern hardware. Our findings reveal the computational characteristics of SSD and provide the first architectural insight into its execution. In addition, we identify performance bottlenecks and propose directions for addressing them in future work.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"309-312"},"PeriodicalIF":1.4,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PIMsynth: A Unified Compiler Framework for Bit-Serial Processing-in-Memory Architectures 内存中位串行处理体系结构的统一编译器框架
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-08-19 DOI: 10.1109/LCA.2025.3600588
Deyuan Guo;Mohammadhosein Gholamrezaei;Matthew Hofmann;Ashish Venkat;Zhiru Zhang;Kevin Skadron
Bit-serial processing-in-memory (PIM) architectures have been extensively studied, yet a standardized tool for generating efficient bit-serial code is lacking, hindering fair comparisons. We present a fully automated compiler framework, PIMsynth, for bit-serial PIM architectures, targeting both digital and analog substrates. The compiler takes Verilog as input and generates optimized micro-operation code for programmable bit-serial PIM backends. Our flow integrates logic synthesis, optimization steps, instruction scheduling, and backend code generation into a unified toolchain. With the compiler, we provide a bit-serial compilation benchmark suite designed for efficient bit-serial code generation. To enable correctness and performance validation, we extend an existing PIM simulator to support compiler-generated micro-op-level workloads. Preliminary results demonstrate that the compiler generates competitive bit-serial code within $1.08times$ and $1.54times$ of hand-optimized digital and analog PIM baselines.
内存中位串行处理(PIM)体系结构已经得到了广泛的研究,但是缺乏一个用于生成高效位串行代码的标准化工具,这阻碍了公平的比较。我们提出了一个完全自动化的编译器框架,PIMsynth,用于位串行PIM架构,针对数字和模拟基板。编译器以Verilog为输入,为可编程位串行PIM后端生成优化的微操作代码。我们的流程将逻辑合成、优化步骤、指令调度和后端代码生成集成到一个统一的工具链中。通过编译器,我们提供了一个位串行编译基准套件,用于高效地生成位串行代码。为了启用正确性和性能验证,我们扩展了现有的PIM模拟器,以支持编译器生成的微操作级工作负载。初步结果表明,编译器在手工优化的数字和模拟PIM基线的1.08倍和1.54倍范围内生成具有竞争力的位串行代码。
{"title":"PIMsynth: A Unified Compiler Framework for Bit-Serial Processing-in-Memory Architectures","authors":"Deyuan Guo;Mohammadhosein Gholamrezaei;Matthew Hofmann;Ashish Venkat;Zhiru Zhang;Kevin Skadron","doi":"10.1109/LCA.2025.3600588","DOIUrl":"https://doi.org/10.1109/LCA.2025.3600588","url":null,"abstract":"Bit-serial processing-in-memory (PIM) architectures have been extensively studied, yet a standardized tool for generating efficient bit-serial code is lacking, hindering fair comparisons. We present a fully automated compiler framework, PIMsynth, for bit-serial PIM architectures, targeting both digital and analog substrates. The compiler takes Verilog as input and generates optimized micro-operation code for programmable bit-serial PIM backends. Our flow integrates logic synthesis, optimization steps, instruction scheduling, and backend code generation into a unified toolchain. With the compiler, we provide a bit-serial compilation benchmark suite designed for efficient bit-serial code generation. To enable correctness and performance validation, we extend an existing PIM simulator to support compiler-generated micro-op-level workloads. Preliminary results demonstrate that the compiler generates competitive bit-serial code within <inline-formula><tex-math>$1.08times$</tex-math></inline-formula> and <inline-formula><tex-math>$1.54times$</tex-math></inline-formula> of hand-optimized digital and analog PIM baselines.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"277-280"},"PeriodicalIF":1.4,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144926886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AiDE: Attention-FFN Disaggregated Execution for Cost-Effective LLM Decoding on CXL-PNM 基于CXL-PNM的低成本LLM译码的注意- ffn分解执行
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-08-11 DOI: 10.1109/LCA.2025.3597323
KyungSoo Kim;Omin Kwon;Yeonhong Park;Jae W. Lee
Disaggregating the prefill and decode phases has recently emerged as a promising strategy in the large language model (LLM) serving systems, driven by the distinct resource demands of each phase. Inspired by this coarse-grained disaggregation, we identify a similar opportunity within the decode phase itself: the feedforward network (FFN) is compute-intensive, whereas attention is constrained by memory bandwidth and capacity due to its key-value (KV) cache. To exploit this heterogeneity, we introduce AiDE, a heterogeneous decoding cluster that executes FFN operations on GPUs while offloading attention computations to Compute Express Link-based Processing Near Memory (CXL-PNM) devices. CXL-PNM provides scalable memory bandwidth and capacity, making it well-suited for attention-heavy workloads. In addition, we propose a batch-level pipelining approach enhanced with request scheduling to optimize the utilization of heterogeneous resources. Our AiDE architecture achieves up to 3.87× higher throughput, 2.72× lower p90 time per output token (TPOT), and a 2.31× reduction in decode latency compared to a GPU-only baseline, at comparable cost, demonstrating significant potential of fine-grained disaggregation for cost-effective LLM inference.
分解预填充和解码阶段最近在大型语言模型(LLM)服务系统中成为一种很有前途的策略,这是由每个阶段不同的资源需求驱动的。受这种粗粒度分解的启发,我们在解码阶段本身中发现了类似的机会:前馈网络(FFN)是计算密集型的,而由于其键值(KV)缓存,注意力受到内存带宽和容量的限制。为了利用这种异质性,我们引入了AiDE,这是一种异构解码集群,它在gpu上执行FFN操作,同时将注意力计算卸载到Compute Express基于链路的处理近内存(CXL-PNM)设备上。CXL-PNM提供可扩展的内存带宽和容量,使其非常适合注意力密集的工作负载。此外,我们提出了一种增强请求调度的批处理级流水线方法,以优化异构资源的利用。与仅使用gpu的基准相比,我们的AiDE架构实现了高达3.87倍的高吞吐量,2.72倍的低p90时间(TPOT),以及2.31倍的解码延迟减少,成本相当,证明了细粒度分解在经济高效的LLM推理中的巨大潜力。
{"title":"AiDE: Attention-FFN Disaggregated Execution for Cost-Effective LLM Decoding on CXL-PNM","authors":"KyungSoo Kim;Omin Kwon;Yeonhong Park;Jae W. Lee","doi":"10.1109/LCA.2025.3597323","DOIUrl":"https://doi.org/10.1109/LCA.2025.3597323","url":null,"abstract":"Disaggregating the prefill and decode phases has recently emerged as a promising strategy in the large language model (LLM) serving systems, driven by the distinct resource demands of each phase. Inspired by this coarse-grained disaggregation, we identify a similar opportunity within the decode phase itself: the feedforward network (FFN) is compute-intensive, whereas attention is constrained by memory bandwidth and capacity due to its key-value (KV) cache. To exploit this heterogeneity, we introduce AiDE, a heterogeneous decoding cluster that executes FFN operations on GPUs while offloading attention computations to Compute Express Link-based Processing Near Memory (CXL-PNM) devices. CXL-PNM provides scalable memory bandwidth and capacity, making it well-suited for attention-heavy workloads. In addition, we propose a batch-level pipelining approach enhanced with request scheduling to optimize the utilization of heterogeneous resources. Our AiDE architecture achieves up to 3.87× higher throughput, 2.72× lower p90 time per output token (TPOT), and a 2.31× reduction in decode latency compared to a GPU-only baseline, at comparable cost, demonstrating significant potential of fine-grained disaggregation for cost-effective LLM inference.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"285-288"},"PeriodicalIF":1.4,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144998058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CABANA : Cluster-Aware Query Batching for Accelerating Billion-Scale ANNS With Intel AMX 用Intel AMX加速十亿规模ANNS的集群感知查询批处理
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-08-08 DOI: 10.1109/LCA.2025.3596970
Minho Kim;Houxiang Ji;Jaeyoung Kang;Hwanjun Lee;Daehoon Kim;Nam Sung Kim
Retrieval-augmented generation (RAG) systems increasingly rely on Approximate Nearest Neighbor Search (ANNS) to efficiently retrieve relevant context from billion-scale vector databases. While IVF-based ANNS frameworks scale well overall, the fine search stage remains a bottleneck due to its compute-intensive GEMV operations, particularly under large query volumes. To address this, we propose CABANA, a cluster-aware query batching for ANNS acceleration mechanism using Intel Advanced Matrix Extensions (AMX) that reformulates these GEMV computations into high-throughput GEMM operations. By aggregating queries targeting the same clusters, CABANA enables batched computation during fine search, significantly improving compute intensity and memory access regularity. Evaluations on billion-scale datasets show that CABANA outperforms traditional SIMD-based implementations, achieving up to $32.6times$ higher query throughput with minimal overhead, while maintaining high recall rates.
检索增强生成(RAG)系统越来越依赖于近似最近邻搜索(ANNS)来有效地从十亿规模的向量数据库中检索相关上下文。虽然基于ivf的ANNS框架总体上扩展良好,但由于其计算密集型的GEMV操作,特别是在大查询量下,精细搜索阶段仍然是一个瓶颈。为了解决这个问题,我们提出了CABANA,这是一种使用英特尔高级矩阵扩展(AMX)的ANNS加速机制的集群感知查询批处理,它将这些GEMV计算重新制定为高吞吐量的GEMM操作。通过聚合针对相同集群的查询,CABANA可以在精细搜索期间进行批量计算,显著提高计算强度和内存访问规律。对十亿规模数据集的评估表明,CABANA优于传统的基于simd的实现,以最小的开销实现高达32.6倍的查询吞吐量,同时保持高召回率。
{"title":"CABANA : Cluster-Aware Query Batching for Accelerating Billion-Scale ANNS With Intel AMX","authors":"Minho Kim;Houxiang Ji;Jaeyoung Kang;Hwanjun Lee;Daehoon Kim;Nam Sung Kim","doi":"10.1109/LCA.2025.3596970","DOIUrl":"https://doi.org/10.1109/LCA.2025.3596970","url":null,"abstract":"Retrieval-augmented generation (RAG) systems increasingly rely on Approximate Nearest Neighbor Search (ANNS) to efficiently retrieve relevant context from billion-scale vector databases. While IVF-based ANNS frameworks scale well overall, the fine search stage remains a bottleneck due to its compute-intensive GEMV operations, particularly under large query volumes. To address this, we propose <monospace>CABANA</monospace>, a <u>c</u>luster-<u>a</u>ware query <u>b</u>atching for <u>AN</u>NS <u>a</u>cceleration mechanism using Intel Advanced Matrix Extensions (AMX) that reformulates these GEMV computations into high-throughput GEMM operations. By aggregating queries targeting the same clusters, <monospace>CABANA</monospace> enables batched computation during fine search, significantly improving compute intensity and memory access regularity. Evaluations on billion-scale datasets show that <monospace>CABANA</monospace> outperforms traditional SIMD-based implementations, achieving up to <inline-formula><tex-math>$32.6times$</tex-math></inline-formula> higher query throughput with minimal overhead, while maintaining high recall rates.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"289-292"},"PeriodicalIF":1.4,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11120372","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145027944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Checkflow: Low-Overhead Checkpointing for Deep Learning Training Checkflow:用于深度学习训练的低开销检查点
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-08-07 DOI: 10.1109/LCA.2025.3596616
Hangyu Liu;Shouxi Luo;Ke Li;Huanlai Xing;Bo Peng
During the time-consuming training of deep neural network (DNN) models, the worker has to periodically create checkpoints for tensors like the model parameters and optimizer state to support fast failover. However, due to the high overhead of checkpointing, existing schemes generally create checkpoints at a very low frequency, making recovery inefficient since the unsaved training progress would get lost. In this paper, we propose Checkflow, a low-overhead checkpointing scheme, which enables per-iteration checkpointing for DNN training with minimal or even zero cost of training slowdown. The power of Checkflow stems from the design of $i)$ decoupling a tensor’s checkpoint operation into snapshot-then-offload, and $ii)$ scheduling these operations appropriately, following the results of the math models. Our experimental results imply that, when the GPU-CPU connection has sufficient bandwidth for the training workload, Checkflow can theoretically overlap all the checkpoint operations for each round of training with the training computation, with trivial or no overhead in peak GPU memory occupancy.
在耗时的深度神经网络(DNN)模型训练过程中,工作者必须定期为模型参数和优化器状态等张量创建检查点以支持快速故障转移。然而,由于检查点的高开销,现有方案通常以非常低的频率创建检查点,使得恢复效率低下,因为未保存的训练进度会丢失。在本文中,我们提出了Checkflow,这是一种低开销的检查点方案,它使DNN训练的每次迭代检查点具有最小甚至零成本的训练速度。Checkflow的强大源于以下设计:1)将张量的检查点操作解耦为快照-然后卸载,2)根据数学模型的结果适当地调度这些操作。我们的实验结果表明,当GPU- cpu连接对训练工作负载有足够的带宽时,Checkflow理论上可以将每轮训练的所有检查点操作与训练计算重叠,在峰值GPU内存占用上很少或没有开销。
{"title":"Checkflow: Low-Overhead Checkpointing for Deep Learning Training","authors":"Hangyu Liu;Shouxi Luo;Ke Li;Huanlai Xing;Bo Peng","doi":"10.1109/LCA.2025.3596616","DOIUrl":"https://doi.org/10.1109/LCA.2025.3596616","url":null,"abstract":"During the time-consuming training of deep neural network (DNN) models, the worker has to periodically create checkpoints for tensors like the model parameters and optimizer state to support fast failover. However, due to the high overhead of checkpointing, existing schemes generally create checkpoints at a very low frequency, making recovery inefficient since the unsaved training progress would get lost. In this paper, we propose Checkflow, a low-overhead checkpointing scheme, which enables per-iteration checkpointing for DNN training with minimal or even zero cost of training slowdown. The power of Checkflow stems from the design of <inline-formula><tex-math>$i)$</tex-math></inline-formula> decoupling a tensor’s checkpoint operation into snapshot-then-offload, and <inline-formula><tex-math>$ii)$</tex-math></inline-formula> scheduling these operations appropriately, following the results of the math models. Our experimental results imply that, when the GPU-CPU connection has sufficient bandwidth for the training workload, Checkflow can theoretically overlap all the checkpoint operations for each round of training with the training computation, with trivial or no overhead in peak GPU memory occupancy.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"281-284"},"PeriodicalIF":1.4,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144926885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RAESC: A Reconfigurable AES Countermeasure Architecture for RISC-V With Enhanced Power Side-Channel Resilience RISC-V的可重构AES对抗体系结构与增强的功率侧信道弹性
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-08-01 DOI: 10.1109/LCA.2025.3595003
Nayana Rajeev;Cathrene Biju;Titu Mary Ignatius;Roy Paily Palathinkal;Rekha K James
This paper presents RAESC, a reconfigurable Advanced Encryption Standard (AES) countermeasure hardware design that supports AES-128, AES-192, and AES-256 types, enhancing flexibility and resource efficiency in IoT applications. The design incorporates a countermeasure to protect against Power-based Side Channel Attacks (PSCA) by randomizing the AES type based on input plaintext, ensuring improved security. The RAESC is integrated with an RV32IM RISC-V processor, offering streamlined operation and enhanced system security. Performance analysis shows that RAESC’s adaptive encryption strength achieves a balanced trade-off in area, power, and throughput, making it ideal for resource-constrained, security-sensitive IoT applications. Power traces for CPA attacks are generated on Application Specific Integrated Circuit (ASIC) and the design achieves a notable reduction in the Signal to Noise Ratio (SNR) and an increase in the Measurements to Disclose (MTD), demonstrating strong resilience against cryptographic attacks.
本文介绍了RAESC,一种可重构的高级加密标准(AES)对抗硬件设计,支持AES-128、AES-192和AES-256类型,增强了物联网应用的灵活性和资源效率。该设计结合了一种对策,通过基于输入明文随机化AES类型来防止基于功率的侧信道攻击(PSCA),确保提高安全性。RAESC集成了RV32IM RISC-V处理器,提供简化的操作和增强的系统安全性。性能分析表明,RAESC的自适应加密强度在面积、功率和吞吐量方面实现了平衡权衡,使其成为资源受限、安全敏感的物联网应用的理想选择。针对CPA攻击的电源走线是在专用集成电路(ASIC)上生成的,该设计显著降低了信噪比(SNR),增加了测量披露(MTD),显示出对加密攻击的强大弹性。
{"title":"RAESC: A Reconfigurable AES Countermeasure Architecture for RISC-V With Enhanced Power Side-Channel Resilience","authors":"Nayana Rajeev;Cathrene Biju;Titu Mary Ignatius;Roy Paily Palathinkal;Rekha K James","doi":"10.1109/LCA.2025.3595003","DOIUrl":"https://doi.org/10.1109/LCA.2025.3595003","url":null,"abstract":"This paper presents RAESC, a reconfigurable Advanced Encryption Standard (AES) countermeasure hardware design that supports AES-128, AES-192, and AES-256 types, enhancing flexibility and resource efficiency in IoT applications. The design incorporates a countermeasure to protect against Power-based Side Channel Attacks (PSCA) by randomizing the AES type based on input plaintext, ensuring improved security. The RAESC is integrated with an RV32IM RISC-V processor, offering streamlined operation and enhanced system security. Performance analysis shows that RAESC’s adaptive encryption strength achieves a balanced trade-off in area, power, and throughput, making it ideal for resource-constrained, security-sensitive IoT applications. Power traces for CPA attacks are generated on Application Specific Integrated Circuit (ASIC) and the design achieves a notable reduction in the Signal to Noise Ratio (SNR) and an increase in the Measurements to Disclose (MTD), demonstrating strong resilience against cryptographic attacks.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"273-276"},"PeriodicalIF":1.4,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144896825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RoSR: A Novel Selective Retransmission FPGA Architecture for RDMA NICs 一种新的RDMA网卡选择性重传FPGA架构
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-31 DOI: 10.1109/LCA.2025.3594110
Mengting Zhang;Zhichuan Guo;Shining Sun
Remote Direct Memory Access (RDMA) enables low-latency datacenter networks but suffers from inefficient loss recovery using Go-Back-N (GBN). GBN retransmits entire packet windows, degrading Flow Completion Time (FCT) under congestion. We introduce RoSR, a novel selective retransmission architecture for Field-Programmable Gate Array (FPGA)-based RDMA NICs that supports hardware-accelerated direct writes of out-of-order (OoO) packets. RoSR supports efficient OoO packet reception and enables fine-grained retransmission using a dynamic shared bitmap for packet tracking. By extending the RDMA over Converged Ethernet version 2 (RoCEv2) packet format, RoSR facilitates selective retransmission. It triggers retransmissions via timeouts using bitmap blocks and introduces new Nack-bitmap and rd-req-bitmap messages for loss reporting. Under 1% packet loss, RoSR achieves up to 13.5× (RDMA Write) and 15.6× (RDMA Read) higher throughput than Xilinx ERNIC. In NS-3 simulations using the HPCC RDMA stack, RoSR reduces FCT slowdown by 3× to 6× compared to GBN across various packet loss rates, congestion control algorithms (DCQCN, HPCC, Timely), and traffic patterns, while maintaining robustness under high round-trip time (RTT) conditions.
远程直接内存访问(RDMA)支持低延迟的数据中心网络,但使用Go-Back-N (GBN)时存在效率低下的损失恢复问题。GBN重传整个数据包窗口,降低了拥塞下的流完成时间(FCT)。我们介绍了RoSR,一种新的选择性重传架构,用于基于现场可编程门阵列(FPGA)的RDMA网卡,支持硬件加速的乱序(OoO)数据包的直接写入。RoSR支持有效的OoO数据包接收,并使用动态共享位图实现数据包跟踪的细粒度重传。通过扩展RDMA over Converged Ethernet version 2 (RoCEv2)数据包格式,RoSR促进了选择性重传。它通过使用位图块的超时触发重传,并为丢失报告引入了新的ack-bitmap和rd-req-bitmap消息。在丢包率为1%的情况下,RoSR的吞吐量比Xilinx ERNIC高13.5倍(RDMA Write)和15.6倍(RDMA Read)。在使用HPCC RDMA堆栈的NS-3模拟中,与GBN相比,RoSR在各种丢包率、拥塞控制算法(DCQCN、HPCC、Timely)和流量模式下将FCT减速减少了3到6倍,同时在高往返时间(RTT)条件下保持鲁棒性。
{"title":"RoSR: A Novel Selective Retransmission FPGA Architecture for RDMA NICs","authors":"Mengting Zhang;Zhichuan Guo;Shining Sun","doi":"10.1109/LCA.2025.3594110","DOIUrl":"https://doi.org/10.1109/LCA.2025.3594110","url":null,"abstract":"Remote Direct Memory Access (RDMA) enables low-latency datacenter networks but suffers from inefficient loss recovery using Go-Back-N (GBN). GBN retransmits entire packet windows, degrading Flow Completion Time (FCT) under congestion. We introduce RoSR, a novel selective retransmission architecture for Field-Programmable Gate Array (FPGA)-based RDMA NICs that supports hardware-accelerated direct writes of out-of-order (OoO) packets. RoSR supports efficient OoO packet reception and enables fine-grained retransmission using a dynamic shared bitmap for packet tracking. By extending the RDMA over Converged Ethernet version 2 (RoCEv2) packet format, RoSR facilitates selective retransmission. It triggers retransmissions via timeouts using bitmap blocks and introduces new Nack-bitmap and rd-req-bitmap messages for loss reporting. Under 1% packet loss, RoSR achieves up to 13.5× (RDMA Write) and 15.6× (RDMA Read) higher throughput than Xilinx ERNIC. In NS-3 simulations using the HPCC RDMA stack, RoSR reduces FCT slowdown by 3× to 6× compared to GBN across various packet loss rates, congestion control algorithms (DCQCN, HPCC, Timely), and traffic patterns, while maintaining robustness under high round-trip time (RTT) conditions.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"269-272"},"PeriodicalIF":1.4,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144896824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SSD Offloading for LLM Mixture-of-Experts Weights Considered Harmful in Energy Efficiency 考虑对能效有害的LLM混合专家权重的SSD卸载
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-24 DOI: 10.1109/LCA.2025.3592563
Kwanhee Kyung;Sungmin Yun;Jung Ho Ahn
Large Language Models (LLMs) applying Mixture-of-Experts (MoE) scale to trillions of parameters but require vast memory, motivating a line of research to offload expert weights from fast-but-small DRAM (HBM) to denser Flash SSDs. While SSDs provide cost-effective capacity, their read energy per bit is substantially higher than that of DRAM. This paper quantitatively analyzes the energy implications of offloading MoE expert weights to SSDs during the critical decode stage of LLM inference. Our analysis, comparing SSD, CPU memory (DDR), and HBM storage scenarios for models like DeepSeek-R1, reveals that offloading MoE weights to current SSDs drastically increases per-token-generation energy consumption (e.g., by up to $sim 12times$ compared to the HBM baseline), dominating the total inference energy budget. Although techniques like prefetching effectively hide access latency, they cannot mitigate this fundamental energy penalty. We further explore future technological scaling, finding that the inherent sparsity of MoE models could potentially make SSDs energy-viable if Flash read energy improves significantly, roughly by an order of magnitude.
大型语言模型(llm)将混合专家(MoE)规模应用于数万亿参数,但需要巨大的内存,这促使一系列研究将专家权重从快速但小型的DRAM (HBM)转移到更密集的闪存ssd上。虽然ssd提供了经济高效的容量,但它们的每比特读取能量大大高于DRAM。本文定量分析了在LLM推理的关键解码阶段将MoE专家权值卸载到ssd的能量影响。我们的分析,比较了SSD、CPU内存(DDR)和HBM存储场景的模型,如DeepSeek-R1,揭示了将MoE权重卸载到当前的SSD上大大增加了每个令牌生成的能量消耗(例如,与HBM基线相比,高达12倍),主导了总推理能量预算。尽管像预取这样的技术有效地隐藏了访问延迟,但它们不能减轻这种基本的能量损失。我们进一步探索了未来的技术规模,发现MoE模型的固有稀疏性可能会使ssd的能量可行,如果闪存读取能量显着提高,大约一个数量级。
{"title":"SSD Offloading for LLM Mixture-of-Experts Weights Considered Harmful in Energy Efficiency","authors":"Kwanhee Kyung;Sungmin Yun;Jung Ho Ahn","doi":"10.1109/LCA.2025.3592563","DOIUrl":"https://doi.org/10.1109/LCA.2025.3592563","url":null,"abstract":"Large Language Models (LLMs) applying Mixture-of-Experts (MoE) scale to trillions of parameters but require vast memory, motivating a line of research to offload expert weights from fast-but-small DRAM (HBM) to denser Flash SSDs. While SSDs provide cost-effective capacity, their read energy per bit is substantially higher than that of DRAM. This paper quantitatively analyzes the energy implications of offloading MoE expert weights to SSDs during the critical decode stage of LLM inference. Our analysis, comparing SSD, CPU memory (DDR), and HBM storage scenarios for models like DeepSeek-R1, reveals that offloading MoE weights to current SSDs drastically increases per-token-generation energy consumption (e.g., by up to <inline-formula><tex-math>$sim 12times$</tex-math></inline-formula> compared to the HBM baseline), dominating the total inference energy budget. Although techniques like prefetching effectively hide access latency, they cannot mitigate this fundamental energy penalty. We further explore future technological scaling, finding that the inherent sparsity of MoE models could potentially make SSDs energy-viable <i>if</i> Flash read energy improves significantly, roughly by an order of magnitude.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"265-268"},"PeriodicalIF":1.4,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144880476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correct Wrong Path 纠正错误路径
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-23 DOI: 10.1109/LCA.2025.3542809
Bhargav Reddy Godala;Sankara Prasad Ramesh;Krishnam Tibrewala;Chrysanthos Pepi;Gino Chacon;Svilen Kanev;Gilles A. Pokam;Alberto Ros;Daniel A. Jiménez;Paul V. Gratz;David I. August
Modern OOO CPUs have very deep pipelines with large branch misprediction recovery penalties. Speculatively executed instructions on the wrong path can significantly change cache state, depending on speculation levels. Architects often employ trace-driven simulation models in the design exploration stage, which sacrifice precision for speed. Trace-driven simulators are orders of magnitude faster than execution-driven models, reducing the often hundreds of thousands of simulation hours needed to explore new micro-architectural ideas. Despite the strong benefits of trace-driven simulation, it often fails to adequately model the consequences of wrong-path execution because obtaining such traces from real systems is nontrivial. Prior works exclusively consider either pollution or prefetching in the instruction stream/L1-I cache and often ignore the impact on the data stream. Here, we examine wrong path execution in simulation results and design a set of infrastructure for enabling wrong-path execution in a trace driven simulator. Our analysis shows the wrong path affects structures on both the instruction and data sides extensively, resulting in performance variations ranging from $-3.05$% to 20.9% versus ignoring wrong path. To benefit the research community and enhance the accuracy of simulators, we opened our traces and tracing utility in the hopes that industry can provide wrong-path traces generated by their internal simulators, enabling academic simulation without exposing industry IP.
现代的OOO cpu具有非常深的管道,具有很大的分支错误预测恢复惩罚。根据推测级别,在错误路径上推测执行的指令可能会显著改变缓存状态。架构师经常在设计探索阶段使用跟踪驱动的仿真模型,这种模型为了速度而牺牲了精度。跟踪驱动的模拟器比执行驱动的模型要快几个数量级,从而减少了探索新的微架构思想所需的数十万小时的模拟时间。尽管跟踪驱动的模拟有很大的好处,但它经常不能充分地模拟错误路径执行的后果,因为从实际系统中获得这样的跟踪是非常重要的。以前的工作只考虑指令流/L1-I缓存中的污染或预取,而经常忽略对数据流的影响。在这里,我们将检查模拟结果中的错误路径执行,并设计一组基础设施,以便在跟踪驱动的模拟器中启用错误路径执行。我们的分析表明,错误路径对指令和数据端的结构都有广泛的影响,与忽略错误路径相比,导致性能变化从-3.05 %到20.9%不等。为了使研究界受益并提高模拟器的准确性,我们开放了我们的跟踪和跟踪实用程序,希望工业界可以提供由其内部模拟器生成的错误路径跟踪,从而在不暴露工业IP的情况下实现学术模拟。
{"title":"Correct Wrong Path","authors":"Bhargav Reddy Godala;Sankara Prasad Ramesh;Krishnam Tibrewala;Chrysanthos Pepi;Gino Chacon;Svilen Kanev;Gilles A. Pokam;Alberto Ros;Daniel A. Jiménez;Paul V. Gratz;David I. August","doi":"10.1109/LCA.2025.3542809","DOIUrl":"https://doi.org/10.1109/LCA.2025.3542809","url":null,"abstract":"Modern OOO CPUs have very deep pipelines with large branch misprediction recovery penalties. Speculatively executed instructions on the wrong path can significantly change cache state, depending on speculation levels. Architects often employ trace-driven simulation models in the design exploration stage, which sacrifice precision for speed. Trace-driven simulators are orders of magnitude faster than execution-driven models, reducing the often hundreds of thousands of simulation hours needed to explore new micro-architectural ideas. Despite the strong benefits of trace-driven simulation, it often fails to adequately model the consequences of wrong-path execution because obtaining such traces from real systems is nontrivial. Prior works exclusively consider either pollution or prefetching in the instruction stream/L1-I cache and often ignore the impact on the data stream. Here, we examine wrong path execution in simulation results and design a set of infrastructure for enabling wrong-path execution in a trace driven simulator. Our analysis shows the wrong path affects structures on both the instruction and data sides extensively, resulting in performance variations ranging from <inline-formula><tex-math>$-3.05$</tex-math></inline-formula>% to 20.9% versus ignoring wrong path. To benefit the research community and enhance the accuracy of simulators, we opened our traces and tracing utility in the hopes that industry can provide wrong-path traces generated by their internal simulators, enabling academic simulation without exposing industry IP.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"221-224"},"PeriodicalIF":1.4,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144687792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Per-Row Activation Counting on Real Hardware: Demystifying Performance Overheads 真实硬件上的每行激活计数:揭开性能开销的神秘面纱
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-10 DOI: 10.1109/LCA.2025.3587293
Jumin Kim;Seungmin Baek;Minbok Wi;Hwayong Nam;Michael Jaemin Kim;Sukhan Lee;Kyomin Sohn;Jung Ho Ahn
Per-Row Activation Counting (PRAC), a DRAM read disturbance mitigation method, modifies key DRAM timing parameters, reportedly causing significant performance overheads in simulator-based studies. However, given known discrepancies between simulators and real hardware, real-machine experiments are vital for accurate PRAC performance estimation. We present the first real-machine performance analysis of PRAC. After verifying timing modifications on the latest CPUs using microbenchmarks, our analysis shows that PRAC’s average and maximum overheads are just 1.06% and 3.28% for the SPEC CPU2017 workloads—up to 9.15× lower than simulator-based reports. Further, we show that the close page policy minimizes this overhead by effectively hiding the elongated DRAM row precharge operations due to PRAC from the critical path.
逐行激活计数(PRAC)是一种缓解DRAM读取干扰的方法,它修改了关键的DRAM时序参数,据报道,在基于模拟器的研究中,这会导致显著的性能开销。然而,鉴于模拟器和真实硬件之间的已知差异,真实机实验对于准确估计PRAC性能至关重要。我们首次对PRAC进行了实机性能分析。在使用微基准测试验证最新cpu上的时序修改后,我们的分析表明,在SPEC CPU2017工作负载下,PRAC的平均开销和最大开销仅为1.06%和3.28%,比基于模拟器的报告低9.15倍。此外,我们还表明,关闭页策略通过有效地隐藏由于关键路径上的PRAC而延长的DRAM行预充值操作,从而最大限度地减少了这种开销。
{"title":"Per-Row Activation Counting on Real Hardware: Demystifying Performance Overheads","authors":"Jumin Kim;Seungmin Baek;Minbok Wi;Hwayong Nam;Michael Jaemin Kim;Sukhan Lee;Kyomin Sohn;Jung Ho Ahn","doi":"10.1109/LCA.2025.3587293","DOIUrl":"https://doi.org/10.1109/LCA.2025.3587293","url":null,"abstract":"Per-Row Activation Counting (PRAC), a DRAM read disturbance mitigation method, modifies key DRAM timing parameters, reportedly causing significant performance overheads in simulator-based studies. However, given known discrepancies between simulators and real hardware, real-machine experiments are vital for accurate PRAC performance estimation. We present the first real-machine performance analysis of PRAC. After verifying timing modifications on the latest CPUs using microbenchmarks, our analysis shows that PRAC’s average and maximum overheads are just 1.06% and 3.28% for the SPEC CPU2017 workloads—up to 9.15× lower than simulator-based reports. Further, we show that the close page policy minimizes this overhead by effectively hiding the elongated DRAM row precharge operations due to PRAC from the critical path.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"217-220"},"PeriodicalIF":1.4,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144687689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1