首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
PNM Meets Sparse Attention: Enabling Multi-Million Tokens Inference at Scale PNM满足稀疏关注:大规模启用数百万令牌推理
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-22 DOI: 10.1109/LCA.2025.3624272
Sookyung Choi;Myunghyun Rhee;Euiseok Kim;Kwangsik Shin;Youngpyo Joo;Hoshik Kim
Processing multi-million tokens for advanced Large Language Models (LLMs) poses a significant memory bottleneck for existing AI systems. This bottleneck stems from a fundamental resource imbalance, where enormous memory capacity and bandwidth are required, yet the computational load is minimal. We propose NELSSA (Processing Near Memory for Extremely Long Sequences with Sparse Attention), an architectural platform that synergistically combines the high-capacity Processing Near Memory (PNM) with the principles of dynamic sparse attention to address this issue. This approach enables capacity scaling without performance degradation, and our evaluation shows that NELSSA can process up to 20M-token sequences on a single node (Llama-2-70B), achieving an 11× to 40× speedup over a representative DIMM-based PNM system. The proposed architecture radically resolves existing inefficiencies, enabling previously impractical multi-million-token processing and thus laying the foundation for next-generation AI applications.
为高级大型语言模型(llm)处理数百万个令牌对现有人工智能系统构成了显著的内存瓶颈。这种瓶颈源于基本的资源不平衡,需要巨大的内存容量和带宽,但计算负载却很小。我们提出了NELSSA (Processing Near Memory for Extremely Long Sequences with Sparse Attention),这是一个将高容量处理近内存(PNM)与动态稀疏注意原理协同结合的架构平台来解决这个问题。这种方法可以在不降低性能的情况下实现容量扩展,我们的评估表明,NELSSA可以在单个节点(Llama-2-70B)上处理多达20m个令牌序列,比典型的基于dimm的PNM系统实现11倍到40倍的加速。所提出的架构从根本上解决了现有的低效率问题,实现了以前不切实际的数百万令牌处理,从而为下一代人工智能应用奠定了基础。
{"title":"PNM Meets Sparse Attention: Enabling Multi-Million Tokens Inference at Scale","authors":"Sookyung Choi;Myunghyun Rhee;Euiseok Kim;Kwangsik Shin;Youngpyo Joo;Hoshik Kim","doi":"10.1109/LCA.2025.3624272","DOIUrl":"https://doi.org/10.1109/LCA.2025.3624272","url":null,"abstract":"Processing multi-million tokens for advanced Large Language Models (LLMs) poses a significant memory bottleneck for existing AI systems. This bottleneck stems from a fundamental resource imbalance, where enormous memory capacity and bandwidth are required, yet the computational load is minimal. We propose <monospace>NELSSA</monospace> (Processing <underline>N</u>ear Memory for <underline>E</u>xtremely <underline>L</u>ong <underline>S</u>equences with <underline>S</u>parse <underline>A</u>ttention), an architectural platform that synergistically combines the high-capacity Processing Near Memory (PNM) with the principles of dynamic sparse attention to address this issue. This approach enables capacity scaling without performance degradation, and our evaluation shows that <monospace>NELSSA</monospace> can process up to 20M-token sequences on a single node (Llama-2-70B), achieving an 11× to 40× speedup over a representative DIMM-based PNM system. The proposed architecture radically resolves existing inefficiencies, enabling previously impractical multi-million-token processing and thus laying the foundation for next-generation AI applications.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"353-356"},"PeriodicalIF":1.4,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reimagining RDMA Through the Lens of ML 通过ML的镜头重新想象RDMA
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-22 DOI: 10.1109/LCA.2025.3624158
Ertza Warraich;Ali Imran;Annus Zulfiqar;Shay Vargaftik;Sonia Fahmy;Muhammad Shahbaz
As distributed machine learning (ML) workloads scale to thousands of GPUs connected by ultra-high-speed interconnects, tail latency in collective communication has emerged as a primary bottleneck. Prior RDMA designs, like RoCE, IRN, and SRNIC, enforce strict reliability and in-order delivery, relying on retransmissions and packet sequencing to ensure correctness. While effective for general-purpose workloads, these mechanisms introduce complexity and latency that scale poorly, where even rare packet losses or delays can consistently degrade system performance. We introduce Celeris, a domain-specific RDMA transport that revisits traditional reliability guarantees based on ML’s tolerance for lost or partial data. Celeris removes retransmissions and in-order delivery from the RDMA NIC, enabling best-effort transport that exploits the robustness of ML workloads. It retains congestion control (e.g., DCQCN) and manages communication with software-level mechanisms such as adaptive timeouts and data prioritization, while shifting loss recovery to the ML pipeline (e.g., using the Hadamard Transform). Early results show that Celeris reduces 99th-percentile latency by up to 2.3×, cuts BRAM usage by 67%, and nearly doubles NIC resilience to faults—delivering a resilient, scalable transport tailored for ML at cluster scale.
随着分布式机器学习(ML)工作负载扩展到通过超高速互连连接的数千个gpu,集体通信中的尾部延迟已成为主要瓶颈。以前的RDMA设计,如RoCE、IRN和SRNIC,强制执行严格的可靠性和按顺序交付,依靠重传和分组排序来确保正确性。虽然这些机制对通用工作负载是有效的,但它们带来的复杂性和延迟的可扩展性很差,即使很少的数据包丢失或延迟也会持续降低系统性能。我们介绍了Celeris,这是一种特定于领域的RDMA传输,它基于ML对丢失或部分数据的容错,重新审视了传统的可靠性保证。Celeris从RDMA网卡中删除了重传和按顺序交付,从而实现了利用ML工作负载健壮性的尽力而为传输。它保留拥塞控制(例如,DCQCN),并管理与软件级机制(如自适应超时和数据优先级)的通信,同时将损失恢复转移到ML管道(例如,使用Hadamard变换)。早期的结果表明,Celeris将第99百分位延迟减少了2.3倍,将BRAM使用量减少了67%,并将网卡的故障恢复能力提高了近一倍,为集群规模的机器学习提供了弹性、可扩展的传输。
{"title":"Reimagining RDMA Through the Lens of ML","authors":"Ertza Warraich;Ali Imran;Annus Zulfiqar;Shay Vargaftik;Sonia Fahmy;Muhammad Shahbaz","doi":"10.1109/LCA.2025.3624158","DOIUrl":"https://doi.org/10.1109/LCA.2025.3624158","url":null,"abstract":"As distributed machine learning (ML) workloads scale to thousands of GPUs connected by ultra-high-speed interconnects, tail latency in collective communication has emerged as a primary bottleneck. Prior RDMA designs, like RoCE, IRN, and SRNIC, enforce strict reliability and in-order delivery, relying on retransmissions and packet sequencing to ensure correctness. While effective for general-purpose workloads, these mechanisms introduce complexity and latency that scale poorly, where even rare packet losses or delays can consistently degrade system performance. We introduce Celeris, a domain-specific RDMA transport that revisits traditional reliability guarantees based on ML’s tolerance for lost or partial data. Celeris removes retransmissions and in-order delivery from the RDMA NIC, enabling best-effort transport that exploits the robustness of ML workloads. It retains congestion control (e.g., DCQCN) and manages communication with software-level mechanisms such as adaptive timeouts and data prioritization, while shifting loss recovery to the ML pipeline (e.g., using the Hadamard Transform). Early results show that Celeris reduces 99th-percentile latency by up to 2.3×, cuts BRAM usage by 67%, and nearly doubles NIC resilience to faults—delivering a resilient, scalable transport tailored for ML at cluster scale.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"393-396"},"PeriodicalIF":1.4,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Partial Tag–Data Decoupled Architecture for Last-Level Cache Optimization 面向最后一级缓存优化的部分标签-数据解耦架构
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-20 DOI: 10.1109/LCA.2025.3623137
Honghui Liu;Xian Lin;Xin Zheng;Qiancheng Liu;Huaien Gao;Shuting Cai;Xiaoming Xiong
Modern processors rely on the last-level cache to bridge the growing latency gap between the CPU core and main memory. However, the memory access patterns of contemporary applications exhibit increasing complexity, characterized by significant temporal locality, irregular reuse, and high conflict rates. We propose a partial tag-data decoupling architecture that leverages temporal locality without modifying the main cache structure or replacement policy. A lightweight auxiliary tag path is introduced, where data is allocated only upon reuse confirmation, thus minimizing resource waste caused by low-reuse blocks. The experimental results show that the proposed design achieves an average IPC improvement of 1.55% and a 5.33% reduction in MPKI without prefetching. With prefetching enabled, IPC improves by 1.96% and MPKI is further reduced by 10.91%, while overall storage overhead is decreased by approximately 2.59%.
现代处理器依靠最后一级缓存来弥合CPU核心和主存之间日益增长的延迟差距。然而,当代应用程序的内存访问模式表现出日益增长的复杂性,其特点是显著的时间局部性、不规则的重用和高冲突率。我们提出了一种局部标签数据解耦架构,该架构在不修改主缓存结构或替换策略的情况下利用了时间局域性。引入了一个轻量级的辅助标签路径,只有在确认重用后才分配数据,从而最大限度地减少了低重用块造成的资源浪费。实验结果表明,该设计在不预取的情况下,IPC平均提高了1.55%,MPKI平均降低了5.33%。启用预取后,IPC提高了1.96%,MPKI进一步降低了10.91%,而总体存储开销降低了大约2.59%。
{"title":"A Partial Tag–Data Decoupled Architecture for Last-Level Cache Optimization","authors":"Honghui Liu;Xian Lin;Xin Zheng;Qiancheng Liu;Huaien Gao;Shuting Cai;Xiaoming Xiong","doi":"10.1109/LCA.2025.3623137","DOIUrl":"https://doi.org/10.1109/LCA.2025.3623137","url":null,"abstract":"Modern processors rely on the last-level cache to bridge the growing latency gap between the CPU core and main memory. However, the memory access patterns of contemporary applications exhibit increasing complexity, characterized by significant temporal locality, irregular reuse, and high conflict rates. We propose a partial tag-data decoupling architecture that leverages temporal locality without modifying the main cache structure or replacement policy. A lightweight auxiliary tag path is introduced, where data is allocated only upon reuse confirmation, thus minimizing resource waste caused by low-reuse blocks. The experimental results show that the proposed design achieves an average IPC improvement of 1.55% and a 5.33% reduction in MPKI without prefetching. With prefetching enabled, IPC improves by 1.96% and MPKI is further reduced by 10.91%, while overall storage overhead is decreased by approximately 2.59%.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"333-336"},"PeriodicalIF":1.4,"publicationDate":"2025-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System 在异构存储系统中通过动态KV缓存加速LLM推理
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-17 DOI: 10.1109/LCA.2025.3622724
Yunhua Fang;Rui Xie;Asad Ul Haq;Linsen Ma;Kaoutar El Maghraoui;Naigang Wang;Meng Wang;Liu Liu;Tong Zhang
Large Language Model (LLM) inference is increasingly constrained by memory bandwidth, with frequent access to the key-value (KV) cache dominating data movement. While attention sparsity reduces some memory traffic, the relevance of past tokens varies over time, requiring the full KV cache to remain accessible and sustaining pressure on both bandwidth and capacity. With advances in interconnects such as NVLink and LPDDR5X, modern AI hardware now integrates high-bandwidth memory (HBM) with high-speed off-package DRAM, making heterogeneous memory systems a practical solution. This work investigates dynamic KV cache placement across such systems to maximize aggregated bandwidth utilization under capacity constraints. Rather than proposing a specific scheduling policy, we formulate the placement problem mathematically and derive a theoretical upper bound, revealing substantial headroom for runtime optimization. To our knowledge, this is the first formal treatment of dynamic KV cache scheduling in heterogeneous memory systems for LLM inference.
大型语言模型(LLM)推理越来越受到内存带宽的限制,对键值(KV)缓存的频繁访问主导了数据移动。虽然注意力稀疏减少了一些内存流量,但过去令牌的相关性随着时间的推移而变化,要求完整KV缓存保持可访问性,并承受带宽和容量的压力。随着NVLink和LPDDR5X等互连技术的进步,现代人工智能硬件现在将高带宽内存(HBM)与高速非封装DRAM集成在一起,使异构存储系统成为一种实用的解决方案。这项工作研究了动态KV缓存放置在这些系统中,以最大限度地提高容量限制下的总带宽利用率。我们没有提出一个具体的调度策略,而是用数学的方法制定了布局问题,并推导了一个理论上的上限,揭示了运行时优化的巨大空间。据我们所知,这是针对LLM推理的异构存储系统中动态KV缓存调度的第一个正式处理。
{"title":"Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System","authors":"Yunhua Fang;Rui Xie;Asad Ul Haq;Linsen Ma;Kaoutar El Maghraoui;Naigang Wang;Meng Wang;Liu Liu;Tong Zhang","doi":"10.1109/LCA.2025.3622724","DOIUrl":"https://doi.org/10.1109/LCA.2025.3622724","url":null,"abstract":"Large Language Model (LLM) inference is increasingly constrained by memory bandwidth, with frequent access to the key-value (KV) cache dominating data movement. While attention sparsity reduces some memory traffic, the relevance of past tokens varies over time, requiring the full KV cache to remain accessible and sustaining pressure on both bandwidth and capacity. With advances in interconnects such as NVLink and LPDDR5X, modern AI hardware now integrates high-bandwidth memory (HBM) with high-speed off-package DRAM, making heterogeneous memory systems a practical solution. This work investigates dynamic KV cache placement across such systems to maximize aggregated bandwidth utilization under capacity constraints. Rather than proposing a specific scheduling policy, we formulate the placement problem mathematically and derive a theoretical upper bound, revealing substantial headroom for runtime optimization. To our knowledge, this is the first formal treatment of dynamic KV cache scheduling in heterogeneous memory systems for LLM inference.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"337-340"},"PeriodicalIF":1.4,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Thread-Adaptive: High-Throughput Parallel Architectures of SLH-DSA on GPUs 线程自适应:gpu上SLH-DSA的高吞吐量并行架构
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-16 DOI: 10.1109/LCA.2025.3622588
Jiahao Xiang;Lang Li
The emergence of quantum computing threatens classical cryptographic systems, necessitating efficient architectural designs for post-quantum algorithms. This paper presents a novel architectural approach for implementing the FIPS 205 Stateless Hash-based Digital Signature Algorithm (SLH-DSA) on GPUs through execution model optimizations that maximize hardware utilization. We introduce a two-tier architectural framework: first, an Adaptive Thread Allocation mechanism that dynamically configures thread-level parallelism based on empirical performance modeling, optimizing the mapping between cryptographic workloads and GPU execution resources. Second, our Function-Level Parallelism design decomposes cryptographic components into fine-grained computational units with optimized memory access patterns and execution flows that better utilize the SIMT architecture of modern GPUs. Performance evaluation on an NVIDIA RTX 4090 demonstrates that our architectural design achieves 62,239 signatures per second for the SHA2-128f parameter set, representing a 1.16× improvement over prior implementations. Architectural analysis reveals that this throughput enhancement stems primarily from optimized thread-memory interactions and reduced resource contention in the GPU’s execution units.
量子计算的出现威胁到经典的密码系统,需要有效的后量子算法架构设计。本文提出了一种在gpu上实现FIPS 205无状态哈希数字签名算法(SLH-DSA)的新体系结构方法,该方法通过优化执行模型来最大化硬件利用率。我们引入了一个两层架构框架:首先,一个自适应线程分配机制,该机制基于经验性能建模动态配置线程级并行性,优化加密工作负载和GPU执行资源之间的映射。其次,我们的函数级并行设计将加密组件分解为具有优化的内存访问模式和执行流的细粒度计算单元,从而更好地利用现代gpu的SIMT架构。在NVIDIA RTX 4090上的性能评估表明,我们的架构设计在SHA2-128f参数集上实现了每秒62,239个签名,比以前的实现提高了1.16倍。架构分析表明,这种吞吐量增强主要源于优化的线程-内存交互和减少GPU执行单元中的资源争用。
{"title":"Thread-Adaptive: High-Throughput Parallel Architectures of SLH-DSA on GPUs","authors":"Jiahao Xiang;Lang Li","doi":"10.1109/LCA.2025.3622588","DOIUrl":"https://doi.org/10.1109/LCA.2025.3622588","url":null,"abstract":"The emergence of quantum computing threatens classical cryptographic systems, necessitating efficient architectural designs for post-quantum algorithms. This paper presents a novel architectural approach for implementing the FIPS 205 Stateless Hash-based Digital Signature Algorithm (SLH-DSA) on GPUs through execution model optimizations that maximize hardware utilization. We introduce a two-tier architectural framework: first, an Adaptive Thread Allocation mechanism that dynamically configures thread-level parallelism based on empirical performance modeling, optimizing the mapping between cryptographic workloads and GPU execution resources. Second, our Function-Level Parallelism design decomposes cryptographic components into fine-grained computational units with optimized memory access patterns and execution flows that better utilize the SIMT architecture of modern GPUs. Performance evaluation on an NVIDIA RTX 4090 demonstrates that our architectural design achieves 62,239 signatures per second for the SHA2-128f parameter set, representing a 1.16× improvement over prior implementations. Architectural analysis reveals that this throughput enhancement stems primarily from optimized thread-memory interactions and reduced resource contention in the GPU’s execution units.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"329-332"},"PeriodicalIF":1.4,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Low-Latency PIM Accelerator for Edge LLM Inference 用于边缘LLM推理的低延迟PIM加速器
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-07 DOI: 10.1109/LCA.2025.3618104
Xinyu Wang;Xiaotian Sun;Wanqian Li;Feng Min;Xiaoyu Zhang;Xinjiang Zhang;Yinhe Han;Xiaoming Chen
Deploying large language models (LLMs) on edge devices has the potentials for low-latency inference and privacy protection. However, meeting the substantial bandwidth demands of latency-oriented edge devices is challenging due to the strict power constraints of edge devices. Resistive random-access memory (RRAM)-based processing-in-memory (PIM) is an ideal solution for this challenge, thanks to its low read power and high internal bandwidth. Moreover, applying quantization methods, which require different precisions for weights and activations, is a common practice in edge inference. But existing accelerators cannot fully leverage the benefits of quantization, as they lack multiply-accumulate (MAC) units optimized for mixed-precision operands. To achieve low-latency edge inference, we design an RRAM-based PIM die that integrates dedicated energy-efficient MAC units, providing both computation and storage capabilities. Coupled with a dynamic random-access memory (DRAM) die for storing the key-value (KV) cache, we propose Lyla, an accelerator for low-latency edge LLM inference. Experimental results show that Lyla achieves 3.8×, 2.4×, and 1.2× latency improvements over a GPU and two DRAM-based PIM accelerators, respectively.
在边缘设备上部署大型语言模型(llm)具有低延迟推理和隐私保护的潜力。然而,由于边缘设备的严格功率限制,满足面向延迟的边缘设备的大量带宽需求是具有挑战性的。基于电阻式随机存取存储器(RRAM)的内存中处理(PIM)是解决这一挑战的理想解决方案,因为它具有低读取功耗和高内部带宽。此外,应用量化方法,需要不同的权重和激活精度,是边缘推理的常见做法。但是现有的加速器不能充分利用量化的好处,因为它们缺乏针对混合精度操作数优化的乘法累积(MAC)单元。为了实现低延迟边缘推断,我们设计了一个基于rram的PIM芯片,该芯片集成了专用的节能MAC单元,提供计算和存储能力。结合用于存储键值(KV)缓存的动态随机存取存储器(DRAM)芯片,我们提出了Lyla,一个用于低延迟边缘LLM推理的加速器。实验结果表明,Lyla在GPU和两个基于dram的PIM加速器上分别实现了3.8倍、2.4倍和1.2倍的延迟改进。
{"title":"Low-Latency PIM Accelerator for Edge LLM Inference","authors":"Xinyu Wang;Xiaotian Sun;Wanqian Li;Feng Min;Xiaoyu Zhang;Xinjiang Zhang;Yinhe Han;Xiaoming Chen","doi":"10.1109/LCA.2025.3618104","DOIUrl":"https://doi.org/10.1109/LCA.2025.3618104","url":null,"abstract":"Deploying large language models (LLMs) on edge devices has the potentials for low-latency inference and privacy protection. However, meeting the substantial bandwidth demands of latency-oriented edge devices is challenging due to the strict power constraints of edge devices. Resistive random-access memory (RRAM)-based processing-in-memory (PIM) is an ideal solution for this challenge, thanks to its low read power and high internal bandwidth. Moreover, applying quantization methods, which require different precisions for weights and activations, is a common practice in edge inference. But existing accelerators cannot fully leverage the benefits of quantization, as they lack multiply-accumulate (MAC) units optimized for mixed-precision operands. To achieve low-latency edge inference, we design an RRAM-based PIM die that integrates dedicated energy-efficient MAC units, providing both computation and storage capabilities. Coupled with a dynamic random-access memory (DRAM) die for storing the key-value (KV) cache, we propose Lyla, an accelerator for low-latency edge LLM inference. Experimental results show that Lyla achieves 3.8×, 2.4×, and 1.2× latency improvements over a GPU and two DRAM-based PIM accelerators, respectively.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"321-324"},"PeriodicalIF":1.4,"publicationDate":"2025-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Context-Aware Set Dueling for Dynamic Policy Arbitration 动态策略仲裁的上下文感知集决斗
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-06 DOI: 10.1109/LCA.2025.3617159
Diamantis Patsidis;Georgios Vavouliotis
Set Dueling (SD) is an effective and widely adopted arbitration mechanism but is limited by its single-counter decision logic, relying solely on aggregate hit and miss counts of Leader Sets and ignoring richer contextual information (e.g., control-flow, sequence of memory accesses). This observation motivates our proposal, Context-Aware Set Dueling (CASD), which extends the original context-oblivious SD by incorporating contextual information in the decision logic, providing a framework that can be used to design runtime arbitration mechanisms capable of selecting between competing policies. As a prototype of CASD framework, we design DuelCeptron, a microarchitectural prediction scheme that replaces the single-counter decision logic of SD with hashed perceptrons to make more informed and accurate arbitration decisions. To showcase the benefits of DuelCeptron, we apply it in a case study, showing that it significantly outperforms SD across a diverse set of 145 workloads. DuelCeptron is one instantiation prototype of CASD, but the broader objective of this work is to advance SD into a general-purpose, context-aware arbitration mechanism applicable across different microarchitectural domains.
集合决斗(SD)是一种有效且被广泛采用的仲裁机制,但受限于其单计数器决策逻辑,仅依赖于Leader Sets的总命中和未命中计数,而忽略了更丰富的上下文信息(例如,控制流,内存访问顺序)。这种观察激发了我们的提议,上下文感知集合决斗(CASD),它通过在决策逻辑中合并上下文信息扩展了原始的上下文无关的SD,提供了一个框架,可用于设计能够在竞争策略之间进行选择的运行时仲裁机制。作为CASD框架的原型,我们设计了DuelCeptron,这是一种微架构预测方案,用哈希感知器取代SD的单计数器决策逻辑,以做出更明智和准确的仲裁决策。为了展示DuelCeptron的优势,我们在一个案例研究中应用了它,表明它在145个不同工作负载上的表现明显优于SD。DuelCeptron是CASD的一个实例原型,但是这项工作更广泛的目标是将SD推进为一种通用的、上下文感知的仲裁机制,适用于不同的微体系结构领域。
{"title":"Context-Aware Set Dueling for Dynamic Policy Arbitration","authors":"Diamantis Patsidis;Georgios Vavouliotis","doi":"10.1109/LCA.2025.3617159","DOIUrl":"https://doi.org/10.1109/LCA.2025.3617159","url":null,"abstract":"Set Dueling (SD) is an effective and widely adopted arbitration mechanism but is limited by its single-counter decision logic, relying solely on aggregate hit and miss counts of Leader Sets and ignoring richer contextual information (e.g., control-flow, sequence of memory accesses). This observation motivates our proposal, <italic>Context-Aware Set Dueling (CASD)</i>, which extends the original context-oblivious SD by incorporating contextual information in the decision logic, providing a framework that can be used to design runtime arbitration mechanisms capable of selecting between competing policies. As a prototype of CASD framework, we design <italic>DuelCeptron</i>, a microarchitectural prediction scheme that replaces the single-counter decision logic of SD with hashed perceptrons to make more informed and accurate arbitration decisions. To showcase the benefits of DuelCeptron, we apply it in a case study, showing that it significantly outperforms SD across a diverse set of 145 workloads. DuelCeptron is one instantiation prototype of CASD, but the broader objective of this work is to advance SD into a general-purpose, context-aware arbitration mechanism applicable across different microarchitectural domains.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"301-304"},"PeriodicalIF":1.4,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Deadlock Avoidance by Considering Stalling, Message Dependencies, and Topology 考虑延迟、消息依赖和拓扑的有效死锁避免
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-06 DOI: 10.1109/LCA.2025.3618627
Sanya Srivastava;Fletch Rydell;Andrés Goens;Vijay Nagarajan;Daniel J. Sorin
Traditional schemes for avoiding deadlocks compose techniques for both protocol deadlocks (virtual networks) and network deadlocks (virtual channels). Recent work has shown how to use fewer virtual networks by analyzing protocol stalls instead of just considering the longest chain of causally dependent messages. We identify a shortcoming in this work, which can lead to deadlocks, and show that combining stall analysis with analyses of message dependencies and topology can avoid deadlocks while using fewer buffers than the conventional approach.
传统的避免死锁的方案包括协议死锁(虚拟网络)和网络死锁(虚拟通道)的技术。最近的工作展示了如何通过分析协议中断来使用更少的虚拟网络,而不是仅仅考虑因果依赖消息的最长链。我们确定了这项工作中的一个缺点,它可能导致死锁,并表明将失速分析与消息依赖关系和拓扑分析相结合可以避免死锁,同时使用比传统方法更少的缓冲区。
{"title":"Efficient Deadlock Avoidance by Considering Stalling, Message Dependencies, and Topology","authors":"Sanya Srivastava;Fletch Rydell;Andrés Goens;Vijay Nagarajan;Daniel J. Sorin","doi":"10.1109/LCA.2025.3618627","DOIUrl":"https://doi.org/10.1109/LCA.2025.3618627","url":null,"abstract":"Traditional schemes for avoiding deadlocks compose techniques for both protocol deadlocks (virtual networks) and network deadlocks (virtual channels). Recent work has shown how to use fewer virtual networks by analyzing protocol stalls instead of just considering the longest chain of causally dependent messages. We identify a shortcoming in this work, which can lead to deadlocks, and show that combining stall analysis with analyses of message dependencies and topology can avoid deadlocks while using fewer buffers than the conventional approach.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"305-308"},"PeriodicalIF":1.4,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Breaking the HBM Bit Cost Barrier: Domain-Specific ECC for AI Inference Infrastructure 打破HBM比特成本障碍:AI推理基础设施的特定领域ECC
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-01 DOI: 10.1109/LCA.2025.3616810
Rui Xie;Asad Ul Haq;Yunhua Fang;Linsen Ma;Sanchari Sen;Swagath Venkataramani;Liu Liu;Tong Zhang
High-Bandwidth Memory (HBM) delivers exceptional bandwidth and energy efficiency for AI workloads, but its high cost per bit, driven in part by stringent on-die reliability requirements, poses a growing barrier to scalable deployment. This work explores a system-level approach to cost reduction by eliminating on-die ECC and shifting all fault management to the memory controller. We introduce a domain-specific ECC framework combining large-codeword Reed–Solomon (RS) correction with lightweight fine-grained CRC detection, differential parity updates to mitigate write amplification, and tunable protection based on data importance. Our evaluation using LLM inference workloads shows that, even under raw HBM bit error rates up to $10^{-3}$, the system retains 78% of throughput while maintaining at least 97% PIQA accuracy and 94% MMLU accuracy relative to error-free HBM. By treating reliability as a tunable system parameter rather than a fixed hardware constraint, our design opens a new path toward low-cost, high-performance HBM deployment in AI infrastructure.
高带宽内存(HBM)为人工智能工作负载提供了卓越的带宽和能源效率,但由于严格的片内可靠性要求,其每比特的高成本对可扩展部署构成了越来越大的障碍。这项工作探索了一种系统级的方法,通过消除片上ECC和将所有故障管理转移到内存控制器来降低成本。我们引入了一个特定于领域的ECC框架,该框架结合了大码字里德-所罗门(RS)校正与轻量级细粒度CRC检测,差分奇偶更新以减轻写放大,以及基于数据重要性的可调保护。我们使用LLM推理工作负载的评估表明,即使在原始HBM误码率高达$10^{-3}$的情况下,系统保持了78%的吞吐量,同时相对于无错误HBM保持了至少97%的PIQA精度和94%的MMLU精度。通过将可靠性视为可调的系统参数,而不是固定的硬件约束,我们的设计为在人工智能基础设施中部署低成本、高性能的HBM开辟了一条新途径。
{"title":"Breaking the HBM Bit Cost Barrier: Domain-Specific ECC for AI Inference Infrastructure","authors":"Rui Xie;Asad Ul Haq;Yunhua Fang;Linsen Ma;Sanchari Sen;Swagath Venkataramani;Liu Liu;Tong Zhang","doi":"10.1109/LCA.2025.3616810","DOIUrl":"https://doi.org/10.1109/LCA.2025.3616810","url":null,"abstract":"High-Bandwidth Memory (HBM) delivers exceptional bandwidth and energy efficiency for AI workloads, but its high cost per bit, driven in part by stringent on-die reliability requirements, poses a growing barrier to scalable deployment. This work explores a system-level approach to cost reduction by eliminating on-die ECC and shifting all fault management to the memory controller. We introduce a domain-specific ECC framework combining large-codeword Reed–Solomon (RS) correction with lightweight fine-grained CRC detection, differential parity updates to mitigate write amplification, and tunable protection based on data importance. Our evaluation using LLM inference workloads shows that, even under raw HBM bit error rates up to <inline-formula><tex-math>$10^{-3}$</tex-math></inline-formula>, the system retains 78% of throughput while maintaining at least 97% PIQA accuracy and 94% MMLU accuracy relative to error-free HBM. By treating reliability as a tunable system parameter rather than a fixed hardware constraint, our design opens a new path toward low-cost, high-performance HBM deployment in AI infrastructure.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"313-316"},"PeriodicalIF":1.4,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Revisiting Virtual Memory Support for Confidential Computing Environments 重述机密计算环境中的虚拟内存支持
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-09-22 DOI: 10.1109/LCA.2025.3612852
Haoyu Wang;Noa Zilberman;Ahmad Atamli;Amro Awad
Confidential computing is increasingly becoming a cornerstone for securely utilizing remote services and building trustworthy cloud infrastructure. Confidential computing builds on hardware-anchored root-of-trust that can attest the identity and authenticity of the remote machine, the configuration, and the running software stack, in an unforgeable way. In addition to the hardware-rooted verifiable attestation mechanism, confidential computing depends on strict run-time isolation of confidential computing tasks’ data and code from each other and the other tasks, including privileged ones. Such isolation is achieved via on-chip access control and cryptographically once off-chip. Despite the wide support of confidential computing in most modern processors, e.g., AMD SEV-SNP and ARM CCA, there is minimal discussion of the effect of such support on the performance of conventional on-chip access control. Thus, in this paper we highlight the key changes in virtual memory support required for access control in confidential computing environments, and quantify their overheads. We propose an optimized design that enables improved performance by caching confidential computing access control metadata effectively. Two design options are proposed to balance hardware overhead and performance. We evaluate two configurations with different TLB entry coverage, which mirror Arm CCA GPC and AMD RMP, respectively. Our design improves performance by 12% over the baseline access control design and 6% over the state-of-the-art.
机密计算正日益成为安全利用远程服务和构建可信云基础设施的基石。机密计算建立在硬件锚定的信任根基础上,可以以一种不可伪造的方式验证远程机器、配置和正在运行的软件堆栈的身份和真实性。除了基于硬件的可验证认证机制外,机密计算还依赖于机密计算任务之间以及其他任务(包括特权任务)之间数据和代码的严格运行时隔离。这种隔离是通过芯片上的访问控制和芯片外的加密实现的。尽管在大多数现代处理器中广泛支持机密计算,例如AMD SEV-SNP和ARM CCA,但很少讨论这种支持对传统片上访问控制性能的影响。因此,在本文中,我们强调了机密计算环境中访问控制所需的虚拟内存支持的关键变化,并量化了它们的开销。我们提出了一种优化设计,通过有效地缓存机密计算访问控制元数据来提高性能。提出了两种设计方案来平衡硬件开销和性能。我们评估了两种具有不同TLB入口覆盖的配置,分别反映了Arm CCA GPC和AMD RMP。我们的设计比基线访问控制设计提高了12%的性能,比最先进的性能提高了6%。
{"title":"Revisiting Virtual Memory Support for Confidential Computing Environments","authors":"Haoyu Wang;Noa Zilberman;Ahmad Atamli;Amro Awad","doi":"10.1109/LCA.2025.3612852","DOIUrl":"https://doi.org/10.1109/LCA.2025.3612852","url":null,"abstract":"Confidential computing is increasingly becoming a cornerstone for securely utilizing remote services and building trustworthy cloud infrastructure. Confidential computing builds on hardware-anchored root-of-trust that can attest the identity and authenticity of the remote machine, the configuration, and the running software stack, in an unforgeable way. In addition to the hardware-rooted verifiable attestation mechanism, confidential computing depends on strict run-time isolation of confidential computing tasks’ data and code from each other and the other tasks, including privileged ones. Such isolation is achieved via on-chip access control and cryptographically once off-chip. Despite the wide support of confidential computing in most modern processors, e.g., AMD SEV-SNP and ARM CCA, there is minimal discussion of the effect of such support on the performance of conventional on-chip access control. Thus, in this paper we highlight the key changes in virtual memory support required for access control in confidential computing environments, and quantify their overheads. We propose an optimized design that enables improved performance by caching confidential computing access control metadata effectively. Two design options are proposed to balance hardware overhead and performance. We evaluate two configurations with different TLB entry coverage, which mirror Arm CCA GPC and AMD RMP, respectively. Our design improves performance by 12% over the baseline access control design and 6% over the state-of-the-art.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"317-320"},"PeriodicalIF":1.4,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1