Pub Date : 2025-10-01DOI: 10.1109/LCA.2025.3616810
Rui Xie;Asad Ul Haq;Yunhua Fang;Linsen Ma;Sanchari Sen;Swagath Venkataramani;Liu Liu;Tong Zhang
High-Bandwidth Memory (HBM) delivers exceptional bandwidth and energy efficiency for AI workloads, but its high cost per bit, driven in part by stringent on-die reliability requirements, poses a growing barrier to scalable deployment. This work explores a system-level approach to cost reduction by eliminating on-die ECC and shifting all fault management to the memory controller. We introduce a domain-specific ECC framework combining large-codeword Reed–Solomon (RS) correction with lightweight fine-grained CRC detection, differential parity updates to mitigate write amplification, and tunable protection based on data importance. Our evaluation using LLM inference workloads shows that, even under raw HBM bit error rates up to $10^{-3}$, the system retains 78% of throughput while maintaining at least 97% PIQA accuracy and 94% MMLU accuracy relative to error-free HBM. By treating reliability as a tunable system parameter rather than a fixed hardware constraint, our design opens a new path toward low-cost, high-performance HBM deployment in AI infrastructure.
{"title":"Breaking the HBM Bit Cost Barrier: Domain-Specific ECC for AI Inference Infrastructure","authors":"Rui Xie;Asad Ul Haq;Yunhua Fang;Linsen Ma;Sanchari Sen;Swagath Venkataramani;Liu Liu;Tong Zhang","doi":"10.1109/LCA.2025.3616810","DOIUrl":"https://doi.org/10.1109/LCA.2025.3616810","url":null,"abstract":"High-Bandwidth Memory (HBM) delivers exceptional bandwidth and energy efficiency for AI workloads, but its high cost per bit, driven in part by stringent on-die reliability requirements, poses a growing barrier to scalable deployment. This work explores a system-level approach to cost reduction by eliminating on-die ECC and shifting all fault management to the memory controller. We introduce a domain-specific ECC framework combining large-codeword Reed–Solomon (RS) correction with lightweight fine-grained CRC detection, differential parity updates to mitigate write amplification, and tunable protection based on data importance. Our evaluation using LLM inference workloads shows that, even under raw HBM bit error rates up to <inline-formula><tex-math>$10^{-3}$</tex-math></inline-formula>, the system retains 78% of throughput while maintaining at least 97% PIQA accuracy and 94% MMLU accuracy relative to error-free HBM. By treating reliability as a tunable system parameter rather than a fixed hardware constraint, our design opens a new path toward low-cost, high-performance HBM deployment in AI infrastructure.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"313-316"},"PeriodicalIF":1.4,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-22DOI: 10.1109/LCA.2025.3612852
Haoyu Wang;Noa Zilberman;Ahmad Atamli;Amro Awad
Confidential computing is increasingly becoming a cornerstone for securely utilizing remote services and building trustworthy cloud infrastructure. Confidential computing builds on hardware-anchored root-of-trust that can attest the identity and authenticity of the remote machine, the configuration, and the running software stack, in an unforgeable way. In addition to the hardware-rooted verifiable attestation mechanism, confidential computing depends on strict run-time isolation of confidential computing tasks’ data and code from each other and the other tasks, including privileged ones. Such isolation is achieved via on-chip access control and cryptographically once off-chip. Despite the wide support of confidential computing in most modern processors, e.g., AMD SEV-SNP and ARM CCA, there is minimal discussion of the effect of such support on the performance of conventional on-chip access control. Thus, in this paper we highlight the key changes in virtual memory support required for access control in confidential computing environments, and quantify their overheads. We propose an optimized design that enables improved performance by caching confidential computing access control metadata effectively. Two design options are proposed to balance hardware overhead and performance. We evaluate two configurations with different TLB entry coverage, which mirror Arm CCA GPC and AMD RMP, respectively. Our design improves performance by 12% over the baseline access control design and 6% over the state-of-the-art.
机密计算正日益成为安全利用远程服务和构建可信云基础设施的基石。机密计算建立在硬件锚定的信任根基础上,可以以一种不可伪造的方式验证远程机器、配置和正在运行的软件堆栈的身份和真实性。除了基于硬件的可验证认证机制外,机密计算还依赖于机密计算任务之间以及其他任务(包括特权任务)之间数据和代码的严格运行时隔离。这种隔离是通过芯片上的访问控制和芯片外的加密实现的。尽管在大多数现代处理器中广泛支持机密计算,例如AMD SEV-SNP和ARM CCA,但很少讨论这种支持对传统片上访问控制性能的影响。因此,在本文中,我们强调了机密计算环境中访问控制所需的虚拟内存支持的关键变化,并量化了它们的开销。我们提出了一种优化设计,通过有效地缓存机密计算访问控制元数据来提高性能。提出了两种设计方案来平衡硬件开销和性能。我们评估了两种具有不同TLB入口覆盖的配置,分别反映了Arm CCA GPC和AMD RMP。我们的设计比基线访问控制设计提高了12%的性能,比最先进的性能提高了6%。
{"title":"Revisiting Virtual Memory Support for Confidential Computing Environments","authors":"Haoyu Wang;Noa Zilberman;Ahmad Atamli;Amro Awad","doi":"10.1109/LCA.2025.3612852","DOIUrl":"https://doi.org/10.1109/LCA.2025.3612852","url":null,"abstract":"Confidential computing is increasingly becoming a cornerstone for securely utilizing remote services and building trustworthy cloud infrastructure. Confidential computing builds on hardware-anchored root-of-trust that can attest the identity and authenticity of the remote machine, the configuration, and the running software stack, in an unforgeable way. In addition to the hardware-rooted verifiable attestation mechanism, confidential computing depends on strict run-time isolation of confidential computing tasks’ data and code from each other and the other tasks, including privileged ones. Such isolation is achieved via on-chip access control and cryptographically once off-chip. Despite the wide support of confidential computing in most modern processors, e.g., AMD SEV-SNP and ARM CCA, there is minimal discussion of the effect of such support on the performance of conventional on-chip access control. Thus, in this paper we highlight the key changes in virtual memory support required for access control in confidential computing environments, and quantify their overheads. We propose an optimized design that enables improved performance by caching confidential computing access control metadata effectively. Two design options are proposed to balance hardware overhead and performance. We evaluate two configurations with different TLB entry coverage, which mirror Arm CCA GPC and AMD RMP, respectively. Our design improves performance by 12% over the baseline access control design and 6% over the state-of-the-art.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"317-320"},"PeriodicalIF":1.4,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-17DOI: 10.1109/LCA.2025.3611326
Allen Aboytes;Pankaj Mehra
Memory-intensive application working sets continue to grow and demand more memory. Far memory technologies such as CXL potentially solve the memory capacity bottleneck. However, efficient use of far memory requires careful data placement among memory tiers. Recent work uses page-based memory tiering systems to expand the memory available to applications. Unfortunately, most state-of-the-art memory tiering systems largely ignore memory allocation and prioritize placing pages in the fast tier while space remains available. Relying on transparent methods for memory allocation can lead to suboptimal data placement, resulting in more data migration. To address these issues, we propose to place data using application semantics to increase the locality of reference within pages. We present M2T, a system that optimizes the layout of application memory allocations by grouping semantically related memory objects with a custom memory allocator and migrates pages between local and far memory. Our evaluation demonstrates that semantic data placement achieves 3.39–4.69× higher throughput than a key-value store that uses a standard memory allocator on top of various state-of-the-art memory tiering systems.
{"title":"Improving Performance on Tiered Memory With Semantic Data Placement","authors":"Allen Aboytes;Pankaj Mehra","doi":"10.1109/LCA.2025.3611326","DOIUrl":"https://doi.org/10.1109/LCA.2025.3611326","url":null,"abstract":"Memory-intensive application working sets continue to grow and demand more memory. Far memory technologies such as CXL potentially solve the memory capacity bottleneck. However, efficient use of far memory requires careful data placement among memory tiers. Recent work uses page-based memory tiering systems to expand the memory available to applications. Unfortunately, most state-of-the-art memory tiering systems largely ignore memory allocation and prioritize placing pages in the fast tier while space remains available. Relying on transparent methods for memory allocation can lead to suboptimal data placement, resulting in more data migration. To address these issues, we propose to place data using application semantics to increase the locality of reference within pages. We present M2T, a system that optimizes the layout of application memory allocations by grouping semantically related memory objects with a custom memory allocator and migrates pages between local and far memory. Our evaluation demonstrates that semantic data placement achieves 3.39–4.69× higher throughput than a key-value store that uses a standard memory allocator on top of various state-of-the-art memory tiering systems.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"297-300"},"PeriodicalIF":1.4,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-11DOI: 10.1109/LCA.2025.3609283
Gyeongrok Yang;Jaeha Min;In-Jun Jung;Joo-Young Kim
Mamba is based on a state space model (SSM) to address limitations of attention-based large language models (LLMs) associated with long-context processing. While Mamba achieves accuracy comparable to attention-based LLMs, it introduces recurrent computation that limits efficiency during the prefill phase of inference. To mitigate this, Mamba-2 introduces the state space duality (SSD), which increases parallelism during multi-token processing. However, its workload characteristics remain unexamined from a systems and architectural perspective. This work presents a system-level analysis of SSD in Mamba-2, characterizing its compute and memory behavior on modern hardware. Our findings reveal the computational characteristics of SSD and provide the first architectural insight into its execution. In addition, we identify performance bottlenecks and propose directions for addressing them in future work.
{"title":"A Quantitative Analysis of Mamba-2-Based Large Language Model: Study of State Space Duality","authors":"Gyeongrok Yang;Jaeha Min;In-Jun Jung;Joo-Young Kim","doi":"10.1109/LCA.2025.3609283","DOIUrl":"https://doi.org/10.1109/LCA.2025.3609283","url":null,"abstract":"Mamba is based on a <italic>state space model (SSM)</i> to address limitations of attention-based large language models (LLMs) associated with long-context processing. While Mamba achieves accuracy comparable to attention-based LLMs, it introduces recurrent computation that limits efficiency during the prefill phase of inference. To mitigate this, Mamba-2 introduces the <italic>state space duality (SSD)</i>, which increases parallelism during multi-token processing. However, its workload characteristics remain unexamined from a systems and architectural perspective. This work presents a system-level analysis of SSD in Mamba-2, characterizing its compute and memory behavior on modern hardware. Our findings reveal the computational characteristics of SSD and provide the first architectural insight into its execution. In addition, we identify performance bottlenecks and propose directions for addressing them in future work.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"309-312"},"PeriodicalIF":1.4,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bit-serial processing-in-memory (PIM) architectures have been extensively studied, yet a standardized tool for generating efficient bit-serial code is lacking, hindering fair comparisons. We present a fully automated compiler framework, PIMsynth, for bit-serial PIM architectures, targeting both digital and analog substrates. The compiler takes Verilog as input and generates optimized micro-operation code for programmable bit-serial PIM backends. Our flow integrates logic synthesis, optimization steps, instruction scheduling, and backend code generation into a unified toolchain. With the compiler, we provide a bit-serial compilation benchmark suite designed for efficient bit-serial code generation. To enable correctness and performance validation, we extend an existing PIM simulator to support compiler-generated micro-op-level workloads. Preliminary results demonstrate that the compiler generates competitive bit-serial code within $1.08times$ and $1.54times$ of hand-optimized digital and analog PIM baselines.
{"title":"PIMsynth: A Unified Compiler Framework for Bit-Serial Processing-in-Memory Architectures","authors":"Deyuan Guo;Mohammadhosein Gholamrezaei;Matthew Hofmann;Ashish Venkat;Zhiru Zhang;Kevin Skadron","doi":"10.1109/LCA.2025.3600588","DOIUrl":"https://doi.org/10.1109/LCA.2025.3600588","url":null,"abstract":"Bit-serial processing-in-memory (PIM) architectures have been extensively studied, yet a standardized tool for generating efficient bit-serial code is lacking, hindering fair comparisons. We present a fully automated compiler framework, PIMsynth, for bit-serial PIM architectures, targeting both digital and analog substrates. The compiler takes Verilog as input and generates optimized micro-operation code for programmable bit-serial PIM backends. Our flow integrates logic synthesis, optimization steps, instruction scheduling, and backend code generation into a unified toolchain. With the compiler, we provide a bit-serial compilation benchmark suite designed for efficient bit-serial code generation. To enable correctness and performance validation, we extend an existing PIM simulator to support compiler-generated micro-op-level workloads. Preliminary results demonstrate that the compiler generates competitive bit-serial code within <inline-formula><tex-math>$1.08times$</tex-math></inline-formula> and <inline-formula><tex-math>$1.54times$</tex-math></inline-formula> of hand-optimized digital and analog PIM baselines.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"277-280"},"PeriodicalIF":1.4,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144926886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-11DOI: 10.1109/LCA.2025.3597323
KyungSoo Kim;Omin Kwon;Yeonhong Park;Jae W. Lee
Disaggregating the prefill and decode phases has recently emerged as a promising strategy in the large language model (LLM) serving systems, driven by the distinct resource demands of each phase. Inspired by this coarse-grained disaggregation, we identify a similar opportunity within the decode phase itself: the feedforward network (FFN) is compute-intensive, whereas attention is constrained by memory bandwidth and capacity due to its key-value (KV) cache. To exploit this heterogeneity, we introduce AiDE, a heterogeneous decoding cluster that executes FFN operations on GPUs while offloading attention computations to Compute Express Link-based Processing Near Memory (CXL-PNM) devices. CXL-PNM provides scalable memory bandwidth and capacity, making it well-suited for attention-heavy workloads. In addition, we propose a batch-level pipelining approach enhanced with request scheduling to optimize the utilization of heterogeneous resources. Our AiDE architecture achieves up to 3.87× higher throughput, 2.72× lower p90 time per output token (TPOT), and a 2.31× reduction in decode latency compared to a GPU-only baseline, at comparable cost, demonstrating significant potential of fine-grained disaggregation for cost-effective LLM inference.
{"title":"AiDE: Attention-FFN Disaggregated Execution for Cost-Effective LLM Decoding on CXL-PNM","authors":"KyungSoo Kim;Omin Kwon;Yeonhong Park;Jae W. Lee","doi":"10.1109/LCA.2025.3597323","DOIUrl":"https://doi.org/10.1109/LCA.2025.3597323","url":null,"abstract":"Disaggregating the prefill and decode phases has recently emerged as a promising strategy in the large language model (LLM) serving systems, driven by the distinct resource demands of each phase. Inspired by this coarse-grained disaggregation, we identify a similar opportunity within the decode phase itself: the feedforward network (FFN) is compute-intensive, whereas attention is constrained by memory bandwidth and capacity due to its key-value (KV) cache. To exploit this heterogeneity, we introduce AiDE, a heterogeneous decoding cluster that executes FFN operations on GPUs while offloading attention computations to Compute Express Link-based Processing Near Memory (CXL-PNM) devices. CXL-PNM provides scalable memory bandwidth and capacity, making it well-suited for attention-heavy workloads. In addition, we propose a batch-level pipelining approach enhanced with request scheduling to optimize the utilization of heterogeneous resources. Our AiDE architecture achieves up to 3.87× higher throughput, 2.72× lower p90 time per output token (TPOT), and a 2.31× reduction in decode latency compared to a GPU-only baseline, at comparable cost, demonstrating significant potential of fine-grained disaggregation for cost-effective LLM inference.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"285-288"},"PeriodicalIF":1.4,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144998058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-08DOI: 10.1109/LCA.2025.3596970
Minho Kim;Houxiang Ji;Jaeyoung Kang;Hwanjun Lee;Daehoon Kim;Nam Sung Kim
Retrieval-augmented generation (RAG) systems increasingly rely on Approximate Nearest Neighbor Search (ANNS) to efficiently retrieve relevant context from billion-scale vector databases. While IVF-based ANNS frameworks scale well overall, the fine search stage remains a bottleneck due to its compute-intensive GEMV operations, particularly under large query volumes. To address this, we propose CABANA, a cluster-aware query batching for ANNS acceleration mechanism using Intel Advanced Matrix Extensions (AMX) that reformulates these GEMV computations into high-throughput GEMM operations. By aggregating queries targeting the same clusters, CABANA enables batched computation during fine search, significantly improving compute intensity and memory access regularity. Evaluations on billion-scale datasets show that CABANA outperforms traditional SIMD-based implementations, achieving up to $32.6times$ higher query throughput with minimal overhead, while maintaining high recall rates.
{"title":"CABANA : Cluster-Aware Query Batching for Accelerating Billion-Scale ANNS With Intel AMX","authors":"Minho Kim;Houxiang Ji;Jaeyoung Kang;Hwanjun Lee;Daehoon Kim;Nam Sung Kim","doi":"10.1109/LCA.2025.3596970","DOIUrl":"https://doi.org/10.1109/LCA.2025.3596970","url":null,"abstract":"Retrieval-augmented generation (RAG) systems increasingly rely on Approximate Nearest Neighbor Search (ANNS) to efficiently retrieve relevant context from billion-scale vector databases. While IVF-based ANNS frameworks scale well overall, the fine search stage remains a bottleneck due to its compute-intensive GEMV operations, particularly under large query volumes. To address this, we propose <monospace>CABANA</monospace>, a <u>c</u>luster-<u>a</u>ware query <u>b</u>atching for <u>AN</u>NS <u>a</u>cceleration mechanism using Intel Advanced Matrix Extensions (AMX) that reformulates these GEMV computations into high-throughput GEMM operations. By aggregating queries targeting the same clusters, <monospace>CABANA</monospace> enables batched computation during fine search, significantly improving compute intensity and memory access regularity. Evaluations on billion-scale datasets show that <monospace>CABANA</monospace> outperforms traditional SIMD-based implementations, achieving up to <inline-formula><tex-math>$32.6times$</tex-math></inline-formula> higher query throughput with minimal overhead, while maintaining high recall rates.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"289-292"},"PeriodicalIF":1.4,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11120372","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145027944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-07DOI: 10.1109/LCA.2025.3596616
Hangyu Liu;Shouxi Luo;Ke Li;Huanlai Xing;Bo Peng
During the time-consuming training of deep neural network (DNN) models, the worker has to periodically create checkpoints for tensors like the model parameters and optimizer state to support fast failover. However, due to the high overhead of checkpointing, existing schemes generally create checkpoints at a very low frequency, making recovery inefficient since the unsaved training progress would get lost. In this paper, we propose Checkflow, a low-overhead checkpointing scheme, which enables per-iteration checkpointing for DNN training with minimal or even zero cost of training slowdown. The power of Checkflow stems from the design of $i)$ decoupling a tensor’s checkpoint operation into snapshot-then-offload, and $ii)$ scheduling these operations appropriately, following the results of the math models. Our experimental results imply that, when the GPU-CPU connection has sufficient bandwidth for the training workload, Checkflow can theoretically overlap all the checkpoint operations for each round of training with the training computation, with trivial or no overhead in peak GPU memory occupancy.
{"title":"Checkflow: Low-Overhead Checkpointing for Deep Learning Training","authors":"Hangyu Liu;Shouxi Luo;Ke Li;Huanlai Xing;Bo Peng","doi":"10.1109/LCA.2025.3596616","DOIUrl":"https://doi.org/10.1109/LCA.2025.3596616","url":null,"abstract":"During the time-consuming training of deep neural network (DNN) models, the worker has to periodically create checkpoints for tensors like the model parameters and optimizer state to support fast failover. However, due to the high overhead of checkpointing, existing schemes generally create checkpoints at a very low frequency, making recovery inefficient since the unsaved training progress would get lost. In this paper, we propose Checkflow, a low-overhead checkpointing scheme, which enables per-iteration checkpointing for DNN training with minimal or even zero cost of training slowdown. The power of Checkflow stems from the design of <inline-formula><tex-math>$i)$</tex-math></inline-formula> decoupling a tensor’s checkpoint operation into snapshot-then-offload, and <inline-formula><tex-math>$ii)$</tex-math></inline-formula> scheduling these operations appropriately, following the results of the math models. Our experimental results imply that, when the GPU-CPU connection has sufficient bandwidth for the training workload, Checkflow can theoretically overlap all the checkpoint operations for each round of training with the training computation, with trivial or no overhead in peak GPU memory occupancy.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"281-284"},"PeriodicalIF":1.4,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144926885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-01DOI: 10.1109/LCA.2025.3595003
Nayana Rajeev;Cathrene Biju;Titu Mary Ignatius;Roy Paily Palathinkal;Rekha K James
This paper presents RAESC, a reconfigurable Advanced Encryption Standard (AES) countermeasure hardware design that supports AES-128, AES-192, and AES-256 types, enhancing flexibility and resource efficiency in IoT applications. The design incorporates a countermeasure to protect against Power-based Side Channel Attacks (PSCA) by randomizing the AES type based on input plaintext, ensuring improved security. The RAESC is integrated with an RV32IM RISC-V processor, offering streamlined operation and enhanced system security. Performance analysis shows that RAESC’s adaptive encryption strength achieves a balanced trade-off in area, power, and throughput, making it ideal for resource-constrained, security-sensitive IoT applications. Power traces for CPA attacks are generated on Application Specific Integrated Circuit (ASIC) and the design achieves a notable reduction in the Signal to Noise Ratio (SNR) and an increase in the Measurements to Disclose (MTD), demonstrating strong resilience against cryptographic attacks.
{"title":"RAESC: A Reconfigurable AES Countermeasure Architecture for RISC-V With Enhanced Power Side-Channel Resilience","authors":"Nayana Rajeev;Cathrene Biju;Titu Mary Ignatius;Roy Paily Palathinkal;Rekha K James","doi":"10.1109/LCA.2025.3595003","DOIUrl":"https://doi.org/10.1109/LCA.2025.3595003","url":null,"abstract":"This paper presents RAESC, a reconfigurable Advanced Encryption Standard (AES) countermeasure hardware design that supports AES-128, AES-192, and AES-256 types, enhancing flexibility and resource efficiency in IoT applications. The design incorporates a countermeasure to protect against Power-based Side Channel Attacks (PSCA) by randomizing the AES type based on input plaintext, ensuring improved security. The RAESC is integrated with an RV32IM RISC-V processor, offering streamlined operation and enhanced system security. Performance analysis shows that RAESC’s adaptive encryption strength achieves a balanced trade-off in area, power, and throughput, making it ideal for resource-constrained, security-sensitive IoT applications. Power traces for CPA attacks are generated on Application Specific Integrated Circuit (ASIC) and the design achieves a notable reduction in the Signal to Noise Ratio (SNR) and an increase in the Measurements to Disclose (MTD), demonstrating strong resilience against cryptographic attacks.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"273-276"},"PeriodicalIF":1.4,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144896825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-31DOI: 10.1109/LCA.2025.3594110
Mengting Zhang;Zhichuan Guo;Shining Sun
Remote Direct Memory Access (RDMA) enables low-latency datacenter networks but suffers from inefficient loss recovery using Go-Back-N (GBN). GBN retransmits entire packet windows, degrading Flow Completion Time (FCT) under congestion. We introduce RoSR, a novel selective retransmission architecture for Field-Programmable Gate Array (FPGA)-based RDMA NICs that supports hardware-accelerated direct writes of out-of-order (OoO) packets. RoSR supports efficient OoO packet reception and enables fine-grained retransmission using a dynamic shared bitmap for packet tracking. By extending the RDMA over Converged Ethernet version 2 (RoCEv2) packet format, RoSR facilitates selective retransmission. It triggers retransmissions via timeouts using bitmap blocks and introduces new Nack-bitmap and rd-req-bitmap messages for loss reporting. Under 1% packet loss, RoSR achieves up to 13.5× (RDMA Write) and 15.6× (RDMA Read) higher throughput than Xilinx ERNIC. In NS-3 simulations using the HPCC RDMA stack, RoSR reduces FCT slowdown by 3× to 6× compared to GBN across various packet loss rates, congestion control algorithms (DCQCN, HPCC, Timely), and traffic patterns, while maintaining robustness under high round-trip time (RTT) conditions.
远程直接内存访问(RDMA)支持低延迟的数据中心网络,但使用Go-Back-N (GBN)时存在效率低下的损失恢复问题。GBN重传整个数据包窗口,降低了拥塞下的流完成时间(FCT)。我们介绍了RoSR,一种新的选择性重传架构,用于基于现场可编程门阵列(FPGA)的RDMA网卡,支持硬件加速的乱序(OoO)数据包的直接写入。RoSR支持有效的OoO数据包接收,并使用动态共享位图实现数据包跟踪的细粒度重传。通过扩展RDMA over Converged Ethernet version 2 (RoCEv2)数据包格式,RoSR促进了选择性重传。它通过使用位图块的超时触发重传,并为丢失报告引入了新的ack-bitmap和rd-req-bitmap消息。在丢包率为1%的情况下,RoSR的吞吐量比Xilinx ERNIC高13.5倍(RDMA Write)和15.6倍(RDMA Read)。在使用HPCC RDMA堆栈的NS-3模拟中,与GBN相比,RoSR在各种丢包率、拥塞控制算法(DCQCN、HPCC、Timely)和流量模式下将FCT减速减少了3到6倍,同时在高往返时间(RTT)条件下保持鲁棒性。
{"title":"RoSR: A Novel Selective Retransmission FPGA Architecture for RDMA NICs","authors":"Mengting Zhang;Zhichuan Guo;Shining Sun","doi":"10.1109/LCA.2025.3594110","DOIUrl":"https://doi.org/10.1109/LCA.2025.3594110","url":null,"abstract":"Remote Direct Memory Access (RDMA) enables low-latency datacenter networks but suffers from inefficient loss recovery using Go-Back-N (GBN). GBN retransmits entire packet windows, degrading Flow Completion Time (FCT) under congestion. We introduce RoSR, a novel selective retransmission architecture for Field-Programmable Gate Array (FPGA)-based RDMA NICs that supports hardware-accelerated direct writes of out-of-order (OoO) packets. RoSR supports efficient OoO packet reception and enables fine-grained retransmission using a dynamic shared bitmap for packet tracking. By extending the RDMA over Converged Ethernet version 2 (RoCEv2) packet format, RoSR facilitates selective retransmission. It triggers retransmissions via timeouts using bitmap blocks and introduces new Nack-bitmap and rd-req-bitmap messages for loss reporting. Under 1% packet loss, RoSR achieves up to 13.5× (RDMA Write) and 15.6× (RDMA Read) higher throughput than Xilinx ERNIC. In NS-3 simulations using the HPCC RDMA stack, RoSR reduces FCT slowdown by 3× to 6× compared to GBN across various packet loss rates, congestion control algorithms (DCQCN, HPCC, Timely), and traffic patterns, while maintaining robustness under high round-trip time (RTT) conditions.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"269-272"},"PeriodicalIF":1.4,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144896824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}