首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
RAESC: A Reconfigurable AES Countermeasure Architecture for RISC-V With Enhanced Power Side-Channel Resilience RISC-V的可重构AES对抗体系结构与增强的功率侧信道弹性
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-08-01 DOI: 10.1109/LCA.2025.3595003
Nayana Rajeev;Cathrene Biju;Titu Mary Ignatius;Roy Paily Palathinkal;Rekha K James
This paper presents RAESC, a reconfigurable Advanced Encryption Standard (AES) countermeasure hardware design that supports AES-128, AES-192, and AES-256 types, enhancing flexibility and resource efficiency in IoT applications. The design incorporates a countermeasure to protect against Power-based Side Channel Attacks (PSCA) by randomizing the AES type based on input plaintext, ensuring improved security. The RAESC is integrated with an RV32IM RISC-V processor, offering streamlined operation and enhanced system security. Performance analysis shows that RAESC’s adaptive encryption strength achieves a balanced trade-off in area, power, and throughput, making it ideal for resource-constrained, security-sensitive IoT applications. Power traces for CPA attacks are generated on Application Specific Integrated Circuit (ASIC) and the design achieves a notable reduction in the Signal to Noise Ratio (SNR) and an increase in the Measurements to Disclose (MTD), demonstrating strong resilience against cryptographic attacks.
本文介绍了RAESC,一种可重构的高级加密标准(AES)对抗硬件设计,支持AES-128、AES-192和AES-256类型,增强了物联网应用的灵活性和资源效率。该设计结合了一种对策,通过基于输入明文随机化AES类型来防止基于功率的侧信道攻击(PSCA),确保提高安全性。RAESC集成了RV32IM RISC-V处理器,提供简化的操作和增强的系统安全性。性能分析表明,RAESC的自适应加密强度在面积、功率和吞吐量方面实现了平衡权衡,使其成为资源受限、安全敏感的物联网应用的理想选择。针对CPA攻击的电源走线是在专用集成电路(ASIC)上生成的,该设计显著降低了信噪比(SNR),增加了测量披露(MTD),显示出对加密攻击的强大弹性。
{"title":"RAESC: A Reconfigurable AES Countermeasure Architecture for RISC-V With Enhanced Power Side-Channel Resilience","authors":"Nayana Rajeev;Cathrene Biju;Titu Mary Ignatius;Roy Paily Palathinkal;Rekha K James","doi":"10.1109/LCA.2025.3595003","DOIUrl":"https://doi.org/10.1109/LCA.2025.3595003","url":null,"abstract":"This paper presents RAESC, a reconfigurable Advanced Encryption Standard (AES) countermeasure hardware design that supports AES-128, AES-192, and AES-256 types, enhancing flexibility and resource efficiency in IoT applications. The design incorporates a countermeasure to protect against Power-based Side Channel Attacks (PSCA) by randomizing the AES type based on input plaintext, ensuring improved security. The RAESC is integrated with an RV32IM RISC-V processor, offering streamlined operation and enhanced system security. Performance analysis shows that RAESC’s adaptive encryption strength achieves a balanced trade-off in area, power, and throughput, making it ideal for resource-constrained, security-sensitive IoT applications. Power traces for CPA attacks are generated on Application Specific Integrated Circuit (ASIC) and the design achieves a notable reduction in the Signal to Noise Ratio (SNR) and an increase in the Measurements to Disclose (MTD), demonstrating strong resilience against cryptographic attacks.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"273-276"},"PeriodicalIF":1.4,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144896825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RoSR: A Novel Selective Retransmission FPGA Architecture for RDMA NICs 一种新的RDMA网卡选择性重传FPGA架构
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-31 DOI: 10.1109/LCA.2025.3594110
Mengting Zhang;Zhichuan Guo;Shining Sun
Remote Direct Memory Access (RDMA) enables low-latency datacenter networks but suffers from inefficient loss recovery using Go-Back-N (GBN). GBN retransmits entire packet windows, degrading Flow Completion Time (FCT) under congestion. We introduce RoSR, a novel selective retransmission architecture for Field-Programmable Gate Array (FPGA)-based RDMA NICs that supports hardware-accelerated direct writes of out-of-order (OoO) packets. RoSR supports efficient OoO packet reception and enables fine-grained retransmission using a dynamic shared bitmap for packet tracking. By extending the RDMA over Converged Ethernet version 2 (RoCEv2) packet format, RoSR facilitates selective retransmission. It triggers retransmissions via timeouts using bitmap blocks and introduces new Nack-bitmap and rd-req-bitmap messages for loss reporting. Under 1% packet loss, RoSR achieves up to 13.5× (RDMA Write) and 15.6× (RDMA Read) higher throughput than Xilinx ERNIC. In NS-3 simulations using the HPCC RDMA stack, RoSR reduces FCT slowdown by 3× to 6× compared to GBN across various packet loss rates, congestion control algorithms (DCQCN, HPCC, Timely), and traffic patterns, while maintaining robustness under high round-trip time (RTT) conditions.
远程直接内存访问(RDMA)支持低延迟的数据中心网络,但使用Go-Back-N (GBN)时存在效率低下的损失恢复问题。GBN重传整个数据包窗口,降低了拥塞下的流完成时间(FCT)。我们介绍了RoSR,一种新的选择性重传架构,用于基于现场可编程门阵列(FPGA)的RDMA网卡,支持硬件加速的乱序(OoO)数据包的直接写入。RoSR支持有效的OoO数据包接收,并使用动态共享位图实现数据包跟踪的细粒度重传。通过扩展RDMA over Converged Ethernet version 2 (RoCEv2)数据包格式,RoSR促进了选择性重传。它通过使用位图块的超时触发重传,并为丢失报告引入了新的ack-bitmap和rd-req-bitmap消息。在丢包率为1%的情况下,RoSR的吞吐量比Xilinx ERNIC高13.5倍(RDMA Write)和15.6倍(RDMA Read)。在使用HPCC RDMA堆栈的NS-3模拟中,与GBN相比,RoSR在各种丢包率、拥塞控制算法(DCQCN、HPCC、Timely)和流量模式下将FCT减速减少了3到6倍,同时在高往返时间(RTT)条件下保持鲁棒性。
{"title":"RoSR: A Novel Selective Retransmission FPGA Architecture for RDMA NICs","authors":"Mengting Zhang;Zhichuan Guo;Shining Sun","doi":"10.1109/LCA.2025.3594110","DOIUrl":"https://doi.org/10.1109/LCA.2025.3594110","url":null,"abstract":"Remote Direct Memory Access (RDMA) enables low-latency datacenter networks but suffers from inefficient loss recovery using Go-Back-N (GBN). GBN retransmits entire packet windows, degrading Flow Completion Time (FCT) under congestion. We introduce RoSR, a novel selective retransmission architecture for Field-Programmable Gate Array (FPGA)-based RDMA NICs that supports hardware-accelerated direct writes of out-of-order (OoO) packets. RoSR supports efficient OoO packet reception and enables fine-grained retransmission using a dynamic shared bitmap for packet tracking. By extending the RDMA over Converged Ethernet version 2 (RoCEv2) packet format, RoSR facilitates selective retransmission. It triggers retransmissions via timeouts using bitmap blocks and introduces new Nack-bitmap and rd-req-bitmap messages for loss reporting. Under 1% packet loss, RoSR achieves up to 13.5× (RDMA Write) and 15.6× (RDMA Read) higher throughput than Xilinx ERNIC. In NS-3 simulations using the HPCC RDMA stack, RoSR reduces FCT slowdown by 3× to 6× compared to GBN across various packet loss rates, congestion control algorithms (DCQCN, HPCC, Timely), and traffic patterns, while maintaining robustness under high round-trip time (RTT) conditions.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"269-272"},"PeriodicalIF":1.4,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144896824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SSD Offloading for LLM Mixture-of-Experts Weights Considered Harmful in Energy Efficiency 考虑对能效有害的LLM混合专家权重的SSD卸载
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-24 DOI: 10.1109/LCA.2025.3592563
Kwanhee Kyung;Sungmin Yun;Jung Ho Ahn
Large Language Models (LLMs) applying Mixture-of-Experts (MoE) scale to trillions of parameters but require vast memory, motivating a line of research to offload expert weights from fast-but-small DRAM (HBM) to denser Flash SSDs. While SSDs provide cost-effective capacity, their read energy per bit is substantially higher than that of DRAM. This paper quantitatively analyzes the energy implications of offloading MoE expert weights to SSDs during the critical decode stage of LLM inference. Our analysis, comparing SSD, CPU memory (DDR), and HBM storage scenarios for models like DeepSeek-R1, reveals that offloading MoE weights to current SSDs drastically increases per-token-generation energy consumption (e.g., by up to $sim 12times$ compared to the HBM baseline), dominating the total inference energy budget. Although techniques like prefetching effectively hide access latency, they cannot mitigate this fundamental energy penalty. We further explore future technological scaling, finding that the inherent sparsity of MoE models could potentially make SSDs energy-viable if Flash read energy improves significantly, roughly by an order of magnitude.
大型语言模型(llm)将混合专家(MoE)规模应用于数万亿参数,但需要巨大的内存,这促使一系列研究将专家权重从快速但小型的DRAM (HBM)转移到更密集的闪存ssd上。虽然ssd提供了经济高效的容量,但它们的每比特读取能量大大高于DRAM。本文定量分析了在LLM推理的关键解码阶段将MoE专家权值卸载到ssd的能量影响。我们的分析,比较了SSD、CPU内存(DDR)和HBM存储场景的模型,如DeepSeek-R1,揭示了将MoE权重卸载到当前的SSD上大大增加了每个令牌生成的能量消耗(例如,与HBM基线相比,高达12倍),主导了总推理能量预算。尽管像预取这样的技术有效地隐藏了访问延迟,但它们不能减轻这种基本的能量损失。我们进一步探索了未来的技术规模,发现MoE模型的固有稀疏性可能会使ssd的能量可行,如果闪存读取能量显着提高,大约一个数量级。
{"title":"SSD Offloading for LLM Mixture-of-Experts Weights Considered Harmful in Energy Efficiency","authors":"Kwanhee Kyung;Sungmin Yun;Jung Ho Ahn","doi":"10.1109/LCA.2025.3592563","DOIUrl":"https://doi.org/10.1109/LCA.2025.3592563","url":null,"abstract":"Large Language Models (LLMs) applying Mixture-of-Experts (MoE) scale to trillions of parameters but require vast memory, motivating a line of research to offload expert weights from fast-but-small DRAM (HBM) to denser Flash SSDs. While SSDs provide cost-effective capacity, their read energy per bit is substantially higher than that of DRAM. This paper quantitatively analyzes the energy implications of offloading MoE expert weights to SSDs during the critical decode stage of LLM inference. Our analysis, comparing SSD, CPU memory (DDR), and HBM storage scenarios for models like DeepSeek-R1, reveals that offloading MoE weights to current SSDs drastically increases per-token-generation energy consumption (e.g., by up to <inline-formula><tex-math>$sim 12times$</tex-math></inline-formula> compared to the HBM baseline), dominating the total inference energy budget. Although techniques like prefetching effectively hide access latency, they cannot mitigate this fundamental energy penalty. We further explore future technological scaling, finding that the inherent sparsity of MoE models could potentially make SSDs energy-viable <i>if</i> Flash read energy improves significantly, roughly by an order of magnitude.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"265-268"},"PeriodicalIF":1.4,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144880476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correct Wrong Path 纠正错误路径
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-23 DOI: 10.1109/LCA.2025.3542809
Bhargav Reddy Godala;Sankara Prasad Ramesh;Krishnam Tibrewala;Chrysanthos Pepi;Gino Chacon;Svilen Kanev;Gilles A. Pokam;Alberto Ros;Daniel A. Jiménez;Paul V. Gratz;David I. August
Modern OOO CPUs have very deep pipelines with large branch misprediction recovery penalties. Speculatively executed instructions on the wrong path can significantly change cache state, depending on speculation levels. Architects often employ trace-driven simulation models in the design exploration stage, which sacrifice precision for speed. Trace-driven simulators are orders of magnitude faster than execution-driven models, reducing the often hundreds of thousands of simulation hours needed to explore new micro-architectural ideas. Despite the strong benefits of trace-driven simulation, it often fails to adequately model the consequences of wrong-path execution because obtaining such traces from real systems is nontrivial. Prior works exclusively consider either pollution or prefetching in the instruction stream/L1-I cache and often ignore the impact on the data stream. Here, we examine wrong path execution in simulation results and design a set of infrastructure for enabling wrong-path execution in a trace driven simulator. Our analysis shows the wrong path affects structures on both the instruction and data sides extensively, resulting in performance variations ranging from $-3.05$% to 20.9% versus ignoring wrong path. To benefit the research community and enhance the accuracy of simulators, we opened our traces and tracing utility in the hopes that industry can provide wrong-path traces generated by their internal simulators, enabling academic simulation without exposing industry IP.
现代的OOO cpu具有非常深的管道,具有很大的分支错误预测恢复惩罚。根据推测级别,在错误路径上推测执行的指令可能会显著改变缓存状态。架构师经常在设计探索阶段使用跟踪驱动的仿真模型,这种模型为了速度而牺牲了精度。跟踪驱动的模拟器比执行驱动的模型要快几个数量级,从而减少了探索新的微架构思想所需的数十万小时的模拟时间。尽管跟踪驱动的模拟有很大的好处,但它经常不能充分地模拟错误路径执行的后果,因为从实际系统中获得这样的跟踪是非常重要的。以前的工作只考虑指令流/L1-I缓存中的污染或预取,而经常忽略对数据流的影响。在这里,我们将检查模拟结果中的错误路径执行,并设计一组基础设施,以便在跟踪驱动的模拟器中启用错误路径执行。我们的分析表明,错误路径对指令和数据端的结构都有广泛的影响,与忽略错误路径相比,导致性能变化从-3.05 %到20.9%不等。为了使研究界受益并提高模拟器的准确性,我们开放了我们的跟踪和跟踪实用程序,希望工业界可以提供由其内部模拟器生成的错误路径跟踪,从而在不暴露工业IP的情况下实现学术模拟。
{"title":"Correct Wrong Path","authors":"Bhargav Reddy Godala;Sankara Prasad Ramesh;Krishnam Tibrewala;Chrysanthos Pepi;Gino Chacon;Svilen Kanev;Gilles A. Pokam;Alberto Ros;Daniel A. Jiménez;Paul V. Gratz;David I. August","doi":"10.1109/LCA.2025.3542809","DOIUrl":"https://doi.org/10.1109/LCA.2025.3542809","url":null,"abstract":"Modern OOO CPUs have very deep pipelines with large branch misprediction recovery penalties. Speculatively executed instructions on the wrong path can significantly change cache state, depending on speculation levels. Architects often employ trace-driven simulation models in the design exploration stage, which sacrifice precision for speed. Trace-driven simulators are orders of magnitude faster than execution-driven models, reducing the often hundreds of thousands of simulation hours needed to explore new micro-architectural ideas. Despite the strong benefits of trace-driven simulation, it often fails to adequately model the consequences of wrong-path execution because obtaining such traces from real systems is nontrivial. Prior works exclusively consider either pollution or prefetching in the instruction stream/L1-I cache and often ignore the impact on the data stream. Here, we examine wrong path execution in simulation results and design a set of infrastructure for enabling wrong-path execution in a trace driven simulator. Our analysis shows the wrong path affects structures on both the instruction and data sides extensively, resulting in performance variations ranging from <inline-formula><tex-math>$-3.05$</tex-math></inline-formula>% to 20.9% versus ignoring wrong path. To benefit the research community and enhance the accuracy of simulators, we opened our traces and tracing utility in the hopes that industry can provide wrong-path traces generated by their internal simulators, enabling academic simulation without exposing industry IP.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"221-224"},"PeriodicalIF":1.4,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144687792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Per-Row Activation Counting on Real Hardware: Demystifying Performance Overheads 真实硬件上的每行激活计数:揭开性能开销的神秘面纱
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-10 DOI: 10.1109/LCA.2025.3587293
Jumin Kim;Seungmin Baek;Minbok Wi;Hwayong Nam;Michael Jaemin Kim;Sukhan Lee;Kyomin Sohn;Jung Ho Ahn
Per-Row Activation Counting (PRAC), a DRAM read disturbance mitigation method, modifies key DRAM timing parameters, reportedly causing significant performance overheads in simulator-based studies. However, given known discrepancies between simulators and real hardware, real-machine experiments are vital for accurate PRAC performance estimation. We present the first real-machine performance analysis of PRAC. After verifying timing modifications on the latest CPUs using microbenchmarks, our analysis shows that PRAC’s average and maximum overheads are just 1.06% and 3.28% for the SPEC CPU2017 workloads—up to 9.15× lower than simulator-based reports. Further, we show that the close page policy minimizes this overhead by effectively hiding the elongated DRAM row precharge operations due to PRAC from the critical path.
逐行激活计数(PRAC)是一种缓解DRAM读取干扰的方法,它修改了关键的DRAM时序参数,据报道,在基于模拟器的研究中,这会导致显著的性能开销。然而,鉴于模拟器和真实硬件之间的已知差异,真实机实验对于准确估计PRAC性能至关重要。我们首次对PRAC进行了实机性能分析。在使用微基准测试验证最新cpu上的时序修改后,我们的分析表明,在SPEC CPU2017工作负载下,PRAC的平均开销和最大开销仅为1.06%和3.28%,比基于模拟器的报告低9.15倍。此外,我们还表明,关闭页策略通过有效地隐藏由于关键路径上的PRAC而延长的DRAM行预充值操作,从而最大限度地减少了这种开销。
{"title":"Per-Row Activation Counting on Real Hardware: Demystifying Performance Overheads","authors":"Jumin Kim;Seungmin Baek;Minbok Wi;Hwayong Nam;Michael Jaemin Kim;Sukhan Lee;Kyomin Sohn;Jung Ho Ahn","doi":"10.1109/LCA.2025.3587293","DOIUrl":"https://doi.org/10.1109/LCA.2025.3587293","url":null,"abstract":"Per-Row Activation Counting (PRAC), a DRAM read disturbance mitigation method, modifies key DRAM timing parameters, reportedly causing significant performance overheads in simulator-based studies. However, given known discrepancies between simulators and real hardware, real-machine experiments are vital for accurate PRAC performance estimation. We present the first real-machine performance analysis of PRAC. After verifying timing modifications on the latest CPUs using microbenchmarks, our analysis shows that PRAC’s average and maximum overheads are just 1.06% and 3.28% for the SPEC CPU2017 workloads—up to 9.15× lower than simulator-based reports. Further, we show that the close page policy minimizes this overhead by effectively hiding the elongated DRAM row precharge operations due to PRAC from the critical path.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"217-220"},"PeriodicalIF":1.4,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144687689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Old is Gold: Optimizing Single-Threaded Applications With ExGen-Malloc 老即是金:用ExGen-Malloc优化单线程应用程序
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-10 DOI: 10.1109/LCA.2025.3587582
Ruihao Li;Lizy K. John;Neeraja J. Yadwadkar
Memory allocators, though constituting a small portion of the entire program code, can significantly impact application performance by affecting global factors such as cache behaviors. Moreover, memory allocators are often regarded as a “datacenter tax” inherent to all programs. Even a 1% improvement in performance can lead to significant cost and energy savings when scaled across an entire datacenter fleet. Modern memory allocators are designed to optimize allocation speed and memory fragmentation in multi-threaded environments, relying on complex metadata and control logic to achieve high performance. However, the overhead introduced by this complexity prompts a reevaluation of allocator design. Notably, such overhead can be avoided in single-threaded scenarios, which continue to be widely used across diverse application domains. In this paper, we present ExGen-Malloc, a memory allocator specifically optimized for single-threaded applications. We prototyped ExGen-Malloc on a real system and demonstrated that it achieves a geometric mean speedup of $1.19 times$ over dlmalloc and $1.03 times$ over mimalloc, a modern multi-threaded allocator developed by Microsoft, on the SPEC CPU2017 benchmark suite.
内存分配器虽然只占整个程序代码的一小部分,但可以通过影响全局因素(如缓存行为)来显著影响应用程序性能。此外,内存分配器通常被认为是所有程序固有的“数据中心税”。当扩展到整个数据中心时,即使是1%的性能改进也可以带来显着的成本和能源节约。现代内存分配器旨在优化多线程环境中的分配速度和内存碎片,依靠复杂的元数据和控制逻辑来实现高性能。然而,这种复杂性带来的开销促使我们重新评估分配器的设计。值得注意的是,在单线程场景中可以避免这种开销,单线程场景在不同的应用程序领域中仍然被广泛使用。在本文中,我们介绍了ExGen-Malloc,一个专门为单线程应用程序优化的内存分配器。我们在实际系统上对ExGen-Malloc进行了原型设计,并证明它在SPEC CPU2017基准测试套件上实现了比dlmalloc高1.19倍和比mimalloc(微软开发的现代多线程分配器)高1.03倍的几何平均加速。
{"title":"Old is Gold: Optimizing Single-Threaded Applications With ExGen-Malloc","authors":"Ruihao Li;Lizy K. John;Neeraja J. Yadwadkar","doi":"10.1109/LCA.2025.3587582","DOIUrl":"https://doi.org/10.1109/LCA.2025.3587582","url":null,"abstract":"Memory allocators, though constituting a small portion of the entire program code, can significantly impact application performance by affecting global factors such as cache behaviors. Moreover, memory allocators are often regarded as a “datacenter tax” inherent to all programs. Even a 1% improvement in performance can lead to significant cost and energy savings when scaled across an entire datacenter fleet. Modern memory allocators are designed to optimize allocation speed and memory fragmentation in multi-threaded environments, relying on complex metadata and control logic to achieve high performance. However, the overhead introduced by this complexity prompts a reevaluation of allocator design. Notably, such overhead can be avoided in single-threaded scenarios, which continue to be widely used across diverse application domains. In this paper, we present <i>ExGen-Malloc</i>, a memory allocator specifically optimized for single-threaded applications. We prototyped <i>ExGen-Malloc</i> on a real system and demonstrated that it achieves a geometric mean speedup of <inline-formula><tex-math>$1.19 times$</tex-math></inline-formula> over dlmalloc and <inline-formula><tex-math>$1.03 times$</tex-math></inline-formula> over mimalloc, a modern multi-threaded allocator developed by Microsoft, on the SPEC CPU2017 benchmark suite.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"225-228"},"PeriodicalIF":1.4,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144687664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Contention-Aware GPU Thread Block Scheduler for Efficient GPU-SSD 竞争感知GPU线程块调度高效GPU- ssd
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-07 DOI: 10.1109/LCA.2025.3586312
Xueyang Liu;Seonjin Na;Euijun Chung;Jiashen Cao;Jing Yang;Hyesoon Kim
The growing dataset sizes in LLM have made low-cost SSDs a popular solution for extending GPU memory in mobile devices. In this paper, we introduce CA-Scheduler, a contention-aware scheduling scheme for GPU-initiated SSD access. The key insight behind CA-Scheduler is leveraging the BSP GPU programming model, which allows reordering work at the thread block level to optimize SSD throughput. By capitalizing on the predictable memory access patterns of GPU thread blocks, CA-Scheduler anticipates SSD locations to minimize contention and improve performance.
LLM中不断增长的数据集大小使得低成本ssd成为移动设备中扩展GPU内存的流行解决方案。在本文中,我们介绍了CA-Scheduler,一个竞争感知的调度方案,用于gpu发起的SSD访问。CA-Scheduler背后的关键洞察是利用BSP GPU编程模型,该模型允许在线程块级别重新排序工作以优化SSD吞吐量。通过利用GPU线程块的可预测内存访问模式,CA-Scheduler可以预测SSD位置,从而最小化争用并提高性能。
{"title":"Contention-Aware GPU Thread Block Scheduler for Efficient GPU-SSD","authors":"Xueyang Liu;Seonjin Na;Euijun Chung;Jiashen Cao;Jing Yang;Hyesoon Kim","doi":"10.1109/LCA.2025.3586312","DOIUrl":"https://doi.org/10.1109/LCA.2025.3586312","url":null,"abstract":"The growing dataset sizes in LLM have made low-cost SSDs a popular solution for extending GPU memory in mobile devices. In this paper, we introduce <monospace>CA-Scheduler</monospace>, a contention-aware scheduling scheme for GPU-initiated SSD access. The key insight behind <monospace>CA-Scheduler</monospace> is leveraging the BSP GPU programming model, which allows reordering work at the thread block level to optimize SSD throughput. By capitalizing on the predictable memory access patterns of GPU thread blocks, <monospace>CA-Scheduler</monospace> anticipates SSD locations to minimize contention and improve performance.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"257-260"},"PeriodicalIF":1.4,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144896798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HPN-SpGEMM: Hybrid PIM-NMP for SpGEMM
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-06-27 DOI: 10.1109/LCA.2025.3583758
Kwangrae Kim;Ki-Seok Chung
Sparse matrix-matrix multiplication (SpGEMM) is widely used in various scientific computing applications. However, the performance of SpGEMM is typically bound by memory performance due to irregular access patterns. Prior accelerators leveraging high-bandwidth memory (HBM) with optimized data flows still face limitations in handling sparse matrices with varying sizes and sparsity levels. We propose HPN-SpGEMM, a hybrid architecture that employs both processing-in-memory (PIM) cores inside bank groups and near-memory-processing (NMP) cores in the logic die of an HBM memory. To the best of our knowledge, this is the first hybrid architecture for SpGEMM that leverages both PIM cores and NMP cores. Evaluation results demonstrate significant performance gains, effectively overcoming memory-bound constraints.
稀疏矩阵-矩阵乘法(SpGEMM)广泛应用于各种科学计算应用。然而,由于不规则的访问模式,SpGEMM的性能通常受到内存性能的限制。先前利用高带宽内存(HBM)和优化数据流的加速器在处理具有不同大小和稀疏度级别的稀疏矩阵时仍然面临限制。我们提出了HPN-SpGEMM,这是一种混合架构,在银行组内使用内存中处理(PIM)内核,在HBM存储器的逻辑芯片中使用近内存处理(NMP)内核。据我们所知,这是SpGEMM的第一个混合架构,它利用了PIM内核和NMP内核。评估结果显示了显著的性能提升,有效地克服了内存约束。
{"title":"HPN-SpGEMM: Hybrid PIM-NMP for SpGEMM","authors":"Kwangrae Kim;Ki-Seok Chung","doi":"10.1109/LCA.2025.3583758","DOIUrl":"https://doi.org/10.1109/LCA.2025.3583758","url":null,"abstract":"Sparse matrix-matrix multiplication (SpGEMM) is widely used in various scientific computing applications. However, the performance of SpGEMM is typically bound by memory performance due to irregular access patterns. Prior accelerators leveraging high-bandwidth memory (HBM) with optimized data flows still face limitations in handling sparse matrices with varying sizes and sparsity levels. We propose HPN-SpGEMM, a hybrid architecture that employs both processing-in-memory (PIM) cores inside bank groups and near-memory-processing (NMP) cores in the logic die of an HBM memory. To the best of our knowledge, this is the first hybrid architecture for SpGEMM that leverages both PIM cores and NMP cores. Evaluation results demonstrate significant performance gains, effectively overcoming memory-bound constraints.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"209-212"},"PeriodicalIF":1.4,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144680825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SAFE: Sharing-Aware Prefetching for Efficient GPU Memory Management With Unified Virtual Memory 安全:共享感知预取高效GPU内存管理与统一的虚拟内存
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-06-24 DOI: 10.1109/LCA.2025.3553143
Hyunkyun Shin;Seongtae Bang;Hyungwon Park;Daehoon Kim
As the demand for GPU memory from applications such as machine learning continues to grow exponentially, maximizing GPU memory capacity has become increasingly important. Unified Virtual Memory (UVM), which combines host and GPU memory into a unified address space, allows GPUs to utilize more memory than their physical capacity. However, this advantage comes at the cost of significant overheads when accessing host memory. Although existing prefetching techniques help alleviate these overheads, they still encounter challenges when dealing with irregular workloads and dynamic mixed workloads. In this paper, we demonstrate that the regularity of workloads is strongly correlated with the sharing status of UVM memory blocks among the Streaming Multiprocessors (SMs) of GPUs, which in turn impacts the effectiveness of prefetching. In addition, we propose the Sharing Aware preFEtching technique, SAFE, which dynamically adjusts prefetching strategies based on the sharing status of the accessed memory blocks. SAFE efficiently tracks the sharing status of the memory blocks by leveraging unified TLBs (uTLBs) and enforces tailored prefetching configurations for each block. This approach requires no hardware modifications and incurs negligible performance overhead. Our evaluation shows that SAFE achieves up to a 6.5× performance improvement over UVM default prefetcher for workloads with predominantly irregular memory access patterns, with an average improvement of 3.6×.
随着机器学习等应用对GPU内存的需求呈指数级增长,最大化GPU内存容量变得越来越重要。UVM (Unified Virtual Memory)将主机和GPU的内存整合到一个统一的地址空间中,使GPU可以使用比其物理容量更多的内存。然而,这种优势是以访问主机内存时的大量开销为代价的。尽管现有的预取技术有助于减轻这些开销,但在处理不规则工作负载和动态混合工作负载时,它们仍然会遇到挑战。在本文中,我们证明了工作负载的规律性与gpu的流多处理器(SMs)之间的UVM内存块共享状态密切相关,这反过来影响了预取的有效性。此外,我们还提出了共享感知预取技术SAFE,该技术可以根据访问的内存块的共享状态动态调整预取策略。SAFE通过利用统一的tlb (utlb)有效地跟踪内存块的共享状态,并为每个块强制定制预取配置。这种方法不需要修改硬件,性能开销可以忽略不计。我们的评估表明,对于主要具有不规则内存访问模式的工作负载,SAFE比UVM默认预取器的性能提高了6.5倍,平均提高了3.6倍。
{"title":"SAFE: Sharing-Aware Prefetching for Efficient GPU Memory Management With Unified Virtual Memory","authors":"Hyunkyun Shin;Seongtae Bang;Hyungwon Park;Daehoon Kim","doi":"10.1109/LCA.2025.3553143","DOIUrl":"https://doi.org/10.1109/LCA.2025.3553143","url":null,"abstract":"As the demand for GPU memory from applications such as machine learning continues to grow exponentially, maximizing GPU memory capacity has become increasingly important. Unified Virtual Memory (UVM), which combines host and GPU memory into a unified address space, allows GPUs to utilize more memory than their physical capacity. However, this advantage comes at the cost of significant overheads when accessing host memory. Although existing prefetching techniques help alleviate these overheads, they still encounter challenges when dealing with irregular workloads and dynamic mixed workloads. In this paper, we demonstrate that the regularity of workloads is strongly correlated with the sharing status of UVM memory blocks among the Streaming Multiprocessors (SMs) of GPUs, which in turn impacts the effectiveness of prefetching. In addition, we propose the <bold>S</b>haring <bold>A</b>ware pre<bold>FE</b>tching technique, <monospace>SAFE</monospace>, which dynamically adjusts prefetching strategies based on the sharing status of the accessed memory blocks. <monospace>SAFE</monospace> efficiently tracks the sharing status of the memory blocks by leveraging unified TLBs (uTLBs) and enforces tailored prefetching configurations for each block. This approach requires no hardware modifications and incurs negligible performance overhead. Our evaluation shows that <monospace>SAFE</monospace> achieves up to a 6.5× performance improvement over UVM default prefetcher for workloads with predominantly irregular memory access patterns, with an average improvement of 3.6×.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"117-120"},"PeriodicalIF":1.4,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144472587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HINT: A Hardware Platform for Intra-Host NIC Traffic and SmartNIC Emulation 提示:主机内网卡流量和智能网卡仿真的硬件平台
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-06-23 DOI: 10.1109/LCA.2025.3582481
Jiaqi Lou;Yu Li;Srikar Vanavasam;Nam Sung Kim
Recent performance advancements in inter-host networking demand innovations in intra-host communication and SmartNIC-accelerated in-network processing. However, developing novel SmartNIC features remains difficult due to absence of hardware observability and low-cost, deterministic testing environments with existing software-based or commercial development platforms. While FPGA-based SmartNICs offer high flexibility and performance for packet processing acceleration, existing solutions support only a limited subset of network technologies widely used in commercial datacenters. To address these challenges, we introduce HINT, an FPGA-based development and emulation platform that transparently mimics a commercial SmartNIC in the system, featuring controlled network traffic generation with a high-performance traffic engine and kernel-bypass network technologies. It also supports configurable workload patterns, nanosecond-level latency measurement, and a reconfigurable Receive Side Scaling (RSS) engine for load balancing. Our evaluation shows that HINT achieves 91% of PCIe’s theoretical efficiency, providing a highly effective and scalable platform to emulate an end-to-end system with support for diverse network stacks. HINT thus establishes an accessible, high-fidelity platform for SmartNIC development and emulation, along with architectural exploration of intra-host communication.
最近主机间网络的性能进步要求在主机内通信和smartnic加速的网络处理方面进行创新。然而,由于缺乏硬件可观察性和现有基于软件或商业开发平台的低成本、确定性测试环境,开发新的SmartNIC功能仍然很困难。虽然基于fpga的smartnic为数据包处理加速提供了高灵活性和高性能,但现有的解决方案仅支持广泛用于商业数据中心的有限网络技术子集。为了应对这些挑战,我们引入了HINT,这是一个基于fpga的开发和仿真平台,它透明地模仿了系统中的商用SmartNIC,具有通过高性能流量引擎和内核旁路网络技术控制网络流量生成的特点。它还支持可配置的工作负载模式、纳秒级延迟测量和用于负载平衡的可重新配置的接收端缩放(RSS)引擎。我们的评估表明,HINT达到了PCIe理论效率的91%,提供了一个高效且可扩展的平台来模拟支持多种网络堆栈的端到端系统。因此,HINT为SmartNIC的开发和仿真建立了一个可访问的、高保真的平台,以及对主机内通信的架构探索。
{"title":"HINT: A Hardware Platform for Intra-Host NIC Traffic and SmartNIC Emulation","authors":"Jiaqi Lou;Yu Li;Srikar Vanavasam;Nam Sung Kim","doi":"10.1109/LCA.2025.3582481","DOIUrl":"https://doi.org/10.1109/LCA.2025.3582481","url":null,"abstract":"Recent performance advancements in inter-host networking demand innovations in intra-host communication and SmartNIC-accelerated in-network processing. However, developing novel SmartNIC features remains difficult due to absence of hardware observability and low-cost, deterministic testing environments with existing software-based or commercial development platforms. While FPGA-based SmartNICs offer high flexibility and performance for packet processing acceleration, existing solutions support only a limited subset of network technologies widely used in commercial datacenters. To address these challenges, we introduce HINT, an FPGA-based development and emulation platform that transparently mimics a commercial SmartNIC in the system, featuring controlled network traffic generation with a high-performance traffic engine and kernel-bypass network technologies. It also supports configurable workload patterns, nanosecond-level latency measurement, and a reconfigurable Receive Side Scaling (RSS) engine for load balancing. Our evaluation shows that HINT achieves 91% of PCIe’s theoretical efficiency, providing a highly effective and scalable platform to emulate an end-to-end system with support for diverse network stacks. HINT thus establishes an accessible, high-fidelity platform for SmartNIC development and emulation, along with architectural exploration of intra-host communication.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"261-264"},"PeriodicalIF":1.4,"publicationDate":"2025-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11048525","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144880525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1