首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
Old is Gold: Optimizing Single-Threaded Applications With ExGen-Malloc 老即是金:用ExGen-Malloc优化单线程应用程序
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-10 DOI: 10.1109/LCA.2025.3587582
Ruihao Li;Lizy K. John;Neeraja J. Yadwadkar
Memory allocators, though constituting a small portion of the entire program code, can significantly impact application performance by affecting global factors such as cache behaviors. Moreover, memory allocators are often regarded as a “datacenter tax” inherent to all programs. Even a 1% improvement in performance can lead to significant cost and energy savings when scaled across an entire datacenter fleet. Modern memory allocators are designed to optimize allocation speed and memory fragmentation in multi-threaded environments, relying on complex metadata and control logic to achieve high performance. However, the overhead introduced by this complexity prompts a reevaluation of allocator design. Notably, such overhead can be avoided in single-threaded scenarios, which continue to be widely used across diverse application domains. In this paper, we present ExGen-Malloc, a memory allocator specifically optimized for single-threaded applications. We prototyped ExGen-Malloc on a real system and demonstrated that it achieves a geometric mean speedup of $1.19 times$ over dlmalloc and $1.03 times$ over mimalloc, a modern multi-threaded allocator developed by Microsoft, on the SPEC CPU2017 benchmark suite.
内存分配器虽然只占整个程序代码的一小部分,但可以通过影响全局因素(如缓存行为)来显著影响应用程序性能。此外,内存分配器通常被认为是所有程序固有的“数据中心税”。当扩展到整个数据中心时,即使是1%的性能改进也可以带来显着的成本和能源节约。现代内存分配器旨在优化多线程环境中的分配速度和内存碎片,依靠复杂的元数据和控制逻辑来实现高性能。然而,这种复杂性带来的开销促使我们重新评估分配器的设计。值得注意的是,在单线程场景中可以避免这种开销,单线程场景在不同的应用程序领域中仍然被广泛使用。在本文中,我们介绍了ExGen-Malloc,一个专门为单线程应用程序优化的内存分配器。我们在实际系统上对ExGen-Malloc进行了原型设计,并证明它在SPEC CPU2017基准测试套件上实现了比dlmalloc高1.19倍和比mimalloc(微软开发的现代多线程分配器)高1.03倍的几何平均加速。
{"title":"Old is Gold: Optimizing Single-Threaded Applications With ExGen-Malloc","authors":"Ruihao Li;Lizy K. John;Neeraja J. Yadwadkar","doi":"10.1109/LCA.2025.3587582","DOIUrl":"https://doi.org/10.1109/LCA.2025.3587582","url":null,"abstract":"Memory allocators, though constituting a small portion of the entire program code, can significantly impact application performance by affecting global factors such as cache behaviors. Moreover, memory allocators are often regarded as a “datacenter tax” inherent to all programs. Even a 1% improvement in performance can lead to significant cost and energy savings when scaled across an entire datacenter fleet. Modern memory allocators are designed to optimize allocation speed and memory fragmentation in multi-threaded environments, relying on complex metadata and control logic to achieve high performance. However, the overhead introduced by this complexity prompts a reevaluation of allocator design. Notably, such overhead can be avoided in single-threaded scenarios, which continue to be widely used across diverse application domains. In this paper, we present <i>ExGen-Malloc</i>, a memory allocator specifically optimized for single-threaded applications. We prototyped <i>ExGen-Malloc</i> on a real system and demonstrated that it achieves a geometric mean speedup of <inline-formula><tex-math>$1.19 times$</tex-math></inline-formula> over dlmalloc and <inline-formula><tex-math>$1.03 times$</tex-math></inline-formula> over mimalloc, a modern multi-threaded allocator developed by Microsoft, on the SPEC CPU2017 benchmark suite.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"225-228"},"PeriodicalIF":1.4,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144687664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Contention-Aware GPU Thread Block Scheduler for Efficient GPU-SSD 竞争感知GPU线程块调度高效GPU- ssd
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-07 DOI: 10.1109/LCA.2025.3586312
Xueyang Liu;Seonjin Na;Euijun Chung;Jiashen Cao;Jing Yang;Hyesoon Kim
The growing dataset sizes in LLM have made low-cost SSDs a popular solution for extending GPU memory in mobile devices. In this paper, we introduce CA-Scheduler, a contention-aware scheduling scheme for GPU-initiated SSD access. The key insight behind CA-Scheduler is leveraging the BSP GPU programming model, which allows reordering work at the thread block level to optimize SSD throughput. By capitalizing on the predictable memory access patterns of GPU thread blocks, CA-Scheduler anticipates SSD locations to minimize contention and improve performance.
LLM中不断增长的数据集大小使得低成本ssd成为移动设备中扩展GPU内存的流行解决方案。在本文中,我们介绍了CA-Scheduler,一个竞争感知的调度方案,用于gpu发起的SSD访问。CA-Scheduler背后的关键洞察是利用BSP GPU编程模型,该模型允许在线程块级别重新排序工作以优化SSD吞吐量。通过利用GPU线程块的可预测内存访问模式,CA-Scheduler可以预测SSD位置,从而最小化争用并提高性能。
{"title":"Contention-Aware GPU Thread Block Scheduler for Efficient GPU-SSD","authors":"Xueyang Liu;Seonjin Na;Euijun Chung;Jiashen Cao;Jing Yang;Hyesoon Kim","doi":"10.1109/LCA.2025.3586312","DOIUrl":"https://doi.org/10.1109/LCA.2025.3586312","url":null,"abstract":"The growing dataset sizes in LLM have made low-cost SSDs a popular solution for extending GPU memory in mobile devices. In this paper, we introduce <monospace>CA-Scheduler</monospace>, a contention-aware scheduling scheme for GPU-initiated SSD access. The key insight behind <monospace>CA-Scheduler</monospace> is leveraging the BSP GPU programming model, which allows reordering work at the thread block level to optimize SSD throughput. By capitalizing on the predictable memory access patterns of GPU thread blocks, <monospace>CA-Scheduler</monospace> anticipates SSD locations to minimize contention and improve performance.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"257-260"},"PeriodicalIF":1.4,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144896798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HPN-SpGEMM: Hybrid PIM-NMP for SpGEMM
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-06-27 DOI: 10.1109/LCA.2025.3583758
Kwangrae Kim;Ki-Seok Chung
Sparse matrix-matrix multiplication (SpGEMM) is widely used in various scientific computing applications. However, the performance of SpGEMM is typically bound by memory performance due to irregular access patterns. Prior accelerators leveraging high-bandwidth memory (HBM) with optimized data flows still face limitations in handling sparse matrices with varying sizes and sparsity levels. We propose HPN-SpGEMM, a hybrid architecture that employs both processing-in-memory (PIM) cores inside bank groups and near-memory-processing (NMP) cores in the logic die of an HBM memory. To the best of our knowledge, this is the first hybrid architecture for SpGEMM that leverages both PIM cores and NMP cores. Evaluation results demonstrate significant performance gains, effectively overcoming memory-bound constraints.
稀疏矩阵-矩阵乘法(SpGEMM)广泛应用于各种科学计算应用。然而,由于不规则的访问模式,SpGEMM的性能通常受到内存性能的限制。先前利用高带宽内存(HBM)和优化数据流的加速器在处理具有不同大小和稀疏度级别的稀疏矩阵时仍然面临限制。我们提出了HPN-SpGEMM,这是一种混合架构,在银行组内使用内存中处理(PIM)内核,在HBM存储器的逻辑芯片中使用近内存处理(NMP)内核。据我们所知,这是SpGEMM的第一个混合架构,它利用了PIM内核和NMP内核。评估结果显示了显著的性能提升,有效地克服了内存约束。
{"title":"HPN-SpGEMM: Hybrid PIM-NMP for SpGEMM","authors":"Kwangrae Kim;Ki-Seok Chung","doi":"10.1109/LCA.2025.3583758","DOIUrl":"https://doi.org/10.1109/LCA.2025.3583758","url":null,"abstract":"Sparse matrix-matrix multiplication (SpGEMM) is widely used in various scientific computing applications. However, the performance of SpGEMM is typically bound by memory performance due to irregular access patterns. Prior accelerators leveraging high-bandwidth memory (HBM) with optimized data flows still face limitations in handling sparse matrices with varying sizes and sparsity levels. We propose HPN-SpGEMM, a hybrid architecture that employs both processing-in-memory (PIM) cores inside bank groups and near-memory-processing (NMP) cores in the logic die of an HBM memory. To the best of our knowledge, this is the first hybrid architecture for SpGEMM that leverages both PIM cores and NMP cores. Evaluation results demonstrate significant performance gains, effectively overcoming memory-bound constraints.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"209-212"},"PeriodicalIF":1.4,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144680825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SAFE: Sharing-Aware Prefetching for Efficient GPU Memory Management With Unified Virtual Memory 安全:共享感知预取高效GPU内存管理与统一的虚拟内存
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-06-24 DOI: 10.1109/LCA.2025.3553143
Hyunkyun Shin;Seongtae Bang;Hyungwon Park;Daehoon Kim
As the demand for GPU memory from applications such as machine learning continues to grow exponentially, maximizing GPU memory capacity has become increasingly important. Unified Virtual Memory (UVM), which combines host and GPU memory into a unified address space, allows GPUs to utilize more memory than their physical capacity. However, this advantage comes at the cost of significant overheads when accessing host memory. Although existing prefetching techniques help alleviate these overheads, they still encounter challenges when dealing with irregular workloads and dynamic mixed workloads. In this paper, we demonstrate that the regularity of workloads is strongly correlated with the sharing status of UVM memory blocks among the Streaming Multiprocessors (SMs) of GPUs, which in turn impacts the effectiveness of prefetching. In addition, we propose the Sharing Aware preFEtching technique, SAFE, which dynamically adjusts prefetching strategies based on the sharing status of the accessed memory blocks. SAFE efficiently tracks the sharing status of the memory blocks by leveraging unified TLBs (uTLBs) and enforces tailored prefetching configurations for each block. This approach requires no hardware modifications and incurs negligible performance overhead. Our evaluation shows that SAFE achieves up to a 6.5× performance improvement over UVM default prefetcher for workloads with predominantly irregular memory access patterns, with an average improvement of 3.6×.
随着机器学习等应用对GPU内存的需求呈指数级增长,最大化GPU内存容量变得越来越重要。UVM (Unified Virtual Memory)将主机和GPU的内存整合到一个统一的地址空间中,使GPU可以使用比其物理容量更多的内存。然而,这种优势是以访问主机内存时的大量开销为代价的。尽管现有的预取技术有助于减轻这些开销,但在处理不规则工作负载和动态混合工作负载时,它们仍然会遇到挑战。在本文中,我们证明了工作负载的规律性与gpu的流多处理器(SMs)之间的UVM内存块共享状态密切相关,这反过来影响了预取的有效性。此外,我们还提出了共享感知预取技术SAFE,该技术可以根据访问的内存块的共享状态动态调整预取策略。SAFE通过利用统一的tlb (utlb)有效地跟踪内存块的共享状态,并为每个块强制定制预取配置。这种方法不需要修改硬件,性能开销可以忽略不计。我们的评估表明,对于主要具有不规则内存访问模式的工作负载,SAFE比UVM默认预取器的性能提高了6.5倍,平均提高了3.6倍。
{"title":"SAFE: Sharing-Aware Prefetching for Efficient GPU Memory Management With Unified Virtual Memory","authors":"Hyunkyun Shin;Seongtae Bang;Hyungwon Park;Daehoon Kim","doi":"10.1109/LCA.2025.3553143","DOIUrl":"https://doi.org/10.1109/LCA.2025.3553143","url":null,"abstract":"As the demand for GPU memory from applications such as machine learning continues to grow exponentially, maximizing GPU memory capacity has become increasingly important. Unified Virtual Memory (UVM), which combines host and GPU memory into a unified address space, allows GPUs to utilize more memory than their physical capacity. However, this advantage comes at the cost of significant overheads when accessing host memory. Although existing prefetching techniques help alleviate these overheads, they still encounter challenges when dealing with irregular workloads and dynamic mixed workloads. In this paper, we demonstrate that the regularity of workloads is strongly correlated with the sharing status of UVM memory blocks among the Streaming Multiprocessors (SMs) of GPUs, which in turn impacts the effectiveness of prefetching. In addition, we propose the <bold>S</b>haring <bold>A</b>ware pre<bold>FE</b>tching technique, <monospace>SAFE</monospace>, which dynamically adjusts prefetching strategies based on the sharing status of the accessed memory blocks. <monospace>SAFE</monospace> efficiently tracks the sharing status of the memory blocks by leveraging unified TLBs (uTLBs) and enforces tailored prefetching configurations for each block. This approach requires no hardware modifications and incurs negligible performance overhead. Our evaluation shows that <monospace>SAFE</monospace> achieves up to a 6.5× performance improvement over UVM default prefetcher for workloads with predominantly irregular memory access patterns, with an average improvement of 3.6×.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"117-120"},"PeriodicalIF":1.4,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144472587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HINT: A Hardware Platform for Intra-Host NIC Traffic and SmartNIC Emulation 提示:主机内网卡流量和智能网卡仿真的硬件平台
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-06-23 DOI: 10.1109/LCA.2025.3582481
Jiaqi Lou;Yu Li;Srikar Vanavasam;Nam Sung Kim
Recent performance advancements in inter-host networking demand innovations in intra-host communication and SmartNIC-accelerated in-network processing. However, developing novel SmartNIC features remains difficult due to absence of hardware observability and low-cost, deterministic testing environments with existing software-based or commercial development platforms. While FPGA-based SmartNICs offer high flexibility and performance for packet processing acceleration, existing solutions support only a limited subset of network technologies widely used in commercial datacenters. To address these challenges, we introduce HINT, an FPGA-based development and emulation platform that transparently mimics a commercial SmartNIC in the system, featuring controlled network traffic generation with a high-performance traffic engine and kernel-bypass network technologies. It also supports configurable workload patterns, nanosecond-level latency measurement, and a reconfigurable Receive Side Scaling (RSS) engine for load balancing. Our evaluation shows that HINT achieves 91% of PCIe’s theoretical efficiency, providing a highly effective and scalable platform to emulate an end-to-end system with support for diverse network stacks. HINT thus establishes an accessible, high-fidelity platform for SmartNIC development and emulation, along with architectural exploration of intra-host communication.
最近主机间网络的性能进步要求在主机内通信和smartnic加速的网络处理方面进行创新。然而,由于缺乏硬件可观察性和现有基于软件或商业开发平台的低成本、确定性测试环境,开发新的SmartNIC功能仍然很困难。虽然基于fpga的smartnic为数据包处理加速提供了高灵活性和高性能,但现有的解决方案仅支持广泛用于商业数据中心的有限网络技术子集。为了应对这些挑战,我们引入了HINT,这是一个基于fpga的开发和仿真平台,它透明地模仿了系统中的商用SmartNIC,具有通过高性能流量引擎和内核旁路网络技术控制网络流量生成的特点。它还支持可配置的工作负载模式、纳秒级延迟测量和用于负载平衡的可重新配置的接收端缩放(RSS)引擎。我们的评估表明,HINT达到了PCIe理论效率的91%,提供了一个高效且可扩展的平台来模拟支持多种网络堆栈的端到端系统。因此,HINT为SmartNIC的开发和仿真建立了一个可访问的、高保真的平台,以及对主机内通信的架构探索。
{"title":"HINT: A Hardware Platform for Intra-Host NIC Traffic and SmartNIC Emulation","authors":"Jiaqi Lou;Yu Li;Srikar Vanavasam;Nam Sung Kim","doi":"10.1109/LCA.2025.3582481","DOIUrl":"https://doi.org/10.1109/LCA.2025.3582481","url":null,"abstract":"Recent performance advancements in inter-host networking demand innovations in intra-host communication and SmartNIC-accelerated in-network processing. However, developing novel SmartNIC features remains difficult due to absence of hardware observability and low-cost, deterministic testing environments with existing software-based or commercial development platforms. While FPGA-based SmartNICs offer high flexibility and performance for packet processing acceleration, existing solutions support only a limited subset of network technologies widely used in commercial datacenters. To address these challenges, we introduce HINT, an FPGA-based development and emulation platform that transparently mimics a commercial SmartNIC in the system, featuring controlled network traffic generation with a high-performance traffic engine and kernel-bypass network technologies. It also supports configurable workload patterns, nanosecond-level latency measurement, and a reconfigurable Receive Side Scaling (RSS) engine for load balancing. Our evaluation shows that HINT achieves 91% of PCIe’s theoretical efficiency, providing a highly effective and scalable platform to emulate an end-to-end system with support for diverse network stacks. HINT thus establishes an accessible, high-fidelity platform for SmartNIC development and emulation, along with architectural exploration of intra-host communication.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"261-264"},"PeriodicalIF":1.4,"publicationDate":"2025-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11048525","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144880525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Time Series Machine Learning Models for Precise SSD Access Latency Prediction 用于SSD访问延迟精确预测的时间序列机器学习模型
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-06-20 DOI: 10.1109/LCA.2025.3581580
Bikrant Das Sharma;Houxiang Ji;Ipoom Jeong;Nam Sung Kim
Solid State Drives (SSDs) have become the dominant storage solution over the past few years. A key component of SSDs is the controller, which manages communication between the host and flash memory, optimizing data transfer speeds, integrity, and lifespan. However, modern SSDs function as closed boxes, as manufacturers do not disclose firmware and controller details. Meanwhile, read and write latencies are affected by various internal optimizations, such as wear-leveling and garbage collection, making precise latency prediction challenging. Existing approaches rely on trace-driven simulation or machine learning, but either (1) just classify operations into broad latency categories (e.g., fast or slow), including software stack overhead, or (2) make imprecise predictions while consuming significant system resources and time. For system simulation, latency predictions must be both fast and accurate, focusing solely on device-level delays excluding OS overhead, which is modeled separately. To tackle these challenges, this paper presents time series machine learning models to accurately predict hardware-only SSD latencies across diverse workloads. Our evaluation shows that the proposed model predicts 85%–95% of individual I/O latencies within a 10% error margin, outperforming existing simulators and ML models, which achieve only 6%–37% accuracy, while also providing 4×–255× speedups in prediction latency.
在过去几年中,固态硬盘(ssd)已成为主要的存储解决方案。ssd的一个关键组件是控制器,它管理主机和闪存之间的通信,优化数据传输速度、完整性和使用寿命。然而,现代ssd的功能是封闭的盒子,因为制造商不披露固件和控制器的细节。同时,读写延迟受到各种内部优化的影响,例如损耗均衡和垃圾收集,这使得精确的延迟预测具有挑战性。现有的方法依赖于跟踪驱动的模拟或机器学习,但要么(1)只是将操作分类为广泛的延迟类别(例如,快速或缓慢),包括软件堆栈开销,要么(2)在消耗大量系统资源和时间的同时做出不精确的预测。对于系统模拟,延迟预测必须既快速又准确,只关注设备级延迟,不包括操作系统开销,这是单独建模的。为了应对这些挑战,本文提出了时间序列机器学习模型,以准确预测不同工作负载下的纯硬件SSD延迟。我们的评估表明,提出的模型在10%的误差范围内预测85%-95%的单个I/O延迟,优于现有的模拟器和ML模型,后者的准确率仅为6%-37%,同时还提供4×-255×预测延迟的速度。
{"title":"Time Series Machine Learning Models for Precise SSD Access Latency Prediction","authors":"Bikrant Das Sharma;Houxiang Ji;Ipoom Jeong;Nam Sung Kim","doi":"10.1109/LCA.2025.3581580","DOIUrl":"https://doi.org/10.1109/LCA.2025.3581580","url":null,"abstract":"Solid State Drives (SSDs) have become the dominant storage solution over the past few years. A key component of SSDs is the controller, which manages communication between the host and flash memory, optimizing data transfer speeds, integrity, and lifespan. However, modern SSDs function as closed boxes, as manufacturers do not disclose firmware and controller details. Meanwhile, read and write latencies are affected by various internal optimizations, such as wear-leveling and garbage collection, making precise latency prediction challenging. Existing approaches rely on trace-driven simulation or machine learning, but either (1) just classify operations into broad latency categories (e.g., fast or slow), including software stack overhead, or (2) make imprecise predictions while consuming significant system resources and time. For system simulation, latency predictions must be both fast and accurate, focusing solely on device-level delays excluding OS overhead, which is modeled separately. To tackle these challenges, this paper presents time series machine learning models to accurately predict hardware-only SSD latencies across diverse workloads. Our evaluation shows that the proposed model predicts 85%–95% of individual I/O latencies within a 10% error margin, outperforming existing simulators and ML models, which achieve only 6%–37% accuracy, while also providing 4×–255× speedups in prediction latency.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"233-236"},"PeriodicalIF":1.4,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144814153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MOST: Memory Oversubscription-Aware Scheduling for Tensor Migration on GPU Unified Storage GPU统一存储上张量迁移的内存超订阅感知调度
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-06-17 DOI: 10.1109/LCA.2025.3580264
Junsu Kim;Jaebeom Jeon;Jaeyong Park;Sangun Choi;Minseong Gil;Seokin Hong;Gunjae Koo;Myung Kuk Yoon;Yunho Oh
Deep Neural Network (DNN) training demands large memory capacities that exceed the limits of current GPU onboard memory. Expanding GPU memory with SSDs is a cost-effective approach. However, the low bandwidth of SSDs introduces severe performance bottlenecks in data management, particularly for Unified Virtual Memory (UVM)-based systems. The default on-demand migration mechanism in UVM causes frequent page faults and stalls, exacerbated by memory oversubscription and eviction processes along the critical path. To address these challenges, this paper proposes Memory Oversubscription-aware Scheduling for Tensor Migration (MOST), a software framework designed to improve data migration in UVM environments. MOST profiles memory access behavior and quantifies the impact of memory oversubscription stalls and schedules tensor migrations to minimize overall training time. With the profiling results, MOST executes newly designed pre-eviction and prefetching instructions within DNN kernel code. MOST effectively selects and migrates tensors that can mitigate memory oversubscription stalls, thus reducing training time. Our evaluation shows that MOST achieves an average speedup of 22.9% and 12.8% over state-of-the-art techniques, DeepUM and G10, respectively.
深度神经网络(DNN)训练需要大量的内存容量,超出了当前GPU板载内存的限制。用ssd扩展GPU内存是一种经济有效的方法。然而,ssd的低带宽给数据管理带来了严重的性能瓶颈,特别是对于基于统一虚拟内存(UVM)的系统。UVM中默认的按需迁移机制会导致频繁的页面错误和停滞,关键路径上的内存过度订阅和退出进程会加剧这种情况。为了解决这些挑战,本文提出了内存超订阅感知的张量迁移调度(MOST),这是一个旨在改善UVM环境中的数据迁移的软件框架。MOST配置内存访问行为,量化内存超额订阅的影响,并调度张量迁移,以最大限度地减少总体训练时间。根据分析结果,MOST在DNN内核代码中执行新设计的预提取和预取指令。MOST有效地选择和迁移张量,可以减轻内存超额订阅的摊位,从而减少训练时间。我们的评估表明,与最先进的技术DeepUM和G10相比,MOST的平均加速速度分别提高了22.9%和12.8%。
{"title":"MOST: Memory Oversubscription-Aware Scheduling for Tensor Migration on GPU Unified Storage","authors":"Junsu Kim;Jaebeom Jeon;Jaeyong Park;Sangun Choi;Minseong Gil;Seokin Hong;Gunjae Koo;Myung Kuk Yoon;Yunho Oh","doi":"10.1109/LCA.2025.3580264","DOIUrl":"https://doi.org/10.1109/LCA.2025.3580264","url":null,"abstract":"Deep Neural Network (DNN) training demands large memory capacities that exceed the limits of current GPU onboard memory. Expanding GPU memory with SSDs is a cost-effective approach. However, the low bandwidth of SSDs introduces severe performance bottlenecks in data management, particularly for Unified Virtual Memory (UVM)-based systems. The default on-demand migration mechanism in UVM causes frequent page faults and stalls, exacerbated by memory oversubscription and eviction processes along the critical path. To address these challenges, this paper proposes Memory Oversubscription-aware Scheduling for Tensor Migration (MOST), a software framework designed to improve data migration in UVM environments. MOST profiles memory access behavior and quantifies the impact of memory oversubscription stalls and schedules tensor migrations to minimize overall training time. With the profiling results, MOST executes newly designed pre-eviction and prefetching instructions within DNN kernel code. MOST effectively selects and migrates tensors that can mitigate memory oversubscription stalls, thus reducing training time. Our evaluation shows that MOST achieves an average speedup of 22.9% and 12.8% over state-of-the-art techniques, DeepUM and G10, respectively.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"213-216"},"PeriodicalIF":1.4,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144680906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Stardust: Scalable and Transferable Workload Mapping for Large AI on Multi-Chiplet Systems 星尘:多芯片系统上大型人工智能的可扩展和可转移工作负载映射
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-06-17 DOI: 10.1109/LCA.2025.3580562
Wencheng Zou;Feiyun Zhao;Nan Wu
Workload partitioning and mapping are critical to optimizing performance in multi-chiplet systems. However, existing approaches struggle with scalability in large search spaces and lack transferability across different workloads. To overcome these limitations, we propose Stardust, a scalable and transferable workload mapping on multi-chiplet systems. Stardust combines learnable graph clustering to downscale computation graphs for efficient partitioning, topology-masked attention to capture structural information, and deep reinforcement learning (DRL) for optimized workload mapping. Evaluations on production-scale AI models show that (1) Stardust-generated mappings significantly outperform commonly used heuristics in throughput, and (2) fine-tuning a pre-trained Stardust model improves sample efficiency by up to 15× compared to training from scratch.
工作负载分区和映射是优化多芯片系统性能的关键。然而,现有的方法难以在大型搜索空间中实现可伸缩性,并且缺乏跨不同工作负载的可移植性。为了克服这些限制,我们提出了Stardust,一个可扩展和可转移的多芯片系统工作负载映射。Stardust结合了可学习的图聚类,以缩小计算图的规模,实现高效分区,拓扑掩码关注,捕获结构信息,深度强化学习(DRL),优化工作负载映射。对生产规模人工智能模型的评估表明:(1)Stardust生成的映射在吞吐量方面显著优于常用的启发式算法;(2)与从头开始训练相比,对预训练的Stardust模型进行微调可将样本效率提高15倍。
{"title":"Stardust: Scalable and Transferable Workload Mapping for Large AI on Multi-Chiplet Systems","authors":"Wencheng Zou;Feiyun Zhao;Nan Wu","doi":"10.1109/LCA.2025.3580562","DOIUrl":"https://doi.org/10.1109/LCA.2025.3580562","url":null,"abstract":"Workload partitioning and mapping are critical to optimizing performance in multi-chiplet systems. However, existing approaches struggle with scalability in large search spaces and lack transferability across different workloads. To overcome these limitations, we propose <sc>Stardust</small>, a <underline>s</u>calable and <underline>t</u>r<underline>a</u>nsfe<underline>r</u>able workloa<underline>d</u> mapping on m<underline>u</u>lti-chiplet sy<underline>st</u>ems. <sc>Stardust</small> combines learnable graph clustering to downscale computation graphs for efficient partitioning, topology-masked attention to capture structural information, and deep reinforcement learning (DRL) for optimized workload mapping. Evaluations on production-scale AI models show that (1) <sc>Stardust</small>-generated mappings significantly outperform commonly used heuristics in throughput, and (2) fine-tuning a pre-trained <sc>Stardust</small> model improves sample efficiency by up to 15× compared to training from scratch.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"201-204"},"PeriodicalIF":1.4,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144623874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
pNet-gem5: Full-System Simulation With High-Performance Networking Enabled by Parallel Network Packet Processing pNet-gem5:通过并行网络数据包处理实现高性能网络的全系统仿真
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-06-06 DOI: 10.1109/LCA.2025.3577232
Jongmin Shin;Seongtae Bang;Gyeongseo Park;Daehoon Kim
Modern server processors in data centers equipped with high-performance networking technologies (e.g., 100 Gigabit Ethernet) commonly support parallel packet processing via multi-queue NICs, enabling multiple cores to efficiently handle massive traffic loads. However, existing architectural simulators such as gem5 lack support for these techniques and suffer from limited bandwidth due to outdated networking models. Although a recent study introduced a simulation framework supporting userspace high-performance networking via the Data Plane Development Kit (DPDK), many applications still rely on kernel-based networking. To address these limitations, we present pNet-gem5, a full-system simulation framework designed to model server systems under high-performance network workloads, targeting data center architecture research. pNet-gem5 extends gem5 by supporting parallel packet processing on multi-core systems through the integration of multiple hardware queues and a more advanced interrupt mechanism—Message Signaled Interrupts (MSI)—which allows each NIC queue to be mapped to a dedicated core with its own IRQ. It also provides a high-performance network interface and device driver that support scalable and configurable packet distribution between hardware and software. Moreover, by decoupling packet distribution and scheduling from NIC core logic, pNet-gem5 enables flexible experimentation with custom policies. As a result, pNet-gem5 enables more realistic simulation of modern server environments by modeling multi-queue NICs and supporting bandwidths up to 46 Gbps—a significant improvement over the previous limit of only a few Gbps and more closely aligned with today’s tens-of-Gbps networks.
配备高性能网络技术(例如,100千兆以太网)的数据中心中的现代服务器处理器通常支持通过多队列网卡并行数据包处理,使多个核心能够有效地处理大量流量负载。然而,现有的架构模拟器(如gem5)缺乏对这些技术的支持,并且由于过时的网络模型而受到带宽限制。尽管最近的一项研究引入了一个模拟框架,通过数据平面开发工具包(Data Plane Development Kit, DPDK)支持用户空间高性能网络,但许多应用程序仍然依赖于基于内核的网络。为了解决这些限制,我们提出了pNet-gem5,这是一个全系统仿真框架,旨在对高性能网络工作负载下的服务器系统进行建模,目标是数据中心架构研究。pNet-gem5扩展了gem5,通过集成多个硬件队列和更高级的中断机制——消息信号中断(message signaling Interrupts, MSI)——在多核系统上支持并行数据包处理,MSI允许每个NIC队列被映射到具有自己IRQ的专用核心。它还提供了一个高性能的网络接口和设备驱动程序,支持硬件和软件之间可伸缩和可配置的数据包分发。此外,通过将数据包分发和调度与网卡核心逻辑解耦,pNet-gem5支持灵活的自定义策略实验。因此,pNet-gem5通过建模多队列nic并支持高达46 Gbps的带宽,从而能够更逼真地模拟现代服务器环境,这比以前仅为几Gbps的限制有了重大改进,并且与今天的数十Gbps网络更加接近。
{"title":"pNet-gem5: Full-System Simulation With High-Performance Networking Enabled by Parallel Network Packet Processing","authors":"Jongmin Shin;Seongtae Bang;Gyeongseo Park;Daehoon Kim","doi":"10.1109/LCA.2025.3577232","DOIUrl":"https://doi.org/10.1109/LCA.2025.3577232","url":null,"abstract":"Modern server processors in data centers equipped with high-performance networking technologies (e.g., 100 Gigabit Ethernet) commonly support parallel packet processing via multi-queue NICs, enabling multiple cores to efficiently handle massive traffic loads. However, existing architectural simulators such as <monospace>gem5</monospace> lack support for these techniques and suffer from limited bandwidth due to outdated networking models. Although a recent study introduced a simulation framework supporting userspace high-performance networking via the Data Plane Development Kit (DPDK), many applications still rely on kernel-based networking. To address these limitations, we present <monospace>pNet-gem5</monospace>, a full-system simulation framework designed to model server systems under high-performance network workloads, targeting data center architecture research. <monospace>pNet-gem5</monospace> extends <monospace>gem5</monospace> by supporting parallel packet processing on multi-core systems through the integration of multiple hardware queues and a more advanced interrupt mechanism—Message Signaled Interrupts (MSI)—which allows each NIC queue to be mapped to a dedicated core with its own IRQ. It also provides a high-performance network interface and device driver that support scalable and configurable packet distribution between hardware and software. Moreover, by decoupling packet distribution and scheduling from NIC core logic, <monospace>pNet-gem5</monospace> enables flexible experimentation with custom policies. As a result, <monospace>pNet-gem5</monospace> enables more realistic simulation of modern server environments by modeling multi-queue NICs and supporting bandwidths up to 46 Gbps—a significant improvement over the previous limit of only a few Gbps and more closely aligned with today’s tens-of-Gbps networks.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"193-196"},"PeriodicalIF":1.4,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144536558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Architectural Sustainability Indicator 建筑可持续发展指标
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-06-05 DOI: 10.1109/LCA.2025.3576891
Jaime Roelandts;Ajeya Naithani;Lieven Eeckhout
Computing devices are responsible for a significant fraction of the world’s total carbon footprint. Designing sustainable systems is a challenging endeavor because of the huge design space, the complex objective function, and the inherent data uncertainty. To make matters worse, a design that seems sustainable at first, might turn out to not be when taking rebound effects into account. In this paper, we propose the Architectural Sustainability Indicator (ASI), a novel metric to assess the sustainability of a given design and determine whether it is strongly, weakly, or unsustainable. ASI provides insight and hints for turning unsustainable and weakly sustainable design points into strongly sustainable ones that are robust against potential rebound effects. A case study illustrates how ASI steers Scalar Vector Runahead, a weakly sustainable hardware prefetching technique, into a strongly sustainable one while offering a 3.2× performance boost.
计算设备的碳足迹占世界总碳足迹的很大一部分。由于设计空间巨大,目标函数复杂,以及固有的数据不确定性,设计可持续系统是一项具有挑战性的工作。更糟糕的是,一开始看起来可持续的设计,在考虑反弹效应时可能会变得不可行。在本文中,我们提出了建筑可持续性指标(ASI),这是一种评估给定设计的可持续性并确定其是强、弱还是不可持续的新指标。ASI提供了将不可持续和弱可持续设计点转变为强可持续设计点的见解和提示,这些设计点可以抵御潜在的反弹效应。一个案例研究说明了ASI如何将一个弱可持续的硬件预取技术——标量矢量提前预取(Scalar Vector Runahead)转变为一个强可持续的技术,同时提供3.2倍的性能提升。
{"title":"The Architectural Sustainability Indicator","authors":"Jaime Roelandts;Ajeya Naithani;Lieven Eeckhout","doi":"10.1109/LCA.2025.3576891","DOIUrl":"https://doi.org/10.1109/LCA.2025.3576891","url":null,"abstract":"Computing devices are responsible for a significant fraction of the world’s total carbon footprint. Designing sustainable systems is a challenging endeavor because of the huge design space, the complex objective function, and the inherent data uncertainty. To make matters worse, a design that seems sustainable at first, might turn out to not be when taking rebound effects into account. In this paper, we propose the Architectural Sustainability Indicator (ASI), a novel metric to assess the sustainability of a given design and determine whether it is strongly, weakly, or unsustainable. ASI provides insight and hints for turning unsustainable and weakly sustainable design points into strongly sustainable ones that are robust against potential rebound effects. A case study illustrates how ASI steers Scalar Vector Runahead, a weakly sustainable hardware prefetching technique, into a strongly sustainable one while offering a 3.2× performance boost.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"205-208"},"PeriodicalIF":1.4,"publicationDate":"2025-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144680905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1