首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
CrossFetch: A Prefetching Scheme for Cross-Page Prefetching in the Physical Address Space 交叉取:一种在物理地址空间中进行跨页预取的预取方案
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-12-08 DOI: 10.1109/LCA.2025.3640965
Qi Shao;Per Stenstrom
Prefetching is an important technique to reduce the miss penalty in deep memory hierarchies, employing multiple levels of cache and memory. Unfortunately, state-of-the-art techniques avoid prefetching across page boundaries in physically addressed memory because contiguous virtual pages are not guaranteed to map to contiguous physical pages. Apart from low accuracy, prefetching across page boundaries can break protection domains, opening up security vulnerabilities. This paper proposes CrossFetch — the first prefetching technique that accurately and securely prefetches data across physical page boundaries. It uses a simple and novel translation mechanism that combines a conventional TLB, called forward TLB (FTLB), with a reverse TLB (RTLB) that provides mappings of physical pages to virtual. CrossFetch leverages a conventional Page Table Walker invoked by a conventional TLB to load mappings into the FTLB and RTLB. The paper demonstrates how CrossFetch can hide far-memory misses in hybrid main-memory systems and last-level cache misses. We show that CrossFetch can improve IPC by 5.7% (up to 27.7%) compared to intra-page prefetchers on SPEC2017 benchmarks where the tolerance of far-memory misses dominates.
预取是一种重要的技术,可以减少深度内存层次结构中丢失的损失,它采用了多级缓存和内存。不幸的是,最先进的技术避免在物理寻址内存中跨页边界预取,因为不能保证连续的虚拟页映射到连续的物理页。除了准确性低之外,跨页面边界的预取可能会破坏保护域,从而导致安全漏洞。本文提出了CrossFetch——第一种准确、安全地跨物理页面边界预取数据的预取技术。它使用一种简单而新颖的转换机制,将传统的TLB(称为正向TLB (FTLB))与反向TLB (RTLB)结合起来,后者提供了物理页面到虚拟页面的映射。CrossFetch利用由常规TLB调用的常规Page Table Walker将映射加载到FTLB和RTLB中。本文演示了CrossFetch如何在混合主存系统和最后一级缓存丢失中隐藏远内存丢失。我们表明,与SPEC2017基准上的页内预取器相比,CrossFetch可以将IPC提高5.7%(最高27.7%),其中远内存丢失的容错性占主导地位。
{"title":"CrossFetch: A Prefetching Scheme for Cross-Page Prefetching in the Physical Address Space","authors":"Qi Shao;Per Stenstrom","doi":"10.1109/LCA.2025.3640965","DOIUrl":"https://doi.org/10.1109/LCA.2025.3640965","url":null,"abstract":"Prefetching is an important technique to reduce the miss penalty in deep memory hierarchies, employing multiple levels of cache and memory. Unfortunately, state-of-the-art techniques avoid prefetching across page boundaries in physically addressed memory because contiguous virtual pages are not guaranteed to map to contiguous physical pages. Apart from low accuracy, prefetching across page boundaries can break protection domains, opening up security vulnerabilities. This paper proposes CrossFetch — the first prefetching technique that accurately and securely prefetches data across physical page boundaries. It uses a simple and novel translation mechanism that combines a conventional TLB, called forward TLB (FTLB), with a reverse TLB (RTLB) that provides mappings of physical pages to virtual. CrossFetch leverages a conventional Page Table Walker invoked by a conventional TLB to load mappings into the FTLB and RTLB. The paper demonstrates how CrossFetch can hide far-memory misses in hybrid main-memory systems and last-level cache misses. We show that CrossFetch can improve IPC by 5.7% (up to 27.7%) compared to intra-page prefetchers on SPEC2017 benchmarks where the tolerance of far-memory misses dominates.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"1-4"},"PeriodicalIF":1.4,"publicationDate":"2025-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11282456","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145860191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LeakDiT: Diffusion Transformers for Trace-Augmented Side-Channel Analysis 用于迹线增强侧通道分析的扩散变压器
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-12-02 DOI: 10.1109/LCA.2025.3639372
Insup Lee;Daehyeon Bae;Seokhie Hong;Sangjin Lee
Deep learning has been extensively used in side-channel analysis (SCA), making trace data insufficiency and class imbalance a critical challenge. Although several studies have explored trace augmentation with generative models, two core limitations remain: (i) insufficient integration of SCA domain knowledge into the models and (ii) limited adoption of state-of-the-art diffusion transformers (DiT). This letter presents LeakDiT, a domain-specific one-dimensional DiT that generates high-quality traces. LeakDiT introduces a loss based on normalized inter-class variance (NICV) to produce realistic traces that preserve the leakage structure. Experimental results demonstrate that LeakDiT improves SCA performance and reduces the number of required traces for key recovery.
深度学习在侧信道分析(SCA)中得到了广泛的应用,使得跟踪数据不足和类不平衡成为一个关键的挑战。尽管一些研究已经探索了生成模型的痕迹增强,但仍然存在两个核心限制:(i) SCA领域知识未充分集成到模型中;(ii)最先进的扩散变压器(DiT)的采用有限。这封信介绍了LeakDiT,一个特定于领域的一维DiT,可以生成高质量的跟踪。LeakDiT引入了一种基于归一化类间方差(NICV)的损失,以产生保留泄漏结构的真实轨迹。实验结果表明,LeakDiT提高了SCA性能,减少了密钥恢复所需的跟踪数量。
{"title":"LeakDiT: Diffusion Transformers for Trace-Augmented Side-Channel Analysis","authors":"Insup Lee;Daehyeon Bae;Seokhie Hong;Sangjin Lee","doi":"10.1109/LCA.2025.3639372","DOIUrl":"https://doi.org/10.1109/LCA.2025.3639372","url":null,"abstract":"Deep learning has been extensively used in side-channel analysis (SCA), making trace data insufficiency and class imbalance a critical challenge. Although several studies have explored trace augmentation with generative models, two core limitations remain: (i) insufficient integration of SCA domain knowledge into the models and (ii) limited adoption of state-of-the-art diffusion transformers (DiT). This letter presents <sc>LeakDiT</small>, a domain-specific one-dimensional DiT that generates high-quality traces. <sc>LeakDiT</small> introduces a loss based on normalized inter-class variance (NICV) to produce realistic traces that preserve the leakage structure. Experimental results demonstrate that <sc>LeakDiT</small> improves SCA performance and reduces the number of required traces for key recovery.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"5-8"},"PeriodicalIF":1.4,"publicationDate":"2025-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145861219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Another Mirage of Breaking MIRAGE: Debunking Occupancy-Based Side-Channel Attacks on Fully Associative Randomized Caches 打破海市蜃楼的另一个海市蜃楼:揭穿对完全关联随机缓存的基于占用的侧通道攻击
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-11-27 DOI: 10.1109/LCA.2025.3638260
Chris Cao;Gururaj Saileshwar
A recent work presented at USENIX Security 2025, Systematic Evaluation of Randomized Cache Designs against Cache Occupancy (RCO), claims that cache-occupancy-based side-channel attacks can recover AES keys from the MIRAGE randomized cache. In this paper, we examine these claims and find that they arise from a flawed modeling of randomized caches in RCO. Critically, we find that the security properties of randomized caches strongly depend on the seeding methodology used to initialize random number generators (RNG) used in these caches. RCO’s modeling uses a constant seed to initialize the cache RNGs for each simulated AES encryption, causing every simulated AES encryption to artificially evict the same sequence of cache lines. This departs from accurate modeling of such randomized caches, where eviction sequences vary randomly for each program execution. We observe that an accurate modeling of such randomized caches, where the RNG seed is randomized in each simulation, causes correlations between AES T-table accesses and attacker observations to disappear, and the attack to fail. These findings show that the previously claimed leakages are due to flawed modeling and that with correct modeling, MIRAGE does not leak AES keys via occupancy based side-channels.
最近在USENIX安全2025上发表的一项工作,针对缓存占用(RCO)的随机缓存设计的系统评估,声称基于缓存占用的侧信道攻击可以从MIRAGE随机缓存中恢复AES密钥。在本文中,我们研究了这些说法,并发现它们源于RCO中随机缓存的有缺陷的建模。关键的是,我们发现随机缓存的安全特性强烈依赖于用于初始化这些缓存中使用的随机数生成器(RNG)的播种方法。RCO的建模使用一个常量种子来初始化每个模拟AES加密的缓存rng,导致每个模拟AES加密人为地排出相同的缓存线序列。这偏离了这种随机缓存的精确建模,在随机缓存中,每个程序执行的驱逐序列随机变化。我们观察到,这种随机缓存的精确建模,其中RNG种子在每次模拟中都是随机的,导致AES t表访问和攻击者观察结果之间的相关性消失,攻击失败。这些发现表明,先前声称的泄漏是由于有缺陷的建模,并且通过正确的建模,MIRAGE不会通过基于占用的侧信道泄漏AES密钥。
{"title":"Another Mirage of Breaking MIRAGE: Debunking Occupancy-Based Side-Channel Attacks on Fully Associative Randomized Caches","authors":"Chris Cao;Gururaj Saileshwar","doi":"10.1109/LCA.2025.3638260","DOIUrl":"https://doi.org/10.1109/LCA.2025.3638260","url":null,"abstract":"A recent work presented at USENIX Security 2025, <italic>Systematic Evaluation of Randomized Cache Designs against Cache Occupancy (RCO)</i>, claims that cache-occupancy-based side-channel attacks can recover AES keys from the MIRAGE randomized cache. In this paper, we examine these claims and find that they arise from a flawed modeling of randomized caches in RCO. Critically, we find that the security properties of randomized caches strongly depend on the seeding methodology used to initialize random number generators (RNG) used in these caches. RCO’s modeling uses a constant seed to initialize the cache RNGs for each simulated AES encryption, causing every simulated AES encryption to artificially evict the same sequence of cache lines. This departs from accurate modeling of such randomized caches, where eviction sequences vary randomly for each program execution. We observe that an accurate modeling of such randomized caches, where the RNG seed is randomized in each simulation, causes correlations between AES T-table accesses and attacker observations to disappear, and the attack to fail. These findings show that the previously claimed leakages are due to flawed modeling and that with correct modeling, MIRAGE does not leak AES keys via occupancy based side-channels.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"25-28"},"PeriodicalIF":1.4,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146082127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fusing Adds and Shifts for Efficient Dot Products 融合添加和移动为有效的点积
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-11-26 DOI: 10.1109/LCA.2025.3637718
Pavel Golikov;Karthik Ganesan;Gennady Pekhimenko;Mark C. Jeffrey
Dot products are heavily used in applications like graphics, signal processing, navigation, and artificial intelligence (AI). These AI models in particular impose significant computational demands on modern computers. Current accelerators typically implement dot product hardware as a row of multipliers followed by a tree of adders. However, treating multiplication and summation operations separately leads to sub-optimal hardware. In contrast, we obtain significant area savings by considering the dot product operation as a whole. We propose FASED, which fuses components of a Booth multiplier with the adder tree to eliminate a significant portion of full adders from a baseline INT8×INT8,4,2 design. Compared to popular dot product hardware units, FASED reduces area by up to $1.9times$.
点积在图形、信号处理、导航和人工智能(AI)等应用中被大量使用。这些人工智能模型尤其对现代计算机提出了巨大的计算需求。当前的加速器通常将点积硬件实现为一行乘法器后面跟着一棵加法器树。但是,单独处理乘法和求和操作会导致硬件不是最优的。相比之下,我们通过将点积运算作为一个整体来考虑,从而获得显著的面积节省。我们提出FASED,它将Booth乘法器的组件与加法器树融合在一起,以从基线INT8×INT8,4,2设计中消除大部分完整加法器。与流行的点积硬件单元相比,FASED可减少高达1.9倍的面积。
{"title":"Fusing Adds and Shifts for Efficient Dot Products","authors":"Pavel Golikov;Karthik Ganesan;Gennady Pekhimenko;Mark C. Jeffrey","doi":"10.1109/LCA.2025.3637718","DOIUrl":"https://doi.org/10.1109/LCA.2025.3637718","url":null,"abstract":"Dot products are heavily used in applications like graphics, signal processing, navigation, and artificial intelligence (AI). These AI models in particular impose significant computational demands on modern computers. Current accelerators typically implement dot product hardware as a row of multipliers followed by a tree of adders. However, treating multiplication and summation operations separately leads to sub-optimal hardware. In contrast, we obtain significant area savings by considering the dot product operation as a whole. We propose <monospace>FASED</monospace>, which fuses components of a Booth multiplier with the adder tree to eliminate a significant portion of full adders from a baseline INT8×INT8,4,2 design. Compared to popular dot product hardware units, <monospace>FASED</monospace> reduces area by up to <inline-formula><tex-math>$1.9times$</tex-math></inline-formula>.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"33-36"},"PeriodicalIF":1.4,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146082107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Disaggregated Speculative Decoding for Carbon-Efficient LLM Serving 碳效率LLM服务的分解推测解码
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-11-06 DOI: 10.1109/LCA.2025.3630094
Tianyao Shi;Yanran Wu;Sihang Liu;Yi Ding
Large language models (LLMs) are increasingly deployed in practice but incur significant computational costs and environmental impacts. Disaggregated serving techniques, particularly decoupling prefill and decoding (DPD) across GPUs, have been introduced to improve performance and reduce carbon emissions. However, DPD suffers from high bandwidth overhead due to frequent large KV cache transfers. To address this, we present disaggregated speculative decoding (DSD), which leverages speculative decoding by assigning draft models to older GPUs and target models to newer GPUs, requiring only token and probability distribution transfers. Building on this insight, we introduce GreenLLM, an SLO- and bandwidth-aware framework that unifies DPD and DSD, profiles workload characteristics, and dynamically selects the most carbon-efficient configuration. Across diverse benchmarks, GreenLLM reduces carbon emissions by up to 40.6% while meeting latency SLOs.
大型语言模型(llm)在实践中越来越多地部署,但会产生大量的计算成本和环境影响。已经引入了分解服务技术,特别是跨gpu的解耦预填充和解码(DPD),以提高性能并减少碳排放。然而,由于频繁的大KV缓存传输,DPD遭受高带宽开销的困扰。为了解决这个问题,我们提出了分解推测解码(DSD),它通过将草案模型分配给旧gpu和目标模型分配给新gpu来利用推测解码,只需要令牌和概率分布传输。在此基础上,我们介绍了GreenLLM,这是一个SLO和带宽感知框架,它统一了DPD和DSD,配置工作负载特征,并动态选择最节能的配置。在不同的基准测试中,GreenLLM在满足延迟slo的同时减少了40.6%的碳排放。
{"title":"Disaggregated Speculative Decoding for Carbon-Efficient LLM Serving","authors":"Tianyao Shi;Yanran Wu;Sihang Liu;Yi Ding","doi":"10.1109/LCA.2025.3630094","DOIUrl":"https://doi.org/10.1109/LCA.2025.3630094","url":null,"abstract":"Large language models (LLMs) are increasingly deployed in practice but incur significant computational costs and environmental impacts. Disaggregated serving techniques, particularly decoupling prefill and decoding (DPD) across GPUs, have been introduced to improve performance and reduce carbon emissions. However, DPD suffers from high bandwidth overhead due to frequent large KV cache transfers. To address this, we present disaggregated speculative decoding (DSD), which leverages speculative decoding by assigning draft models to older GPUs and target models to newer GPUs, requiring only token and probability distribution transfers. Building on this insight, we introduce GreenLLM, an SLO- and bandwidth-aware framework that unifies DPD and DSD, profiles workload characteristics, and dynamically selects the most carbon-efficient configuration. Across diverse benchmarks, GreenLLM reduces carbon emissions by up to 40.6% while meeting latency SLOs.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"369-372"},"PeriodicalIF":1.4,"publicationDate":"2025-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145612047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing DCIM Efficiency with Multi-Storage-Row Architecture for Edge AI Workloads 边缘AI工作负载的多存储行架构提高DCIM效率
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-11-05 DOI: 10.1109/LCA.2025.3629390
Xiaoyu Sun;Haruki Mori;Wei-Chang Zhao;Je-Min Hung;Hidehiro Fujiwara;Brian Crafton;Bo Zhang;Win-San Khwa;Yu-Der Chih;Tsung-Yung Jonathan Chang;Kerem Akarvardar
Digital Compute-in-Memory (DCIM) has emerged as a promising solution for accelerating Artificial Intelligence (AI) workloads, especially at the edge. However, DCIM macros may lead to inefficiencies due to frequent weight reloads associated with the block-wise (tiled) computation (BWC) scheme, required due to limited accumulator buffer capacity. To address this challenge, we leverage a multi-storage-row (MSR) architecture that integrates multiple storage cells per multiplier within a DCIM macro. This approach transforms costly weight buffer accesses into highly efficient local row-switching (RS) operations, significantly reducing weight reloading overhead and enhancing accelerator performance. Using technology parameters extracted from our 3 nm SRAM-based MSR-DCIM testchip, coupled with architectural statistics from our in-house analytical framework, we evaluated power, performance, and area (PPA) across several Deep Neural Network (DNN) workloads. Our findings on four edge DNN workloads, including convolutional neural networks (CNNs) and large language models (LLMs), demonstrate that MSR architecture facilitates the BWC-based execution with reduced latency and energy consumption compared to a baseline DCIM macro. A case study reveals that an MSR-DCIM macro with 16-storage-row achieves 15%–55% savings in energy-delay product (EDP) across the selected workloads, underscoring its potential for efficient edge AI acceleration.
数字内存计算(DCIM)已经成为加速人工智能(AI)工作负载的一种有前途的解决方案,特别是在边缘。然而,DCIM宏可能会导致效率低下,因为与块(平铺)计算(BWC)方案相关的权重频繁重新加载,这是由于有限的累加器缓冲容量所必需的。为了应对这一挑战,我们利用了多存储行(MSR)架构,该架构在DCIM宏中为每个乘法器集成了多个存储单元。这种方法将代价高昂的权重缓冲区访问转换为高效的本地行交换(RS)操作,显著降低了权重重新加载开销并提高了加速器性能。利用从我们基于3nm sram的MSR-DCIM测试芯片中提取的技术参数,再加上我们内部分析框架的架构统计数据,我们评估了几种深度神经网络(DNN)工作负载的功耗、性能和面积(PPA)。我们在四个边缘DNN工作负载(包括卷积神经网络(cnn)和大型语言模型(llm))上的研究结果表明,与基线DCIM宏相比,MSR架构促进了基于bwc的执行,降低了延迟和能耗。一个案例研究表明,具有16行存储的MSR-DCIM宏在选定的工作负载中可以节省15%-55%的能量延迟产品(EDP),强调了其高效边缘人工智能加速的潜力。
{"title":"Enhancing DCIM Efficiency with Multi-Storage-Row Architecture for Edge AI Workloads","authors":"Xiaoyu Sun;Haruki Mori;Wei-Chang Zhao;Je-Min Hung;Hidehiro Fujiwara;Brian Crafton;Bo Zhang;Win-San Khwa;Yu-Der Chih;Tsung-Yung Jonathan Chang;Kerem Akarvardar","doi":"10.1109/LCA.2025.3629390","DOIUrl":"https://doi.org/10.1109/LCA.2025.3629390","url":null,"abstract":"Digital Compute-in-Memory (DCIM) has emerged as a promising solution for accelerating Artificial Intelligence (AI) workloads, especially at the edge. However, DCIM macros may lead to inefficiencies due to frequent weight reloads associated with the block-wise (tiled) computation (BWC) scheme, required due to limited accumulator buffer capacity. To address this challenge, we leverage a multi-storage-row (MSR) architecture that integrates multiple storage cells per multiplier within a DCIM macro. This approach transforms costly weight buffer accesses into highly efficient local row-switching (RS) operations, significantly reducing weight reloading overhead and enhancing accelerator performance. Using technology parameters extracted from our 3 nm SRAM-based MSR-DCIM testchip, coupled with architectural statistics from our in-house analytical framework, we evaluated power, performance, and area (PPA) across several Deep Neural Network (DNN) workloads. Our findings on four edge DNN workloads, including convolutional neural networks (CNNs) and large language models (LLMs), demonstrate that MSR architecture facilitates the BWC-based execution with reduced latency and energy consumption compared to a baseline DCIM macro. A case study reveals that an MSR-DCIM macro with 16-storage-row achieves 15%–55% savings in energy-delay product (EDP) across the selected workloads, underscoring its potential for efficient edge AI acceleration.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"377-380"},"PeriodicalIF":1.4,"publicationDate":"2025-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145612071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
I/O-ETEM: An I/O-Aware Approach for Estimating Execution Time of Machine Learning Workloads I/O-ETEM:一种I/ o感知的机器学习工作负载执行时间估计方法
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-11-04 DOI: 10.1109/LCA.2025.3620629
Elham Adibi;Mohammadamin Ajdari;Pouria Arefijamal;Amirsaeed Ahmadi-Tonekaboni;Hossein Asadi
Training Machine Learning (ML) models commonly rely on High-Performance Computing (HPC) centers or cloud servers that accommodate compute nodes with powerful resources. Users tend to enhance the accuracy of ML applications by continuous training, after model refinements, and dataset size increase. With such application changes, and heterogeneity of HPC nodes, knowing each job execution time in advance (i.e., prediction) is necessary for efficient job scheduling. We observe that I/O accesses highly influence the executing time of modern ML applications. Unfortunately, existing studies on estimating job execution time either (a) rely on overestimated user declared time, or (b) predict execution time mainly based on compute resources (ignoring I/O or storage effects), and use complex deep learning models for this purpose. In this paper, we propose a simple, yet effective method for predicting the execution time of ML training. Our approach explicitly accounts for I/O accesses as a critical factor. Our method combines (a) partial application execution & monitoring, (b) analytical modeling leveraging ML application characteristics, (c) dynamic re-estimation, and (d) simplified history-based analysis. Our evaluation on a number of Convolutional Neural Networks (CNNs) and Transformer models show that our proposed method predicts the execution time accurately (i.e., with error less than 8% for most cases) compared to actual execution.
训练机器学习(ML)模型通常依赖于高性能计算(HPC)中心或云服务器,这些服务器可以容纳具有强大资源的计算节点。用户倾向于通过持续训练、模型细化和数据集大小的增加来提高机器学习应用程序的准确性。由于应用程序的变化和HPC节点的异构性,提前知道每个作业的执行时间(即预测)对于有效的作业调度是必要的。我们观察到I/O访问高度影响现代ML应用程序的执行时间。不幸的是,现有的关于估计作业执行时间的研究要么(a)依赖于高估的用户声明时间,要么(b)主要基于计算资源(忽略I/O或存储影响)预测执行时间,并为此使用复杂的深度学习模型。在本文中,我们提出了一种简单而有效的方法来预测机器学习训练的执行时间。我们的方法明确地将I/O访问作为一个关键因素。我们的方法结合了(a)部分应用程序执行和监控,(b)利用ML应用程序特征的分析建模,(c)动态重新评估,以及(d)简化的基于历史的分析。我们对许多卷积神经网络(cnn)和Transformer模型的评估表明,与实际执行相比,我们提出的方法准确地预测了执行时间(即,在大多数情况下误差小于8%)。
{"title":"I/O-ETEM: An I/O-Aware Approach for Estimating Execution Time of Machine Learning Workloads","authors":"Elham Adibi;Mohammadamin Ajdari;Pouria Arefijamal;Amirsaeed Ahmadi-Tonekaboni;Hossein Asadi","doi":"10.1109/LCA.2025.3620629","DOIUrl":"https://doi.org/10.1109/LCA.2025.3620629","url":null,"abstract":"Training <italic>Machine Learning</i> (ML) models commonly rely on <italic>High-Performance Computing</i> (HPC) centers or cloud servers that accommodate compute nodes with powerful resources. Users tend to enhance the accuracy of ML applications by continuous training, after model refinements, and dataset size increase. With such application changes, and heterogeneity of HPC nodes, knowing each job execution time in advance (i.e., <italic>prediction</i>) is necessary for efficient job scheduling. We observe that I/O accesses highly influence the executing time of modern ML applications. Unfortunately, existing studies on estimating job execution time either (a) rely on overestimated user declared time, or (b) predict execution time mainly based on compute resources (ignoring I/O or storage effects), and use complex deep learning models for this purpose. <bold>In this paper</b>, we propose a simple, yet effective method for predicting the execution time of ML training. Our approach explicitly accounts for I/O accesses as a critical factor. Our method combines (a) partial application execution & monitoring, (b) analytical modeling leveraging ML application characteristics, (c) dynamic re-estimation, and (d) simplified history-based analysis. Our evaluation on a number of <italic>Convolutional Neural Networks</i> (CNNs) and Transformer models show that our proposed method predicts the execution time accurately (i.e., with error less than 8% for most cases) compared to actual execution.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"325-328"},"PeriodicalIF":1.4,"publicationDate":"2025-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Rethinking In-Memory Hash Table Design for CXL-Based Main Memory Compression 基于cxl的内存压缩中内存哈希表设计的再思考
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-11-04 DOI: 10.1109/LCA.2025.3628805
Teresa Zhang
Main memory compression has re-emerged as a viable solution to DRAM cost challenges, enabled by CXL memory devices that support hardware-managed, coarse-grained compression. This shift decouples logical memory space usage from physical consumption, opening new opportunities to re-think the design and implementation of in-memory data structures. This letter presents case studies on blocked variants of chained and Cuckoo hashing that align naturally with the memory compression granularity. Targeting large-scale in-memory data stores, blocked hash tables trade logical sparsity for implementation efficiency while leveraging compression to avoid physical memory waste. We develop mathematical formulations to enable theoretical analysis of key metrics, including memory usage saving and operational throughput. Our results highlight the potential of compression-aware data structures to better leverage modern memory hierarchies and motivate further exploration in this largely untapped design space.
通过支持硬件管理的粗粒度压缩的CXL内存设备,主内存压缩已经重新成为应对DRAM成本挑战的可行解决方案。这种转变将逻辑内存空间使用与物理消耗解耦,为重新考虑内存中数据结构的设计和实现提供了新的机会。这封信介绍了链和布谷鸟哈希的阻塞变体的案例研究,这些变体与内存压缩粒度自然地对齐。针对大规模内存中的数据存储,阻塞哈希表以逻辑稀疏性换取实现效率,同时利用压缩来避免物理内存浪费。我们开发数学公式来实现关键指标的理论分析,包括内存使用节省和操作吞吐量。我们的研究结果强调了压缩感知数据结构在更好地利用现代内存层次结构方面的潜力,并激发了在这个很大程度上尚未开发的设计空间中的进一步探索。
{"title":"Rethinking In-Memory Hash Table Design for CXL-Based Main Memory Compression","authors":"Teresa Zhang","doi":"10.1109/LCA.2025.3628805","DOIUrl":"https://doi.org/10.1109/LCA.2025.3628805","url":null,"abstract":"Main memory compression has re-emerged as a viable solution to DRAM cost challenges, enabled by CXL memory devices that support hardware-managed, coarse-grained compression. This shift decouples logical memory space usage from physical consumption, opening new opportunities to re-think the design and implementation of in-memory data structures. This letter presents case studies on <italic>blocked</i> variants of chained and Cuckoo hashing that align naturally with the memory compression granularity. Targeting large-scale in-memory data stores, blocked hash tables trade logical sparsity for implementation efficiency while leveraging compression to avoid physical memory waste. We develop mathematical formulations to enable theoretical analysis of key metrics, including memory usage saving and operational throughput. Our results highlight the potential of compression-aware data structures to better leverage modern memory hierarchies and motivate further exploration in this largely untapped design space.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"357-360"},"PeriodicalIF":1.4,"publicationDate":"2025-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure LLMServingSim2.0: LLM基础设施中异构硬件和服务技术的统一模拟器
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-11-03 DOI: 10.1109/LCA.2025.3628325
Jaehong Cho;Hyunmin Choi;Jongse Park
This letter introduces LLMServingSim2.0, a system simulator designed for exploring heterogeneous hardware in large-scale LLM serving systems. LLMServingSim2.0 addresses two key limitations of its predecessor: (1) integrating hardware models into system-level simulators is non-trivial due to the lack of a clear abstraction, and (2) existing simulators support only a narrow subset of serving techniques, leaving no infrastructure that captures the breadth of approaches in modern LLM serving. To overcome these issues, LLMServingSim2.0 adopts trace-driven performance modeling, accompanied by an operator-level latency profiler, enabling the integration of new accelerators with a single command. It further embeds up-to-date serving techniques while exposing flexible interfaces for request routing, cache management, and scheduling policies. In a TPU case study, our profiler requires 18.5 × fewer LoC and outperforms the predecessor’s hardware-simulator integration, demonstrating LLMServingSim2.0’s low-effort hardware extensibility. Our experiments further show that LLMServingSim2.0 reproduces GPU-based LLM serving with 1.9% error, while maintaining practical simulation time, making it a comprehensive platform for both hardware developers and LLM service providers.
这封信介绍了LLMServingSim2.0,这是一个系统模拟器,旨在探索大型LLM服务系统中的异构硬件。LLMServingSim2.0解决了其前身的两个关键限制:(1)由于缺乏明确的抽象,将硬件模型集成到系统级模拟器中是非常重要的;(2)现有的模拟器只支持服务技术的一个狭窄子集,没有基础设施可以捕获现代LLM服务方法的广度。为了克服这些问题,LLMServingSim2.0采用了跟踪驱动的性能建模,并附带了一个操作员级延迟分析器,从而能够将新的加速器与单个命令集成在一起。它进一步嵌入了最新的服务技术,同时为请求路由、缓存管理和调度策略提供了灵活的接口。在TPU案例研究中,我们的分析器需要的LoC减少了18.5倍,并且优于前一代的硬件模拟器集成,展示了LLMServingSim2.0的低工作量硬件可扩展性。我们的实验进一步表明,LLMServingSim2.0在保持实际仿真时间的同时,以1.9%的误差再现了基于gpu的LLM服务,使其成为硬件开发商和LLM服务提供商的综合平台。
{"title":"LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure","authors":"Jaehong Cho;Hyunmin Choi;Jongse Park","doi":"10.1109/LCA.2025.3628325","DOIUrl":"https://doi.org/10.1109/LCA.2025.3628325","url":null,"abstract":"This letter introduces LLMServingSim2.0, a system simulator designed for exploring heterogeneous hardware in large-scale LLM serving systems. LLMServingSim2.0 addresses two key limitations of its predecessor: (1) integrating hardware models into system-level simulators is non-trivial due to the lack of a clear abstraction, and (2) existing simulators support only a narrow subset of serving techniques, leaving no infrastructure that captures the breadth of approaches in modern LLM serving. To overcome these issues, LLMServingSim2.0 adopts trace-driven performance modeling, accompanied by an operator-level latency profiler, enabling the integration of new accelerators with a single command. It further embeds up-to-date serving techniques while exposing flexible interfaces for request routing, cache management, and scheduling policies. In a TPU case study, our profiler requires 18.5 × fewer LoC and outperforms the predecessor’s hardware-simulator integration, demonstrating LLMServingSim2.0’s low-effort hardware extensibility. Our experiments further show that LLMServingSim2.0 reproduces GPU-based LLM serving with 1.9% error, while maintaining practical simulation time, making it a comprehensive platform for both hardware developers and LLM service providers.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"361-364"},"PeriodicalIF":1.4,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
StreamDQ: HBM-Integrated On-the-Fly DeQuantization via Memory Load for Large Language Models 基于内存负载的基于hbm的大型语言模型动态去量化
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-10-31 DOI: 10.1109/LCA.2025.3626929
Minki Jeong;Daegun Yoon;Soohong Ahn;Seungyong Lee;Jooyoung Kim;Jinuk Jeon;Joonseop Sim;Youngpyo Joo;Hoshik Kim
As large language models (LLMs) scale, their memory and computation demands increase, making weight-only quantization a widely adopted technique to reduce memory footprint with minimal accuracy loss. However, CUDA core–based dequantization introduces significant instruction overhead, memory traffic, and pipeline stalls across all batch sizes, and critically remains a persistent bottleneck even in large-batch, cloud-scale LLM serving. To address these challenges, we propose StreamDQ, a lightweight architectural enhancement for cloud-scale LLM inference that enables on-the-fly dequantization within the memory subsystem by integrating compact DeQuantization Blocks (DQBs) into the base-die of high-bandwidth memory (HBM). StreamDQ leverages reserved address bits for control signaling, requiring no modifications to the ISA or compiler. Our design minimizes data movement, offloads computation from CUDA cores, and delivers dequantized weights directly to tensor cores for general matrix multiplication (GEMM) execution. StreamDQ achieves up to 83.57% reduction in inference latency and 5.15 × improvement in tokens-per-second throughput, with only 0.013 ${mathrm{mm}}^{2}$ area and 0.17 W power overhead per DQB. The design is scalable, software-transparent, and well-suited for high-throughput LLM inference on modern HBM-enabled GPU platforms.
随着大型语言模型(llm)的扩展,它们的内存和计算需求增加,使得仅权重量化成为一种广泛采用的技术,以最小的准确性损失来减少内存占用。然而,基于CUDA核心的去量化会在所有批处理规模中引入大量的指令开销、内存流量和管道停顿,即使在大规模的云规模LLM服务中,这仍然是一个持续的瓶颈。为了解决这些挑战,我们提出了StreamDQ,这是一种用于云规模LLM推理的轻量级架构增强,通过将紧凑的去量化块(dqb)集成到高带宽内存(HBM)的基本芯片中,可以在内存子系统中实现动态去量化。StreamDQ利用保留的地址位来控制信号,不需要修改ISA或编译器。我们的设计最大限度地减少了数据移动,卸载了CUDA核心的计算,并将去量化的权重直接传递给张量核心,用于通用矩阵乘法(GEMM)的执行。StreamDQ的推理延迟减少了83.57%,每秒令牌吞吐量提高了5.15倍,每个DQB的面积只有0.013 ${ mathm {mm}}^{2}$,功耗只有0.17 W。该设计是可扩展的,软件透明的,非常适合在现代支持hbm的GPU平台上进行高吞吐量LLM推理。
{"title":"StreamDQ: HBM-Integrated On-the-Fly DeQuantization via Memory Load for Large Language Models","authors":"Minki Jeong;Daegun Yoon;Soohong Ahn;Seungyong Lee;Jooyoung Kim;Jinuk Jeon;Joonseop Sim;Youngpyo Joo;Hoshik Kim","doi":"10.1109/LCA.2025.3626929","DOIUrl":"https://doi.org/10.1109/LCA.2025.3626929","url":null,"abstract":"As large language models (LLMs) scale, their memory and computation demands increase, making weight-only quantization a widely adopted technique to reduce memory footprint with minimal accuracy loss. However, CUDA core–based dequantization introduces significant instruction overhead, memory traffic, and pipeline stalls across all batch sizes, and critically remains a persistent bottleneck even in large-batch, cloud-scale LLM serving. To address these challenges, we propose StreamDQ, a lightweight architectural enhancement for cloud-scale LLM inference that enables on-the-fly dequantization within the memory subsystem by integrating compact DeQuantization Blocks (DQBs) into the base-die of high-bandwidth memory (HBM). StreamDQ leverages reserved address bits for control signaling, requiring no modifications to the ISA or compiler. Our design minimizes data movement, offloads computation from CUDA cores, and delivers dequantized weights directly to tensor cores for general matrix multiplication (GEMM) execution. StreamDQ achieves up to 83.57% reduction in inference latency and 5.15 × improvement in tokens-per-second throughput, with only 0.013 <inline-formula><tex-math>${mathrm{mm}}^{2}$</tex-math></inline-formula> area and 0.17 W power overhead per DQB. The design is scalable, software-transparent, and well-suited for high-throughput LLM inference on modern HBM-enabled GPU platforms.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"373-376"},"PeriodicalIF":1.4,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145612132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1