首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
UniCNet: Unified Cycle-Accurate Simulation for Composable Chiplet Network With Modular Design-Integration Workflow 基于模块化设计集成工作流的可组合芯片网络的统一周期精确仿真
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-01-13 DOI: 10.1109/LCA.2026.3653809
Peilin Wang;Mingyu Wang;Zhirong Ye;Tao Lu;Zhiyi Yu
Composable chiplet-based architectures adopt a two-stage design flow: first chiplet design, then modular integration, while still presenting a shared-memory, single-OS system view. However, the heterogeneous and modular nature of the resulting network may introduce performance inefficiencies and functional correctness issues, calling for an advanced simulation tool. In this paper, we introduce UniCNet, a unified and cycle-accurate network simulator designed for composable chiplet-based architectures, which is open-sourced. To achieve both accurate modeling and unified simulation, UniCNet employs a design-integration workflow that closely aligns with the composable chiplet design flow. It supports several key features oriented towards composable chiplet scenarios and introduces the first cycle-level chiplet protocol interface model. UniCNet further supports multi-threaded simulation, achieving up to 4× speedup with no cycle accuracy loss. We validate UniCNet against RTL models and demonstrate its utility via several case studies.
基于芯片的可组合架构采用两阶段设计流程:首先是芯片设计,然后是模块化集成,同时仍然呈现共享内存、单一操作系统的视图。然而,由此产生的网络的异构和模块化特性可能会导致性能低下和功能正确性问题,因此需要先进的仿真工具。在本文中,我们介绍了UniCNet,这是一个统一的、周期精确的网络模拟器,专为基于可组合芯片的架构而设计,它是开源的。为了实现精确建模和统一仿真,UniCNet采用了与可组合芯片设计流程紧密结合的设计集成工作流程。它支持面向可组合芯片场景的几个关键特性,并引入了第一个周期级芯片协议接口模型。UniCNet进一步支持多线程模拟,在没有周期精度损失的情况下实现高达4倍的加速。我们根据RTL模型验证了UniCNet,并通过几个案例研究展示了它的实用性。
{"title":"UniCNet: Unified Cycle-Accurate Simulation for Composable Chiplet Network With Modular Design-Integration Workflow","authors":"Peilin Wang;Mingyu Wang;Zhirong Ye;Tao Lu;Zhiyi Yu","doi":"10.1109/LCA.2026.3653809","DOIUrl":"https://doi.org/10.1109/LCA.2026.3653809","url":null,"abstract":"Composable chiplet-based architectures adopt a two-stage design flow: first chiplet design, then modular integration, while still presenting a shared-memory, single-OS system view. However, the heterogeneous and modular nature of the resulting network may introduce performance inefficiencies and functional correctness issues, calling for an advanced simulation tool. In this paper, we introduce UniCNet, a unified and cycle-accurate network simulator designed for composable chiplet-based architectures, which is open-sourced. To achieve both accurate modeling and unified simulation, UniCNet employs a <italic>design-integration</i> workflow that closely aligns with the composable chiplet design flow. It supports several key features oriented towards composable chiplet scenarios and introduces the first cycle-level chiplet protocol interface model. UniCNet further supports multi-threaded simulation, achieving up to 4× speedup with no cycle accuracy loss. We validate UniCNet against RTL models and demonstrate its utility via several case studies.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"37-40"},"PeriodicalIF":1.4,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146082129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LACIN: Linearly Arranged Complete Interconnection Networks LACIN:线性排列的完全互连网络
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-12-30 DOI: 10.1109/LCA.2025.3649284
Ramón Beivide;Cristóbal Camarero;Carmen Martínez;Enrique Vallejo;Mateo Valero
Several interconnection networks are based on the complete graph topology. Networks with a moderate size can be based on a single complete graph. However, large-scale networks such as Dragonfly and HyperX use, respectively, a hierarchical or a multi-dimensional composition of complete graphs. The number of links in these networks is huge and grows rapidly with their size. This paper introduces LACIN, a set of complete graph implementations that use identically indexed ports to link switches. This way of implementing the network reduces the complexity of its cabling and its routing. LACIN eases the deployment of networks for parallel computers of different scales, from VLSI systems to the largest supercomputers.
有几种互连网络是基于完全图拓扑的。中等规模的网络可以基于单个完全图。然而,像Dragonfly和HyperX这样的大规模网络分别使用完全图的分层或多维组合。这些网络中的链接数量是巨大的,并且随着它们的规模而迅速增长。本文介绍了LACIN,一组使用相同索引端口连接交换机的完全图实现。这种实现网络的方式降低了其布线和路由的复杂性。LACIN简化了不同规模的并行计算机(从VLSI系统到最大的超级计算机)的网络部署。
{"title":"LACIN: Linearly Arranged Complete Interconnection Networks","authors":"Ramón Beivide;Cristóbal Camarero;Carmen Martínez;Enrique Vallejo;Mateo Valero","doi":"10.1109/LCA.2025.3649284","DOIUrl":"https://doi.org/10.1109/LCA.2025.3649284","url":null,"abstract":"Several interconnection networks are based on the complete graph topology. Networks with a moderate size can be based on a single complete graph. However, large-scale networks such as Dragonfly and HyperX use, respectively, a hierarchical or a multi-dimensional composition of complete graphs. The number of links in these networks is huge and grows rapidly with their size. This paper introduces LACIN, a set of complete graph implementations that use identically indexed ports to link switches. This way of implementing the network reduces the complexity of its cabling and its routing. LACIN eases the deployment of networks for parallel computers of different scales, from VLSI systems to the largest supercomputers.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"29-32"},"PeriodicalIF":1.4,"publicationDate":"2025-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146082061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Inspex: Speculative Execution of Ready-to-Execute Loads in In-Order Cores Inspex:顺序核中准备执行负载的推测执行
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-12-26 DOI: 10.1109/LCA.2025.3646976
Yotaro Nada;Toru Koizumi;Ryota Shioya;Hidetsugu Irie;Shuichi Sakai
In-order (InO) cores are processor cores that execute instructions in program order. Due to their low complexity, InO cores have been widely used in situations where energy efficiency and small circuit area are required, but they provide limited performance. We focus on stalls in InO cores that are caused by load instructions and their consumer instructions. These stalls significantly degrade the performance, which accounts for 70% of the total execution time on SPEC CPU 2017. We found that many of these load instructions are ready to execute and could have been issued earlier. Based on this observation, we propose Inspex, which improves the performance of InO cores while maintaining its simplicity. Inspex predicts ready-to-execute loads and speculatively pre-executes them. This enables obtaining load results earlier, thereby removing the stalls caused by load consumers. Our simulation results show that Inspex improved the performance by 25.6% and reduced the energy consumption by 10.6%, and reduced the energy-delay product by 28.0% compared with a baseline InO core while incurring an area overhead of 0.65%.
按顺序(InO)核是按程序顺序执行指令的处理器核。由于其较低的复杂度,InO内核已广泛应用于需要能效和小电路面积的场合,但其提供的性能有限。我们关注的是由加载指令和它们的消费者指令引起的InO内核中的停滞。这些停顿严重降低了性能,在SPEC CPU 2017上占总执行时间的70%。我们发现,这些加载指令中有许多已经准备好执行了,可以更早地发出。基于这种观察,我们提出了Inspex,它在保持其简单性的同时提高了InO内核的性能。Inspex预测准备执行的负载,并推测性地预执行它们。这样可以更早地获得负载结果,从而消除由负载消费者引起的延迟。我们的仿真结果表明,Inspex与基准InO核心相比,性能提高了25.6%,能耗降低了10.6%,能量延迟产品降低了28.0%,而面积开销仅为0.65%。
{"title":"Inspex: Speculative Execution of Ready-to-Execute Loads in In-Order Cores","authors":"Yotaro Nada;Toru Koizumi;Ryota Shioya;Hidetsugu Irie;Shuichi Sakai","doi":"10.1109/LCA.2025.3646976","DOIUrl":"https://doi.org/10.1109/LCA.2025.3646976","url":null,"abstract":"In-order (InO) cores are processor cores that execute instructions in program order. Due to their low complexity, InO cores have been widely used in situations where energy efficiency and small circuit area are required, but they provide limited performance. We focus on stalls in InO cores that are caused by load instructions and their consumer instructions. These stalls significantly degrade the performance, which accounts for 70% of the total execution time on SPEC CPU 2017. We found that many of these load instructions are ready to execute and could have been issued earlier. Based on this observation, we propose Inspex, which improves the performance of InO cores while maintaining its simplicity. Inspex predicts ready-to-execute loads and speculatively pre-executes them. This enables obtaining load results earlier, thereby removing the stalls caused by load consumers. Our simulation results show that Inspex improved the performance by 25.6% and reduced the energy consumption by 10.6%, and reduced the energy-delay product by 28.0% compared with a baseline InO core while incurring an area overhead of 0.65%.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"17-20"},"PeriodicalIF":1.4,"publicationDate":"2025-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11316409","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enabling Cost-Efficient LLM Inference on Mid-Tier GPUs With NMP DIMMs 在采用NMP内存条的中间层gpu上实现低成本LLM推理
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-12-22 DOI: 10.1109/LCA.2025.3646622
Heewoo Kim;Alan La;Joseph Izraelevitz
Large Language Models (LLMs) require substantial computational resources, making cost-efficient inference challenging. Scaling out with mid-tier GPUs (e.g., NVIDIA A10) appears attractive for LLMs, but our characterization shows that communication bottlenecks prevent them from matching high-end GPUs (e.g., 4 × A100). Using 16 × A10 GPUs, we find the decode stage—dominant in inference runtime—is memory-bandwidth-bound in matrix multiplications, I/O-bandwidth-bound in AllReduce, and underutilizes compute resources. These traits make it well-suited for both DIMM-based Near-Memory Processing (NMP) offloading and also communication quantization. Analytical modeling shows that a 16 × A10 with NMP DIMMs and INT8 communication quantization can match 4 × A100 performance at 30% lower cost and even surpass it under equal cost. These results demonstrate the potential of our approach for cost-efficient LLM inference on mid-tier GPUs.
大型语言模型(llm)需要大量的计算资源,使得具有成本效益的推理具有挑战性。中层gpu(例如,NVIDIA A10)的扩展对llm来说似乎很有吸引力,但我们的表征表明,通信瓶颈阻止了它们匹配高端gpu(例如,4 × A100)。使用16 × A10 gpu,我们发现解码阶段(在推理运行时占主导地位)在矩阵乘法中是内存带宽受限的,在AllReduce中是I/ o带宽受限的,并且计算资源利用不足。这些特性使得它非常适合基于内存的近内存处理(NMP)卸载和通信量化。分析模型表明,采用NMP内存和INT8通信量化的16 × A10可以在同等成本下以低30%的成本达到4 × A100的性能,甚至超过4 × A100。这些结果证明了我们的方法在中间层gpu上具有成本效益的LLM推理的潜力。
{"title":"Enabling Cost-Efficient LLM Inference on Mid-Tier GPUs With NMP DIMMs","authors":"Heewoo Kim;Alan La;Joseph Izraelevitz","doi":"10.1109/LCA.2025.3646622","DOIUrl":"https://doi.org/10.1109/LCA.2025.3646622","url":null,"abstract":"Large Language Models (LLMs) require substantial computational resources, making cost-efficient inference challenging. Scaling out with mid-tier GPUs (e.g., NVIDIA A10) appears attractive for LLMs, but our characterization shows that communication bottlenecks prevent them from matching high-end GPUs (e.g., 4 × A100). Using 16 × A10 GPUs, we find the decode stage—dominant in inference runtime—is memory-bandwidth-bound in matrix multiplications, I/O-bandwidth-bound in <italic>AllReduce</i>, and underutilizes compute resources. These traits make it well-suited for both DIMM-based Near-Memory Processing (NMP) offloading and also communication quantization. Analytical modeling shows that a 16 × A10 with NMP DIMMs and INT8 communication quantization can match 4 × A100 performance at 30% lower cost and even surpass it under equal cost. These results demonstrate the potential of our approach for cost-efficient LLM inference on mid-tier GPUs.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"21-24"},"PeriodicalIF":1.4,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Understanding the Performance Behaviors of End-to-End Protein Design Pipelines on GPUs 了解gpu上端到端蛋白质设计管道的性能行为
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-12-19 DOI: 10.1109/LCA.2025.3646250
Jinwoo Hwang;Yeongmin Hwang;Tadiwos Meaza;Hyeonbin Bae;Jongse Park
Recent computational advances enable protein design pipelines to run end-to-end on GPUs, yet their heterogeneous computational behaviors remain undercharacterized at the system level. We implement and profile a representative pipeline at both component and full-pipeline granularities across varying inputs and hyperparameters. Our characterization identifies generally low GPU utilization and high sensitivity to sequence length and sampling strategies. We outline future research directions based on these insights and release an open-source pipeline and profiling scripts to facilitate further studies.
最近的计算进步使蛋白质设计管道能够在gpu上端到端运行,但它们的异构计算行为在系统级别上仍未得到充分表征。我们在不同输入和超参数的组件和全管道粒度上实现和分析一个有代表性的管道。我们的特性确定了通常较低的GPU利用率和对序列长度和采样策略的高灵敏度。我们根据这些见解概述了未来的研究方向,并发布了一个开源管道和分析脚本,以促进进一步的研究。
{"title":"Understanding the Performance Behaviors of End-to-End Protein Design Pipelines on GPUs","authors":"Jinwoo Hwang;Yeongmin Hwang;Tadiwos Meaza;Hyeonbin Bae;Jongse Park","doi":"10.1109/LCA.2025.3646250","DOIUrl":"https://doi.org/10.1109/LCA.2025.3646250","url":null,"abstract":"Recent computational advances enable protein design pipelines to run end-to-end on GPUs, yet their heterogeneous computational behaviors remain undercharacterized at the system level. We implement and profile a representative pipeline at both component and full-pipeline granularities across varying inputs and hyperparameters. Our characterization identifies generally low GPU utilization and high sensitivity to sequence length and sampling strategies. We outline future research directions based on these insights and release an open-source pipeline and profiling scripts to facilitate further studies.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"9-12"},"PeriodicalIF":1.4,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring KV Cache Quantization in Multimodal Large Language Model Inference 探索多模态大语言模型推理中的KV缓存量化
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-12-19 DOI: 10.1109/LCA.2025.3646170
Hyesung Ahn;Ranggi Hwang;Minsoo Rhu
Multimodal large language models (MLLMs) have demonstrated strong performance across modalities, such as image, video, and audio understanding, by leveraging large language models (LLMs) as a backbone. However, a critical challenge in MLLM inference is the large memory capacity required for the key–value (KV) cache, particularly when processing high-resolution images. This pressure often forces heterogeneous CPU–GPU systems to offload the KV cache to CPU memory, introducing substantial transfer latency. KV cache quantization is a promising way to reduce this memory demand, yet it remains underexplored for MLLM inference. In this work, we characterize MLLM inference and present a text-centric KV cache quantization method that retains only 10% of tokens in high precision while quantizing the rest. Our method reduces Time-To-First-Token (TTFT) by $1.7times$ and Time-Per-Output-Token (TPOT) by $4.3times$, with negligible accuracy loss.
通过利用大型语言模型(llm)作为主干,多模态大型语言模型(mllm)已经展示了跨模态(如图像、视频和音频理解)的强大性能。然而,MLLM推理中的一个关键挑战是键值(KV)缓存所需的大内存容量,特别是在处理高分辨率图像时。这种压力通常迫使异构CPU - gpu系统将KV缓存卸载到CPU内存,从而引入大量传输延迟。KV缓存量化是一种很有前途的减少内存需求的方法,但它在MLLM推理中仍未得到充分的探索。在这项工作中,我们描述了MLLM推理,并提出了一种以文本为中心的KV缓存量化方法,该方法在量化其余部分的同时仅保留10%的高精度令牌。我们的方法将第一个令牌的时间(TTFT)减少了1.7倍,每个输出令牌的时间(TPOT)减少了4.3倍,精度损失可以忽略不计。
{"title":"Exploring KV Cache Quantization in Multimodal Large Language Model Inference","authors":"Hyesung Ahn;Ranggi Hwang;Minsoo Rhu","doi":"10.1109/LCA.2025.3646170","DOIUrl":"https://doi.org/10.1109/LCA.2025.3646170","url":null,"abstract":"Multimodal large language models (MLLMs) have demonstrated strong performance across modalities, such as image, video, and audio understanding, by leveraging large language models (LLMs) as a backbone. However, a critical challenge in MLLM inference is the large memory capacity required for the key–value (KV) cache, particularly when processing high-resolution images. This pressure often forces heterogeneous CPU–GPU systems to offload the KV cache to CPU memory, introducing substantial transfer latency. KV cache quantization is a promising way to reduce this memory demand, yet it remains underexplored for MLLM inference. In this work, we characterize MLLM inference and present a text-centric KV cache quantization method that retains only 10% of tokens in high precision while quantizing the rest. Our method reduces Time-To-First-Token (TTFT) by <inline-formula><tex-math>$1.7times$</tex-math></inline-formula> and Time-Per-Output-Token (TPOT) by <inline-formula><tex-math>$4.3times$</tex-math></inline-formula>, with negligible accuracy loss.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"13-16"},"PeriodicalIF":1.4,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CrossFetch: A Prefetching Scheme for Cross-Page Prefetching in the Physical Address Space 交叉取:一种在物理地址空间中进行跨页预取的预取方案
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-12-08 DOI: 10.1109/LCA.2025.3640965
Qi Shao;Per Stenstrom
Prefetching is an important technique to reduce the miss penalty in deep memory hierarchies, employing multiple levels of cache and memory. Unfortunately, state-of-the-art techniques avoid prefetching across page boundaries in physically addressed memory because contiguous virtual pages are not guaranteed to map to contiguous physical pages. Apart from low accuracy, prefetching across page boundaries can break protection domains, opening up security vulnerabilities. This paper proposes CrossFetch — the first prefetching technique that accurately and securely prefetches data across physical page boundaries. It uses a simple and novel translation mechanism that combines a conventional TLB, called forward TLB (FTLB), with a reverse TLB (RTLB) that provides mappings of physical pages to virtual. CrossFetch leverages a conventional Page Table Walker invoked by a conventional TLB to load mappings into the FTLB and RTLB. The paper demonstrates how CrossFetch can hide far-memory misses in hybrid main-memory systems and last-level cache misses. We show that CrossFetch can improve IPC by 5.7% (up to 27.7%) compared to intra-page prefetchers on SPEC2017 benchmarks where the tolerance of far-memory misses dominates.
预取是一种重要的技术,可以减少深度内存层次结构中丢失的损失,它采用了多级缓存和内存。不幸的是,最先进的技术避免在物理寻址内存中跨页边界预取,因为不能保证连续的虚拟页映射到连续的物理页。除了准确性低之外,跨页面边界的预取可能会破坏保护域,从而导致安全漏洞。本文提出了CrossFetch——第一种准确、安全地跨物理页面边界预取数据的预取技术。它使用一种简单而新颖的转换机制,将传统的TLB(称为正向TLB (FTLB))与反向TLB (RTLB)结合起来,后者提供了物理页面到虚拟页面的映射。CrossFetch利用由常规TLB调用的常规Page Table Walker将映射加载到FTLB和RTLB中。本文演示了CrossFetch如何在混合主存系统和最后一级缓存丢失中隐藏远内存丢失。我们表明,与SPEC2017基准上的页内预取器相比,CrossFetch可以将IPC提高5.7%(最高27.7%),其中远内存丢失的容错性占主导地位。
{"title":"CrossFetch: A Prefetching Scheme for Cross-Page Prefetching in the Physical Address Space","authors":"Qi Shao;Per Stenstrom","doi":"10.1109/LCA.2025.3640965","DOIUrl":"https://doi.org/10.1109/LCA.2025.3640965","url":null,"abstract":"Prefetching is an important technique to reduce the miss penalty in deep memory hierarchies, employing multiple levels of cache and memory. Unfortunately, state-of-the-art techniques avoid prefetching across page boundaries in physically addressed memory because contiguous virtual pages are not guaranteed to map to contiguous physical pages. Apart from low accuracy, prefetching across page boundaries can break protection domains, opening up security vulnerabilities. This paper proposes CrossFetch — the first prefetching technique that accurately and securely prefetches data across physical page boundaries. It uses a simple and novel translation mechanism that combines a conventional TLB, called forward TLB (FTLB), with a reverse TLB (RTLB) that provides mappings of physical pages to virtual. CrossFetch leverages a conventional Page Table Walker invoked by a conventional TLB to load mappings into the FTLB and RTLB. The paper demonstrates how CrossFetch can hide far-memory misses in hybrid main-memory systems and last-level cache misses. We show that CrossFetch can improve IPC by 5.7% (up to 27.7%) compared to intra-page prefetchers on SPEC2017 benchmarks where the tolerance of far-memory misses dominates.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"1-4"},"PeriodicalIF":1.4,"publicationDate":"2025-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11282456","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145860191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LeakDiT: Diffusion Transformers for Trace-Augmented Side-Channel Analysis 用于迹线增强侧通道分析的扩散变压器
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-12-02 DOI: 10.1109/LCA.2025.3639372
Insup Lee;Daehyeon Bae;Seokhie Hong;Sangjin Lee
Deep learning has been extensively used in side-channel analysis (SCA), making trace data insufficiency and class imbalance a critical challenge. Although several studies have explored trace augmentation with generative models, two core limitations remain: (i) insufficient integration of SCA domain knowledge into the models and (ii) limited adoption of state-of-the-art diffusion transformers (DiT). This letter presents LeakDiT, a domain-specific one-dimensional DiT that generates high-quality traces. LeakDiT introduces a loss based on normalized inter-class variance (NICV) to produce realistic traces that preserve the leakage structure. Experimental results demonstrate that LeakDiT improves SCA performance and reduces the number of required traces for key recovery.
深度学习在侧信道分析(SCA)中得到了广泛的应用,使得跟踪数据不足和类不平衡成为一个关键的挑战。尽管一些研究已经探索了生成模型的痕迹增强,但仍然存在两个核心限制:(i) SCA领域知识未充分集成到模型中;(ii)最先进的扩散变压器(DiT)的采用有限。这封信介绍了LeakDiT,一个特定于领域的一维DiT,可以生成高质量的跟踪。LeakDiT引入了一种基于归一化类间方差(NICV)的损失,以产生保留泄漏结构的真实轨迹。实验结果表明,LeakDiT提高了SCA性能,减少了密钥恢复所需的跟踪数量。
{"title":"LeakDiT: Diffusion Transformers for Trace-Augmented Side-Channel Analysis","authors":"Insup Lee;Daehyeon Bae;Seokhie Hong;Sangjin Lee","doi":"10.1109/LCA.2025.3639372","DOIUrl":"https://doi.org/10.1109/LCA.2025.3639372","url":null,"abstract":"Deep learning has been extensively used in side-channel analysis (SCA), making trace data insufficiency and class imbalance a critical challenge. Although several studies have explored trace augmentation with generative models, two core limitations remain: (i) insufficient integration of SCA domain knowledge into the models and (ii) limited adoption of state-of-the-art diffusion transformers (DiT). This letter presents <sc>LeakDiT</small>, a domain-specific one-dimensional DiT that generates high-quality traces. <sc>LeakDiT</small> introduces a loss based on normalized inter-class variance (NICV) to produce realistic traces that preserve the leakage structure. Experimental results demonstrate that <sc>LeakDiT</small> improves SCA performance and reduces the number of required traces for key recovery.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"5-8"},"PeriodicalIF":1.4,"publicationDate":"2025-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145861219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Another Mirage of Breaking MIRAGE: Debunking Occupancy-Based Side-Channel Attacks on Fully Associative Randomized Caches 打破海市蜃楼的另一个海市蜃楼:揭穿对完全关联随机缓存的基于占用的侧通道攻击
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-11-27 DOI: 10.1109/LCA.2025.3638260
Chris Cao;Gururaj Saileshwar
A recent work presented at USENIX Security 2025, Systematic Evaluation of Randomized Cache Designs against Cache Occupancy (RCO), claims that cache-occupancy-based side-channel attacks can recover AES keys from the MIRAGE randomized cache. In this paper, we examine these claims and find that they arise from a flawed modeling of randomized caches in RCO. Critically, we find that the security properties of randomized caches strongly depend on the seeding methodology used to initialize random number generators (RNG) used in these caches. RCO’s modeling uses a constant seed to initialize the cache RNGs for each simulated AES encryption, causing every simulated AES encryption to artificially evict the same sequence of cache lines. This departs from accurate modeling of such randomized caches, where eviction sequences vary randomly for each program execution. We observe that an accurate modeling of such randomized caches, where the RNG seed is randomized in each simulation, causes correlations between AES T-table accesses and attacker observations to disappear, and the attack to fail. These findings show that the previously claimed leakages are due to flawed modeling and that with correct modeling, MIRAGE does not leak AES keys via occupancy based side-channels.
最近在USENIX安全2025上发表的一项工作,针对缓存占用(RCO)的随机缓存设计的系统评估,声称基于缓存占用的侧信道攻击可以从MIRAGE随机缓存中恢复AES密钥。在本文中,我们研究了这些说法,并发现它们源于RCO中随机缓存的有缺陷的建模。关键的是,我们发现随机缓存的安全特性强烈依赖于用于初始化这些缓存中使用的随机数生成器(RNG)的播种方法。RCO的建模使用一个常量种子来初始化每个模拟AES加密的缓存rng,导致每个模拟AES加密人为地排出相同的缓存线序列。这偏离了这种随机缓存的精确建模,在随机缓存中,每个程序执行的驱逐序列随机变化。我们观察到,这种随机缓存的精确建模,其中RNG种子在每次模拟中都是随机的,导致AES t表访问和攻击者观察结果之间的相关性消失,攻击失败。这些发现表明,先前声称的泄漏是由于有缺陷的建模,并且通过正确的建模,MIRAGE不会通过基于占用的侧信道泄漏AES密钥。
{"title":"Another Mirage of Breaking MIRAGE: Debunking Occupancy-Based Side-Channel Attacks on Fully Associative Randomized Caches","authors":"Chris Cao;Gururaj Saileshwar","doi":"10.1109/LCA.2025.3638260","DOIUrl":"https://doi.org/10.1109/LCA.2025.3638260","url":null,"abstract":"A recent work presented at USENIX Security 2025, <italic>Systematic Evaluation of Randomized Cache Designs against Cache Occupancy (RCO)</i>, claims that cache-occupancy-based side-channel attacks can recover AES keys from the MIRAGE randomized cache. In this paper, we examine these claims and find that they arise from a flawed modeling of randomized caches in RCO. Critically, we find that the security properties of randomized caches strongly depend on the seeding methodology used to initialize random number generators (RNG) used in these caches. RCO’s modeling uses a constant seed to initialize the cache RNGs for each simulated AES encryption, causing every simulated AES encryption to artificially evict the same sequence of cache lines. This departs from accurate modeling of such randomized caches, where eviction sequences vary randomly for each program execution. We observe that an accurate modeling of such randomized caches, where the RNG seed is randomized in each simulation, causes correlations between AES T-table accesses and attacker observations to disappear, and the attack to fail. These findings show that the previously claimed leakages are due to flawed modeling and that with correct modeling, MIRAGE does not leak AES keys via occupancy based side-channels.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"25-28"},"PeriodicalIF":1.4,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146082127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fusing Adds and Shifts for Efficient Dot Products 融合添加和移动为有效的点积
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-11-26 DOI: 10.1109/LCA.2025.3637718
Pavel Golikov;Karthik Ganesan;Gennady Pekhimenko;Mark C. Jeffrey
Dot products are heavily used in applications like graphics, signal processing, navigation, and artificial intelligence (AI). These AI models in particular impose significant computational demands on modern computers. Current accelerators typically implement dot product hardware as a row of multipliers followed by a tree of adders. However, treating multiplication and summation operations separately leads to sub-optimal hardware. In contrast, we obtain significant area savings by considering the dot product operation as a whole. We propose FASED, which fuses components of a Booth multiplier with the adder tree to eliminate a significant portion of full adders from a baseline INT8×INT8,4,2 design. Compared to popular dot product hardware units, FASED reduces area by up to $1.9times$.
点积在图形、信号处理、导航和人工智能(AI)等应用中被大量使用。这些人工智能模型尤其对现代计算机提出了巨大的计算需求。当前的加速器通常将点积硬件实现为一行乘法器后面跟着一棵加法器树。但是,单独处理乘法和求和操作会导致硬件不是最优的。相比之下,我们通过将点积运算作为一个整体来考虑,从而获得显著的面积节省。我们提出FASED,它将Booth乘法器的组件与加法器树融合在一起,以从基线INT8×INT8,4,2设计中消除大部分完整加法器。与流行的点积硬件单元相比,FASED可减少高达1.9倍的面积。
{"title":"Fusing Adds and Shifts for Efficient Dot Products","authors":"Pavel Golikov;Karthik Ganesan;Gennady Pekhimenko;Mark C. Jeffrey","doi":"10.1109/LCA.2025.3637718","DOIUrl":"https://doi.org/10.1109/LCA.2025.3637718","url":null,"abstract":"Dot products are heavily used in applications like graphics, signal processing, navigation, and artificial intelligence (AI). These AI models in particular impose significant computational demands on modern computers. Current accelerators typically implement dot product hardware as a row of multipliers followed by a tree of adders. However, treating multiplication and summation operations separately leads to sub-optimal hardware. In contrast, we obtain significant area savings by considering the dot product operation as a whole. We propose <monospace>FASED</monospace>, which fuses components of a Booth multiplier with the adder tree to eliminate a significant portion of full adders from a baseline INT8×INT8,4,2 design. Compared to popular dot product hardware units, <monospace>FASED</monospace> reduces area by up to <inline-formula><tex-math>$1.9times$</tex-math></inline-formula>.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"33-36"},"PeriodicalIF":1.4,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146082107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1