首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
DAWN: Efficient Distribution of Attention Workload in PIM-Enabled Systems for LLM Inference 在支持pim的LLM推理系统中有效分配注意力工作量
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-02-16 DOI: 10.1109/LCA.2026.3665202
Jaehoon Chung;Jinho Han;Young-Ho Gong;Sung Woo Chung
Recently, processing-in-memory (PIM) units have been deployed to accelerate matrix-vector multiplications in large language models (LLMs). However, due to the limited flexibility of PIMs, PIMs require a strict data layout for storing matrices in memory. As LLM inference operates autoregressively, new elements are appended to the stored matrices during inference, necessitating costly data layout reorganization. Nevertheless, since the conventional workload allocation method assigns entire matrices solely to PIMs, it causes data layout reorganization overhead (i.e., excessive memory writes). Furthermore, the significant variance in matrix sizes exacerbates PIM load imbalance. In this letter, we propose DAWN, a novel workload allocation method. DAWN divides matrices into equally sized chunks and employs a single chunk as the allocation unit. DAWN assigns a portion of chunks to traditional accelerators (e.g., neural processing units), which have no constraints on data layout for computation, to mitigate reorganization overhead. DAWN evenly distributes the remaining chunks across PIMs using a greedy approach to achieve PIM load balancing. Our simulation results show that DAWN improves throughput by up to 44.2% (34.8% on average) over the conventional workload allocation method.
最近,内存处理(PIM)单元被用于加速大型语言模型(llm)中的矩阵向量乘法。然而,由于pim的灵活性有限,pim需要严格的数据布局才能在内存中存储矩阵。由于LLM推理操作是自回归的,在推理过程中,新元素被附加到存储的矩阵中,需要进行昂贵的数据布局重组。然而,由于传统的工作负载分配方法将整个矩阵单独分配给pim,因此会导致数据布局重组开销(即过多的内存写入)。此外,矩阵大小的显著差异加剧了PIM负载不平衡。在这封信中,我们提出了一种新的工作负载分配方法DAWN。DAWN将矩阵划分为大小相等的块,并使用单个块作为分配单元。DAWN将一部分数据块分配给传统的加速器(例如,神经处理单元),这些加速器对计算的数据布局没有限制,以减轻重组开销。DAWN使用贪婪的方法将剩余的块均匀地分布在PIM上,以实现PIM的负载平衡。我们的模拟结果表明,与传统的工作负载分配方法相比,DAWN将吞吐量提高了44.2%(平均34.8%)。
{"title":"DAWN: Efficient Distribution of Attention Workload in PIM-Enabled Systems for LLM Inference","authors":"Jaehoon Chung;Jinho Han;Young-Ho Gong;Sung Woo Chung","doi":"10.1109/LCA.2026.3665202","DOIUrl":"https://doi.org/10.1109/LCA.2026.3665202","url":null,"abstract":"Recently, processing-in-memory (PIM) units have been deployed to accelerate matrix-vector multiplications in large language models (LLMs). However, due to the limited flexibility of PIMs, PIMs require a strict data layout for storing matrices in memory. As LLM inference operates autoregressively, new elements are appended to the stored matrices during inference, necessitating costly data layout reorganization. Nevertheless, since the conventional workload allocation method assigns entire matrices solely to PIMs, it causes data layout reorganization overhead (i.e., excessive memory writes). Furthermore, the significant variance in matrix sizes exacerbates PIM load imbalance. In this letter, we propose DAWN, a novel workload allocation method. DAWN divides matrices into equally sized chunks and employs a single chunk as the allocation unit. DAWN assigns a portion of chunks to traditional accelerators (e.g., neural processing units), which have no constraints on data layout for computation, to mitigate reorganization overhead. DAWN evenly distributes the remaining chunks across PIMs using a greedy approach to achieve PIM load balancing. Our simulation results show that DAWN improves throughput by up to 44.2% (34.8% on average) over the conventional workload allocation method.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"65-68"},"PeriodicalIF":1.4,"publicationDate":"2026-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147299643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
2025 Reviewers List* 2025审稿人名单*
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-02-12 DOI: 10.1109/LCA.2026.3653095
{"title":"2025 Reviewers List*","authors":"","doi":"10.1109/LCA.2026.3653095","DOIUrl":"https://doi.org/10.1109/LCA.2026.3653095","url":null,"abstract":"","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"1-3"},"PeriodicalIF":1.4,"publicationDate":"2026-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11395465","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Driving the Core Frontend With LiteBTB 用LiteBTB驱动核心前端
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-02-11 DOI: 10.1109/LCA.2026.3663702
Roman K. Brunner;Rakesh Kumar
Branch target buffer (BTB) is a central component of high performance core front-ends as it not only steers instruction fetch by uncovering upcoming control flow but also enables highly effective fetch-directed instruction prefetching. However, the massive instruction footprints of modern server applications far exceed the capacities of moderately sized BTBs, resulting in frequent misses that inevitably hurt performance. While commercial CPUs deploy large BTBs to mitigate this problem, they incur high storage and area overheads. Prior efforts to reduce BTB storage have primarily targeted branch targets, which has proven highly effective — so much that the tag storage now dominates the BTB storage budget. We make a key observation that BTBs exhibit a large degree of tag redundancy, i.e. only a small fraction of entries contain unique tags, and this fraction falls sharply as BTB capacity grows. Leveraging this insight, we propose LiteBTB, which employs a dedicated hardware structure to store unique tags only once and replaces per-entry tags in BTB with compact tag pointers. To avoid latency overheads, LiteBTB accesses the tag storage and BTB in parallel. Our evaluation shows that LiteBTB reduces storage by up to 13.1% compared to the state-of-the-art BTB design, called BTB-X, while maintaining equivalent performance. Alternatively, with the same storage budget, LiteBTB accommodates up to 1.125× more branches, yielding up to 2.7% performance improvement.
分支目标缓冲区(BTB)是高性能核心前端的核心组件,它不仅通过揭示即将到来的控制流来引导指令提取,而且还实现了高效的定向提取指令预取。然而,现代服务器应用程序的大量指令占用远远超过中等大小的btb的容量,导致频繁的错误,这不可避免地会损害性能。虽然商用cpu部署了大型btb来缓解这个问题,但它们会产生很高的存储和区域开销。先前减少BTB存储的努力主要针对分支目标,这已被证明是非常有效的——以至于标签存储现在主导了BTB存储预算。我们观察到BTB表现出很大程度的标签冗余,即只有一小部分条目包含唯一标签,并且随着BTB容量的增长,这一比例急剧下降。利用这种见解,我们提出了LiteBTB,它使用专用的硬件结构只存储唯一标签一次,并用紧凑的标签指针替换BTB中的每个条目标签。为了避免延迟开销,LiteBTB并行访问标签存储和BTB。我们的评估表明,与最先进的BTB设计(称为BTB- x)相比,LiteBTB最多可减少13.1%的存储空间,同时保持相同的性能。或者,在相同的存储预算下,LiteBTB可以容纳多达1.125倍的分支,从而产生高达2.7%的性能提升。
{"title":"Driving the Core Frontend With LiteBTB","authors":"Roman K. Brunner;Rakesh Kumar","doi":"10.1109/LCA.2026.3663702","DOIUrl":"https://doi.org/10.1109/LCA.2026.3663702","url":null,"abstract":"Branch target buffer (BTB) is a central component of high performance core front-ends as it not only steers instruction fetch by uncovering upcoming control flow but also enables highly effective fetch-directed instruction prefetching. However, the massive instruction footprints of modern server applications far exceed the capacities of moderately sized BTBs, resulting in frequent misses that inevitably hurt performance. While commercial CPUs deploy large BTBs to mitigate this problem, they incur high storage and area overheads. Prior efforts to reduce BTB storage have primarily targeted branch targets, which has proven highly effective — so much that the tag storage now dominates the BTB storage budget. We make a key observation that BTBs exhibit a large degree of tag redundancy, i.e. only a small fraction of entries contain unique tags, and this fraction falls sharply as BTB capacity grows. Leveraging this insight, we propose LiteBTB, which employs a dedicated hardware structure to store unique tags only once and replaces per-entry tags in BTB with compact tag pointers. To avoid latency overheads, LiteBTB accesses the tag storage and BTB in parallel. Our evaluation shows that LiteBTB reduces storage by up to 13.1% compared to the state-of-the-art BTB design, called BTB-X, while maintaining equivalent performance. Alternatively, with the same storage budget, LiteBTB accommodates up to 1.125× more branches, yielding up to 2.7% performance improvement.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"57-60"},"PeriodicalIF":1.4,"publicationDate":"2026-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147299691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CTL: A Case for CXL Device-Managed Hugepages CTL: CXL设备管理大页面的案例
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-02-06 DOI: 10.1109/LCA.2026.3661563
Sangbeom Jeon;Sang-Hoon Kim
Compute Express Link (CXL) enables memory expansion via high-speed cache-coherent interconnects, yet it magnifies address translation overheads due to limited TLB reach. While hugepages alleviate translation costs, they introduce severe fragmentation and compaction overheads in long-running systems. Given this trade-off, we propose the CXL Translation Layer (CTL), a device-resident mechanism that provides host with hugepages using fine-grained basepages in the device. CTL preserves the hugepage-level translation efficiency while achieving flexible memory management, delivering near-native performance in ideal cases and up to 16% improvement under fragmented conditions.
计算快速链路(Compute Express Link, CXL)通过高速缓存一致互连实现内存扩展,但由于TLB覆盖范围有限,它放大了地址转换开销。虽然大页面减轻了翻译成本,但它们在长时间运行的系统中引入了严重的碎片和压缩开销。考虑到这种权衡,我们提出了CXL转换层(CTL),这是一种设备驻留机制,它使用设备中的细粒度基页为主机提供巨大的页面。CTL保留了巨大的页面级翻译效率,同时实现了灵活的内存管理,在理想情况下提供接近本机的性能,在碎片化条件下提供高达16%的改进。
{"title":"CTL: A Case for CXL Device-Managed Hugepages","authors":"Sangbeom Jeon;Sang-Hoon Kim","doi":"10.1109/LCA.2026.3661563","DOIUrl":"https://doi.org/10.1109/LCA.2026.3661563","url":null,"abstract":"Compute Express Link (CXL) enables memory expansion via high-speed cache-coherent interconnects, yet it magnifies address translation overheads due to limited TLB reach. While hugepages alleviate translation costs, they introduce severe fragmentation and compaction overheads in long-running systems. Given this trade-off, we propose the CXL Translation Layer (CTL), a device-resident mechanism that provides host with hugepages using fine-grained basepages in the device. CTL preserves the hugepage-level translation efficiency while achieving flexible memory management, delivering near-native performance in ideal cases and up to 16% improvement under fragmented conditions.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"69-72"},"PeriodicalIF":1.4,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147362484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
H3: Hybrid Architecture Using High Bandwidth Memory and High Bandwidth Flash for Cost-Efficient LLM Inference H3:使用高带宽内存和高带宽闪存的混合架构,用于经济高效的LLM推理
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-02-04 DOI: 10.1109/LCA.2026.3660969
Minho Ha;Euiseok Kim;Hoshik Kim
Large language model (LLM) inference requires massive memory capacity to process long sequences, posing a challenge due to the capacity limitations of high bandwidth memory (HBM). High bandwidth flash (HBF) is an emerging memory device based on NAND flash that offers HBM-comparable bandwidth with much larger capacity, but suffers from disadvantages such as longer access latency, lower write endurance, and higher power consumption. This paper proposes H3, a hybrid architecture designed to effectively utilize both HBM and HBF by leveraging their respective strengths. By storing read-only data in HBF and other data in HBM, H3-equipped systems can process more requests at once with the same number of GPUs than HBM-only systems, making H3 suitable for gigantic read-only use cases in LLM inference, particularly those employing a shared pre-computed key-value cache. Simulation results show that a GPU system with H3 achieves up to 2.69x higher throughput per power compared to a system with HBM-only. This result validates the cost-effectiveness of H3 for handling LLM inference with gigantic read-only data.
大语言模型(LLM)推理需要大量的内存容量来处理长序列,这给高带宽内存(HBM)的容量限制带来了挑战。高带宽闪存(HBF)是一种新兴的基于NAND闪存的存储设备,它提供与hbm相当的带宽和更大的容量,但缺点是访问延迟较长、写入持久性较低和功耗较高。本文提出了一种混合架构H3,旨在利用HBM和HBF各自的优势,有效地利用它们。通过在HBF中存储只读数据,在HBM中存储其他数据,配备H3的系统可以使用相同数量的gpu同时处理比仅使用HBM的系统更多的请求,这使得H3适用于LLM推理中的大型只读用例,特别是那些使用共享预计算键值缓存的用例。仿真结果表明,与仅使用hbm的GPU系统相比,使用H3的GPU系统的单功率吞吐量可提高2.69倍。这个结果验证了H3处理具有巨大只读数据的LLM推理的成本效益。
{"title":"H3: Hybrid Architecture Using High Bandwidth Memory and High Bandwidth Flash for Cost-Efficient LLM Inference","authors":"Minho Ha;Euiseok Kim;Hoshik Kim","doi":"10.1109/LCA.2026.3660969","DOIUrl":"https://doi.org/10.1109/LCA.2026.3660969","url":null,"abstract":"Large language model (LLM) inference requires massive memory capacity to process long sequences, posing a challenge due to the capacity limitations of high bandwidth memory (HBM). High bandwidth flash (HBF) is an emerging memory device based on NAND flash that offers HBM-comparable bandwidth with much larger capacity, but suffers from disadvantages such as longer access latency, lower write endurance, and higher power consumption. This paper proposes H<sup>3</sup>, a hybrid architecture designed to effectively utilize both HBM and HBF by leveraging their respective strengths. By storing read-only data in HBF and other data in HBM, H<sup>3</sup>-equipped systems can process more requests at once with the same number of GPUs than HBM-only systems, making H<sup>3</sup> suitable for gigantic read-only use cases in LLM inference, particularly those employing a shared pre-computed key-value cache. Simulation results show that a GPU system with H<sup>3</sup> achieves up to 2.69x higher throughput per power compared to a system with HBM-only. This result validates the cost-effectiveness of H<sup>3</sup> for handling LLM inference with gigantic read-only data.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"49-52"},"PeriodicalIF":1.4,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146223586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
De-Quantization Penalties for Interactive LLM Inference on Prosumer GPUs 基于Prosumer gpu的交互式LLM推理的去量化惩罚
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-01-30 DOI: 10.1109/LCA.2026.3659512
Junaid Ahmad Khan
Post-training 4-bit quantization is often treated as the default path for running large language models (LLMs) on prosumer GPUs: if decoding is memory-bandwidth bound, shrinking weights from FP16 to 4-bit should cut memory traffic and improve latency and energy efficiency. We revisit this assumption on an RTX 3090 (Ampere) that lacks native INT4 tensor support. For 1–8 billion parameter models (TinyLlama, Qwen-2.5, Mistral, Llama 3.1, DeepSeek-R1-8B), we compare native FP16 inference against AutoGPTQ 4-bit models and GGUF kernels in llama.cpp. On a standard Transformers+AutoGPTQ stack, interactive batch size $B=1$ decoding is still 1.3–2.2× slower than FP16 despite a 2.4× reduction in VRAM usage, and some INT4 configurations are up to 2.4× less energy-efficient. An optimized GGUF backend improves 4-bit TinyLlama throughput by 1.65× over GPTQ, indicating that the de-quantization penalty is dominated by kernel design rather than hardware limits. We conclude that on prosumer GPUs without native INT4 tensor cores, 4-bit quantization is only attractive when paired with mature low-bit kernels; otherwise, FP16 remains the more robust choice for interactive workloads.
训练后的4位量化通常被视为在消费级gpu上运行大型语言模型(llm)的默认路径:如果解码是内存带宽限制,将权重从FP16缩小到4位应该减少内存流量并改善延迟和能源效率。我们在缺乏原生INT4张量支持的RTX 3090 (Ampere)上重新考虑这个假设。对于10 - 80亿个参数模型(TinyLlama、Qwen-2.5、Mistral、Llama 3.1、DeepSeek-R1-8B),我们比较了原生FP16与AutoGPTQ 4位模型和Llama .cpp中的GGUF内核的推理。在标准的transformer +AutoGPTQ堆栈上,尽管VRAM使用量减少了2.4倍,但交互式批处理大小$B=1$解码仍然比FP16慢1.3 - 2.2倍,并且一些INT4配置的能效降低了2.4倍。经过优化的GGUF后端使4位TinyLlama吞吐量比GPTQ提高了1.65倍,这表明去量化的代价是由内核设计而不是硬件限制决定的。我们得出结论,在没有原生INT4张量内核的产消级gpu上,4位量化只有在与成熟的低位内核配对时才有吸引力;否则,FP16仍然是交互式工作负载的更健壮的选择。
{"title":"De-Quantization Penalties for Interactive LLM Inference on Prosumer GPUs","authors":"Junaid Ahmad Khan","doi":"10.1109/LCA.2026.3659512","DOIUrl":"https://doi.org/10.1109/LCA.2026.3659512","url":null,"abstract":"Post-training 4-bit quantization is often treated as the default path for running large language models (LLMs) on prosumer GPUs: if decoding is memory-bandwidth bound, shrinking weights from FP16 to 4-bit should cut memory traffic and improve latency and energy efficiency. We revisit this assumption on an RTX 3090 (Ampere) that lacks native INT4 tensor support. For 1–8 billion parameter models (TinyLlama, Qwen-2.5, Mistral, Llama 3.1, DeepSeek-R1-8B), we compare native FP16 inference against AutoGPTQ 4-bit models and GGUF kernels in <monospace>llama.cpp</monospace>. On a standard Transformers+AutoGPTQ stack, interactive batch size <inline-formula><tex-math>$B=1$</tex-math></inline-formula> decoding is still <i>1.3–2.2×</i> <roman>slower</roman> than FP16 despite a 2.4× reduction in VRAM usage, and some INT4 configurations are up to <i>2.4×</i> <roman>less energy-efficient</roman>. An optimized GGUF backend improves 4-bit TinyLlama throughput by 1.65× over GPTQ, indicating that the de-quantization penalty is dominated by kernel design rather than hardware limits. We conclude that on prosumer GPUs without native INT4 tensor cores, 4-bit quantization is only attractive when paired with mature low-bit kernels; otherwise, FP16 remains the more robust choice for interactive workloads.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"45-48"},"PeriodicalIF":1.4,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146223588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SAP: Shared-Aware Prefetching for Reducing Inter-Chiplet Data Access Latency SAP:共享感知预取,减少片间数据访问延迟
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-01-28 DOI: 10.1109/LCA.2026.3658371
Junpei Huang;Haobo Xu;Yiming Gan;Ying Li;Wenhao Sun;Mengdi Wang;Xiaotong Wei;Feng Min;Ying Wang;Yinhe Han
In multi-chiplet systems, inter-chiplet shared-data transfers pose a significant bottleneck, prolonging the critical paths of memory accesses. In inter-chiplet coherence traffic, since each chiplet often needs to wait reactively for data from remote chiplets, proactive data-fetching mechanisms such as prefetching are essential to anticipate inter-chiplet data accesses and mitigate latency. Nevertheless, traditional prefetchers are inadequate for explicitly handling inter-chiplet shared-data transfers, overlooking potential prefetching opportunities. To overcome this limitation, we propose SAP, a shared-aware prefetching mechanism that minimizes inter-chiplet data access latency. By transforming the IO chiplet into an active prefetching agent, SAP proactively fetches inter-chiplet shared data before demand requests arrive, utilizing a sharing table to track recent shared-data events and a prefetch agent to initiate inter-chiplet data transfers early. Our experiments on a chiplet-based system demonstrate that SAP improves system throughput by 13.44% and reduces execution time by 12.33% compared to the prior design.
在多晶片系统中,晶片间的共享数据传输成为一个重要的瓶颈,延长了内存访问的关键路径。在芯片间相干通信中,由于每个芯片通常需要被动地等待来自远程芯片的数据,因此预取等主动数据获取机制对于预测芯片间数据访问和减轻延迟至关重要。然而,传统的预取器不足以显式处理芯片间的共享数据传输,忽略了潜在的预取机会。为了克服这一限制,我们提出了SAP,这是一种共享感知的预取机制,可以最大限度地减少芯片间数据访问延迟。通过将IO芯片转换为主动预取代理,SAP在需求请求到达之前主动获取芯片间共享数据,利用共享表跟踪最近的共享数据事件,并利用预取代理提前启动芯片间数据传输。我们在基于芯片的系统上的实验表明,与之前的设计相比,SAP提高了13.44%的系统吞吐量,减少了12.33%的执行时间。
{"title":"SAP: Shared-Aware Prefetching for Reducing Inter-Chiplet Data Access Latency","authors":"Junpei Huang;Haobo Xu;Yiming Gan;Ying Li;Wenhao Sun;Mengdi Wang;Xiaotong Wei;Feng Min;Ying Wang;Yinhe Han","doi":"10.1109/LCA.2026.3658371","DOIUrl":"https://doi.org/10.1109/LCA.2026.3658371","url":null,"abstract":"In multi-chiplet systems, inter-chiplet shared-data transfers pose a significant bottleneck, prolonging the critical paths of memory accesses. In inter-chiplet coherence traffic, since each chiplet often needs to wait reactively for data from remote chiplets, proactive data-fetching mechanisms such as prefetching are essential to anticipate inter-chiplet data accesses and mitigate latency. Nevertheless, traditional prefetchers are inadequate for explicitly handling inter-chiplet shared-data transfers, overlooking potential prefetching opportunities. To overcome this limitation, we propose SAP, a shared-aware prefetching mechanism that minimizes inter-chiplet data access latency. By transforming the IO chiplet into an active prefetching agent, SAP proactively fetches inter-chiplet shared data before demand requests arrive, utilizing a sharing table to track recent shared-data events and a prefetch agent to initiate inter-chiplet data transfers early. Our experiments on a chiplet-based system demonstrate that SAP improves system throughput by 13.44% and reduces execution time by 12.33% compared to the prior design.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"53-56"},"PeriodicalIF":1.4,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147299648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GEMM the New Gem: The Inevitable Kernel and its Sensitivity to Compiler Optimizations and Libraries 新的Gem:不可避免的内核及其对编译器优化和库的敏感性
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-01-21 DOI: 10.1109/LCA.2026.3656423
Siyuan Ma;Bagus Hanindhito;Anushka Subramanian;Lizy K. John
When the SPEC benchmark suite was first assembled in 1989, matrix multiplication code matrix300 was one of the 10 programs in the suite, but it was discarded within 2-3 years due to the high sensitivity of matrix multiplication to compiler optimizations. However, with the advent of machine learning (ML), neural networks and generative AI (GenAI), matrix multiplication is an integral part of the modern computing workload. While sensitive, general matrix multiplication (GEMM) cannot be ignored anymore, especially if hardware that runs ML workloads is being evaluated. In this paper the sensitivity of GEMM workloads to libraries and compiler optimizations is studied. While it may be inevitable to use matmul kernels as a benchmark to understand the performance of accelerators for machine learning, understanding the sensitivity to compiler optimizations and software libraries can help to optimize and interpret the results appropriately. We observe more than 9000× variation in CPU runtimes and around 84× variation in GPU run times depending on the optimizations used.
当SPEC基准套件于1989年首次组装时,矩阵乘法代码matrix300是套件中的10个程序之一,但由于矩阵乘法对编译器优化的高度敏感性,它在2-3年内被丢弃。然而,随着机器学习(ML),神经网络和生成式人工智能(GenAI)的出现,矩阵乘法是现代计算工作量的一个组成部分。虽然很敏感,但通用矩阵乘法(GEMM)不能再被忽视了,特别是在评估运行ML工作负载的硬件时。本文研究了GEMM工作负载对库和编译器优化的敏感性。虽然使用多内核作为基准来理解机器学习加速器的性能可能是不可避免的,但理解对编译器优化和软件库的敏感性可以帮助适当地优化和解释结果。根据所使用的优化,我们观察到CPU运行时间的变化超过9000倍,GPU运行时间的变化约为84倍。
{"title":"GEMM the New Gem: The Inevitable Kernel and its Sensitivity to Compiler Optimizations and Libraries","authors":"Siyuan Ma;Bagus Hanindhito;Anushka Subramanian;Lizy K. John","doi":"10.1109/LCA.2026.3656423","DOIUrl":"https://doi.org/10.1109/LCA.2026.3656423","url":null,"abstract":"When the SPEC benchmark suite was first assembled in 1989, matrix multiplication code <italic>matrix300</i> was one of the 10 programs in the suite, but it was discarded within 2-3 years due to the high sensitivity of matrix multiplication to compiler optimizations. However, with the advent of machine learning (ML), neural networks and generative AI (GenAI), matrix multiplication is an integral part of the modern computing workload. While sensitive, general matrix multiplication (GEMM) cannot be ignored anymore, especially if hardware that runs ML workloads is being evaluated. In this paper the sensitivity of GEMM workloads to libraries and compiler optimizations is studied. While it may be inevitable to use matmul kernels as a benchmark to understand the performance of accelerators for machine learning, understanding the sensitivity to compiler optimizations and software libraries can help to optimize and interpret the results appropriately. We observe more than 9000× variation in CPU runtimes and around 84× variation in GPU run times depending on the optimizations used.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"61-64"},"PeriodicalIF":1.4,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147299710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hisui: Unlocking Tiered Memory Efficiency for FaaS Workloads Hisui:为FaaS工作负载解锁分层内存效率
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-01-14 DOI: 10.1109/LCA.2026.3654119
Seonggyu Han;Sangwoong Kim;Minho Kim;Daehoon Kim
Modern tiered-memory architectures are increasingly adopted in cloud servers with extensive physical capacity. Realizing their full performance potential, however, requires effective page management. Existing systems, tuned for long-running workloads, primarily rely on access-count-based promotion. Yet this policy is ill-suited to the short-lived, event-driven model of Function-as-a-Service (FaaS) workloads. The resulting workload-architecture mismatch yields poor page placement and severely degrades architectural efficiency. We present Hisui, a FaaS-aware tiered-memory management system tailored to FaaS workloads. It stages pages with high expected reuse using two mechanisms: an FMem admission filter and an invocation-frequency–weighted valuation that promotes pages by descending gain. Hisui delivers up to 1.57× higher throughput than access-count baselines and consistently lowers latency on real workloads.
具有大量物理容量的云服务器越来越多地采用现代分层内存架构。但是,要实现它们的全部性能潜力,需要有效的页面管理。针对长时间运行的工作负载进行了调优的现有系统主要依赖于基于访问计数的提升。然而,此策略不适合功能即服务(FaaS)工作负载的短期事件驱动模型。由此产生的工作负载-体系结构不匹配会导致页面放置不良,并严重降低体系结构效率。我们提出了Hisui,一个为FaaS工作负载量身定制的FaaS感知分层内存管理系统。它使用两种机制对具有高期望重用的页面进行分级:FMem准入过滤器和调用频率加权估值(通过降低增益来提升页面)。Hisui提供比访问计数基线高1.57倍的吞吐量,并持续降低实际工作负载的延迟。
{"title":"Hisui: Unlocking Tiered Memory Efficiency for FaaS Workloads","authors":"Seonggyu Han;Sangwoong Kim;Minho Kim;Daehoon Kim","doi":"10.1109/LCA.2026.3654119","DOIUrl":"https://doi.org/10.1109/LCA.2026.3654119","url":null,"abstract":"Modern tiered-memory architectures are increasingly adopted in cloud servers with extensive physical capacity. Realizing their full performance potential, however, requires effective page management. Existing systems, tuned for long-running workloads, primarily rely on access-count-based promotion. Yet this policy is ill-suited to the short-lived, event-driven model of Function-as-a-Service (FaaS) workloads. The resulting workload-architecture mismatch yields poor page placement and severely degrades architectural efficiency. We present Hisui, a FaaS-aware tiered-memory management system tailored to FaaS workloads. It stages pages with high expected reuse using two mechanisms: an FMem admission filter and an invocation-frequency–weighted valuation that promotes pages by descending gain. Hisui delivers up to 1.57× higher throughput than access-count baselines and consistently lowers latency on real workloads.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"41-44"},"PeriodicalIF":1.4,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UniCNet: Unified Cycle-Accurate Simulation for Composable Chiplet Network With Modular Design-Integration Workflow 基于模块化设计集成工作流的可组合芯片网络的统一周期精确仿真
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-01-13 DOI: 10.1109/LCA.2026.3653809
Peilin Wang;Mingyu Wang;Zhirong Ye;Tao Lu;Zhiyi Yu
Composable chiplet-based architectures adopt a two-stage design flow: first chiplet design, then modular integration, while still presenting a shared-memory, single-OS system view. However, the heterogeneous and modular nature of the resulting network may introduce performance inefficiencies and functional correctness issues, calling for an advanced simulation tool. In this paper, we introduce UniCNet, a unified and cycle-accurate network simulator designed for composable chiplet-based architectures, which is open-sourced. To achieve both accurate modeling and unified simulation, UniCNet employs a design-integration workflow that closely aligns with the composable chiplet design flow. It supports several key features oriented towards composable chiplet scenarios and introduces the first cycle-level chiplet protocol interface model. UniCNet further supports multi-threaded simulation, achieving up to 4× speedup with no cycle accuracy loss. We validate UniCNet against RTL models and demonstrate its utility via several case studies.
基于芯片的可组合架构采用两阶段设计流程:首先是芯片设计,然后是模块化集成,同时仍然呈现共享内存、单一操作系统的视图。然而,由此产生的网络的异构和模块化特性可能会导致性能低下和功能正确性问题,因此需要先进的仿真工具。在本文中,我们介绍了UniCNet,这是一个统一的、周期精确的网络模拟器,专为基于可组合芯片的架构而设计,它是开源的。为了实现精确建模和统一仿真,UniCNet采用了与可组合芯片设计流程紧密结合的设计集成工作流程。它支持面向可组合芯片场景的几个关键特性,并引入了第一个周期级芯片协议接口模型。UniCNet进一步支持多线程模拟,在没有周期精度损失的情况下实现高达4倍的加速。我们根据RTL模型验证了UniCNet,并通过几个案例研究展示了它的实用性。
{"title":"UniCNet: Unified Cycle-Accurate Simulation for Composable Chiplet Network With Modular Design-Integration Workflow","authors":"Peilin Wang;Mingyu Wang;Zhirong Ye;Tao Lu;Zhiyi Yu","doi":"10.1109/LCA.2026.3653809","DOIUrl":"https://doi.org/10.1109/LCA.2026.3653809","url":null,"abstract":"Composable chiplet-based architectures adopt a two-stage design flow: first chiplet design, then modular integration, while still presenting a shared-memory, single-OS system view. However, the heterogeneous and modular nature of the resulting network may introduce performance inefficiencies and functional correctness issues, calling for an advanced simulation tool. In this paper, we introduce UniCNet, a unified and cycle-accurate network simulator designed for composable chiplet-based architectures, which is open-sourced. To achieve both accurate modeling and unified simulation, UniCNet employs a <italic>design-integration</i> workflow that closely aligns with the composable chiplet design flow. It supports several key features oriented towards composable chiplet scenarios and introduces the first cycle-level chiplet protocol interface model. UniCNet further supports multi-threaded simulation, achieving up to 4× speedup with no cycle accuracy loss. We validate UniCNet against RTL models and demonstrate its utility via several case studies.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"37-40"},"PeriodicalIF":1.4,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146082129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1