首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
Memory-Centric MCM-GPU Architecture 以内存为中心的MCM-GPU架构
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-09 DOI: 10.1109/LCA.2025.3553766
Hossein SeyyedAghaei;Mahmood Naderan-Tahan;Magnus Jahre;Lieven Eeckhout
The demand for powerful GPUs continues to grow, driven by modern-day applications that require ever increasing computational power and memory bandwidth. Multi-Chip Module (MCM) GPUs provide the scalability potential by integrating GPU chiplets on an interposer substrate, however, they are hindered by their GPU-centric design, i.e., off-chip GPU bandwidth is statically (at design time) allocated to local versus remote memory accesses. This paper presents the memory-centric MCM-GPU architecture. By connecting the HBM stacks on the interposer, rather than the GPUs, and by connecting the GPUs to bridges on the interposer network, the full off-chip GPU bandwidth can be dynamically allocated to local and remote memory accesses. Preliminary results demonstrate the potential of the memory-centric architecture offering an average 1.36× (and up to 1.90×) performance improvement over a GPU-centric architecture.
由于现代应用程序需要不断增加的计算能力和内存带宽,对强大gpu的需求持续增长。多芯片模块(MCM) GPU通过在中间层基板上集成GPU小芯片来提供可扩展性潜力,然而,它们受到以GPU为中心的设计的阻碍,即片外GPU带宽是静态地(在设计时)分配给本地而不是远程内存访问的。本文提出了以内存为中心的MCM-GPU架构。通过将HBM堆栈连接到中间层(而不是GPU)上,并将GPU连接到中间层网络上的网桥上,可以动态地将片外GPU的全部带宽分配给本地和远程内存访问。初步结果表明,与以gpu为中心的体系结构相比,以内存为中心的体系结构的性能平均提高了1.36倍(最高可达1.90倍)。
{"title":"Memory-Centric MCM-GPU Architecture","authors":"Hossein SeyyedAghaei;Mahmood Naderan-Tahan;Magnus Jahre;Lieven Eeckhout","doi":"10.1109/LCA.2025.3553766","DOIUrl":"https://doi.org/10.1109/LCA.2025.3553766","url":null,"abstract":"The demand for powerful GPUs continues to grow, driven by modern-day applications that require ever increasing computational power and memory bandwidth. Multi-Chip Module (MCM) GPUs provide the scalability potential by integrating GPU chiplets on an interposer substrate, however, they are hindered by their GPU-centric design, i.e., off-chip GPU bandwidth is statically (at design time) allocated to local versus remote memory accesses. This paper presents the memory-centric MCM-GPU architecture. By connecting the HBM stacks on the interposer, rather than the GPUs, and by connecting the GPUs to bridges on the interposer network, the full off-chip GPU bandwidth can be dynamically allocated to local and remote memory accesses. Preliminary results demonstrate the potential of the memory-centric architecture offering an average 1.36× (and up to 1.90×) performance improvement over a GPU-centric architecture.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"101-104"},"PeriodicalIF":1.4,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143817792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analyzing and Exploiting Memory Hierarchy Parallelism With MLP Stacks 利用MLP栈分析和开发内存层次并行性
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-08 DOI: 10.1109/LCA.2025.3558808
Adnan Hasnat;Wim Heirman;Shoaib Akram
Obtaining high instruction throughput on modern CPUs requires generating a high degree of memory-level parallelism (MLP). MLP is typically reported as a quantitative metric at the DRAM level. However, understanding the reasons that hinder memory parallelism requires more insightful metrics and visualizations. This paper proposes a new taxonomy of MLP metrics, splitting MLP into core and prefetch components and measuring both miss and hit cache level parallelism. Our key contribution is an MLP stack, a visualization that integrates these metrics, and connects then to performance by showing the CPI contribution of each memory level. The stack also shows speculative parallelism from dependency-bound and structural-hazard-bound loads. We implement the MLP stack in a processor simulator and conduct case studies that demonstrate the potential for targeting software optimizations (e.g., software prefetching), and hardware improvements (e.g., instruction window sizing).
在现代cpu上获得高指令吞吐量需要生成高度的内存级并行性(MLP)。MLP通常作为DRAM级别的定量度量来报告。然而,理解阻碍内存并行性的原因需要更有洞察力的指标和可视化。本文提出了一种新的MLP度量分类法,将MLP划分为核心和预取组件,并测量未命中和命中缓存级并行性。我们的主要贡献是一个MLP堆栈,一个集成这些指标的可视化,并通过显示每个内存级别的CPI贡献将其与性能联系起来。该堆栈还显示了依赖绑定和结构危险绑定负载的推测并行性。我们在处理器模拟器中实现了MLP堆栈,并进行了案例研究,以展示针对软件优化(例如,软件预取)和硬件改进(例如,指令窗口大小)的潜力。
{"title":"Analyzing and Exploiting Memory Hierarchy Parallelism With MLP Stacks","authors":"Adnan Hasnat;Wim Heirman;Shoaib Akram","doi":"10.1109/LCA.2025.3558808","DOIUrl":"https://doi.org/10.1109/LCA.2025.3558808","url":null,"abstract":"Obtaining high instruction throughput on modern CPUs requires generating a high degree of memory-level parallelism (MLP). MLP is typically reported as a quantitative metric at the DRAM level. However, understanding the reasons that hinder memory parallelism requires more insightful metrics and visualizations. This paper proposes a new taxonomy of MLP metrics, splitting MLP into core and prefetch components and measuring both miss and hit cache level parallelism. Our key contribution is an MLP stack, a visualization that integrates these metrics, and connects then to performance by showing the CPI contribution of each memory level. The stack also shows speculative parallelism from dependency-bound and structural-hazard-bound loads. We implement the MLP stack in a processor simulator and conduct case studies that demonstrate the potential for targeting software optimizations (e.g., software prefetching), and hardware improvements (e.g., instruction window sizing).","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"125-128"},"PeriodicalIF":1.4,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143896472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Estimating CPI Stacks From Multiplexed Performance Counter Data Using Machine Learning 使用机器学习从多路性能计数器数据估计CPI堆栈
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-01 DOI: 10.1109/LCA.2025.3556644
Daniel Puckett;Tyler Tomer;Paul V. Gratz;Jiang Hu;Galen Shipman;Jered Dominguez-Trujillo;Kevin Sheridan
Optimizing software at runtime is much easier with a clear understanding of the bottlenecks facing the software. CPI stacks are a common method of visualizing these bottlenecks. However, existing proposals to implement CPI stacks require hardware modifications. To compute CPI stacks without modifying the CPU, we demonstrate CPI stacks can be estimated from existing performance counters using machine learning.
如果清楚地了解软件面临的瓶颈,在运行时优化软件会容易得多。CPI堆栈是可视化这些瓶颈的常用方法。然而,现有的实现CPI堆栈的建议需要对硬件进行修改。为了在不修改CPU的情况下计算CPI堆栈,我们演示了可以使用机器学习从现有的性能计数器中估计CPI堆栈。
{"title":"Estimating CPI Stacks From Multiplexed Performance Counter Data Using Machine Learning","authors":"Daniel Puckett;Tyler Tomer;Paul V. Gratz;Jiang Hu;Galen Shipman;Jered Dominguez-Trujillo;Kevin Sheridan","doi":"10.1109/LCA.2025.3556644","DOIUrl":"https://doi.org/10.1109/LCA.2025.3556644","url":null,"abstract":"Optimizing software at runtime is much easier with a clear understanding of the bottlenecks facing the software. CPI stacks are a common method of visualizing these bottlenecks. However, existing proposals to implement CPI stacks require hardware modifications. To compute CPI stacks without modifying the CPU, we demonstrate CPI stacks can be estimated from existing performance counters using machine learning.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"129-132"},"PeriodicalIF":1.4,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating Control Flow on CGRAs via Speculative Iteration Execution 通过推测迭代执行加速CGRAs控制流
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-26 DOI: 10.1109/LCA.2025.3554777
Heng Cao;Zhipeng Wu;Dejian Li;Peiguang Jing;Sio Hang Pun;Yu Liu
Coarse-Grained Reconfigurable Arrays (CGRAs) offer a promising architecture for accelerating general-purpose, compute-intensive tasks. However, handling control flow within these tasks remains a challenge for CGRAs. Current methods for handling control flow in CGRAs execute condition operations before selecting branch paths, which adds extra execution time. This article proposes a CGRA architecture that decouples the control flow condition and path selection within an iteration through speculative iteration execution (SIE), where the condition is predicted before the start of the current iteration. Compared to existing methods, the SIE CGRA achieves a geometric mean speedup of $1.31times$ over Partial Predication, $1.17 times$ over Dynamic-II Pipeline and $1.12times$ over Dual-Issue Single-Execution.
粗粒度可重构数组(CGRAs)为加速通用的计算密集型任务提供了一种很有前途的体系结构。然而,在这些任务中处理控制流对CGRAs来说仍然是一个挑战。当前CGRAs中处理控制流的方法在选择分支路径之前执行条件操作,这增加了额外的执行时间。本文提出了一种CGRA架构,该架构通过推测迭代执行(SIE)将控制流条件和迭代中的路径选择解耦,其中在当前迭代开始之前预测条件。与现有方法相比,SIE CGRA的几何平均加速速度比部分预测提高1.31倍,比Dynamic-II Pipeline提高1.17倍,比Dual-Issue Single-Execution提高1.12倍。
{"title":"Accelerating Control Flow on CGRAs via Speculative Iteration Execution","authors":"Heng Cao;Zhipeng Wu;Dejian Li;Peiguang Jing;Sio Hang Pun;Yu Liu","doi":"10.1109/LCA.2025.3554777","DOIUrl":"https://doi.org/10.1109/LCA.2025.3554777","url":null,"abstract":"Coarse-Grained Reconfigurable Arrays (CGRAs) offer a promising architecture for accelerating general-purpose, compute-intensive tasks. However, handling control flow within these tasks remains a challenge for CGRAs. Current methods for handling control flow in CGRAs execute condition operations before selecting branch paths, which adds extra execution time. This article proposes a CGRA architecture that decouples the control flow condition and path selection within an iteration through speculative iteration execution (SIE), where the condition is predicted before the start of the current iteration. Compared to existing methods, the SIE CGRA achieves a geometric mean speedup of <inline-formula><tex-math>$1.31times$</tex-math> </inline-formula> over Partial Predication, <inline-formula><tex-math>$1.17 times$</tex-math> </inline-formula> over Dynamic-II Pipeline and <inline-formula><tex-math>$1.12times$</tex-math> </inline-formula> over Dual-Issue Single-Execution.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"109-112"},"PeriodicalIF":1.4,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143848801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Approximate SFQ-Based Computing Architecture Modeling With Device-Level Guidelines 基于sfq的近似计算体系结构建模与设备级指南
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-26 DOI: 10.1109/LCA.2025.3573740
Pratiksha Mundhe;Yuta Hano;Satoshi Kawakami;Teruo Tanimoto;Masamitsu Tanaka;Koji Inoue;Ilkwon Byun
Single-flux-quantum (SFQ) logic has emerged as a promising post-Moore technology thanks to its ultra-fast and low-energy operation. However, despite progress in various fields, its feasibility is questionable due to the prohibitive cooling cost. Proven conventional ideas, such as approximate computing, may help to resolve this challenge. However, introducing such ideas has been impossible due to the complex performance, power, and error trade-offs originating from the unique SFQ device characteristics. This work introduces approximate SFQ-based computing (AxSFQ) with an architecture modeling framework and essential design guidelines. Our optimized device-level AxSFQ showcases 30–100 times energy efficiency improvement, which motivates further circuit and architecture-level exploration.
单通量量子(SFQ)逻辑由于其超快和低能量的运行而成为一种有前途的后摩尔技术。然而,尽管在各个领域取得了进展,但由于冷却成本过高,其可行性受到质疑。经过验证的传统思想,如近似计算,可能有助于解决这一挑战。然而,引入这样的想法是不可能的,因为复杂的性能,功率和错误权衡源自独特的SFQ器件特性。这项工作介绍了近似的基于sfq的计算(AxSFQ)与架构建模框架和基本的设计准则。我们优化的器件级AxSFQ展示了30-100倍的能效提升,这激发了进一步的电路和架构级探索。
{"title":"Approximate SFQ-Based Computing Architecture Modeling With Device-Level Guidelines","authors":"Pratiksha Mundhe;Yuta Hano;Satoshi Kawakami;Teruo Tanimoto;Masamitsu Tanaka;Koji Inoue;Ilkwon Byun","doi":"10.1109/LCA.2025.3573740","DOIUrl":"https://doi.org/10.1109/LCA.2025.3573740","url":null,"abstract":"Single-flux-quantum (SFQ) logic has emerged as a promising post-Moore technology thanks to its ultra-fast and low-energy operation. However, despite progress in various fields, its feasibility is questionable due to the prohibitive cooling cost. Proven conventional ideas, such as approximate computing, may help to resolve this challenge. However, introducing such ideas has been impossible due to the complex performance, power, and error trade-offs originating from the unique SFQ device characteristics. This work introduces approximate SFQ-based computing (AxSFQ) with an architecture modeling framework and essential design guidelines. Our optimized device-level AxSFQ showcases 30–100 times energy efficiency improvement, which motivates further circuit and architecture-level exploration.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"253-256"},"PeriodicalIF":1.4,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144831783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploiting Intel AMX Power Gating 利用英特尔AMX电源门控
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-26 DOI: 10.1109/LCA.2025.3555183
Joshua Kalyanapu;Farshad Dizani;Azam Ghanbari;Darsh Asher;Samira Mirbagher Ajorpaz
We identify a novel vulnerability in Intel AMX’s dynamic power performance scaling, enabling NetLoki, a stealthy and high-performance remote speculative attack that bypasses traditional cache defenses and leaks arbitrary addresses over a realistic network where other attacks fail. NetLoki shows a 34,900% improvement in leakage rate over NetSpectre. We show that NetLoki evades detection by three state-of-the-art microarchitectural attack detectors (EVAX, PerSpectron, RHMD) and requires a 20,000x reduction in the system’s timer resolution (10 us) than the standard 0.5 ns hardware timer to be mitigated via timer coarsening. Finally, we analyze the root cause of the leakage and propose an effective defense. We show that the mitigation increases CPU power consumption by 12.33%.
我们在英特尔AMX的动态功率性能扩展中发现了一个新的漏洞,启用了NetLoki,这是一种隐形的高性能远程推测攻击,可以绕过传统的缓存防御,并在其他攻击失败的现实网络上泄露任意地址。与NetSpectre相比,NetLoki的泄漏率提高了34900%。我们表明,NetLoki避开了三种最先进的微架构攻击检测器(EVAX, PerSpectron, RHMD)的检测,并且需要将系统的计时器分辨率(10 us)降低20,000倍,而不是通过计时器粗化来减轻标准的0.5 ns硬件计时器。最后,分析了泄漏的根本原因,并提出了有效的防护措施。我们表明,缓解使CPU功耗增加了12.33%。
{"title":"Exploiting Intel AMX Power Gating","authors":"Joshua Kalyanapu;Farshad Dizani;Azam Ghanbari;Darsh Asher;Samira Mirbagher Ajorpaz","doi":"10.1109/LCA.2025.3555183","DOIUrl":"https://doi.org/10.1109/LCA.2025.3555183","url":null,"abstract":"We identify a novel vulnerability in Intel AMX’s dynamic power performance scaling, enabling <sc>NetLoki</small>, a stealthy and high-performance remote speculative attack that bypasses traditional cache defenses and leaks arbitrary addresses over a realistic network where other attacks fail. <sc>NetLoki</small> shows a 34,900% improvement in leakage rate over NetSpectre. We show that <sc>NetLoki</small> evades detection by three state-of-the-art microarchitectural attack detectors (EVAX, PerSpectron, RHMD) and requires a 20,000x reduction in the system’s timer resolution (10 us) than the standard 0.5 ns hardware timer to be mitigated via timer coarsening. Finally, we analyze the root cause of the leakage and propose an effective defense. We show that the mitigation increases CPU power consumption by<monospace> 12.33%.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"113-116"},"PeriodicalIF":1.4,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143848842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
X-PPR: Post Package Repair for CXL Memory X-PPR: CXL内存的包后修复
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-21 DOI: 10.1109/LCA.2025.3552190
Chihun Song;Michael Jaemin Kim;Yan Sun;Houxiang Ji;Kyungsan Kim;TaeKyeong Ko;Jung Ho Ahn;Nam Sung Kim
CXL is an emerging interface that can cost-efficiently expand the capacity and bandwidth of servers, recycling DRAM modules from retired servers. Such DRAM modules, however, will likely have many uncorrectable faulty words due to years of strenuous use in datacenters. To repair faulty words in the field, a few solutions based on Post Package Repair (PPR) and memory offlining have been proposed. Nonetheless, they are either unable to fix thousands of faulty words or prone to causing severe memory fragmentation, as they operate at the granularity of DRAM row and memory page addresses, respectively. In this work, for cost-efficient use of recycled DRAM modules with thousands of faulty words, we propose CXL-PPR (X-PPR), exploiting the CXL’s support for near-memory processing and variable memory access latency. We demonstrate that X-PPR implemented in a commercial CXL device with DDR4 DRAM modules can handle a faulty bit probability that is $3.3 times 10^{4}$ higher than ECC for a 512GB DRAM module. Meanwhile, X-PPR negligibly degrades the performance of popular memory-intensive benchmarks, which is achieved through two mechanisms designed in X-PPR to minimize the performance impact of additional DRAM accesses required for repairing faulty words.
CXL是一种新兴的接口,可以经济高效地扩展服务器的容量和带宽,回收退役服务器的DRAM模块。但是,由于在数据中心的长期使用,这种DRAM模块有可能存在许多无法纠正的错误字。针对现场错误词的修复,提出了基于包后修复(Post Package repair, PPR)和内存离线的几种解决方案。尽管如此,它们要么无法修复数千个错误的单词,要么容易导致严重的内存碎片,因为它们分别以DRAM行和内存页地址的粒度进行操作。在这项工作中,为了经济高效地使用具有数千个错误字的回收DRAM模块,我们提出了CXL- ppr (X-PPR),利用CXL对近内存处理和可变内存访问延迟的支持。我们证明了在具有DDR4 DRAM模块的商业CXL器件中实现的X-PPR可以处理比512GB DRAM模块的ECC高3.3 × 10^{4}$的错误位概率。同时,X-PPR可以忽略不计地降低流行的内存密集型基准测试的性能,这是通过X-PPR中设计的两种机制来实现的,以最大限度地减少修复错误字所需的额外DRAM访问对性能的影响。
{"title":"X-PPR: Post Package Repair for CXL Memory","authors":"Chihun Song;Michael Jaemin Kim;Yan Sun;Houxiang Ji;Kyungsan Kim;TaeKyeong Ko;Jung Ho Ahn;Nam Sung Kim","doi":"10.1109/LCA.2025.3552190","DOIUrl":"https://doi.org/10.1109/LCA.2025.3552190","url":null,"abstract":"CXL is an emerging interface that can cost-efficiently expand the capacity and bandwidth of servers, recycling DRAM modules from retired servers. Such DRAM modules, however, will likely have many uncorrectable faulty words due to years of strenuous use in datacenters. To repair faulty words in the field, a few solutions based on Post Package Repair (PPR) and memory offlining have been proposed. Nonetheless, they are either unable to fix thousands of faulty words or prone to causing severe memory fragmentation, as they operate at the granularity of DRAM row and memory page addresses, respectively. In this work, for cost-efficient use of recycled DRAM modules with thousands of faulty words, we propose C<u>X</u>L-<u>PPR</u> (X-PPR), exploiting the CXL’s support for near-memory processing and variable memory access latency. We demonstrate that X-PPR implemented in a commercial CXL device with DDR4 DRAM modules can handle a faulty bit probability that is <inline-formula><tex-math>$3.3 times 10^{4}$</tex-math></inline-formula> higher than ECC for a 512GB DRAM module. Meanwhile, X-PPR negligibly degrades the performance of popular memory-intensive benchmarks, which is achieved through two mechanisms designed in X-PPR to minimize the performance impact of additional DRAM accesses required for repairing faulty words.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"97-100"},"PeriodicalIF":1.4,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143792857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
srNAND: A Novel NAND Flash Organization for Enhanced Small Read Throughput in SSDs srNAND:一种提高ssd小读吞吐量的新型NAND闪存组织
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-19 DOI: 10.1109/LCA.2025.3571321
Jeongho Lee;Sangjun Kim;Jaeyong Lee;Jaeyoung Kang;Sungjin Lee;Nam Sung Kim;Jihong Kim
Emerging data-intensive applications with frequent small random read operations challenge the throughput capabilities of conventional SSD architectures. Although Compute Express Link enabled SSDs allow for fine-grained data access with reduced latency, their read throughput remains limited by legacy block-oriented designs. To address this, we propose ${sf srNAND}$, an advanced NAND flash architecture for CXL SSDs. It uses a two-stage ECC decoding mechanism to reduce read amplification, an optimized read command sequence to boost parallelism, and a request merging module to eliminate redundant operations. Our evaluation shows that ${sf srSSD}$ can improve read throughput by up to 10.4× compared to conventional CXL SSDs.
新兴的数据密集型应用程序具有频繁的小随机读取操作,对传统SSD架构的吞吐量能力提出了挑战。尽管支持Compute Express Link的ssd支持细粒度的数据访问,同时降低了延迟,但它们的读取吞吐量仍然受到传统的面向块设计的限制。为了解决这个问题,我们提出了${sf srNAND}$,这是一种用于CXL ssd的高级NAND闪存架构。它使用两阶段ECC解码机制来减少读放大,一个优化的读命令序列来提高并行性,以及一个请求合并模块来消除冗余操作。我们的评估表明,与传统的CXL ssd相比,${sf srSSD}$可以将读取吞吐量提高10.4倍。
{"title":"srNAND: A Novel NAND Flash Organization for Enhanced Small Read Throughput in SSDs","authors":"Jeongho Lee;Sangjun Kim;Jaeyong Lee;Jaeyoung Kang;Sungjin Lee;Nam Sung Kim;Jihong Kim","doi":"10.1109/LCA.2025.3571321","DOIUrl":"https://doi.org/10.1109/LCA.2025.3571321","url":null,"abstract":"Emerging data-intensive applications with frequent small random read operations challenge the throughput capabilities of conventional SSD architectures. Although Compute Express Link enabled SSDs allow for fine-grained data access with reduced latency, their read throughput remains limited by legacy block-oriented designs. To address this, we propose <inline-formula><tex-math>${sf srNAND}$</tex-math></inline-formula>, an advanced NAND flash architecture for CXL SSDs. It uses a two-stage ECC decoding mechanism to reduce read amplification, an optimized read command sequence to boost parallelism, and a request merging module to eliminate redundant operations. Our evaluation shows that <inline-formula><tex-math>${sf srSSD}$</tex-math></inline-formula> can improve read throughput by up to 10.4× compared to conventional CXL SSDs.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"197-200"},"PeriodicalIF":1.4,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144536557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DynaFlow: An ML Framework for Dynamic Dataflow Selection in SpGEMM Accelerators 在SpGEMM加速器中动态数据流选择的ML框架
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-15 DOI: 10.1109/LCA.2025.3570667
Sanjali Yadav;Bahar Asgari
Sparse matrix-matrix multiplication (SpGEMM) is a critical operation in numerous fields, including scientific computing, graph analytics, and deep learning, leveraging matrix sparsity to reduce both storage and computation costs. However, the irregular structure of sparse matrices poses significant challenges for performance optimization. Existing hardware accelerators often employ fixed dataflows designed for specific sparsity patterns, leading to performance degradation when the input deviates from these assumptions. As SpGEMM adoption expands across a broad spectrum of sparsity workloads, the demand grows for accelerators capable of dynamically adapting their dataflow schemes to diverse sparsity patterns. To address this, we propose DynaFlow, a machine learning-based framework that trains on the set of dataflows supported by any given accelerator and learns to predict the optimal dataflow based on the input sparsity pattern. By leveraging decision trees and deep reinforcement learning, DynaFlow surpasses static dataflow selection approaches, achieving up to a 50× speedup.
稀疏矩阵-矩阵乘法(SpGEMM)是科学计算、图分析和深度学习等众多领域的关键运算,利用矩阵稀疏性来降低存储和计算成本。然而,稀疏矩阵的不规则结构给性能优化带来了巨大的挑战。现有的硬件加速器通常采用为特定稀疏模式设计的固定数据流,当输入偏离这些假设时,会导致性能下降。随着SpGEMM在各种稀疏性工作负载上的广泛应用,对能够动态调整其数据流方案以适应各种稀疏性模式的加速器的需求也在增长。为了解决这个问题,我们提出了DynaFlow,这是一个基于机器学习的框架,它在任何给定加速器支持的数据流集上进行训练,并学习基于输入稀疏性模式预测最佳数据流。通过利用决策树和深度强化学习,DynaFlow超越了静态数据流选择方法,实现了高达50倍的加速。
{"title":"DynaFlow: An ML Framework for Dynamic Dataflow Selection in SpGEMM Accelerators","authors":"Sanjali Yadav;Bahar Asgari","doi":"10.1109/LCA.2025.3570667","DOIUrl":"https://doi.org/10.1109/LCA.2025.3570667","url":null,"abstract":"Sparse matrix-matrix multiplication (SpGEMM) is a critical operation in numerous fields, including scientific computing, graph analytics, and deep learning, leveraging matrix sparsity to reduce both storage and computation costs. However, the irregular structure of sparse matrices poses significant challenges for performance optimization. Existing hardware accelerators often employ fixed dataflows designed for specific sparsity patterns, leading to performance degradation when the input deviates from these assumptions. As SpGEMM adoption expands across a broad spectrum of sparsity workloads, the demand grows for accelerators capable of dynamically adapting their dataflow schemes to diverse sparsity patterns. To address this, we propose DynaFlow, a machine learning-based framework that trains on the set of dataflows supported by any given accelerator and learns to predict the optimal dataflow based on the input sparsity pattern. By leveraging decision trees and deep reinforcement learning, DynaFlow surpasses static dataflow selection approaches, achieving up to a 50× speedup.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"189-192"},"PeriodicalIF":1.4,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144205869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cosmos: A CXL-Based Full In-Memory System for Approximate Nearest Neighbor Search Cosmos:一个基于cxl的全内存系统,用于近似最近邻搜索
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-14 DOI: 10.1109/LCA.2025.3570235
Seoyoung Ko;Hyunjeong Shim;Wanju Doh;Sungmin Yun;Jinin So;Yongsuk Kwon;Sang-Soo Park;Si-Dong Roh;Minyong Yoon;Taeksang Song;Jung Ho Ahn
Retrieval-Augmented Generation (RAG) is crucial for improving the quality of large language models by injecting proper contexts extracted from external sources. RAG requires high-throughput, low-latency Approximate Nearest Neighbor Search (ANNS) over billion-scale vector databases. Conventional DRAM/SSD solutions face capacity/latency limits, whereas specialized hardware or RDMA clusters lack flexibility or incur network overhead. We present Cosmos, integrating general-purpose cores within CXL memory devices for full ANNS offload and introducing rank-level parallel distance computation to maximize memory bandwidth. We also propose an adjacency-aware data placement that balances search loads across CXL devices based on inter-cluster proximity. Evaluations on SIFT1B and DEEP1B traces show that Cosmos achieves up to 6.72× higher throughput than the baseline CXL system and 2.35× over a state-of-the-art CXL-based solution, demonstrating scalability for RAG pipelines.
检索增强生成(RAG)通过注入从外部源提取的适当上下文,对于提高大型语言模型的质量至关重要。RAG需要在十亿规模的矢量数据库上进行高吞吐量、低延迟的近似最近邻搜索(ANNS)。传统的DRAM/SSD解决方案面临容量/延迟限制,而专用硬件或RDMA集群缺乏灵活性或导致网络开销。我们提出Cosmos,在CXL存储设备中集成通用内核以实现全ANNS卸载,并引入秩级并行距离计算以最大化内存带宽。我们还提出了一种邻接感知的数据放置方法,该方法基于集群间的接近度平衡跨CXL设备的搜索负载。对SIFT1B和DEEP1B轨迹的评估表明,Cosmos的吞吐量比基线CXL系统高6.72倍,比最先进的基于CXL的解决方案高2.35倍,证明了RAG管道的可扩展性。
{"title":"Cosmos: A CXL-Based Full In-Memory System for Approximate Nearest Neighbor Search","authors":"Seoyoung Ko;Hyunjeong Shim;Wanju Doh;Sungmin Yun;Jinin So;Yongsuk Kwon;Sang-Soo Park;Si-Dong Roh;Minyong Yoon;Taeksang Song;Jung Ho Ahn","doi":"10.1109/LCA.2025.3570235","DOIUrl":"https://doi.org/10.1109/LCA.2025.3570235","url":null,"abstract":"Retrieval-Augmented Generation (RAG) is crucial for improving the quality of large language models by injecting proper contexts extracted from external sources. RAG requires high-throughput, low-latency Approximate Nearest Neighbor Search (ANNS) over billion-scale vector databases. Conventional DRAM/SSD solutions face capacity/latency limits, whereas specialized hardware or RDMA clusters lack flexibility or incur network overhead. We present <sc>Cosmos</small>, integrating general-purpose cores within CXL memory devices for full ANNS offload and introducing rank-level parallel distance computation to maximize memory bandwidth. We also propose an adjacency-aware data placement that balances search loads across CXL devices based on inter-cluster proximity. Evaluations on SIFT1B and DEEP1B traces show that <sc>Cosmos</small> achieves up to 6.72× higher throughput than the baseline CXL system and 2.35× over a state-of-the-art CXL-based solution, demonstrating scalability for RAG pipelines.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"173-176"},"PeriodicalIF":1.4,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144196731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1