IEEE Computer Architecture Letters最新文献

英文中文

Memory-Centric MCM-GPU Architecture

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2025-04-09 DOI: 10.1109/LCA.2025.3553766

Hossein SeyyedAghaei;Mahmood Naderan-Tahan;Magnus Jahre;Lieven Eeckhout

The demand for powerful GPUs continues to grow, driven by modern-day applications that require ever increasing computational power and memory bandwidth. Multi-Chip Module (MCM) GPUs provide the scalability potential by integrating GPU chiplets on an interposer substrate, however, they are hindered by their GPU-centric design, i.e., off-chip GPU bandwidth is statically (at design time) allocated to local versus remote memory accesses. This paper presents the memory-centric MCM-GPU architecture. By connecting the HBM stacks on the interposer, rather than the GPUs, and by connecting the GPUs to bridges on the interposer network, the full off-chip GPU bandwidth can be dynamically allocated to local and remote memory accesses. Preliminary results demonstrate the potential of the memory-centric architecture offering an average 1.36× (and up to 1.90×) performance improvement over a GPU-centric architecture.

引用次数: 0

Accelerating Control Flow on CGRAs via Speculative Iteration Execution

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2025-03-26 DOI: 10.1109/LCA.2025.3554777

Heng Cao;Zhipeng Wu;Dejian Li;Peiguang Jing;Sio Hang Pun;Yu Liu

Coarse-Grained Reconfigurable Arrays (CGRAs) offer a promising architecture for accelerating general-purpose, compute-intensive tasks. However, handling control flow within these tasks remains a challenge for CGRAs. Current methods for handling control flow in CGRAs execute condition operations before selecting branch paths, which adds extra execution time. This article proposes a CGRA architecture that decouples the control flow condition and path selection within an iteration through speculative iteration execution (SIE), where the condition is predicted before the start of the current iteration. Compared to existing methods, the SIE CGRA achieves a geometric mean speedup of

$1.31times$

over Partial Predication,

$1.17 times$

over Dynamic-II Pipeline and

$1.12times$

over Dual-Issue Single-Execution.

引用次数: 0

Exploiting Intel AMX Power Gating

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2025-03-26 DOI: 10.1109/LCA.2025.3555183

Joshua Kalyanapu;Farshad Dizani;Azam Ghanbari;Darsh Asher;Samira Mirbagher Ajorpaz

We identify a novel vulnerability in Intel AMX’s dynamic power performance scaling, enabling NetLoki, a stealthy and high-performance remote speculative attack that bypasses traditional cache defenses and leaks arbitrary addresses over a realistic network where other attacks fail. NetLoki shows a 34,900% improvement in leakage rate over NetSpectre. We show that NetLoki evades detection by three state-of-the-art microarchitectural attack detectors (EVAX, PerSpectron, RHMD) and requires a 20,000x reduction in the system’s timer resolution (10 us) than the standard 0.5 ns hardware timer to be mitigated via timer coarsening. Finally, we analyze the root cause of the leakage and propose an effective defense. We show that the mitigation increases CPU power consumption by 12.33%.

引用次数: 0

X-PPR: Post Package Repair for CXL Memory

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2025-03-21 DOI: 10.1109/LCA.2025.3552190

Chihun Song;Michael Jaemin Kim;Yan Sun;Houxiang Ji;Kyungsan Kim;TaeKyeong Ko;Jung Ho Ahn;Nam Sung Kim

CXL is an emerging interface that can cost-efficiently expand the capacity and bandwidth of servers, recycling DRAM modules from retired servers. Such DRAM modules, however, will likely have many uncorrectable faulty words due to years of strenuous use in datacenters. To repair faulty words in the field, a few solutions based on Post Package Repair (PPR) and memory offlining have been proposed. Nonetheless, they are either unable to fix thousands of faulty words or prone to causing severe memory fragmentation, as they operate at the granularity of DRAM row and memory page addresses, respectively. In this work, for cost-efficient use of recycled DRAM modules with thousands of faulty words, we propose CXL-PPR (X-PPR), exploiting the CXL’s support for near-memory processing and variable memory access latency. We demonstrate that X-PPR implemented in a commercial CXL device with DDR4 DRAM modules can handle a faulty bit probability that is

$3.3 times 10^{4}$

higher than ECC for a 512GB DRAM module. Meanwhile, X-PPR negligibly degrades the performance of popular memory-intensive benchmarks, which is achieved through two mechanisms designed in X-PPR to minimize the performance impact of additional DRAM accesses required for repairing faulty words.

{"title":"X-PPR: Post Package Repair for CXL Memory","authors":"Chihun Song;Michael Jaemin Kim;Yan Sun;Houxiang Ji;Kyungsan Kim;TaeKyeong Ko;Jung Ho Ahn;Nam Sung Kim","doi":"10.1109/LCA.2025.3552190","DOIUrl":"https://doi.org/10.1109/LCA.2025.3552190","url":null,"abstract":"CXL is an emerging interface that can cost-efficiently expand the capacity and bandwidth of servers, recycling DRAM modules from retired servers. Such DRAM modules, however, will likely have many uncorrectable faulty words due to years of strenuous use in datacenters. To repair faulty words in the field, a few solutions based on Post Package Repair (PPR) and memory offlining have been proposed. Nonetheless, they are either unable to fix thousands of faulty words or prone to causing severe memory fragmentation, as they operate at the granularity of DRAM row and memory page addresses, respectively. In this work, for cost-efficient use of recycled DRAM modules with thousands of faulty words, we propose C<u>X</u>L-<u>PPR</u> (X-PPR), exploiting the CXL’s support for near-memory processing and variable memory access latency. We demonstrate that X-PPR implemented in a commercial CXL device with DDR4 DRAM modules can handle a faulty bit probability that is <inline-formula><tex-math>$3.3 times 10^{4}$</tex-math></inline-formula> higher than ECC for a 512GB DRAM module. Meanwhile, X-PPR negligibly degrades the performance of popular memory-intensive benchmarks, which is achieved through two mechanisms designed in X-PPR to minimize the performance impact of additional DRAM accesses required for repairing faulty words.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"97-100"},"PeriodicalIF":1.4,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143792857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SDT: Cutting Datacenter Tax Through Simultaneous Data-Delivery Threads

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2025-03-11 DOI: 10.1109/LCA.2025.3549423

Amin Mamandipoor;Huy Dinh Tran;Mohammad Alian

Networking is considered a datacenter tax, and hyperscalers push hard to provide high-performance networking with minimal resource expenditure. To keep up with the ever-increasing network rates, many CPU cycles are spent on the networking tax. We make a key observation that network processing threads can be simultaneously executed on server CPUs with minimal interference with the application threads. However, utilizing simultaneous multithreading (SMT) to scale the number of network threads with the number of application threads suffers from (1) failing to provide strict tail latency requirements for latency-critical applications, and (2) reducing the number of available hardware threads for application processes, thus contributing to a high datacenter network tax. In this work, we design, implement, and evaluate a chip-multiprocessor (CMP) with specialized Simultaneous Data-delivery Threads (SDT) per physical core. The key insight is that with judicious partitioning at the architectural level, SDT can safely co-run with application processes with guaranteed performance isolation. Our evaluation results, using full-system simulation, show that a 20-core CMP enhanced with SDT reduces the area and power consumption of a baseline 40-core CMP by 47.5% and 66%, respectively, while reducing network throughput by less than 10%.

{"title":"SDT: Cutting Datacenter Tax Through Simultaneous Data-Delivery Threads","authors":"Amin Mamandipoor;Huy Dinh Tran;Mohammad Alian","doi":"10.1109/LCA.2025.3549423","DOIUrl":"https://doi.org/10.1109/LCA.2025.3549423","url":null,"abstract":"Networking is considered a datacenter tax, and hyperscalers push hard to provide high-performance networking with minimal resource expenditure. To keep up with the ever-increasing network rates, many CPU cycles are spent on the networking tax. We make a key observation that network processing threads can be simultaneously executed on server CPUs with minimal interference with the application threads. However, utilizing simultaneous multithreading (SMT) to scale the number of network threads with the number of application threads suffers from (1) failing to provide strict tail latency requirements for latency-critical applications, and (2) reducing the number of available hardware threads for application processes, thus contributing to a high datacenter network tax. In this work, we design, implement, and evaluate a chip-multiprocessor (CMP) with specialized Simultaneous Data-delivery Threads (SDT) per physical core. The key insight is that with judicious partitioning at the architectural level, SDT can safely co-run with application processes with guaranteed performance isolation. Our evaluation results, using full-system simulation, show that a 20-core CMP enhanced with SDT reduces the area and power consumption of a baseline 40-core CMP by 47.5% and 66%, respectively, while reducing network throughput by less than 10%.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"93-96"},"PeriodicalIF":1.4,"publicationDate":"2025-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143777969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Data-Pattern-Driven LUT for Efficient In-Cache Computing in CNNs Acceleration

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2025-03-05 DOI: 10.1109/LCA.2025.3548080

Zhengpan Fei;Mingchuan Lyu;Satoshi Kawakami;Koji Inoue

The lookup table (LUT)-based Processing-in-Memory (PIM) solutions perform computations by looking up precomputed results stored in LUTs, providing exceptional efficiency for complex operations such as multiplication, making them highly suitable for energy- and latency-efficient Convolutional Neural Network (CNN) inference tasks. However, including all possible results in the LUT naively demands exponential hardware resources, significantly limiting parallelism and increasing hardware area, latency, and power overhead. While decomposition and compression techniques can reduce the LUT size, they also introduce considerable memory access overhead and additional operations. To address these challenges, we conduct an extensive analysis to identify which data portions significantly impact accuracy in CNNs. Based on the insight that key data is concentrated in a small range, we propose a data-pattern-driven (DPD) optimization strategy, which approximates less critical data to drastically reduce LUT size while preserving computational efficiency with acceptable accuracy loss.

引用次数: 0

DPWatch: A Framework for Hardware-Based Differential Privacy Guarantees

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2025-03-04 DOI: 10.1109/LCA.2025.3547262

Pawan Kumar Sanjaya;Christina Giannoula;Ian Colbert;Ihab Amer;Mehdi Saeedi;Gabor Sines;Nandita Vijaykumar

Differential privacy (DP) and federated learning (FL) have emerged as important privacy-preserving approaches when using sensitive data to train machine learning models. FL ensures that raw sensitive data does not leave the users’ devices by training the model in a distributed manner. DP ensures that the model does not leak any information about an individual by clipping and adding noise to the gradients. However, real-life deployments of such algorithms assume that the third-party application implementing DP-based FL is trusted, and is thus given access to sensitive data on the data owner’s device/server. In this work, we propose DPWatch, a hardware-based framework for ML accelerators that enforces guarantees that a third party application cannot leak sensitive user data used for training and ensures that the gradients are appropriately noised before leaving the device. We evaluate DPWatch on two accelerators and demonstrate small area and performance overheads.

引用次数: 0

A Characterization of Generative Recommendation Models: Study of Hierarchical Sequential Transduction Unit

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2025-02-28 DOI: 10.1109/LCA.2025.3546811

Taehun Kim;Yunjae Lee;Juntaek Lim;Minsoo Rhu

Recommendation systems are crucial for personalizing user experiences on online platforms. While Deep Learning Recommendation Models (DLRMs) have been the state-of-the-art for nearly a decade, their scalability is limited, as model quality scales poorly with compute. Recently, there have been research efforts applying Transformer architecture to recommendation systems, and Hierarchical Sequential Transaction Unit (HSTU), an encoder architecture, has been proposed to address scalability challenges. Although HSTU-based generative recommenders show significant potential, they have received little attention from computer architects. In this paper, we analyze the inference process of HSTU-based generative recommenders and perform an in-depth characterization of the model. Our findings indicate the attention mechanism is a major performance bottleneck. We further discuss promising research directions and optimization strategies that can potentially enhance the efficiency of HSTU models.

引用次数: 0

Thor: A Non-Speculative Value Dependent Timing Side Channel Attack Exploiting Intel AMX

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2025-02-27 DOI: 10.1109/LCA.2025.3544989

Farshad Dizani;Azam Ghanbari;Joshua Kalyanapu;Darsh Asher;Samira Mirbagher Ajorpaz

The rise of on-chip accelerators signifies a major shift in computing, driven by the growing demands of artificial intelligence (AI) and specialized applications. These accelerators have gained popularity due to their ability to substantially boost performance, cut energy usage, lower total cost of ownership (TCO), and promote sustainability. Intel's Advanced Matrix Extensions (AMX) is one such on-chip accelerator, specifically designed for handling tasks involving large matrix multiplications commonly used in machine learning (ML) models, image processing, and other computational-heavy operations. In this paper, we introduce a novel value-dependent timing side-channel vulnerability in Intel AMX. By exploiting this weakness, we demonstrate a software-based, value-dependent timing side-channel attack capable of inferring the sparsity of neural network weights without requiring any knowledge of the confidence score, privileged access or physical proximity. Our attack method can fully recover the sparsity of weights assigned to 64 input elements within 50 minutes, which is 631% faster than the maximum leakage rate achieved in the Hertzbleed attack.

{"title":"Thor: A Non-Speculative Value Dependent Timing Side Channel Attack Exploiting Intel AMX","authors":"Farshad Dizani;Azam Ghanbari;Joshua Kalyanapu;Darsh Asher;Samira Mirbagher Ajorpaz","doi":"10.1109/LCA.2025.3544989","DOIUrl":"https://doi.org/10.1109/LCA.2025.3544989","url":null,"abstract":"The rise of on-chip accelerators signifies a major shift in computing, driven by the growing demands of artificial intelligence (AI) and specialized applications. These accelerators have gained popularity due to their ability to substantially boost performance, cut energy usage, lower total cost of ownership (TCO), and promote sustainability. Intel's Advanced Matrix Extensions (AMX) is one such on-chip accelerator, specifically designed for handling tasks involving large matrix multiplications commonly used in machine learning (ML) models, image processing, and other computational-heavy operations. In this paper, we introduce a novel value-dependent timing side-channel vulnerability in Intel AMX. By exploiting this weakness, we demonstrate a software-based, value-dependent timing side-channel attack capable of inferring the sparsity of neural network weights without requiring any knowledge of the confidence score, privileged access or physical proximity. Our attack method can fully recover the sparsity of weights assigned to 64 input elements within 50 minutes, which is 631% faster than the maximum leakage rate achieved in the Hertzbleed attack.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"69-72"},"PeriodicalIF":1.4,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143688117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Data Prefetcher-Based 1000-Core RISC-V Processor for Efficient Processing of Graph Neural Networks

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2025-02-26 DOI: 10.1109/LCA.2025.3545799

Omer Khan

Graphs-based neural networks have seen tremendous adoption to perform complex predictive analytics on massive real-world graphs. The trend in hardware acceleration has identified significant challenges with harnessing graph locality and workload imbalance due to ultra-sparse and irregular matrix computations at a massively parallel scale. State-of-the-art hardware accelerators utilize massive multithreading and asynchronous execution in GPUs to achieve parallel performance at high power consumption. This paper aims to bridge the power-performance gap using the energy efficiency-centric RISC-V ecosystem. A 1000-core RISC-V processor is proposed to unlock massive parallelism in the graphs-based matrix operators to achieve a low-latency data access paradigm in hardware to achieve robust power-performance scaling. Each core implements a single-threaded pipeline with a novel graph-aware data prefetcher at the 1000 cores scale to deliver an average 20× performance per watt advantage over state-of-the-art NVIDIA GPU.

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

IEEE Computer Architecture Letters

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀