首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
Memory-Centric MCM-GPU Architecture
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-09 DOI: 10.1109/LCA.2025.3553766
Hossein SeyyedAghaei;Mahmood Naderan-Tahan;Magnus Jahre;Lieven Eeckhout
The demand for powerful GPUs continues to grow, driven by modern-day applications that require ever increasing computational power and memory bandwidth. Multi-Chip Module (MCM) GPUs provide the scalability potential by integrating GPU chiplets on an interposer substrate, however, they are hindered by their GPU-centric design, i.e., off-chip GPU bandwidth is statically (at design time) allocated to local versus remote memory accesses. This paper presents the memory-centric MCM-GPU architecture. By connecting the HBM stacks on the interposer, rather than the GPUs, and by connecting the GPUs to bridges on the interposer network, the full off-chip GPU bandwidth can be dynamically allocated to local and remote memory accesses. Preliminary results demonstrate the potential of the memory-centric architecture offering an average 1.36× (and up to 1.90×) performance improvement over a GPU-centric architecture.
{"title":"Memory-Centric MCM-GPU Architecture","authors":"Hossein SeyyedAghaei;Mahmood Naderan-Tahan;Magnus Jahre;Lieven Eeckhout","doi":"10.1109/LCA.2025.3553766","DOIUrl":"https://doi.org/10.1109/LCA.2025.3553766","url":null,"abstract":"The demand for powerful GPUs continues to grow, driven by modern-day applications that require ever increasing computational power and memory bandwidth. Multi-Chip Module (MCM) GPUs provide the scalability potential by integrating GPU chiplets on an interposer substrate, however, they are hindered by their GPU-centric design, i.e., off-chip GPU bandwidth is statically (at design time) allocated to local versus remote memory accesses. This paper presents the memory-centric MCM-GPU architecture. By connecting the HBM stacks on the interposer, rather than the GPUs, and by connecting the GPUs to bridges on the interposer network, the full off-chip GPU bandwidth can be dynamically allocated to local and remote memory accesses. Preliminary results demonstrate the potential of the memory-centric architecture offering an average 1.36× (and up to 1.90×) performance improvement over a GPU-centric architecture.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"101-104"},"PeriodicalIF":1.4,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143817792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating Control Flow on CGRAs via Speculative Iteration Execution
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-26 DOI: 10.1109/LCA.2025.3554777
Heng Cao;Zhipeng Wu;Dejian Li;Peiguang Jing;Sio Hang Pun;Yu Liu
Coarse-Grained Reconfigurable Arrays (CGRAs) offer a promising architecture for accelerating general-purpose, compute-intensive tasks. However, handling control flow within these tasks remains a challenge for CGRAs. Current methods for handling control flow in CGRAs execute condition operations before selecting branch paths, which adds extra execution time. This article proposes a CGRA architecture that decouples the control flow condition and path selection within an iteration through speculative iteration execution (SIE), where the condition is predicted before the start of the current iteration. Compared to existing methods, the SIE CGRA achieves a geometric mean speedup of $1.31times$ over Partial Predication, $1.17 times$ over Dynamic-II Pipeline and $1.12times$ over Dual-Issue Single-Execution.
{"title":"Accelerating Control Flow on CGRAs via Speculative Iteration Execution","authors":"Heng Cao;Zhipeng Wu;Dejian Li;Peiguang Jing;Sio Hang Pun;Yu Liu","doi":"10.1109/LCA.2025.3554777","DOIUrl":"https://doi.org/10.1109/LCA.2025.3554777","url":null,"abstract":"Coarse-Grained Reconfigurable Arrays (CGRAs) offer a promising architecture for accelerating general-purpose, compute-intensive tasks. However, handling control flow within these tasks remains a challenge for CGRAs. Current methods for handling control flow in CGRAs execute condition operations before selecting branch paths, which adds extra execution time. This article proposes a CGRA architecture that decouples the control flow condition and path selection within an iteration through speculative iteration execution (SIE), where the condition is predicted before the start of the current iteration. Compared to existing methods, the SIE CGRA achieves a geometric mean speedup of <inline-formula><tex-math>$1.31times$</tex-math> </inline-formula> over Partial Predication, <inline-formula><tex-math>$1.17 times$</tex-math> </inline-formula> over Dynamic-II Pipeline and <inline-formula><tex-math>$1.12times$</tex-math> </inline-formula> over Dual-Issue Single-Execution.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"109-112"},"PeriodicalIF":1.4,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143848801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploiting Intel AMX Power Gating
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-26 DOI: 10.1109/LCA.2025.3555183
Joshua Kalyanapu;Farshad Dizani;Azam Ghanbari;Darsh Asher;Samira Mirbagher Ajorpaz
We identify a novel vulnerability in Intel AMX’s dynamic power performance scaling, enabling NetLoki, a stealthy and high-performance remote speculative attack that bypasses traditional cache defenses and leaks arbitrary addresses over a realistic network where other attacks fail. NetLoki shows a 34,900% improvement in leakage rate over NetSpectre. We show that NetLoki evades detection by three state-of-the-art microarchitectural attack detectors (EVAX, PerSpectron, RHMD) and requires a 20,000x reduction in the system’s timer resolution (10 us) than the standard 0.5 ns hardware timer to be mitigated via timer coarsening. Finally, we analyze the root cause of the leakage and propose an effective defense. We show that the mitigation increases CPU power consumption by 12.33%.
{"title":"Exploiting Intel AMX Power Gating","authors":"Joshua Kalyanapu;Farshad Dizani;Azam Ghanbari;Darsh Asher;Samira Mirbagher Ajorpaz","doi":"10.1109/LCA.2025.3555183","DOIUrl":"https://doi.org/10.1109/LCA.2025.3555183","url":null,"abstract":"We identify a novel vulnerability in Intel AMX’s dynamic power performance scaling, enabling <sc>NetLoki</small>, a stealthy and high-performance remote speculative attack that bypasses traditional cache defenses and leaks arbitrary addresses over a realistic network where other attacks fail. <sc>NetLoki</small> shows a 34,900% improvement in leakage rate over NetSpectre. We show that <sc>NetLoki</small> evades detection by three state-of-the-art microarchitectural attack detectors (EVAX, PerSpectron, RHMD) and requires a 20,000x reduction in the system’s timer resolution (10 us) than the standard 0.5 ns hardware timer to be mitigated via timer coarsening. Finally, we analyze the root cause of the leakage and propose an effective defense. We show that the mitigation increases CPU power consumption by<monospace> 12.33%.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"113-116"},"PeriodicalIF":1.4,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143848842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
X-PPR: Post Package Repair for CXL Memory
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-21 DOI: 10.1109/LCA.2025.3552190
Chihun Song;Michael Jaemin Kim;Yan Sun;Houxiang Ji;Kyungsan Kim;TaeKyeong Ko;Jung Ho Ahn;Nam Sung Kim
CXL is an emerging interface that can cost-efficiently expand the capacity and bandwidth of servers, recycling DRAM modules from retired servers. Such DRAM modules, however, will likely have many uncorrectable faulty words due to years of strenuous use in datacenters. To repair faulty words in the field, a few solutions based on Post Package Repair (PPR) and memory offlining have been proposed. Nonetheless, they are either unable to fix thousands of faulty words or prone to causing severe memory fragmentation, as they operate at the granularity of DRAM row and memory page addresses, respectively. In this work, for cost-efficient use of recycled DRAM modules with thousands of faulty words, we propose CXL-PPR (X-PPR), exploiting the CXL’s support for near-memory processing and variable memory access latency. We demonstrate that X-PPR implemented in a commercial CXL device with DDR4 DRAM modules can handle a faulty bit probability that is $3.3 times 10^{4}$ higher than ECC for a 512GB DRAM module. Meanwhile, X-PPR negligibly degrades the performance of popular memory-intensive benchmarks, which is achieved through two mechanisms designed in X-PPR to minimize the performance impact of additional DRAM accesses required for repairing faulty words.
{"title":"X-PPR: Post Package Repair for CXL Memory","authors":"Chihun Song;Michael Jaemin Kim;Yan Sun;Houxiang Ji;Kyungsan Kim;TaeKyeong Ko;Jung Ho Ahn;Nam Sung Kim","doi":"10.1109/LCA.2025.3552190","DOIUrl":"https://doi.org/10.1109/LCA.2025.3552190","url":null,"abstract":"CXL is an emerging interface that can cost-efficiently expand the capacity and bandwidth of servers, recycling DRAM modules from retired servers. Such DRAM modules, however, will likely have many uncorrectable faulty words due to years of strenuous use in datacenters. To repair faulty words in the field, a few solutions based on Post Package Repair (PPR) and memory offlining have been proposed. Nonetheless, they are either unable to fix thousands of faulty words or prone to causing severe memory fragmentation, as they operate at the granularity of DRAM row and memory page addresses, respectively. In this work, for cost-efficient use of recycled DRAM modules with thousands of faulty words, we propose C<u>X</u>L-<u>PPR</u> (X-PPR), exploiting the CXL’s support for near-memory processing and variable memory access latency. We demonstrate that X-PPR implemented in a commercial CXL device with DDR4 DRAM modules can handle a faulty bit probability that is <inline-formula><tex-math>$3.3 times 10^{4}$</tex-math></inline-formula> higher than ECC for a 512GB DRAM module. Meanwhile, X-PPR negligibly degrades the performance of popular memory-intensive benchmarks, which is achieved through two mechanisms designed in X-PPR to minimize the performance impact of additional DRAM accesses required for repairing faulty words.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"97-100"},"PeriodicalIF":1.4,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143792857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SDT: Cutting Datacenter Tax Through Simultaneous Data-Delivery Threads
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-11 DOI: 10.1109/LCA.2025.3549423
Amin Mamandipoor;Huy Dinh Tran;Mohammad Alian
Networking is considered a datacenter tax, and hyperscalers push hard to provide high-performance networking with minimal resource expenditure. To keep up with the ever-increasing network rates, many CPU cycles are spent on the networking tax. We make a key observation that network processing threads can be simultaneously executed on server CPUs with minimal interference with the application threads. However, utilizing simultaneous multithreading (SMT) to scale the number of network threads with the number of application threads suffers from (1) failing to provide strict tail latency requirements for latency-critical applications, and (2) reducing the number of available hardware threads for application processes, thus contributing to a high datacenter network tax. In this work, we design, implement, and evaluate a chip-multiprocessor (CMP) with specialized Simultaneous Data-delivery Threads (SDT) per physical core. The key insight is that with judicious partitioning at the architectural level, SDT can safely co-run with application processes with guaranteed performance isolation. Our evaluation results, using full-system simulation, show that a 20-core CMP enhanced with SDT reduces the area and power consumption of a baseline 40-core CMP by 47.5% and 66%, respectively, while reducing network throughput by less than 10%.
{"title":"SDT: Cutting Datacenter Tax Through Simultaneous Data-Delivery Threads","authors":"Amin Mamandipoor;Huy Dinh Tran;Mohammad Alian","doi":"10.1109/LCA.2025.3549423","DOIUrl":"https://doi.org/10.1109/LCA.2025.3549423","url":null,"abstract":"Networking is considered a datacenter tax, and hyperscalers push hard to provide high-performance networking with minimal resource expenditure. To keep up with the ever-increasing network rates, many CPU cycles are spent on the networking tax. We make a key observation that network processing threads can be simultaneously executed on server CPUs with minimal interference with the application threads. However, utilizing simultaneous multithreading (SMT) to scale the number of network threads with the number of application threads suffers from (1) failing to provide strict tail latency requirements for latency-critical applications, and (2) reducing the number of available hardware threads for application processes, thus contributing to a high datacenter network tax. In this work, we design, implement, and evaluate a chip-multiprocessor (CMP) with specialized Simultaneous Data-delivery Threads (SDT) per physical core. The key insight is that with judicious partitioning at the architectural level, SDT can safely co-run with application processes with guaranteed performance isolation. Our evaluation results, using full-system simulation, show that a 20-core CMP enhanced with SDT reduces the area and power consumption of a baseline 40-core CMP by 47.5% and 66%, respectively, while reducing network throughput by less than 10%.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"93-96"},"PeriodicalIF":1.4,"publicationDate":"2025-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143777969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data-Pattern-Driven LUT for Efficient In-Cache Computing in CNNs Acceleration
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-05 DOI: 10.1109/LCA.2025.3548080
Zhengpan Fei;Mingchuan Lyu;Satoshi Kawakami;Koji Inoue
The lookup table (LUT)-based Processing-in-Memory (PIM) solutions perform computations by looking up precomputed results stored in LUTs, providing exceptional efficiency for complex operations such as multiplication, making them highly suitable for energy- and latency-efficient Convolutional Neural Network (CNN) inference tasks. However, including all possible results in the LUT naively demands exponential hardware resources, significantly limiting parallelism and increasing hardware area, latency, and power overhead. While decomposition and compression techniques can reduce the LUT size, they also introduce considerable memory access overhead and additional operations. To address these challenges, we conduct an extensive analysis to identify which data portions significantly impact accuracy in CNNs. Based on the insight that key data is concentrated in a small range, we propose a data-pattern-driven (DPD) optimization strategy, which approximates less critical data to drastically reduce LUT size while preserving computational efficiency with acceptable accuracy loss.
{"title":"Data-Pattern-Driven LUT for Efficient In-Cache Computing in CNNs Acceleration","authors":"Zhengpan Fei;Mingchuan Lyu;Satoshi Kawakami;Koji Inoue","doi":"10.1109/LCA.2025.3548080","DOIUrl":"https://doi.org/10.1109/LCA.2025.3548080","url":null,"abstract":"The lookup table (LUT)-based Processing-in-Memory (PIM) solutions perform computations by looking up precomputed results stored in LUTs, providing exceptional efficiency for complex operations such as multiplication, making them highly suitable for energy- and latency-efficient Convolutional Neural Network (CNN) inference tasks. However, including all possible results in the LUT naively demands exponential hardware resources, significantly limiting parallelism and increasing hardware area, latency, and power overhead. While decomposition and compression techniques can reduce the LUT size, they also introduce considerable memory access overhead and additional operations. To address these challenges, we conduct an extensive analysis to identify which data portions significantly impact accuracy in CNNs. Based on the insight that key data is concentrated in a small range, we propose a data-pattern-driven (DPD) optimization strategy, which approximates less critical data to drastically reduce LUT size while preserving computational efficiency with acceptable accuracy loss.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"81-84"},"PeriodicalIF":1.4,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143706788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DPWatch: A Framework for Hardware-Based Differential Privacy Guarantees
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-03-04 DOI: 10.1109/LCA.2025.3547262
Pawan Kumar Sanjaya;Christina Giannoula;Ian Colbert;Ihab Amer;Mehdi Saeedi;Gabor Sines;Nandita Vijaykumar
Differential privacy (DP) and federated learning (FL) have emerged as important privacy-preserving approaches when using sensitive data to train machine learning models. FL ensures that raw sensitive data does not leave the users’ devices by training the model in a distributed manner. DP ensures that the model does not leak any information about an individual by clipping and adding noise to the gradients. However, real-life deployments of such algorithms assume that the third-party application implementing DP-based FL is trusted, and is thus given access to sensitive data on the data owner’s device/server. In this work, we propose DPWatch, a hardware-based framework for ML accelerators that enforces guarantees that a third party application cannot leak sensitive user data used for training and ensures that the gradients are appropriately noised before leaving the device. We evaluate DPWatch on two accelerators and demonstrate small area and performance overheads.
{"title":"DPWatch: A Framework for Hardware-Based Differential Privacy Guarantees","authors":"Pawan Kumar Sanjaya;Christina Giannoula;Ian Colbert;Ihab Amer;Mehdi Saeedi;Gabor Sines;Nandita Vijaykumar","doi":"10.1109/LCA.2025.3547262","DOIUrl":"https://doi.org/10.1109/LCA.2025.3547262","url":null,"abstract":"Differential privacy (DP) and federated learning (FL) have emerged as important privacy-preserving approaches when using sensitive data to train machine learning models. FL ensures that raw sensitive data does not leave the users’ devices by training the model in a distributed manner. DP ensures that the model does not leak any information about an individual by <italic>clipping</i> and adding <italic>noise</i> to the gradients. However, real-life deployments of such algorithms assume that the third-party application implementing DP-based FL is trusted, and is thus given access to sensitive data on the data owner’s device/server. In this work, we propose DPWatch, a hardware-based framework for ML accelerators that enforces guarantees that a third party application cannot leak sensitive user data used for training and ensures that the gradients are appropriately noised before leaving the device. We evaluate DPWatch on two accelerators and demonstrate small area and performance overheads.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"89-92"},"PeriodicalIF":1.4,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143761420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Characterization of Generative Recommendation Models: Study of Hierarchical Sequential Transduction Unit
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-02-28 DOI: 10.1109/LCA.2025.3546811
Taehun Kim;Yunjae Lee;Juntaek Lim;Minsoo Rhu
Recommendation systems are crucial for personalizing user experiences on online platforms. While Deep Learning Recommendation Models (DLRMs) have been the state-of-the-art for nearly a decade, their scalability is limited, as model quality scales poorly with compute. Recently, there have been research efforts applying Transformer architecture to recommendation systems, and Hierarchical Sequential Transaction Unit (HSTU), an encoder architecture, has been proposed to address scalability challenges. Although HSTU-based generative recommenders show significant potential, they have received little attention from computer architects. In this paper, we analyze the inference process of HSTU-based generative recommenders and perform an in-depth characterization of the model. Our findings indicate the attention mechanism is a major performance bottleneck. We further discuss promising research directions and optimization strategies that can potentially enhance the efficiency of HSTU models.
{"title":"A Characterization of Generative Recommendation Models: Study of Hierarchical Sequential Transduction Unit","authors":"Taehun Kim;Yunjae Lee;Juntaek Lim;Minsoo Rhu","doi":"10.1109/LCA.2025.3546811","DOIUrl":"https://doi.org/10.1109/LCA.2025.3546811","url":null,"abstract":"Recommendation systems are crucial for personalizing user experiences on online platforms. While Deep Learning Recommendation Models (DLRMs) have been the state-of-the-art for nearly a decade, their scalability is limited, as model quality scales poorly with compute. Recently, there have been research efforts applying Transformer architecture to recommendation systems, and Hierarchical Sequential Transaction Unit (HSTU), an encoder architecture, has been proposed to address scalability challenges. Although HSTU-based generative recommenders show significant potential, they have received little attention from computer architects. In this paper, we analyze the inference process of HSTU-based generative recommenders and perform an in-depth characterization of the model. Our findings indicate the attention mechanism is a major performance bottleneck. We further discuss promising research directions and optimization strategies that can potentially enhance the efficiency of HSTU models.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"85-88"},"PeriodicalIF":1.4,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143706790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Thor: A Non-Speculative Value Dependent Timing Side Channel Attack Exploiting Intel AMX
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-02-27 DOI: 10.1109/LCA.2025.3544989
Farshad Dizani;Azam Ghanbari;Joshua Kalyanapu;Darsh Asher;Samira Mirbagher Ajorpaz
The rise of on-chip accelerators signifies a major shift in computing, driven by the growing demands of artificial intelligence (AI) and specialized applications. These accelerators have gained popularity due to their ability to substantially boost performance, cut energy usage, lower total cost of ownership (TCO), and promote sustainability. Intel's Advanced Matrix Extensions (AMX) is one such on-chip accelerator, specifically designed for handling tasks involving large matrix multiplications commonly used in machine learning (ML) models, image processing, and other computational-heavy operations. In this paper, we introduce a novel value-dependent timing side-channel vulnerability in Intel AMX. By exploiting this weakness, we demonstrate a software-based, value-dependent timing side-channel attack capable of inferring the sparsity of neural network weights without requiring any knowledge of the confidence score, privileged access or physical proximity. Our attack method can fully recover the sparsity of weights assigned to 64 input elements within 50 minutes, which is 631% faster than the maximum leakage rate achieved in the Hertzbleed attack.
{"title":"Thor: A Non-Speculative Value Dependent Timing Side Channel Attack Exploiting Intel AMX","authors":"Farshad Dizani;Azam Ghanbari;Joshua Kalyanapu;Darsh Asher;Samira Mirbagher Ajorpaz","doi":"10.1109/LCA.2025.3544989","DOIUrl":"https://doi.org/10.1109/LCA.2025.3544989","url":null,"abstract":"The rise of on-chip accelerators signifies a major shift in computing, driven by the growing demands of artificial intelligence (AI) and specialized applications. These accelerators have gained popularity due to their ability to substantially boost performance, cut energy usage, lower total cost of ownership (TCO), and promote sustainability. Intel's Advanced Matrix Extensions (AMX) is one such on-chip accelerator, specifically designed for handling tasks involving large matrix multiplications commonly used in machine learning (ML) models, image processing, and other computational-heavy operations. In this paper, we introduce a novel value-dependent timing side-channel vulnerability in Intel AMX. By exploiting this weakness, we demonstrate a software-based, value-dependent timing side-channel attack capable of inferring the sparsity of neural network weights without requiring any knowledge of the confidence score, privileged access or physical proximity. Our attack method can fully recover the sparsity of weights assigned to 64 input elements within 50 minutes, which is 631% faster than the maximum leakage rate achieved in the Hertzbleed attack.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"69-72"},"PeriodicalIF":1.4,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143688117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Data Prefetcher-Based 1000-Core RISC-V Processor for Efficient Processing of Graph Neural Networks
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-02-26 DOI: 10.1109/LCA.2025.3545799
Omer Khan
Graphs-based neural networks have seen tremendous adoption to perform complex predictive analytics on massive real-world graphs. The trend in hardware acceleration has identified significant challenges with harnessing graph locality and workload imbalance due to ultra-sparse and irregular matrix computations at a massively parallel scale. State-of-the-art hardware accelerators utilize massive multithreading and asynchronous execution in GPUs to achieve parallel performance at high power consumption. This paper aims to bridge the power-performance gap using the energy efficiency-centric RISC-V ecosystem. A 1000-core RISC-V processor is proposed to unlock massive parallelism in the graphs-based matrix operators to achieve a low-latency data access paradigm in hardware to achieve robust power-performance scaling. Each core implements a single-threaded pipeline with a novel graph-aware data prefetcher at the 1000 cores scale to deliver an average 20× performance per watt advantage over state-of-the-art NVIDIA GPU.
{"title":"A Data Prefetcher-Based 1000-Core RISC-V Processor for Efficient Processing of Graph Neural Networks","authors":"Omer Khan","doi":"10.1109/LCA.2025.3545799","DOIUrl":"https://doi.org/10.1109/LCA.2025.3545799","url":null,"abstract":"Graphs-based neural networks have seen tremendous adoption to perform complex predictive analytics on massive real-world graphs. The trend in hardware acceleration has identified significant challenges with harnessing graph locality and workload imbalance due to ultra-sparse and irregular matrix computations at a massively parallel scale. State-of-the-art hardware accelerators utilize massive multithreading and asynchronous execution in GPUs to achieve parallel performance at high power consumption. This paper aims to bridge the power-performance gap using the energy efficiency-centric RISC-V ecosystem. A 1000-core RISC-V processor is proposed to unlock massive parallelism in the graphs-based matrix operators to achieve a low-latency data access paradigm in hardware to achieve robust power-performance scaling. Each core implements a single-threaded pipeline with a novel graph-aware data prefetcher at the 1000 cores scale to deliver an average 20× performance per watt advantage over state-of-the-art NVIDIA GPU.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"73-76"},"PeriodicalIF":1.4,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143698183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1