首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
Optically Connected Multi-Stack HBM Modules for Large Language Model Training and Inference
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-02-18 DOI: 10.1109/LCA.2025.3540058
Yanghui Ou;Hengrui Zhang;Austin Rovinski;David Wentzlaff;Christopher Batten
Large language models (LLMs) have grown exponentially in size, presenting significant challenges to traditional memory architectures. Current high bandwidth memory (HBM) systems are constrained by chiplet I/O bandwidth and the limited number of HBM stacks that can be integrated due to packaging constraints. In this letter, we propose a novel memory system architecture that leverages silicon photonic interconnects to increase memory capacity and bandwidth for compute devices. By introducing optically connected multi-stack HBM modules, we extend the HBM memory system off the compute chip, significantly increasing the number of HBM stacks. Our evaluations show that this architecture can improve training efficiency for a trillion-parameter model by 1.4× compared to a modeled A100 baseline, while also enhancing inference performance by 4.2× if the L2 is modified to provide sufficient bandwidth.
{"title":"Optically Connected Multi-Stack HBM Modules for Large Language Model Training and Inference","authors":"Yanghui Ou;Hengrui Zhang;Austin Rovinski;David Wentzlaff;Christopher Batten","doi":"10.1109/LCA.2025.3540058","DOIUrl":"https://doi.org/10.1109/LCA.2025.3540058","url":null,"abstract":"Large language models (LLMs) have grown exponentially in size, presenting significant challenges to traditional memory architectures. Current high bandwidth memory (HBM) systems are constrained by chiplet I/O bandwidth and the limited number of HBM stacks that can be integrated due to packaging constraints. In this letter, we propose a novel memory system architecture that leverages silicon photonic interconnects to increase memory capacity and bandwidth for compute devices. By introducing optically connected multi-stack HBM modules, we extend the HBM memory system off the compute chip, significantly increasing the number of HBM stacks. Our evaluations show that this architecture can improve training efficiency for a trillion-parameter model by 1.4× compared to a modeled A100 baseline, while also enhancing inference performance by 4.2× if the L2 is modified to provide sufficient bandwidth.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"49-52"},"PeriodicalIF":1.4,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143496530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comprehensive Design Space Exploration for Graph Neural Network Aggregation on GPUs
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-02-06 DOI: 10.1109/LCA.2025.3539371
Hyunwoo Nam;Jay Hwan Lee;Shinhyung Yang;Yeonsoo Kim;Jiun Jeong;Jeonggeun Kim;Bernd Burgstaller
Graph neural networks (GNNs) have become the state-of-the-art technology for extracting and predicting data representations on graphs. With increasing demand to accelerate GNN computations, the GPU has become the dominant platform for GNN training and inference. GNNs consist of a compute-bound combination phase and a memory-bound aggregation phase. The memory access patterns of the aggregation phase remain a major performance bottleneck on GPUs, despite recent microarchitectural enhancements. Although GNN characterizations have been conducted to investigate this bottleneck, they did not reveal the impact of architectural modifications. However, a comprehensive understanding of improvements from such modifications is imperative to devise GPU optimizations for the aggregation phase. In this letter, we explore the GPU design space for aggregation by assessing the performance improvement potential of a series of architectural modifications. We find that the low locality of aggregation deteriorates performance with increased thread-level parallelism, and a significant enhancement follows memory access optimizations, which remain effective even with software optimization. Our analysis provides insights for hardware optimizations to significantly improve GNN aggregation on GPUs.
图神经网络(GNN)已成为提取和预测图上数据表示的最先进技术。随着加速 GNN 计算的需求不断增加,GPU 已成为 GNN 训练和推理的主流平台。GNN 由计算绑定的组合阶段和内存绑定的聚合阶段组成。尽管最近的微体系结构有所改进,但聚合阶段的内存访问模式仍然是 GPU 的主要性能瓶颈。虽然已经进行了 GNN 特性分析来研究这一瓶颈,但它们并未揭示架构修改的影响。然而,要为聚合阶段设计 GPU 优化方案,就必须全面了解此类修改带来的改进。在这封信中,我们通过评估一系列架构修改的性能改进潜力,探索了 GPU 的聚合设计空间。我们发现,随着线程级并行性的提高,聚合的低局部性会使性能下降,而内存访问优化后性能会显著提高,即使进行软件优化也依然有效。我们的分析为硬件优化提供了启示,可显著提高 GPU 上的 GNN 聚合性能。
{"title":"Comprehensive Design Space Exploration for Graph Neural Network Aggregation on GPUs","authors":"Hyunwoo Nam;Jay Hwan Lee;Shinhyung Yang;Yeonsoo Kim;Jiun Jeong;Jeonggeun Kim;Bernd Burgstaller","doi":"10.1109/LCA.2025.3539371","DOIUrl":"https://doi.org/10.1109/LCA.2025.3539371","url":null,"abstract":"Graph neural networks (GNNs) have become the state-of-the-art technology for extracting and predicting data representations on graphs. With increasing demand to accelerate GNN computations, the GPU has become the dominant platform for GNN training and inference. GNNs consist of a compute-bound combination phase and a memory-bound aggregation phase. The memory access patterns of the aggregation phase remain a major performance bottleneck on GPUs, despite recent microarchitectural enhancements. Although GNN characterizations have been conducted to investigate this bottleneck, they did not reveal the impact of architectural modifications. However, a comprehensive understanding of improvements from such modifications is imperative to devise GPU optimizations for the aggregation phase. In this letter, we explore the GPU design space for aggregation by assessing the performance improvement potential of a series of architectural modifications. We find that the low locality of aggregation deteriorates performance with increased thread-level parallelism, and a significant enhancement follows memory access optimizations, which remain effective even with software optimization. Our analysis provides insights for hardware optimizations to significantly improve GNN aggregation on GPUs.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"45-48"},"PeriodicalIF":1.4,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143480835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Editorial: A Letter From the Editor-in-Chief of IEEE Computer Architecture Letters
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-28 DOI: 10.1109/LCA.2025.3528276
Sudhanva Gurumurthi;Mattan Erez
{"title":"Editorial: A Letter From the Editor-in-Chief of IEEE Computer Architecture Letters","authors":"Sudhanva Gurumurthi;Mattan Erez","doi":"10.1109/LCA.2025.3528276","DOIUrl":"https://doi.org/10.1109/LCA.2025.3528276","url":null,"abstract":"","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"iii-iv"},"PeriodicalIF":1.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10856691","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143105557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RoPIM: A Processing-in-Memory Architecture for Accelerating Rotary Positional Embedding in Transformer Models
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-28 DOI: 10.1109/LCA.2025.3535470
Yunhyeong Jeon;Minwoo Jang;Hwanjun Lee;Yeji Jung;Jin Jung;Jonggeon Lee;Jinin So;Daehoon Kim
The emergence of attention-based Transformer models, such as GPT, BERT, and LLaMA, has revolutionized Natural Language Processing (NLP) by significantly improving performance across a wide range of applications. A critical factor driving these improvements is the use of positional embeddings, which are crucial for capturing the contextual relationships between tokens in a sequence. However, current positional embedding methods face challenges, particularly in managing performance overhead for long sequences and effectively capturing relationships between adjacent tokens. In response, Rotary Positional Embedding (RoPE) has emerged as a method that effectively embeds positional information with high accuracy and without necessitating model retraining even with long sequences. Despite its effectiveness, RoPE introduces a considerable performance bottleneck during inference. We observe that RoPE accounts for 61% of GPU execution time due to extensive data movement and execution dependencies. In this paper, we introduce RoPIM, a Processing-In-Memory (PIM) architecture designed to efficiently accelerate RoPE operations in Transformer models. RoPIM achieves this by utilizing a bank-level accelerator that reduces off-chip data movement through in-accelerator support for multiply-addition operations and minimizes operational dependencies via parallel data rearrangement. Additionally, RoPIM proposes an optimized data mapping strategy that leverages both bank-level and row-level mappings to enable parallel execution, eliminate bank-to-bank communication, and reduce DRAM activations. Our experimental results show that RoPIM achieves up to a 307.9× performance improvement and 914.1× energy savings compared to conventional systems.
{"title":"RoPIM: A Processing-in-Memory Architecture for Accelerating Rotary Positional Embedding in Transformer Models","authors":"Yunhyeong Jeon;Minwoo Jang;Hwanjun Lee;Yeji Jung;Jin Jung;Jonggeon Lee;Jinin So;Daehoon Kim","doi":"10.1109/LCA.2025.3535470","DOIUrl":"https://doi.org/10.1109/LCA.2025.3535470","url":null,"abstract":"The emergence of attention-based Transformer models, such as GPT, BERT, and LLaMA, has revolutionized Natural Language Processing (NLP) by significantly improving performance across a wide range of applications. A critical factor driving these improvements is the use of positional embeddings, which are crucial for capturing the contextual relationships between tokens in a sequence. However, current positional embedding methods face challenges, particularly in managing performance overhead for long sequences and effectively capturing relationships between adjacent tokens. In response, Rotary Positional Embedding (RoPE) has emerged as a method that effectively embeds positional information with high accuracy and without necessitating model retraining even with long sequences. Despite its effectiveness, RoPE introduces a considerable performance bottleneck during inference. We observe that RoPE accounts for 61% of GPU execution time due to extensive data movement and execution dependencies. In this paper, we introduce <monospace>RoPIM</monospace>, a Processing-In-Memory (PIM) architecture designed to efficiently accelerate RoPE operations in Transformer models. <monospace>RoPIM</monospace> achieves this by utilizing a bank-level accelerator that reduces off-chip data movement through in-accelerator support for multiply-addition operations and minimizes operational dependencies via parallel data rearrangement. Additionally, <monospace>RoPIM</monospace> proposes an optimized data mapping strategy that leverages both bank-level and row-level mappings to enable parallel execution, eliminate bank-to-bank communication, and reduce DRAM activations. Our experimental results show that <monospace>RoPIM</monospace> achieves up to a 307.9× performance improvement and 914.1× energy savings compared to conventional systems.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"41-44"},"PeriodicalIF":1.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143455148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GPU-Centric Memory Tiering for LLM Serving With NVIDIA Grace Hopper Superchip
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-23 DOI: 10.1109/LCA.2025.3533588
Woohyung Choi;Jinwoo Jeong;Hanhwi Jang;Jeongseob Ahn
This study investigates the performance of serving large language models (LLMs) with a focus on the high-bandwidth interconnect between GPU and CPU using a real NVIDIA Grace Hopper Superchip. This architecture features a GPU-centric memory tiering system, comprising a performance tier with GPU memory and a capacity tier with host memory. We revisit a conventional pipelined execution for LLM inference, utilizing host memory connected via NVLink alongside GPU memory. For the Llama-3.1 8B base (FP16) model, such a GPU-centric tiered memory system meets the target latency requirements for both prefill and decoding while improving throughput compared to the in-memory case, where all model weights are maintained in GPU memory. However, even with NVLink-connected CPU memory, meeting latency constraints for large models like the 70B and 405B FP16 models remains challenging. To address this, we explore the efficacy of model quantization (e.g., AWQ) along with the pipelined execution. Our evaluation reveals that the model quantization makes the pipelined execution a viable solution for serving large models. For the Llama-3.1 70B and 405B AWQ models, we show that the pipelined execution achieves 1.6× and 2.9× throughput improvement, respectively, compared to the in-memory only case, while meeting the latency constraint.
{"title":"GPU-Centric Memory Tiering for LLM Serving With NVIDIA Grace Hopper Superchip","authors":"Woohyung Choi;Jinwoo Jeong;Hanhwi Jang;Jeongseob Ahn","doi":"10.1109/LCA.2025.3533588","DOIUrl":"https://doi.org/10.1109/LCA.2025.3533588","url":null,"abstract":"This study investigates the performance of serving large language models (LLMs) with a focus on the high-bandwidth interconnect between GPU and CPU using a real NVIDIA Grace Hopper Superchip. This architecture features a GPU-centric memory tiering system, comprising a performance tier with GPU memory and a capacity tier with host memory. We revisit a conventional pipelined execution for LLM inference, utilizing host memory connected via NVLink alongside GPU memory. For the Llama-3.1 8B base (FP16) model, such a GPU-centric tiered memory system meets the target latency requirements for both prefill and decoding while improving throughput compared to the in-memory case, where all model weights are maintained in GPU memory. However, even with NVLink-connected CPU memory, meeting latency constraints for large models like the 70B and 405B FP16 models remains challenging. To address this, we explore the efficacy of model quantization (e.g., AWQ) along with the pipelined execution. Our evaluation reveals that the model quantization makes the pipelined execution a viable solution for serving large models. For the Llama-3.1 70B and 405B AWQ models, we show that the pipelined execution achieves 1.6× and 2.9× throughput improvement, respectively, compared to the in-memory only case, while meeting the latency constraint.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"33-36"},"PeriodicalIF":1.4,"publicationDate":"2025-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143388529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
2024 Reviewers List 2024审稿人名单
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-22 DOI: 10.1109/LCA.2025.3528619
{"title":"2024 Reviewers List","authors":"","doi":"10.1109/LCA.2025.3528619","DOIUrl":"https://doi.org/10.1109/LCA.2025.3528619","url":null,"abstract":"","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"i-ii"},"PeriodicalIF":1.4,"publicationDate":"2025-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10849623","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SPAM: Streamlined Prefetcher-Aware Multi-Threaded Cache Covert-Channel Attack
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-14 DOI: 10.1109/LCA.2025.3529213
E. Kritheesh;Biswabandan Panda
Last-level cache (LLC) covert-channels exploit the cache timing differences to transmit information. In recent works, the attacks rely on a single sender and a single receiver. Streamline is the state-of-the-art cache covert channel attack that uses a shared array of addresses mapped to the payload bits, allowing parallelization of the encoding and decoding of bits. As multi-core systems are ubiquitous, multiple senders and receivers can be used to create a high bandwidth cache covert channel. However, this is not the case, and the bandwidth per thread is limited by various factors. We extend Streamline to a multi-threaded Streamline, where the senders buffer a few thousand bits at the LLC for the receivers to decode. We observe that these buffered bits are prone to eviction by the co-running processes before they are decoded. We propose SPAM, a multi-threaded covert-channel at the LLC. SPAM shows that fewer but faster senders must encode for more receivers to reduce this time frame. This ensures resilience to noise coming from cache activities of co-running applications. SPAM uses two different access patterns for the sender(s) and the receiver(s). The sender access pattern of the addresses is modified to leverage the hardware prefetchers to accelerate the loads while encoding. The receiver access pattern circumvents the hardware prefetchers for accurate load latency measurements. We demonstrate SPAM on a six-core (12-threaded) system, achieving a bit-rate of 12.21 MB/s at an error rate of 9.02% which is an improvement of over 70% over the state-of-the-art multi-threaded Streamline for comparable error rates when 50% of the co-running threads stress the cache system.
{"title":"SPAM: Streamlined Prefetcher-Aware Multi-Threaded Cache Covert-Channel Attack","authors":"E. Kritheesh;Biswabandan Panda","doi":"10.1109/LCA.2025.3529213","DOIUrl":"https://doi.org/10.1109/LCA.2025.3529213","url":null,"abstract":"Last-level cache (LLC) covert-channels exploit the cache timing differences to transmit information. In recent works, the attacks rely on a single sender and a single receiver. Streamline is the state-of-the-art cache covert channel attack that uses a shared array of addresses mapped to the payload bits, allowing parallelization of the encoding and decoding of bits. As multi-core systems are ubiquitous, multiple senders and receivers can be used to create a high bandwidth cache covert channel. However, this is not the case, and the bandwidth per thread is limited by various factors. We extend Streamline to a multi-threaded Streamline, where the senders buffer a few thousand bits at the LLC for the receivers to decode. We observe that these buffered bits are prone to eviction by the co-running processes before they are decoded. We propose SPAM, a multi-threaded covert-channel at the LLC. SPAM shows that fewer but faster senders must encode for more receivers to reduce this time frame. This ensures resilience to noise coming from cache activities of co-running applications. SPAM uses two different access patterns for the sender(s) and the receiver(s). The sender access pattern of the addresses is modified to leverage the hardware prefetchers to accelerate the loads while encoding. The receiver access pattern circumvents the hardware prefetchers for accurate load latency measurements. We demonstrate SPAM on a six-core (12-threaded) system, achieving a bit-rate of 12.21 MB/s at an error rate of 9.02% which is an improvement of over 70% over the state-of-the-art multi-threaded Streamline for comparable error rates when 50% of the co-running threads stress the cache system.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"25-28"},"PeriodicalIF":1.4,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143105562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating Page Migrations in Operating Systems With Intel DSA
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-14 DOI: 10.1109/LCA.2025.3530093
Jongho Baik;Jonghyeon Kim;Chang Hyun Park;Jeongseob Ahn
Modern server-class CPUs are introducing special-purpose accelerators on the same chip to improve performance and efficiency for data-intensive applications. This paper presents a case for accelerating data migrations in operating systems with the Data Streaming Accelerator (DSA), a new feature by Intel. To the best of our knowledge, this is the first study that exploits a hardware-assisted data migration scheme in the operating system. We identify which Linux kernel components can benefit from the hardware acceleration, particularly focusing on the kernel subsystems that rely on the migrate_pages() kernel function. As the hardware accelerator is not suitable for transferring a small amount of data due to the HW setup overhead, this preliminary study concentrates on the design and implementation of accelerating migrate_pages() with DSA. We prototype a DSA-enabled Linux kernel and evaluate its effectiveness through two benchmarks demonstrating real-world page compaction (kcompactd) and promotion (kdamond) scenarios. In both cases, our prototype demonstrates improved throughput in page migration, benefiting both the kernel subsystem and applications.
{"title":"Accelerating Page Migrations in Operating Systems With Intel DSA","authors":"Jongho Baik;Jonghyeon Kim;Chang Hyun Park;Jeongseob Ahn","doi":"10.1109/LCA.2025.3530093","DOIUrl":"https://doi.org/10.1109/LCA.2025.3530093","url":null,"abstract":"Modern server-class CPUs are introducing special-purpose accelerators on the same chip to improve performance and efficiency for data-intensive applications. This paper presents a case for accelerating data migrations in operating systems with the Data Streaming Accelerator (DSA), a new feature by Intel. To the best of our knowledge, this is the first study that exploits a hardware-assisted data migration scheme in the operating system. We identify which Linux kernel components can benefit from the hardware acceleration, particularly focusing on the kernel subsystems that rely on the <monospace>migrate_pages()</monospace> kernel function. As the hardware accelerator is not suitable for transferring a small amount of data due to the HW setup overhead, this preliminary study concentrates on the design and implementation of accelerating <monospace>migrate_pages()</monospace> with DSA. We prototype a DSA-enabled Linux kernel and evaluate its effectiveness through two benchmarks demonstrating real-world page compaction (<monospace>kcompactd</monospace>) and promotion (<monospace>kdamond</monospace>) scenarios. In both cases, our prototype demonstrates improved throughput in page migration, benefiting both the kernel subsystem and applications.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"37-40"},"PeriodicalIF":1.4,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143388528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cooperative Memory Deduplication With Intel Data Streaming Accelerator
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-09 DOI: 10.1109/LCA.2025.3527458
Houxiang Ji;Minho Kim;Seonmu Oh;Daehoon Kim;Nam Sung Kim
Memory deduplication plays a critical role in reducing memory consumption and the total cost of ownership (TCO) in hyperscalers, particularly as the advent of large language models imposes unprecedented demands on memory resources. However, conventional CPU-based memory deduplication can interfere with co-running applications, significantly impacting the performance of time-sensitive workloads. Intel introduced the on-chip Data Streaming Accelerator (DSA), providing high-performance data movement and transformation capabilities, including comparison and checksum calculation, which are heavily utilized in the deduplication. In this work, we enhance a widely-used kernel-space memory deduplication feature, Kernel Samepage Merging (ksm), by selectively offloading these operations to the DSA. Our evaluation demonstrates that CPU-based ksm can lead to 5.0–10.9× increase in the tail latency of co-running applications while DSA-based ksm limits the latency increase to just 1.6× while achieving comparable memory savings.
{"title":"Cooperative Memory Deduplication With Intel Data Streaming Accelerator","authors":"Houxiang Ji;Minho Kim;Seonmu Oh;Daehoon Kim;Nam Sung Kim","doi":"10.1109/LCA.2025.3527458","DOIUrl":"https://doi.org/10.1109/LCA.2025.3527458","url":null,"abstract":"Memory deduplication plays a critical role in reducing memory consumption and the total cost of ownership (TCO) in hyperscalers, particularly as the advent of large language models imposes unprecedented demands on memory resources. However, conventional CPU-based memory deduplication can interfere with co-running applications, significantly impacting the performance of time-sensitive workloads. Intel introduced the <italic>on-chip</i> Data Streaming Accelerator (DSA), providing high-performance data movement and transformation capabilities, including comparison and checksum calculation, which are heavily utilized in the deduplication. In this work, we enhance a widely-used kernel-space memory deduplication feature, Kernel Samepage Merging (<monospace>ksm</monospace>), by selectively offloading these operations to the DSA. Our evaluation demonstrates that CPU-based <monospace>ksm</monospace> can lead to 5.0–10.9× increase in the tail latency of co-running applications while DSA-based <monospace>ksm</monospace> limits the latency increase to just 1.6× while achieving comparable memory savings.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"29-32"},"PeriodicalIF":1.4,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143105563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-Performance Winograd Based Accelerator Architecture for Convolutional Neural Network 基于Winograd的高性能卷积神经网络加速器架构
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-08 DOI: 10.1109/LCA.2025.3525970
Vardhana M;Rohan Pinto
Convolutional Neural Networks are deployed mostly on GPUs or CPUs. However, due to the increasing complexity of architecture and growing performance requirements, these platforms may not be suitable for deploying inference engines. ASIC and FPGA implementations are appearing as superior alternatives to software-based solutions for achieving the required performance. In this article, an efficient architecture for accelerating convolution using the Winograd transform is proposed and implemented on FPGA. The proposed accelerator consumes 38% less resources as compared with conventional GEMM-based implementation. Analysis results indicate that our accelerator can achieve 3.5 TOP/s, 1.28 TOP/s, and 1.42 TOP/s for VGG16, ResNet18, and MobileNetV2 CNNs, respectively, at 250 MHz. The proposed accelerator demonstrates the best energy efficiency as compared with prior arts.
卷积神经网络主要部署在gpu或cpu上。然而,由于体系结构的复杂性和性能需求的增长,这些平台可能不适合部署推理引擎。ASIC和FPGA实现正在成为实现所需性能的基于软件的解决方案的优越替代方案。本文提出了一种利用Winograd变换加速卷积的高效架构,并在FPGA上实现。与传统的基于gem的实现相比,所提出的加速器消耗的资源减少了38%。分析结果表明,在250 MHz下,我们的加速器对VGG16、ResNet18和MobileNetV2的cnn分别可以达到3.5、1.28和1.42 TOP/s。与现有技术相比,所提出的加速器显示出最佳的能源效率。
{"title":"High-Performance Winograd Based Accelerator Architecture for Convolutional Neural Network","authors":"Vardhana M;Rohan Pinto","doi":"10.1109/LCA.2025.3525970","DOIUrl":"https://doi.org/10.1109/LCA.2025.3525970","url":null,"abstract":"Convolutional Neural Networks are deployed mostly on GPUs or CPUs. However, due to the increasing complexity of architecture and growing performance requirements, these platforms may not be suitable for deploying inference engines. ASIC and FPGA implementations are appearing as superior alternatives to software-based solutions for achieving the required performance. In this article, an efficient architecture for accelerating convolution using the Winograd transform is proposed and implemented on FPGA. The proposed accelerator consumes 38% less resources as compared with conventional GEMM-based implementation. Analysis results indicate that our accelerator can achieve 3.5 TOP/s, 1.28 TOP/s, and 1.42 TOP/s for VGG16, ResNet18, and MobileNetV2 CNNs, respectively, at 250 MHz. The proposed accelerator demonstrates the best energy efficiency as compared with prior arts.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"21-24"},"PeriodicalIF":1.4,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1