首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
Editorial: A Letter From the Editor-in-Chief of IEEE Computer Architecture Letters 社论:一封来自IEEE计算机体系结构快报总编辑的信
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-28 DOI: 10.1109/LCA.2025.3528276
Sudhanva Gurumurthi;Mattan Erez
{"title":"Editorial: A Letter From the Editor-in-Chief of IEEE Computer Architecture Letters","authors":"Sudhanva Gurumurthi;Mattan Erez","doi":"10.1109/LCA.2025.3528276","DOIUrl":"https://doi.org/10.1109/LCA.2025.3528276","url":null,"abstract":"","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"iii-iv"},"PeriodicalIF":1.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10856691","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143105557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hardware-Accelerated Kernel-Space Memory Compression Using Intel QAT 使用Intel QAT的硬件加速内核空间内存压缩
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-28 DOI: 10.1109/LCA.2025.3534831
Qirong Xia;Houxiang Ji;Yang Zhou;Nam Sung Kim
Data compression has been widely used by datacenters to decrease the consumption of not only the memory and storage capacity but also the interconnect bandwidth. Nonetheless, the CPU cycles consumed for data compression notably contribute to the overall datacenter taxes. To provide a cost-efficient data compression capability for datacenters, Intel has introduced QuickAssist Technology (QAT), a PCIe-attached data-compression accelerator. In this work, we first comprehensively evaluate the compression/decompression performance of the latest on-chip QAT accelerator and then compare it with that of the previous-generation off-chip QAT accelerator. Subsequently, as a compelling application for QAT, we take a Linux memory optimization kernel feature: compressed cache for swap pages (zswap), re-implement it to use QAT efficiently, and then compare the performance of QAT-based zswap with that of CPU-based zswap. Our evaluation shows that the deployment of CPU-based zswap increases the tail latency of a co-running latency-sensitive application, Redis by 3.2-12.1×, while that of QAT-based zswap does not notably increase the tail latency compared to no deployment of zswap.
数据压缩技术被广泛应用于数据中心,不仅可以减少内存和存储容量的消耗,还可以减少互连带宽的消耗。尽管如此,用于数据压缩的CPU周期明显增加了数据中心的总体负担。为了为数据中心提供经济高效的数据压缩能力,英特尔推出了QuickAssist Technology (QAT),这是一种附着在pcie上的数据压缩加速器。在这项工作中,我们首先全面评估了最新的片上QAT加速器的压缩/解压缩性能,然后将其与上一代片外QAT加速器进行了比较。随后,作为QAT的一个引人注目的应用程序,我们采用Linux内存优化内核特性:交换页面压缩缓存(zswap),重新实现它以有效地使用QAT,然后比较基于QAT的zswap与基于cpu的zswap的性能。我们的评估表明,部署基于cpu的zswap会使共同运行的对延迟敏感的应用程序Redis的尾部延迟增加3.2-12.1倍,而基于qat的zswap与不部署zswap相比,并没有明显增加尾部延迟。
{"title":"Hardware-Accelerated Kernel-Space Memory Compression Using Intel QAT","authors":"Qirong Xia;Houxiang Ji;Yang Zhou;Nam Sung Kim","doi":"10.1109/LCA.2025.3534831","DOIUrl":"https://doi.org/10.1109/LCA.2025.3534831","url":null,"abstract":"Data compression has been widely used by datacenters to decrease the consumption of not only the memory and storage capacity but also the interconnect bandwidth. Nonetheless, the CPU cycles consumed for data compression notably contribute to the overall datacenter taxes. To provide a cost-efficient data compression capability for datacenters, Intel has introduced QuickAssist Technology (QAT), a PCIe-attached data-compression accelerator. In this work, we first comprehensively evaluate the compression/decompression performance of the latest <italic>on-chip</i> QAT accelerator and then compare it with that of the previous-generation <italic>off-chip</i> QAT accelerator. Subsequently, as a compelling application for QAT, we take a Linux memory optimization kernel feature: compressed cache for swap pages (<monospace>zswap</monospace>), re-implement it to use QAT efficiently, and then compare the performance of QAT-based <monospace>zswap</monospace> with that of CPU-based <monospace>zswap</monospace>. Our evaluation shows that the deployment of CPU-based <monospace>zswap</monospace> increases the tail latency of a co-running latency-sensitive application, Redis by 3.2-12.1×, while that of QAT-based <monospace>zswap</monospace> does not notably increase the tail latency compared to no deployment of <monospace>zswap</monospace>.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"57-60"},"PeriodicalIF":1.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10856688","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143619076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Toward Scalable RDMA Through Resource Prefetching 通过资源预取实现可扩展的RDMA
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-27 DOI: 10.1109/LCA.2025.3534188
Zhenlong Ma;Ning Kang;Fan Yang;Chongyang Hong;Jing Xu;Guojun Yuan;Peiheng Zhang;Zhan Wang;Ninghui Sun
RDMA network is being widely deployed in data centers, high-performance computing, and AI clusters. By offloading the network processing protocol stack to hardware, RDMA bypasses the operating system kernel, thereby enabling high performance and low CPU overhead. However, the protocol processing demands substantial communication resources, and due to the limited hardware resources, commercial NICs (Network Interface Cards) experience a significant number of cache misses in large-scale connection scenarios. This results in performance degradation, indicating that RDMA lacks scalability. In this paper, we first analyze the characteristics of resource access in RDMA. Based on these characteristics, we propose a resource access prediction and prefetching mechanism in the hardware, which preemptively fetches the resources required by the protocol processing pipeline to the on-chip cache. This mechanism increases the NIC’s cache hit ratio. Evaluation results demonstrate that our approach improves throughput by 125% and reduces latency by 17.9% under large-scale communication scenarios.
RDMA网络在数据中心、高性能计算、人工智能集群等领域得到广泛应用。通过将网络处理协议栈卸载到硬件,RDMA绕过了操作系统内核,从而实现了高性能和低CPU开销。但是,协议处理需要大量的通信资源,并且由于硬件资源有限,商用网卡在大规模连接场景下会出现大量的缓存丢失。这将导致性能下降,表明RDMA缺乏可伸缩性。本文首先分析了RDMA中资源访问的特点。基于这些特点,我们提出了一种硬件资源访问预测和预取机制,该机制可以将协议处理管道所需的资源抢占到片上缓存中。这种机制增加了网卡的缓存命中率。评估结果表明,在大规模通信场景下,我们的方法将吞吐量提高了125%,延迟降低了17.9%。
{"title":"Toward Scalable RDMA Through Resource Prefetching","authors":"Zhenlong Ma;Ning Kang;Fan Yang;Chongyang Hong;Jing Xu;Guojun Yuan;Peiheng Zhang;Zhan Wang;Ninghui Sun","doi":"10.1109/LCA.2025.3534188","DOIUrl":"https://doi.org/10.1109/LCA.2025.3534188","url":null,"abstract":"RDMA network is being widely deployed in data centers, high-performance computing, and AI clusters. By offloading the network processing protocol stack to hardware, RDMA bypasses the operating system kernel, thereby enabling high performance and low CPU overhead. However, the protocol processing demands substantial communication resources, and due to the limited hardware resources, commercial NICs (Network Interface Cards) experience a significant number of cache misses in large-scale connection scenarios. This results in performance degradation, indicating that RDMA lacks scalability. In this paper, we first analyze the characteristics of resource access in RDMA. Based on these characteristics, we propose a resource access prediction and prefetching mechanism in the hardware, which preemptively fetches the resources required by the protocol processing pipeline to the on-chip cache. This mechanism increases the NIC’s cache hit ratio. Evaluation results demonstrate that our approach improves throughput by 125% and reduces latency by 17.9% under large-scale communication scenarios.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"77-80"},"PeriodicalIF":1.4,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143706789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GPU-Centric Memory Tiering for LLM Serving With NVIDIA Grace Hopper Superchip 以gpu为中心的LLM内存分层与NVIDIA Grace Hopper超级芯片
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-23 DOI: 10.1109/LCA.2025.3533588
Woohyung Choi;Jinwoo Jeong;Hanhwi Jang;Jeongseob Ahn
This study investigates the performance of serving large language models (LLMs) with a focus on the high-bandwidth interconnect between GPU and CPU using a real NVIDIA Grace Hopper Superchip. This architecture features a GPU-centric memory tiering system, comprising a performance tier with GPU memory and a capacity tier with host memory. We revisit a conventional pipelined execution for LLM inference, utilizing host memory connected via NVLink alongside GPU memory. For the Llama-3.1 8B base (FP16) model, such a GPU-centric tiered memory system meets the target latency requirements for both prefill and decoding while improving throughput compared to the in-memory case, where all model weights are maintained in GPU memory. However, even with NVLink-connected CPU memory, meeting latency constraints for large models like the 70B and 405B FP16 models remains challenging. To address this, we explore the efficacy of model quantization (e.g., AWQ) along with the pipelined execution. Our evaluation reveals that the model quantization makes the pipelined execution a viable solution for serving large models. For the Llama-3.1 70B and 405B AWQ models, we show that the pipelined execution achieves 1.6× and 2.9× throughput improvement, respectively, compared to the in-memory only case, while meeting the latency constraint.
本研究研究了服务大型语言模型(llm)的性能,重点关注GPU和CPU之间的高带宽互连,使用真正的NVIDIA Grace Hopper超级芯片。该架构采用以GPU为中心的内存分级系统,包括包含GPU内存的性能层和包含主机内存的容量层。我们重新审视了LLM推理的传统流水线执行,利用通过NVLink连接的主机内存和GPU内存。对于Llama-3.1 8B基础(FP16)模型,这种以GPU为中心的分层存储系统满足预填充和解码的目标延迟要求,同时与内存中的情况相比,提高了吞吐量,其中所有模型权重都保持在GPU内存中。然而,即使使用nvlink连接的CPU内存,满足70B和405B等大型型号FP16模型的延迟限制仍然具有挑战性。为了解决这个问题,我们探索了模型量化(例如,AWQ)以及流水线执行的有效性。我们的评估表明,模型量化使流水线执行成为服务大型模型的可行解决方案。对于lama-3.1 70B和405B AWQ模型,我们表明,在满足延迟约束的情况下,与仅在内存中的情况相比,流水线执行分别实现了1.6倍和2.9倍的吞吐量提高。
{"title":"GPU-Centric Memory Tiering for LLM Serving With NVIDIA Grace Hopper Superchip","authors":"Woohyung Choi;Jinwoo Jeong;Hanhwi Jang;Jeongseob Ahn","doi":"10.1109/LCA.2025.3533588","DOIUrl":"https://doi.org/10.1109/LCA.2025.3533588","url":null,"abstract":"This study investigates the performance of serving large language models (LLMs) with a focus on the high-bandwidth interconnect between GPU and CPU using a real NVIDIA Grace Hopper Superchip. This architecture features a GPU-centric memory tiering system, comprising a performance tier with GPU memory and a capacity tier with host memory. We revisit a conventional pipelined execution for LLM inference, utilizing host memory connected via NVLink alongside GPU memory. For the Llama-3.1 8B base (FP16) model, such a GPU-centric tiered memory system meets the target latency requirements for both prefill and decoding while improving throughput compared to the in-memory case, where all model weights are maintained in GPU memory. However, even with NVLink-connected CPU memory, meeting latency constraints for large models like the 70B and 405B FP16 models remains challenging. To address this, we explore the efficacy of model quantization (e.g., AWQ) along with the pipelined execution. Our evaluation reveals that the model quantization makes the pipelined execution a viable solution for serving large models. For the Llama-3.1 70B and 405B AWQ models, we show that the pipelined execution achieves 1.6× and 2.9× throughput improvement, respectively, compared to the in-memory only case, while meeting the latency constraint.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"33-36"},"PeriodicalIF":1.4,"publicationDate":"2025-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143388529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
2024 Reviewers List 2024审稿人名单
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-22 DOI: 10.1109/LCA.2025.3528619
{"title":"2024 Reviewers List","authors":"","doi":"10.1109/LCA.2025.3528619","DOIUrl":"https://doi.org/10.1109/LCA.2025.3528619","url":null,"abstract":"","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"i-ii"},"PeriodicalIF":1.4,"publicationDate":"2025-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10849623","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating Page Migrations in Operating Systems With Intel DSA 使用Intel DSA加速操作系统中的页面迁移
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-14 DOI: 10.1109/LCA.2025.3530093
Jongho Baik;Jonghyeon Kim;Chang Hyun Park;Jeongseob Ahn
Modern server-class CPUs are introducing special-purpose accelerators on the same chip to improve performance and efficiency for data-intensive applications. This paper presents a case for accelerating data migrations in operating systems with the Data Streaming Accelerator (DSA), a new feature by Intel. To the best of our knowledge, this is the first study that exploits a hardware-assisted data migration scheme in the operating system. We identify which Linux kernel components can benefit from the hardware acceleration, particularly focusing on the kernel subsystems that rely on the migrate_pages() kernel function. As the hardware accelerator is not suitable for transferring a small amount of data due to the HW setup overhead, this preliminary study concentrates on the design and implementation of accelerating migrate_pages() with DSA. We prototype a DSA-enabled Linux kernel and evaluate its effectiveness through two benchmarks demonstrating real-world page compaction (kcompactd) and promotion (kdamond) scenarios. In both cases, our prototype demonstrates improved throughput in page migration, benefiting both the kernel subsystem and applications.
现代服务器级cpu正在同一芯片上引入专用加速器,以提高数据密集型应用程序的性能和效率。本文介绍了一个使用数据流加速器(data Streaming Accelerator, DSA)加速操作系统中数据迁移的案例,DSA是英特尔的一项新特性。据我们所知,这是第一个在操作系统中利用硬件辅助数据迁移方案的研究。我们确定哪些Linux内核组件可以从硬件加速中受益,特别关注依赖migrate_pages()内核函数的内核子系统。由于硬件加速器由于硬件设置开销而不适合传输少量数据,因此本初步研究集中在使用DSA加速migrate_pages()的设计和实现上。我们构建了一个支持dsa的Linux内核原型,并通过两个演示真实页面压缩(kcompactd)和提升(kdiamond)场景的基准测试来评估其有效性。在这两种情况下,我们的原型都证明了页面迁移的吞吐量得到了改进,这对内核子系统和应用程序都有好处。
{"title":"Accelerating Page Migrations in Operating Systems With Intel DSA","authors":"Jongho Baik;Jonghyeon Kim;Chang Hyun Park;Jeongseob Ahn","doi":"10.1109/LCA.2025.3530093","DOIUrl":"https://doi.org/10.1109/LCA.2025.3530093","url":null,"abstract":"Modern server-class CPUs are introducing special-purpose accelerators on the same chip to improve performance and efficiency for data-intensive applications. This paper presents a case for accelerating data migrations in operating systems with the Data Streaming Accelerator (DSA), a new feature by Intel. To the best of our knowledge, this is the first study that exploits a hardware-assisted data migration scheme in the operating system. We identify which Linux kernel components can benefit from the hardware acceleration, particularly focusing on the kernel subsystems that rely on the <monospace>migrate_pages()</monospace> kernel function. As the hardware accelerator is not suitable for transferring a small amount of data due to the HW setup overhead, this preliminary study concentrates on the design and implementation of accelerating <monospace>migrate_pages()</monospace> with DSA. We prototype a DSA-enabled Linux kernel and evaluate its effectiveness through two benchmarks demonstrating real-world page compaction (<monospace>kcompactd</monospace>) and promotion (<monospace>kdamond</monospace>) scenarios. In both cases, our prototype demonstrates improved throughput in page migration, benefiting both the kernel subsystem and applications.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"37-40"},"PeriodicalIF":1.4,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143388528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SPAM: Streamlined Prefetcher-Aware Multi-Threaded Cache Covert-Channel Attack 垃圾邮件:精简的预取器感知多线程缓存转换通道攻击
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-14 DOI: 10.1109/LCA.2025.3529213
E. Kritheesh;Biswabandan Panda
Last-level cache (LLC) covert-channels exploit the cache timing differences to transmit information. In recent works, the attacks rely on a single sender and a single receiver. Streamline is the state-of-the-art cache covert channel attack that uses a shared array of addresses mapped to the payload bits, allowing parallelization of the encoding and decoding of bits. As multi-core systems are ubiquitous, multiple senders and receivers can be used to create a high bandwidth cache covert channel. However, this is not the case, and the bandwidth per thread is limited by various factors. We extend Streamline to a multi-threaded Streamline, where the senders buffer a few thousand bits at the LLC for the receivers to decode. We observe that these buffered bits are prone to eviction by the co-running processes before they are decoded. We propose SPAM, a multi-threaded covert-channel at the LLC. SPAM shows that fewer but faster senders must encode for more receivers to reduce this time frame. This ensures resilience to noise coming from cache activities of co-running applications. SPAM uses two different access patterns for the sender(s) and the receiver(s). The sender access pattern of the addresses is modified to leverage the hardware prefetchers to accelerate the loads while encoding. The receiver access pattern circumvents the hardware prefetchers for accurate load latency measurements. We demonstrate SPAM on a six-core (12-threaded) system, achieving a bit-rate of 12.21 MB/s at an error rate of 9.02% which is an improvement of over 70% over the state-of-the-art multi-threaded Streamline for comparable error rates when 50% of the co-running threads stress the cache system.
最后一级缓存(LLC)转换通道利用缓存时间差异来传输信息。在最近的研究中,攻击依赖于单个发送者和单个接收者。流线是最先进的缓存隐蔽通道攻击,它使用映射到有效载荷位的共享地址数组,允许并行编码和解码位。由于多核系统的普遍存在,可以使用多个发送方和接收方来创建高带宽缓存隐蔽信道。然而,事实并非如此,每个线程的带宽受到各种因素的限制。我们将流线扩展为多线程流线,其中发送方在LLC缓冲几千位以供接收方解码。我们观察到,这些缓冲位在解码之前容易被共同运行的进程驱逐。我们提出了SPAM, LLC的一个多线程转换通道。SPAM表明,更少但更快的发送方必须为更多的接收方编码,以减少这个时间框架。这确保了对来自共同运行的应用程序的缓存活动的噪声的弹性。SPAM对发送方和接收方使用两种不同的访问模式。修改地址的发送方访问模式,以利用硬件预取器在编码时加速加载。接收器访问模式绕过硬件预取器,以获得准确的负载延迟测量。我们在一个六核(12线程)系统上演示了SPAM,实现了12.21 MB/s的比特率,错误率为9.02%,当50%的共同运行线程对缓存系统造成压力时,这比最先进的多线程streamlined的错误率提高了70%以上。
{"title":"SPAM: Streamlined Prefetcher-Aware Multi-Threaded Cache Covert-Channel Attack","authors":"E. Kritheesh;Biswabandan Panda","doi":"10.1109/LCA.2025.3529213","DOIUrl":"https://doi.org/10.1109/LCA.2025.3529213","url":null,"abstract":"Last-level cache (LLC) covert-channels exploit the cache timing differences to transmit information. In recent works, the attacks rely on a single sender and a single receiver. Streamline is the state-of-the-art cache covert channel attack that uses a shared array of addresses mapped to the payload bits, allowing parallelization of the encoding and decoding of bits. As multi-core systems are ubiquitous, multiple senders and receivers can be used to create a high bandwidth cache covert channel. However, this is not the case, and the bandwidth per thread is limited by various factors. We extend Streamline to a multi-threaded Streamline, where the senders buffer a few thousand bits at the LLC for the receivers to decode. We observe that these buffered bits are prone to eviction by the co-running processes before they are decoded. We propose SPAM, a multi-threaded covert-channel at the LLC. SPAM shows that fewer but faster senders must encode for more receivers to reduce this time frame. This ensures resilience to noise coming from cache activities of co-running applications. SPAM uses two different access patterns for the sender(s) and the receiver(s). The sender access pattern of the addresses is modified to leverage the hardware prefetchers to accelerate the loads while encoding. The receiver access pattern circumvents the hardware prefetchers for accurate load latency measurements. We demonstrate SPAM on a six-core (12-threaded) system, achieving a bit-rate of 12.21 MB/s at an error rate of 9.02% which is an improvement of over 70% over the state-of-the-art multi-threaded Streamline for comparable error rates when 50% of the co-running threads stress the cache system.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"25-28"},"PeriodicalIF":1.4,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143105562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cooperative Memory Deduplication With Intel Data Streaming Accelerator 协作内存重复数据删除与英特尔数据流加速器
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-09 DOI: 10.1109/LCA.2025.3527458
Houxiang Ji;Minho Kim;Seonmu Oh;Daehoon Kim;Nam Sung Kim
Memory deduplication plays a critical role in reducing memory consumption and the total cost of ownership (TCO) in hyperscalers, particularly as the advent of large language models imposes unprecedented demands on memory resources. However, conventional CPU-based memory deduplication can interfere with co-running applications, significantly impacting the performance of time-sensitive workloads. Intel introduced the on-chip Data Streaming Accelerator (DSA), providing high-performance data movement and transformation capabilities, including comparison and checksum calculation, which are heavily utilized in the deduplication. In this work, we enhance a widely-used kernel-space memory deduplication feature, Kernel Samepage Merging (ksm), by selectively offloading these operations to the DSA. Our evaluation demonstrates that CPU-based ksm can lead to 5.0–10.9× increase in the tail latency of co-running applications while DSA-based ksm limits the latency increase to just 1.6× while achieving comparable memory savings.
内存重复数据删除在减少超大规模服务器的内存消耗和总拥有成本(TCO)方面发挥着关键作用,特别是在大型语言模型的出现对内存资源提出了前所未有的需求时。但是,传统的基于cpu的内存重复数据删除会干扰协同运行的应用程序,严重影响对时间敏感的工作负载的性能。英特尔推出了片上数据流加速器(DSA),提供高性能的数据移动和转换功能,包括在重复数据删除中大量使用的比较和校验和计算。在这项工作中,我们通过选择性地将这些操作卸载到DSA,增强了广泛使用的内核空间内存重复数据删除功能——内核同页合并(ksm)。我们的评估表明,基于cpu的ksm可能导致共同运行应用程序的尾部延迟增加5.0 - 10.9倍,而基于dsa的ksm将延迟增加限制在1.6倍,同时实现了相当的内存节省。
{"title":"Cooperative Memory Deduplication With Intel Data Streaming Accelerator","authors":"Houxiang Ji;Minho Kim;Seonmu Oh;Daehoon Kim;Nam Sung Kim","doi":"10.1109/LCA.2025.3527458","DOIUrl":"https://doi.org/10.1109/LCA.2025.3527458","url":null,"abstract":"Memory deduplication plays a critical role in reducing memory consumption and the total cost of ownership (TCO) in hyperscalers, particularly as the advent of large language models imposes unprecedented demands on memory resources. However, conventional CPU-based memory deduplication can interfere with co-running applications, significantly impacting the performance of time-sensitive workloads. Intel introduced the <italic>on-chip</i> Data Streaming Accelerator (DSA), providing high-performance data movement and transformation capabilities, including comparison and checksum calculation, which are heavily utilized in the deduplication. In this work, we enhance a widely-used kernel-space memory deduplication feature, Kernel Samepage Merging (<monospace>ksm</monospace>), by selectively offloading these operations to the DSA. Our evaluation demonstrates that CPU-based <monospace>ksm</monospace> can lead to 5.0–10.9× increase in the tail latency of co-running applications while DSA-based <monospace>ksm</monospace> limits the latency increase to just 1.6× while achieving comparable memory savings.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"29-32"},"PeriodicalIF":1.4,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143105563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-Performance Winograd Based Accelerator Architecture for Convolutional Neural Network 基于Winograd的高性能卷积神经网络加速器架构
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-01-08 DOI: 10.1109/LCA.2025.3525970
Vardhana M;Rohan Pinto
Convolutional Neural Networks are deployed mostly on GPUs or CPUs. However, due to the increasing complexity of architecture and growing performance requirements, these platforms may not be suitable for deploying inference engines. ASIC and FPGA implementations are appearing as superior alternatives to software-based solutions for achieving the required performance. In this article, an efficient architecture for accelerating convolution using the Winograd transform is proposed and implemented on FPGA. The proposed accelerator consumes 38% less resources as compared with conventional GEMM-based implementation. Analysis results indicate that our accelerator can achieve 3.5 TOP/s, 1.28 TOP/s, and 1.42 TOP/s for VGG16, ResNet18, and MobileNetV2 CNNs, respectively, at 250 MHz. The proposed accelerator demonstrates the best energy efficiency as compared with prior arts.
卷积神经网络主要部署在gpu或cpu上。然而,由于体系结构的复杂性和性能需求的增长,这些平台可能不适合部署推理引擎。ASIC和FPGA实现正在成为实现所需性能的基于软件的解决方案的优越替代方案。本文提出了一种利用Winograd变换加速卷积的高效架构,并在FPGA上实现。与传统的基于gem的实现相比,所提出的加速器消耗的资源减少了38%。分析结果表明,在250 MHz下,我们的加速器对VGG16、ResNet18和MobileNetV2的cnn分别可以达到3.5、1.28和1.42 TOP/s。与现有技术相比,所提出的加速器显示出最佳的能源效率。
{"title":"High-Performance Winograd Based Accelerator Architecture for Convolutional Neural Network","authors":"Vardhana M;Rohan Pinto","doi":"10.1109/LCA.2025.3525970","DOIUrl":"https://doi.org/10.1109/LCA.2025.3525970","url":null,"abstract":"Convolutional Neural Networks are deployed mostly on GPUs or CPUs. However, due to the increasing complexity of architecture and growing performance requirements, these platforms may not be suitable for deploying inference engines. ASIC and FPGA implementations are appearing as superior alternatives to software-based solutions for achieving the required performance. In this article, an efficient architecture for accelerating convolution using the Winograd transform is proposed and implemented on FPGA. The proposed accelerator consumes 38% less resources as compared with conventional GEMM-based implementation. Analysis results indicate that our accelerator can achieve 3.5 TOP/s, 1.28 TOP/s, and 1.42 TOP/s for VGG16, ResNet18, and MobileNetV2 CNNs, respectively, at 250 MHz. The proposed accelerator demonstrates the best energy efficiency as compared with prior arts.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"21-24"},"PeriodicalIF":1.4,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PINSim: A Processing In- and Near-Sensor Simulator to Model Intelligent Vision Sensors PINSim:用于模拟智能视觉传感器的处理内传感器和近传感器模拟器
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-12-25 DOI: 10.1109/LCA.2024.3522777
Sepehr Tabrizchi;Mehrdad Morsali;David Pan;Shaahin Angizi;Arman Roohi
This letter introduces PINSim, a user-friendly and flexible framework for simulating emerging smart vision sensors in the early design stages. PINSim enables the realization of integrated sensing and processing near and in the sensor, effectively addressing challenges such as data movement and power-hungry analog-to-digital converters. The framework offers a flexible interface and a wide range of design options for customizing the efficiency and accuracy of processing-near/in-sensor-based accelerators using a hierarchical structure. Its organization spans from the device level upward to the algorithm level. PINSim realizes instruction-accurate evaluation of circuit-level performance metrics. PINSim achieves over $25,000times$ speed-up compared to SPICE simulation with less than a 4.1% error rate on average. Furthermore, it supports both multilayer perceptron (MLP) and convolutional neural network (CNN) models, with limitations determined by IoT budget constraints. By facilitating the exploration and optimization of various design parameters, PiNSim empowers researchers and engineers to develop energy-efficient and high-performance smart vision sensors for a wide range of applications.
这封信介绍了PINSim,一个用户友好且灵活的框架,用于在早期设计阶段模拟新兴的智能视觉传感器。PINSim能够实现传感器附近和传感器内部的集成传感和处理,有效地解决数据移动和耗电的模数转换器等挑战。该框架提供了一个灵活的接口和广泛的设计选项,可以使用分层结构定制基于传感器的加速器的效率和精度。它的组织从设备级向上跨越到算法级。PINSim实现了电路级性能指标的指令精确评估。与SPICE模拟相比,PINSim实现了超过25,000倍的加速,平均错误率低于4.1%。此外,它支持多层感知器(MLP)和卷积神经网络(CNN)模型,其局限性取决于物联网预算约束。通过促进各种设计参数的探索和优化,PiNSim使研究人员和工程师能够为广泛的应用开发节能和高性能的智能视觉传感器。
{"title":"PINSim: A Processing In- and Near-Sensor Simulator to Model Intelligent Vision Sensors","authors":"Sepehr Tabrizchi;Mehrdad Morsali;David Pan;Shaahin Angizi;Arman Roohi","doi":"10.1109/LCA.2024.3522777","DOIUrl":"https://doi.org/10.1109/LCA.2024.3522777","url":null,"abstract":"This letter introduces PINSim, a user-friendly and flexible framework for simulating emerging smart vision sensors in the early design stages. PINSim enables the realization of integrated sensing and processing near and in the sensor, effectively addressing challenges such as data movement and power-hungry analog-to-digital converters. The framework offers a flexible interface and a wide range of design options for customizing the efficiency and accuracy of processing-near/in-sensor-based accelerators using a hierarchical structure. Its organization spans from the device level upward to the algorithm level. PINSim realizes instruction-accurate evaluation of circuit-level performance metrics. PINSim achieves over <inline-formula><tex-math>$25,000times$</tex-math></inline-formula> speed-up compared to SPICE simulation with less than a 4.1% error rate on average. Furthermore, it supports both multilayer perceptron (MLP) and convolutional neural network (CNN) models, with limitations determined by IoT budget constraints. By facilitating the exploration and optimization of various design parameters, PiNSim empowers researchers and engineers to develop energy-efficient and high-performance smart vision sensors for a wide range of applications.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"17-20"},"PeriodicalIF":1.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1