首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
Architecting Compatible PIM Protocol for CPU-PIM Collaboration 为 CPU-PIM 协作构建兼容的 PIM 协议
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-07-24 DOI: 10.1109/LCA.2024.3432936
Seunghyuk Yu;Hyeonu Kim;Kyoungho Jeun;Sunyoung Hwang;Eojin Lee
Processing in Memory (PIM) technology is gaining traction with the introduction of several prototype products. However, the interfaces of existing PIM devices hinder CPU performance excessively by delaying normal memory requests for long periods during PIM operations. In this paper, we propose a new PIM command and protocol designed for compatibility across various PIM devices and host processors, focusing on DRAM standards with limited command space. Our proposed command, PIM-ACT, activates multiple banks simultaneously with assigning the specific PIM operation. It closely follows the functionality of the ACT command for straightforward control by the memory controller. We also explore memory scheduling policies that balance the latency of conventional memory requests with the throughput of PIM workloads. Our evaluation demonstrates the effectiveness of our approach in optimizing both PIM and conventional workload performance.
随着一些原型产品的推出,内存处理(PIM)技术正日益受到重视。然而,现有 PIM 设备的接口会在 PIM 操作期间长时间延迟正常的内存请求,从而严重影响 CPU 性能。在本文中,我们提出了一种新的 PIM 命令和协议,旨在兼容各种 PIM 设备和主机处理器,重点关注命令空间有限的 DRAM 标准。我们提出的 PIM-ACT 命令可同时激活多个存储体,并分配特定的 PIM 操作。它与 ACT 命令的功能密切相关,可由内存控制器直接控制。我们还探索了内存调度策略,以平衡传统内存请求的延迟和 PIM 工作负载的吞吐量。我们的评估证明了我们的方法在优化 PIM 和传统工作负载性能方面的有效性。
{"title":"Architecting Compatible PIM Protocol for CPU-PIM Collaboration","authors":"Seunghyuk Yu;Hyeonu Kim;Kyoungho Jeun;Sunyoung Hwang;Eojin Lee","doi":"10.1109/LCA.2024.3432936","DOIUrl":"10.1109/LCA.2024.3432936","url":null,"abstract":"Processing in Memory (PIM) technology is gaining traction with the introduction of several prototype products. However, the interfaces of existing PIM devices hinder CPU performance excessively by delaying normal memory requests for long periods during PIM operations. In this paper, we propose a new PIM command and protocol designed for compatibility across various PIM devices and host processors, focusing on DRAM standards with limited command space. Our proposed command, PIM-ACT, activates multiple banks simultaneously with assigning the specific PIM operation. It closely follows the functionality of the ACT command for straightforward control by the memory controller. We also explore memory scheduling policies that balance the latency of conventional memory requests with the throughput of PIM workloads. Our evaluation demonstrates the effectiveness of our approach in optimizing both PIM and conventional workload performance.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"183-186"},"PeriodicalIF":1.4,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141778103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Quantitative Analysis of State Space Model-Based Large Language Model: Study of Hungry Hungry Hippos 基于状态空间模型的大型语言模型定量分析:饥饿的河马》研究
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-07-03 DOI: 10.1109/LCA.2024.3422492
Dongho Yoon;Taehun Kim;Jae W. Lee;Minsoo Rhu
As the need for processing long contexts in large language models (LLMs) increases, attention-based LLMs face significant challenges due to their high computation and memory requirements. To overcome this challenge, there have been several recent works that seek to alleviate attention's system-level bottlenecks. An approach that has been receiving a lot of attraction lately is state space models (SSMs) thanks to their ability to substantially reduce computational complexity and memory footprint. Despite the excitement around SSMs, there is a lack of an in-depth characterization and analysis on this important model architecture. In this paper, we delve into a representative SSM named Hungry Hungry Hippos (H3), examining its advantages as well as its current limitations. We also discuss future research directions on improving the efficiency of SSMs via hardware architectural support.
随着在大型语言模型(LLM)中处理长语境的需求不断增加,基于注意力的 LLM 因其对计算和内存的高要求而面临巨大挑战。为了克服这一挑战,最近有几项研究试图缓解注意力的系统级瓶颈。状态空间模型(SSM)是近来备受关注的一种方法,因为它能大大降低计算复杂度和内存占用。尽管 SSM 备受关注,但对这种重要的模型架构却缺乏深入的描述和分析。在本文中,我们将深入研究一种具有代表性的 SSM,名为 "饥饿的河马"(Hungry Hungry Hippos,H3),研究它的优势以及目前的局限性。我们还讨论了通过硬件架构支持提高 SSM 效率的未来研究方向。
{"title":"A Quantitative Analysis of State Space Model-Based Large Language Model: Study of Hungry Hungry Hippos","authors":"Dongho Yoon;Taehun Kim;Jae W. Lee;Minsoo Rhu","doi":"10.1109/LCA.2024.3422492","DOIUrl":"10.1109/LCA.2024.3422492","url":null,"abstract":"As the need for processing long contexts in large language models (LLMs) increases, attention-based LLMs face significant challenges due to their high computation and memory requirements. To overcome this challenge, there have been several recent works that seek to alleviate attention's system-level bottlenecks. An approach that has been receiving a lot of attraction lately is state space models (SSMs) thanks to their ability to substantially reduce computational complexity and memory footprint. Despite the excitement around SSMs, there is a lack of an in-depth characterization and analysis on this important model architecture. In this paper, we delve into a representative SSM named Hungry Hungry Hippos (H3), examining its advantages as well as its current limitations. We also discuss future research directions on improving the efficiency of SSMs via hardware architectural support.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"154-157"},"PeriodicalIF":1.4,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141547667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Empirical Architectural Analysis on Performance Scalability of Petascale All-Flash Storage Systems 有关 Petascale 全闪存存储系统性能可扩展性的经验架构分析
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-06-25 DOI: 10.1109/LCA.2024.3418874
Mohammadamin Ajdari;Behrang Montazerzohour;Kimia Abdi;Hossein Asadi
In this paper, we first analyze a real storage system consisting of 72 SSDs utilizing either Hardware RAID (HW-RAID) or Software RAID (SW-RAID), and show that SW-RAID is up to 7× faster. We then reveal that with an increasing number of SSDs, the limited I/O parallelism in SAS controllers and multi-enclosure handshaking overheads cause a significant performance drop, minimizing the total I/O Per Second (IOPS) of a 144-SSD system to less than a single SSD. Second, we disclose the most important architectural parameters that affect a large-scale storage system. Third, we propose a framework that models a large-scale storage system and estimates the system IOPS and system resource usage for various architectures. We verify our framework against a real system and show its high accuracy. Lastly, we analyze a use case of a 240-SSD system and reveal how our framework guides architects in storage system scaling.
在本文中,我们首先分析了一个由 72 个固态硬盘组成的真实存储系统,该系统采用硬件 RAID(HW-RAID)或软件 RAID(SW-RAID),结果表明 SW-RAID 的速度最高可达 7 倍。然后,我们揭示了随着固态硬盘数量的增加,SAS 控制器中有限的 I/O 并行性和多机箱握手开销会导致性能大幅下降,从而将 144-SSD 系统的总每秒 I/O (IOPS) 降低到单个固态硬盘的水平。其次,我们揭示了影响大规模存储系统的最重要的架构参数。第三,我们提出了一个框架,该框架可对大规模存储系统进行建模,并估算各种架构下的系统 IOPS 和系统资源使用量。我们用实际系统验证了我们的框架,并展示了其高准确性。最后,我们分析了 240-SSD 系统的使用案例,揭示了我们的框架如何指导架构师进行存储系统扩展。
{"title":"Empirical Architectural Analysis on Performance Scalability of Petascale All-Flash Storage Systems","authors":"Mohammadamin Ajdari;Behrang Montazerzohour;Kimia Abdi;Hossein Asadi","doi":"10.1109/LCA.2024.3418874","DOIUrl":"10.1109/LCA.2024.3418874","url":null,"abstract":"In this paper, we \u0000<italic>first</i>\u0000 analyze a real storage system consisting of 72 SSDs utilizing either \u0000<italic>Hardware RAID</i>\u0000 (HW-RAID) or \u0000<italic>Software RAID</i>\u0000 (SW-RAID), and show that SW-RAID is up to 7× faster. We then reveal that with an increasing number of SSDs, the limited I/O parallelism in SAS controllers and multi-enclosure handshaking overheads cause a significant performance drop, minimizing the total \u0000<italic>I/O Per Second</i>\u0000 (IOPS) of a 144-SSD system to less than a single SSD. \u0000<italic>Second</i>\u0000, we disclose the most important architectural parameters that affect a large-scale storage system. \u0000<italic>Third</i>\u0000, we propose a framework that models a large-scale storage system and estimates the system IOPS and system resource usage for various architectures. We verify our framework against a real system and show its high accuracy. \u0000<italic>Lastly</i>\u0000, we analyze a use case of a 240-SSD system and reveal how our framework guides architects in storage system scaling.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"158-161"},"PeriodicalIF":1.4,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating Programmable Bootstrapping Targeting Contemporary GPU Microarchitecture 加速以当代 GPU 微体系结构为目标的可编程引导
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-06-24 DOI: 10.1109/LCA.2024.3418448
Hyesung Ji;Sangpyo Kim;Jaewan Choi;Jung Ho Ahn
Fully homomorphic encryption (FHE) enables computation on encrypted data without privacy leakage, among which GSW-based schemes are notable for supporting the evaluation of arbitrary univariate functions using programmable bootstrapping (PBS). Despite their wide applicability, their computational complexity in a single PBS impedes widespread adoption. However, at the application level, there are enough number of independent PBSs to achieve high data-level parallelism, making them suitable for running on GPUs known for their high parallel computing capability. On contemporary GPUs, peak integer performance has steadily increased, and the sizes of L2 cache and shared memory have also grown rapidly since the Volta architecture. Prior attempts to accelerate PBS on GPUs have fallen short due to their outdated implementations that cannot leverage recent GPU advances. In this paper, we introduce a GPU implementation that supports the latest PBS algorithm and incorporates GPU-trend-aware optimizations. Our implementation achieves a 10.8× performance improvement over the state-of-the-art (SOTA) GPU implementations on RTX 4090 and even outperforms the SOTA ASIC implementation.
全同态加密(FHE)可在不泄露隐私的情况下对加密数据进行计算,其中基于 GSW 的方案因支持使用可编程引导(PBS)对任意单变量函数进行评估而备受瞩目。尽管它们具有广泛的适用性,但单个 PBS 的计算复杂性阻碍了它们的广泛应用。然而,在应用层面上,有足够数量的独立 PBS 可以实现数据级的高度并行性,使它们适合在以高并行计算能力著称的 GPU 上运行。在当代 GPU 上,整数峰值性能稳步提升,二级缓存和共享内存的大小自 Volta 架构以来也迅速增长。之前在 GPU 上加速 PBS 的尝试都因其过时的实现而失败,无法充分利用 GPU 的最新进展。在本文中,我们介绍了一种支持最新 PBS 算法的 GPU 实现,并结合了 GPU 趋势感知优化。与 RTX 4090 上最先进的(SOTA)GPU 实现相比,我们的实现提高了 10.8 倍的性能,甚至优于 SOTA ASIC 实现。
{"title":"Accelerating Programmable Bootstrapping Targeting Contemporary GPU Microarchitecture","authors":"Hyesung Ji;Sangpyo Kim;Jaewan Choi;Jung Ho Ahn","doi":"10.1109/LCA.2024.3418448","DOIUrl":"10.1109/LCA.2024.3418448","url":null,"abstract":"Fully homomorphic encryption (FHE) enables computation on encrypted data without privacy leakage, among which GSW-based schemes are notable for supporting the evaluation of arbitrary univariate functions using programmable bootstrapping (PBS). Despite their wide applicability, their computational complexity in a single PBS impedes widespread adoption. However, at the application level, there are enough number of independent PBSs to achieve high data-level parallelism, making them suitable for running on GPUs known for their high parallel computing capability. On contemporary GPUs, peak integer performance has steadily increased, and the sizes of L2 cache and shared memory have also grown rapidly since the Volta architecture. Prior attempts to accelerate PBS on GPUs have fallen short due to their outdated implementations that cannot leverage recent GPU advances. In this paper, we introduce a GPU implementation that supports the latest PBS algorithm and incorporates GPU-trend-aware optimizations. Our implementation achieves a 10.8× performance improvement over the state-of-the-art (SOTA) GPU implementations on RTX 4090 and even outperforms the SOTA ASIC implementation.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"207-210"},"PeriodicalIF":1.4,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10570278","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141532506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TeleVM: A Lightweight Virtual Machine for RISC-V Architecture TeleVM:适用于 RISC-V 架构的轻量级虚拟机
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-04-30 DOI: 10.1109/LCA.2024.3394835
Tianzheng Li;Enfang Cui;Yuting Wu;Qian Wei;Yue Gao
Serverless computing has become an important paradigm in cloud computing due to its advantages such as fast large-scale deployment and pay-as-you-go charging model. Due to shared infrastructure and multi-tenant environments, serverless applications have high security requirements. Traditional virtual machines and containers cannot fully meet the requirements of serverless applications. Therefore, lightweight virtual machine technology has emerged, which can reduce overhead and boot time while ensuring security. In this letter, we propose TeleVM, a lightweight virtual machine for RISC-V architecture. TeleVM can achieve strong isolation through the hypervisor extension of RISC-V. Compared with traditional virtual machines, TeleVM only implements a small number of IO devices and functions, which can effectively reduce memory overhead and boot time. We compared TeleVM and QEMU+KVM through experiments. Compared to QEMU+KVM, the boot time and memory overhead of TeleVM have decreased by 74% and 90% respectively. This work further improves the cloud computing software ecosystem of RISC-V architecture and promotes the use of RISC-V architecture in cloud computing scenarios.
无服务器计算具有快速大规模部署和现收现付收费模式等优势,已成为云计算的重要范式。由于共享基础设施和多租户环境,无服务器应用程序对安全性有很高的要求。传统的虚拟机和容器无法完全满足无服务器应用程序的要求。因此,轻量级虚拟机技术应运而生,它既能减少开销和启动时间,又能确保安全。在这封信中,我们提出了针对 RISC-V 架构的轻量级虚拟机 TeleVM。TeleVM 可通过对 RISC-V 的管理程序扩展实现强隔离。与传统虚拟机相比,TeleVM 只实现了少量的 IO 设备和功能,可以有效减少内存开销和启动时间。我们通过实验比较了TeleVM和QEMU+KVM。与QEMU+KVM相比,TeleVM的启动时间和内存开销分别减少了74%和90%。这项工作进一步完善了 RISC-V 架构的云计算软件生态系统,促进了 RISC-V 架构在云计算场景中的应用。
{"title":"TeleVM: A Lightweight Virtual Machine for RISC-V Architecture","authors":"Tianzheng Li;Enfang Cui;Yuting Wu;Qian Wei;Yue Gao","doi":"10.1109/LCA.2024.3394835","DOIUrl":"10.1109/LCA.2024.3394835","url":null,"abstract":"Serverless computing has become an important paradigm in cloud computing due to its advantages such as fast large-scale deployment and pay-as-you-go charging model. Due to shared infrastructure and multi-tenant environments, serverless applications have high security requirements. Traditional virtual machines and containers cannot fully meet the requirements of serverless applications. Therefore, lightweight virtual machine technology has emerged, which can reduce overhead and boot time while ensuring security. In this letter, we propose TeleVM, a lightweight virtual machine for RISC-V architecture. TeleVM can achieve strong isolation through the hypervisor extension of RISC-V. Compared with traditional virtual machines, TeleVM only implements a small number of IO devices and functions, which can effectively reduce memory overhead and boot time. We compared TeleVM and QEMU+KVM through experiments. Compared to QEMU+KVM, the boot time and memory overhead of TeleVM have decreased by 74% and 90% respectively. This work further improves the cloud computing software ecosystem of RISC-V architecture and promotes the use of RISC-V architecture in cloud computing scenarios.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"121-124"},"PeriodicalIF":2.3,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analysis of Data Transfer Bottlenecks in Commercial PIM Systems: A Study With UPMEM-PIM 商业 PIM 系统中的数据传输瓶颈分析:UPMEM-PIM 研究
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-04-12 DOI: 10.1109/LCA.2024.3387472
Dongjae Lee;Bongjoon Hyun;Taehun Kim;Minsoo Rhu
Due to emerging workloads that require high memory bandwidth, Processing-in-Memory (PIM) has gained significant attention and led several industrial PIM products to be introduced which are integrated with conventional computing systems. This letter characterizes the data transfer overheads between conventional DRAM address space and PIM address space within a PIM-integrated system using the commercialized PIM device made by UPMEM. Our findings highlight the need for optimization in PIM-integrated systems to address these overheads, offering critical insights for future PIM technologies.
由于新出现的工作负载需要较高的内存带宽,内存处理(PIM)备受关注,并推出了几款与传统计算系统集成的工业 PIM 产品。本文利用 UPMEM 生产的商业化 PIM 设备,描述了 PIM 集成系统中传统 DRAM 地址空间和 PIM 地址空间之间的数据传输开销。我们的研究结果凸显了在 PIM 集成系统中针对这些开销进行优化的必要性,为未来的 PIM 技术提供了重要的启示。
{"title":"Analysis of Data Transfer Bottlenecks in Commercial PIM Systems: A Study With UPMEM-PIM","authors":"Dongjae Lee;Bongjoon Hyun;Taehun Kim;Minsoo Rhu","doi":"10.1109/LCA.2024.3387472","DOIUrl":"10.1109/LCA.2024.3387472","url":null,"abstract":"Due to emerging workloads that require high memory bandwidth, Processing-in-Memory (PIM) has gained significant attention and led several industrial PIM products to be introduced which are integrated with conventional computing systems. This letter characterizes the data transfer overheads between conventional DRAM address space and PIM address space within a PIM-integrated system using the commercialized PIM device made by UPMEM. Our findings highlight the need for optimization in PIM-integrated systems to address these overheads, offering critical insights for future PIM technologies.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"179-182"},"PeriodicalIF":1.4,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140569511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GATe: Streamlining Memory Access and Communication to Accelerate Graph Attention Network With Near-Memory Processing GATe:简化内存访问和通信,利用近记忆处理加速图形注意网络
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-04-10 DOI: 10.1109/LCA.2024.3386734
Shiyan Yi;Yudi Qiu;Lingfei Lu;Guohao Xu;Yong Gong;Xiaoyang Zeng;Yibo Fan
Graph Attention Network (GAT) has gained widespread adoption thanks to its exceptional performance. The critical components of a GAT model involve aggregation and attention, which cause numerous main-memory access. Recently, much research has proposed near-memory processing (NMP) architectures to accelerate aggregation. However, graph attention requires additional operations distinct from aggregation, making previous NMP architectures less suitable for supporting GAT. In this paper, we propose GATe, a practical and efficient GAT accelerator with NMP architecture. To the best of our knowledge, this is the first time that accelerates both attention and aggregation computation on DIMM. In the attention and aggregation phases, we unify feature vector access to reduce repetitive memory accesses and refine the computation flow to reduce communication. Furthermore, we introduce a novel sharding method that enhances the data reusability. Experiments show that our work achieves substantial speedup of up to 6.77× and 2.46×, respectively, compared to state-of-the-art NMP works GNNear and GraNDe.
图形注意力网络(GAT)因其卓越的性能而得到广泛应用。图形注意力网络模型的关键组件包括聚合和注意力,它们会导致大量主内存访问。最近,许多研究提出了近内存处理(NMP)架构来加速聚合。然而,图注意需要与聚合不同的额外操作,这使得以前的 NMP 架构不太适合支持 GAT。在本文中,我们提出了 GATe,一种采用 NMP 架构的实用高效的 GAT 加速器。据我们所知,这是首次在 DIMM 上同时加速注意力和聚合计算。在注意和聚合阶段,我们统一了特征向量访问以减少重复内存访问,并改进了计算流程以减少通信。此外,我们还引入了一种新颖的分片方法,以提高数据的可重用性。实验表明,与最先进的 NMP 作品 GNNear 和 GraNDe 相比,我们的作品分别实现了高达 6.77 倍和 2.46 倍的大幅提速。
{"title":"GATe: Streamlining Memory Access and Communication to Accelerate Graph Attention Network With Near-Memory Processing","authors":"Shiyan Yi;Yudi Qiu;Lingfei Lu;Guohao Xu;Yong Gong;Xiaoyang Zeng;Yibo Fan","doi":"10.1109/LCA.2024.3386734","DOIUrl":"10.1109/LCA.2024.3386734","url":null,"abstract":"Graph Attention Network (GAT) has gained widespread adoption thanks to its exceptional performance. The critical components of a GAT model involve aggregation and attention, which cause numerous main-memory access. Recently, much research has proposed near-memory processing (NMP) architectures to accelerate aggregation. However, graph attention requires additional operations distinct from aggregation, making previous NMP architectures less suitable for supporting GAT. In this paper, we propose GATe, a practical and efficient \u0000<underline>GAT</u>\u0000 acc\u0000<underline>e</u>\u0000lerator with NMP architecture. To the best of our knowledge, this is the first time that accelerates both attention and aggregation computation on DIMM. In the attention and aggregation phases, we unify feature vector access to reduce repetitive memory accesses and refine the computation flow to reduce communication. Furthermore, we introduce a novel sharding method that enhances the data reusability. Experiments show that our work achieves substantial speedup of up to 6.77× and 2.46×, respectively, compared to state-of-the-art NMP works GNNear and GraNDe.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"87-90"},"PeriodicalIF":2.3,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140569407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Area Efficient Architecture of a Novel Chaotic System for High Randomness Security in e-Health 用于电子医疗高随机性安全的新型混沌系统的面积效率架构
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-04-10 DOI: 10.1109/LCA.2024.3387352
Kyriaki Tsantikidou;Nicolas Sklavos
An e-Health application must be carefully designed, as a malicious attack has ethical and legal consequences. While common cryptography protocols enhance security, they also add high computation overhead. In this letter, an area efficient architecture of a novel chaotic system for high randomness security is proposed. It consists of the chaotic logistic map and a novel component that efficiently combines it with a block cipher's key generation function. The proposed architecture operates as both a key scheduling/management scheme and a stream cipher. All operations are implemented in an FPGA with appropriate resource utilization techniques. The proposed architecture achieves smaller area consumption, minimum 41.5%, compared to published cryptography architectures and a 5.7% increase in throughput-to-area efficiency compared to published chaotic designs. Finally, it passes all NIST randomness tests, presents avalanche effect and produces the highest number of random bits with a single seed compared to other published security systems.
电子医疗应用必须经过精心设计,因为恶意攻击会带来道德和法律后果。虽然常见的加密协议能提高安全性,但也会增加高计算开销。在这封信中,我们提出了一种用于高随机安全性的新型混沌系统的高效面积架构。它由混沌逻辑图和一个将其与区块密码密钥生成功能有效结合的新型组件组成。所提出的架构既是密钥调度/管理方案,又是流密码。所有操作都通过适当的资源利用技术在 FPGA 中实现。与已发布的加密体系结构相比,拟议的体系结构实现了更小的面积消耗,最小为 41.5%;与已发布的混沌设计相比,吞吐量-面积效率提高了 5.7%。最后,与其他已发布的安全系统相比,它通过了所有 NIST 随机性测试,呈现出雪崩效应,并能以单个种子产生最高数量的随机比特。
{"title":"An Area Efficient Architecture of a Novel Chaotic System for High Randomness Security in e-Health","authors":"Kyriaki Tsantikidou;Nicolas Sklavos","doi":"10.1109/LCA.2024.3387352","DOIUrl":"10.1109/LCA.2024.3387352","url":null,"abstract":"An e-Health application must be carefully designed, as a malicious attack has ethical and legal consequences. While common cryptography protocols enhance security, they also add high computation overhead. In this letter, an area efficient architecture of a novel chaotic system for high randomness security is proposed. It consists of the chaotic logistic map and a novel component that efficiently combines it with a block cipher's key generation function. The proposed architecture operates as both a key scheduling/management scheme and a stream cipher. All operations are implemented in an FPGA with appropriate resource utilization techniques. The proposed architecture achieves smaller area consumption, minimum 41.5%, compared to published cryptography architectures and a 5.7% increase in throughput-to-area efficiency compared to published chaotic designs. Finally, it passes all NIST randomness tests, presents avalanche effect and produces the highest number of random bits with a single seed compared to other published security systems.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"104-107"},"PeriodicalIF":2.3,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140569832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Importance of Generalizability in Machine Learning for Systems 系统机器学习中通用性的重要性
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-04-02 DOI: 10.1109/LCA.2024.3384449
Varun Gohil;Sundar Dev;Gaurang Upasani;David Lo;Parthasarathy Ranganathan;Christina Delimitrou
Using machine learning (ML) to tackle computer systems tasks is gaining popularity. One of the shortcomings of such ML-based approaches is the inability of models to generalize to out-of-distribution data i.e., data whose distribution is different than the training dataset. We showcase that this issue exists in cloud environments by analyzing various ML models used to improve resource balance in Google's fleet. We discuss the trade-offs associated with different techniques used to detect out-of-distribution data. Finally, we propose and demonstrate the efficacy of using Bayesian models to detect the model's confidence in its output when used to improve cloud server resource balance.
使用机器学习(ML)来处理计算机系统任务越来越受欢迎。这种基于 ML 的方法的缺点之一是模型无法泛化到分布外数据,即分布与训练数据集不同的数据。我们通过分析用于改善谷歌机队资源平衡的各种 ML 模型,展示了云环境中存在的这一问题。我们讨论了与用于检测分布失衡数据的不同技术相关的权衡问题。最后,我们提出并展示了使用贝叶斯模型检测模型在用于改善云服务器资源平衡时对其输出的置信度的功效。
{"title":"The Importance of Generalizability in Machine Learning for Systems","authors":"Varun Gohil;Sundar Dev;Gaurang Upasani;David Lo;Parthasarathy Ranganathan;Christina Delimitrou","doi":"10.1109/LCA.2024.3384449","DOIUrl":"10.1109/LCA.2024.3384449","url":null,"abstract":"Using machine learning (ML) to tackle computer systems tasks is gaining popularity. One of the shortcomings of such ML-based approaches is the inability of models to generalize to out-of-distribution data i.e., data whose distribution is different than the training dataset. We showcase that this issue exists in cloud environments by analyzing various ML models used to improve resource balance in Google's fleet. We discuss the trade-offs associated with different techniques used to detect out-of-distribution data. Finally, we propose and demonstrate the efficacy of using Bayesian models to detect the model's confidence in its output when used to improve cloud server resource balance.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"95-98"},"PeriodicalIF":2.3,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140569515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MajorK: Majority Based kmer Matching in Commodity DRAM MajorK:商品 DRAM 中基于多数的 kmer 匹配
IF 2.3 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-04-02 DOI: 10.1109/LCA.2024.3384259
Z. Jahshan;L. Yavits
Fast parallel search capabilities on large datasets are required across multiple application domains. One such domain is genome analysis, which requires high-performance kmer matching in large genome databases. Recently proposed solutions implemented kmer matching in DRAM, utilizing its sheer capacity and parallelism. However, their operation is essentially bit-serial, which ultimately limits the performance, especially when matching long strings, as customary in genome analysis pipelines. The proposed solution, MajorK, enables bit-parallel majority based kmer matching in an unmodified commodity DRAM. MajorK employs multiple DRAM row activation, where the search patterns (query kmers) are coded into DRAM addresses. We evaluate MajorK on viral genome kmer matching and show that it can achieve up to 2.7$ times $ higher performance while providing a better matching accuracy compared to state-of-the-art DRAM based kmer matching accelerators.
多个应用领域都需要对大型数据集进行快速并行搜索。基因组分析就是这样一个领域,它需要在大型基因组数据库中进行高性能 kmer 匹配。最近提出的解决方案在 DRAM 中实现了 kmer 匹配,充分利用了 DRAM 的容量和并行性。然而,它们的操作本质上是比特串行的,最终限制了性能,尤其是在匹配长字符串时,这在基因组分析流水线中很常见。建议的解决方案 MajorK 可以在未修改的商品 DRAM 中实现基于比特并行多数的 kmer 匹配。MajorK 采用多 DRAM 行激活,将搜索模式(查询 kmers)编码到 DRAM 地址中。我们在病毒基因组kmer匹配上对MajorK进行了评估,结果表明,与基于DRAM的最先进的kmer匹配加速器相比,MajorK可以实现高达2.7倍的性能提升,同时提供更好的匹配精度。
{"title":"MajorK: Majority Based kmer Matching in Commodity DRAM","authors":"Z. Jahshan;L. Yavits","doi":"10.1109/LCA.2024.3384259","DOIUrl":"10.1109/LCA.2024.3384259","url":null,"abstract":"Fast parallel search capabilities on large datasets are required across multiple application domains. One such domain is genome analysis, which requires high-performance \u0000<i>k</i>\u0000mer matching in large genome databases. Recently proposed solutions implemented \u0000<i>k</i>\u0000mer matching in DRAM, utilizing its sheer capacity and parallelism. However, their operation is essentially bit-serial, which ultimately limits the performance, especially when matching long strings, as customary in genome analysis pipelines. The proposed solution, MajorK, enables bit-parallel majority based \u0000<i>k</i>\u0000mer matching in an unmodified commodity DRAM. MajorK employs multiple DRAM row activation, where the search patterns (query \u0000<i>k</i>\u0000mers) are coded into DRAM addresses. We evaluate MajorK on viral genome \u0000<i>k</i>\u0000mer matching and show that it can achieve up to 2.7\u0000<inline-formula><tex-math>$ times $</tex-math></inline-formula>\u0000 higher performance while providing a better matching accuracy compared to state-of-the-art DRAM based \u0000<i>k</i>\u0000mer matching accelerators.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"83-86"},"PeriodicalIF":2.3,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140569825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1