首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
Efficient Implementation of Knuth Yao Sampler on Reconfigurable Hardware 在可重构硬件上高效实现 Knuth Yao 采样器
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-09-03 DOI: 10.1109/LCA.2024.3454490
Paresh Baidya;Rourab Paul;Swagata Mandal;Sumit Kumar Debnath
Lattice-based cryptography offers a promising alternative to traditional cryptographic schemes due to its resistance against quantum attacks. Discrete Gaussian sampling plays a crucial role in lattice-based cryptographic algorithms such as Ring Learning with error (R-LWE) for generating the coefficient of the polynomials. The Knuth Yao Sampler is a widely used discrete Gaussian sampling technique in Lattice-based cryptography. On the other hand, Lattice based cryptography involves resource intensive complex computation. Due to the presence of inherent parallelism, on field programmability Field Programmable Gate Array (FPGA) based reconfigurable hardware can be a good platform for the implementation of Lattice-based cryptographic algorithms. In this work, an efficient implementation of Knuth Yao Sampler on reconfigurable hardware is proposed that not only reduces the resource utilization but also enhances the speed of the sampling operation. The proposed method reduces look up table (LUT) requirement by almost 29% and enhances the speed by almost 17 times compared to the method proposed by the authors in (Sinha Roy et al., 2014).
由于能抵御量子攻击,基于晶格的加密技术为传统加密方案提供了一种前景广阔的替代方案。离散高斯采样在基于网格的加密算法中起着至关重要的作用,如用于生成多项式系数的有误差环学习(R-LWE)。Knuth Yao 采样器是基于网格的密码学中广泛使用的离散高斯采样技术。另一方面,基于网格的密码学涉及资源密集型的复杂计算。由于存在固有的并行性,基于现场可编程门阵列(FPGA)的可重构硬件可以成为实现基于网格的加密算法的良好平台。本研究提出了一种在可重构硬件上高效实现 Knuth Yao 采样器的方法,不仅降低了资源利用率,还提高了采样操作的速度。与作者在(Sinha Roy 等人,2014 年)中提出的方法相比,所提出的方法减少了近 29% 的查找表(LUT)需求,速度提高了近 17 倍。
{"title":"Efficient Implementation of Knuth Yao Sampler on Reconfigurable Hardware","authors":"Paresh Baidya;Rourab Paul;Swagata Mandal;Sumit Kumar Debnath","doi":"10.1109/LCA.2024.3454490","DOIUrl":"10.1109/LCA.2024.3454490","url":null,"abstract":"Lattice-based cryptography offers a promising alternative to traditional cryptographic schemes due to its resistance against quantum attacks. Discrete Gaussian sampling plays a crucial role in lattice-based cryptographic algorithms such as Ring Learning with error (R-LWE) for generating the coefficient of the polynomials. The Knuth Yao Sampler is a widely used discrete Gaussian sampling technique in Lattice-based cryptography. On the other hand, Lattice based cryptography involves resource intensive complex computation. Due to the presence of inherent parallelism, on field programmability Field Programmable Gate Array (FPGA) based reconfigurable hardware can be a good platform for the implementation of Lattice-based cryptographic algorithms. In this work, an efficient implementation of Knuth Yao Sampler on reconfigurable hardware is proposed that not only reduces the resource utilization but also enhances the speed of the sampling operation. The proposed method reduces look up table (LUT) requirement by almost 29% and enhances the speed by almost 17 times compared to the method proposed by the authors in (Sinha Roy et al., 2014).","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142183928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SmartQuant: CXL-Based AI Model Store in Support of Runtime Configurable Weight Quantization SmartQuant:基于 CXL 的人工智能模型存储,支持运行时可配置的权重量化
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-09-02 DOI: 10.1109/LCA.2024.3452699
Rui Xie;Asad Ul Haq;Linsen Ma;Krystal Sun;Sanchari Sen;Swagath Venkataramani;Liu Liu;Tong Zhang
Recent studies have revealed that, during the inference on generative AI models such as transformer, the importance of different weights exhibits substantial context-dependent variations. This naturally manifests a promising potential of adaptively configuring weight quantization to improve the generative AI inference efficiency. Although configurable weight quantization can readily leverage the hardware support of variable-precision arithmetics in modern GPU and AI accelerators, little prior research has studied how one could exploit variable weight quantization to proportionally improve the AI model memory access speed and energy efficiency. Motivated by the rapidly maturing CXL ecosystem, this work develops a CXL-based design solution to fill this gap. The key is to allow CXL memory controllers play an active role in supporting and exploiting runtime configurable weight quantization. Using transformer as a representative generative AI model, we carried out experiments that well demonstrate the effectiveness of the proposed design solution.
最近的研究发现,在生成式人工智能模型(如变压器)的推理过程中,不同权重的重要性会表现出很大的上下文依赖性变化。这自然体现了自适应配置权重量化以提高生成式人工智能推理效率的巨大潜力。虽然可配置的权重量化可以轻松利用现代 GPU 和人工智能加速器对可变精度算术的硬件支持,但之前的研究很少涉及如何利用可变权重量化成比例地提高人工智能模型的内存访问速度和能效。在快速成熟的 CXL 生态系统的推动下,这项工作开发了一种基于 CXL 的设计解决方案,以填补这一空白。关键是让 CXL 内存控制器在支持和利用运行时可配置权重量化方面发挥积极作用。我们使用变压器作为具有代表性的生成式人工智能模型进行了实验,很好地证明了所提设计方案的有效性。
{"title":"SmartQuant: CXL-Based AI Model Store in Support of Runtime Configurable Weight Quantization","authors":"Rui Xie;Asad Ul Haq;Linsen Ma;Krystal Sun;Sanchari Sen;Swagath Venkataramani;Liu Liu;Tong Zhang","doi":"10.1109/LCA.2024.3452699","DOIUrl":"10.1109/LCA.2024.3452699","url":null,"abstract":"Recent studies have revealed that, during the inference on generative AI models such as transformer, the importance of different weights exhibits substantial context-dependent variations. This naturally manifests a promising potential of adaptively configuring weight quantization to improve the generative AI inference efficiency. Although configurable weight quantization can readily leverage the hardware support of variable-precision arithmetics in modern GPU and AI accelerators, little prior research has studied how one could exploit variable weight quantization to proportionally improve the AI model memory access speed and energy efficiency. Motivated by the rapidly maturing CXL ecosystem, this work develops a CXL-based design solution to fill this gap. The key is to allow CXL memory controllers play an active role in supporting and exploiting runtime configurable weight quantization. Using transformer as a representative generative AI model, we carried out experiments that well demonstrate the effectiveness of the proposed design solution.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142183930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Proactive Embedding on Cold Data for Deep Learning Recommendation Model Training 在冷数据上主动嵌入,用于深度学习推荐模型训练
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-08-28 DOI: 10.1109/LCA.2024.3445948
Haeyoon Cho;Hyojun Son;Jungmin Choi;Byungil Koh;Minho Ha;John Kim
Deep learning recommendation model (DLRM) is an important class of deep learning networks that are commonly used in many applications. DRLM presents unique challenges, especially for scale-out training since it not only has compute and memory-intensive components but the communication between the multiple GPUs is also on the critical path. In this work, we propose how cold data in DLRM embedding tables can be exploited to propose proactive embedding. In particular, proactive embedding allows embedding table accesses to be done in advance to reduce the impact of the memory access latency by overlapping the embedding access with communication. Our analysis of proactive embedding demonstrates that it can improve overall training performance by 46%.
深度学习推荐模型(DLRM)是一类重要的深度学习网络,常用于许多应用中。DRLM 带来了独特的挑战,尤其是在扩展训练方面,因为它不仅有计算和内存密集型组件,而且多个 GPU 之间的通信也是关键路径。在这项工作中,我们提出了如何利用 DRLRM 嵌入表中的冷数据来实现主动嵌入。特别是,主动嵌入允许提前访问嵌入表,通过将嵌入访问与通信重叠来减少内存访问延迟的影响。我们对主动嵌入的分析表明,它能将整体训练性能提高 46%。
{"title":"Proactive Embedding on Cold Data for Deep Learning Recommendation Model Training","authors":"Haeyoon Cho;Hyojun Son;Jungmin Choi;Byungil Koh;Minho Ha;John Kim","doi":"10.1109/LCA.2024.3445948","DOIUrl":"10.1109/LCA.2024.3445948","url":null,"abstract":"Deep learning recommendation model (DLRM) is an important class of deep learning networks that are commonly used in many applications. DRLM presents unique challenges, especially for scale-out training since it not only has compute and memory-intensive components but the communication between the multiple GPUs is also on the critical path. In this work, we propose how \u0000<italic>cold</i>\u0000 data in DLRM embedding tables can be exploited to propose proactive embedding. In particular, proactive embedding allows embedding table accesses to be done in advance to reduce the impact of the memory access latency by overlapping the embedding access with communication. Our analysis of proactive embedding demonstrates that it can improve overall training performance by 46%.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142183929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Octopus: A Cycle-Accurate Cache System Simulator 章鱼:周期精确的高速缓存系统模拟器
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-08-12 DOI: 10.1109/LCA.2024.3441941
Mohamed Hossam;Salah Hessien;Mohamed Hassan
This paper introduces Octopus1, an open-source cycle-accurate cache system simulator with flexible interconnect models. Octopus meticulously simulates various cache system and interconnect components, including controllers, data arrays, coherence protocols, and arbiters. Being cycle-accurate enables Octopus to precisely model the behavior of target systems, while monitoring every memory request cycle by cycle. The design approach of Octopus distinguishes it from existing cache memory simulators, as it does not enforce a fixed memory system architecture but instead offers flexibility in configuring component connections and parameters, enabling simulation of diverse memory architectures. Moreover, the simulator provides two dual modes of operation, standalone and full-system simulation, which attains the best of both worlds benefits: fast simulations and high accuracy.
本文介绍了具有灵活互连模型的开源周期精确高速缓存系统模拟器 Octopus1。Octopus 可细致模拟各种高速缓存系统和互连组件,包括控制器、数据阵列、一致性协议和仲裁器。周期精确性使 Octopus 能够精确模拟目标系统的行为,同时逐周期监控每个内存请求。Octopus 的设计方法有别于现有的高速缓冲存储器模拟器,因为它不强制执行固定的内存系统架构,而是灵活配置组件连接和参数,从而能够模拟各种内存架构。此外,该模拟器还提供了独立和全系统模拟两种双重操作模式,从而实现了两全其美的效果:快速模拟和高精度。
{"title":"Octopus: A Cycle-Accurate Cache System Simulator","authors":"Mohamed Hossam;Salah Hessien;Mohamed Hassan","doi":"10.1109/LCA.2024.3441941","DOIUrl":"10.1109/LCA.2024.3441941","url":null,"abstract":"This paper introduces Octopus\u0000<sup>1</sup>\u0000, an open-source cycle-accurate cache system simulator with flexible interconnect models. Octopus meticulously simulates various cache system and interconnect components, including controllers, data arrays, coherence protocols, and arbiters. Being cycle-accurate enables Octopus to precisely model the behavior of target systems, while monitoring every memory request cycle by cycle. The design approach of Octopus distinguishes it from existing cache memory simulators, as it does not enforce a fixed memory system architecture but instead offers flexibility in configuring component connections and parameters, enabling simulation of diverse memory architectures. Moreover, the simulator provides two dual modes of operation, standalone and full-system simulation, which attains the best of both worlds benefits: fast simulations and high accuracy.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142183931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cycle-Oriented Dynamic Approximation: Architectural Framework to Meet Performance Requirements 面向周期的动态逼近:满足性能要求的架构框架
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-08-06 DOI: 10.1109/LCA.2024.3439318
Yuya Degawa;Shota Suzuki;Junichiro Kadomoto;Hidetsugu Irie;Shuichi Sakai
Approximate computing achieves shorter execution times and reduced energy consumption in areas where precise computation written in a program is not essential to meet a goal. When applying the approximations, it is vital to satisfy the required quality-of-service (QoS) (execution time) and quality-of-results (QoR) (output accuracy). Existing methods have difficulty in maintaining a constant QoS or impose a burden on programmers. In this study, we propose the Cycle-oriented Dynamic Approximation (CODAX) algorithms and processor architecture that minimize the burden on the programmer and maintain the execution time close to the required QoS while providing the user with an option to satisfy their QoR requirement. CODAX operates based on a threshold that indicates the maximum number of cycles available for one loop iteration. The threshold automatically increases or decreases at runtime to bring the total number of elapsed cycles close to the required QoS. Furthermore, CODAX allows the user to change the threshold to indirectly guarantee the required QoR. Our simulation revealed that CODAX brought the actual number of executed cycles close to the expected number for four workloads.
近似计算可以缩短执行时间,降低能耗,在这些领域中,程序中的精确计算对于实现目标并非必不可少。在应用近似计算时,满足所需的服务质量(QoS)(执行时间)和结果质量(QoR)(输出精确度)至关重要。现有的方法难以保持稳定的 QoS,或给程序员带来负担。在本研究中,我们提出了面向循环的动态逼近 (CODAX) 算法和处理器架构,可最大限度地减轻程序员的负担,并将执行时间保持在所需的 QoS 附近,同时为用户提供满足 QoR 要求的选择。CODAX 基于一个阈值运行,该阈值表示一个循环迭代可用的最大周期数。该阈值会在运行时自动增加或减少,以使循环总次数接近所需的 QoS。此外,CODAX 还允许用户更改阈值,以间接保证所需的 QoR。我们的模拟显示,CODAX 使四种工作负载的实际执行周期数接近预期数。
{"title":"Cycle-Oriented Dynamic Approximation: Architectural Framework to Meet Performance Requirements","authors":"Yuya Degawa;Shota Suzuki;Junichiro Kadomoto;Hidetsugu Irie;Shuichi Sakai","doi":"10.1109/LCA.2024.3439318","DOIUrl":"10.1109/LCA.2024.3439318","url":null,"abstract":"Approximate computing achieves shorter execution times and reduced energy consumption in areas where precise computation written in a program is not essential to meet a goal. When applying the approximations, it is vital to satisfy the required quality-of-service (QoS) (execution time) and quality-of-results (QoR) (output accuracy). Existing methods have difficulty in maintaining a constant QoS or impose a burden on programmers. In this study, we propose the Cycle-oriented Dynamic Approximation (CODAX) algorithms and processor architecture that minimize the burden on the programmer and maintain the execution time close to the required QoS while providing the user with an option to satisfy their QoR requirement. CODAX operates based on a threshold that indicates the maximum number of cycles available for one loop iteration. The threshold automatically increases or decreases at runtime to bring the total number of elapsed cycles close to the required QoS. Furthermore, CODAX allows the user to change the threshold to indirectly guarantee the required QoR. Our simulation revealed that CODAX brought the actual number of executed cycles close to the expected number for four workloads.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LTE: Lightweight and Time-Efficient Hardware Encoder for Post-Quantum Scheme HQC LTE:用于后量子方案 HQC 的轻量级省时硬件编码器
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-07-30 DOI: 10.1109/LCA.2024.3435495
Yazheng Tu;Pengzhou He;Chip-Hong Chang;Jiafeng Xie
Post-quantum cryptography (PQC) has gained increasing attention across the hardware research community, especially after the National Institute of Standards and Technology (NIST) started the PQC standardization process. There are, however, very few hardware implementations reported for the Hamming Quasi-Cyclic (HQC), which is one of the NIST fourth-round PQC candidates. As encoding is an important step in code-based public key encryption scheme, this paper presents a Lightweight and Time-Efficient (LTE) hardware encoder for HQC. Our proposed design features a streamlined data flow setup to manage the iterative computations between the Reed-Solomon encoder and the Reed-Muller encoder, and a detailed analysis to obtain an optimized Galois field multiplier. The proposed LTE encoder is also implemented on an FPGA platform to demonstrate its area-time efficiency. Our evaluation shows that the proposed hardware implementation of HQC encoder outperforms the most recently reported state-of-the-art hardware implementation with 34.5%, 26.7%, and 35.2% reduction in area-delay product (ADP) for hqc-128, hqc-192, and hqc-256, respectively.
后量子密码学(PQC)越来越受到硬件研究界的关注,尤其是在美国国家标准与技术研究院(NIST)启动 PQC 标准化进程之后。然而,作为 NIST 第四轮 PQC 候选方案之一的 Hamming Quasi-Cyclic (HQC) 的硬件实现却鲜有报道。由于编码是基于代码的公开密钥加密方案的重要步骤,本文提出了一种轻量级、省时(LTE)的 HQC 硬件编码器。我们提出的设计采用精简的数据流设置来管理里德-所罗门编码器和里德-穆勒编码器之间的迭代计算,并通过详细分析获得优化的伽罗瓦场乘法器。我们还在 FPGA 平台上实现了拟议的 LTE 编码器,以展示其面积-时间效率。我们的评估结果表明,所提出的 HQC 编码器硬件实现优于最新报道的最先进硬件实现,在 hqc-128、hqc-192 和 hqc-256 的面积-延迟积 (ADP) 方面分别减少了 34.5%、26.7% 和 35.2%。
{"title":"LTE: Lightweight and Time-Efficient Hardware Encoder for Post-Quantum Scheme HQC","authors":"Yazheng Tu;Pengzhou He;Chip-Hong Chang;Jiafeng Xie","doi":"10.1109/LCA.2024.3435495","DOIUrl":"10.1109/LCA.2024.3435495","url":null,"abstract":"Post-quantum cryptography (PQC) has gained increasing attention across the hardware research community, especially after the National Institute of Standards and Technology (NIST) started the PQC standardization process. There are, however, very few hardware implementations reported for the Hamming Quasi-Cyclic (HQC), which is one of the NIST fourth-round PQC candidates. As encoding is an important step in code-based public key encryption scheme, this paper presents a \u0000<bold>L</b>\u0000ightweight and \u0000<bold>T</b>\u0000ime-\u0000<bold>E</b>\u0000fficient (LTE) hardware encoder for HQC. Our proposed design features a streamlined data flow setup to manage the iterative computations between the Reed-Solomon encoder and the Reed-Muller encoder, and a detailed analysis to obtain an optimized Galois field multiplier. The proposed LTE encoder is also implemented on an FPGA platform to demonstrate its area-time efficiency. Our evaluation shows that the proposed hardware implementation of HQC encoder outperforms the most recently reported state-of-the-art hardware implementation with 34.5%, 26.7%, and 35.2% reduction in area-delay product (ADP) for hqc-128, hqc-192, and hqc-256, respectively.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141870587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Architecting Compatible PIM Protocol for CPU-PIM Collaboration 为 CPU-PIM 协作构建兼容的 PIM 协议
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-07-24 DOI: 10.1109/LCA.2024.3432936
Seunghyuk Yu;Hyeonu Kim;Kyoungho Jeun;Sunyoung Hwang;Eojin Lee
Processing in Memory (PIM) technology is gaining traction with the introduction of several prototype products. However, the interfaces of existing PIM devices hinder CPU performance excessively by delaying normal memory requests for long periods during PIM operations. In this paper, we propose a new PIM command and protocol designed for compatibility across various PIM devices and host processors, focusing on DRAM standards with limited command space. Our proposed command, PIM-ACT, activates multiple banks simultaneously with assigning the specific PIM operation. It closely follows the functionality of the ACT command for straightforward control by the memory controller. We also explore memory scheduling policies that balance the latency of conventional memory requests with the throughput of PIM workloads. Our evaluation demonstrates the effectiveness of our approach in optimizing both PIM and conventional workload performance.
随着一些原型产品的推出,内存处理(PIM)技术正日益受到重视。然而,现有 PIM 设备的接口会在 PIM 操作期间长时间延迟正常的内存请求,从而严重影响 CPU 性能。在本文中,我们提出了一种新的 PIM 命令和协议,旨在兼容各种 PIM 设备和主机处理器,重点关注命令空间有限的 DRAM 标准。我们提出的 PIM-ACT 命令可同时激活多个存储体,并分配特定的 PIM 操作。它与 ACT 命令的功能密切相关,可由内存控制器直接控制。我们还探索了内存调度策略,以平衡传统内存请求的延迟和 PIM 工作负载的吞吐量。我们的评估证明了我们的方法在优化 PIM 和传统工作负载性能方面的有效性。
{"title":"Architecting Compatible PIM Protocol for CPU-PIM Collaboration","authors":"Seunghyuk Yu;Hyeonu Kim;Kyoungho Jeun;Sunyoung Hwang;Eojin Lee","doi":"10.1109/LCA.2024.3432936","DOIUrl":"10.1109/LCA.2024.3432936","url":null,"abstract":"Processing in Memory (PIM) technology is gaining traction with the introduction of several prototype products. However, the interfaces of existing PIM devices hinder CPU performance excessively by delaying normal memory requests for long periods during PIM operations. In this paper, we propose a new PIM command and protocol designed for compatibility across various PIM devices and host processors, focusing on DRAM standards with limited command space. Our proposed command, PIM-ACT, activates multiple banks simultaneously with assigning the specific PIM operation. It closely follows the functionality of the ACT command for straightforward control by the memory controller. We also explore memory scheduling policies that balance the latency of conventional memory requests with the throughput of PIM workloads. Our evaluation demonstrates the effectiveness of our approach in optimizing both PIM and conventional workload performance.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141778103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Quantitative Analysis of State Space Model-Based Large Language Model: Study of Hungry Hungry Hippos 基于状态空间模型的大型语言模型定量分析:饥饿的河马》研究
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-07-03 DOI: 10.1109/LCA.2024.3422492
Dongho Yoon;Taehun Kim;Jae W. Lee;Minsoo Rhu
As the need for processing long contexts in large language models (LLMs) increases, attention-based LLMs face significant challenges due to their high computation and memory requirements. To overcome this challenge, there have been several recent works that seek to alleviate attention's system-level bottlenecks. An approach that has been receiving a lot of attraction lately is state space models (SSMs) thanks to their ability to substantially reduce computational complexity and memory footprint. Despite the excitement around SSMs, there is a lack of an in-depth characterization and analysis on this important model architecture. In this paper, we delve into a representative SSM named Hungry Hungry Hippos (H3), examining its advantages as well as its current limitations. We also discuss future research directions on improving the efficiency of SSMs via hardware architectural support.
随着在大型语言模型(LLM)中处理长语境的需求不断增加,基于注意力的 LLM 因其对计算和内存的高要求而面临巨大挑战。为了克服这一挑战,最近有几项研究试图缓解注意力的系统级瓶颈。状态空间模型(SSM)是近来备受关注的一种方法,因为它能大大降低计算复杂度和内存占用。尽管 SSM 备受关注,但对这种重要的模型架构却缺乏深入的描述和分析。在本文中,我们将深入研究一种具有代表性的 SSM,名为 "饥饿的河马"(Hungry Hungry Hippos,H3),研究它的优势以及目前的局限性。我们还讨论了通过硬件架构支持提高 SSM 效率的未来研究方向。
{"title":"A Quantitative Analysis of State Space Model-Based Large Language Model: Study of Hungry Hungry Hippos","authors":"Dongho Yoon;Taehun Kim;Jae W. Lee;Minsoo Rhu","doi":"10.1109/LCA.2024.3422492","DOIUrl":"10.1109/LCA.2024.3422492","url":null,"abstract":"As the need for processing long contexts in large language models (LLMs) increases, attention-based LLMs face significant challenges due to their high computation and memory requirements. To overcome this challenge, there have been several recent works that seek to alleviate attention's system-level bottlenecks. An approach that has been receiving a lot of attraction lately is state space models (SSMs) thanks to their ability to substantially reduce computational complexity and memory footprint. Despite the excitement around SSMs, there is a lack of an in-depth characterization and analysis on this important model architecture. In this paper, we delve into a representative SSM named Hungry Hungry Hippos (H3), examining its advantages as well as its current limitations. We also discuss future research directions on improving the efficiency of SSMs via hardware architectural support.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141547667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Empirical Architectural Analysis on Performance Scalability of Petascale All-Flash Storage Systems 有关 Petascale 全闪存存储系统性能可扩展性的经验架构分析
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-06-25 DOI: 10.1109/LCA.2024.3418874
Mohammadamin Ajdari;Behrang Montazerzohour;Kimia Abdi;Hossein Asadi
In this paper, we first analyze a real storage system consisting of 72 SSDs utilizing either Hardware RAID (HW-RAID) or Software RAID (SW-RAID), and show that SW-RAID is up to 7× faster. We then reveal that with an increasing number of SSDs, the limited I/O parallelism in SAS controllers and multi-enclosure handshaking overheads cause a significant performance drop, minimizing the total I/O Per Second (IOPS) of a 144-SSD system to less than a single SSD. Second, we disclose the most important architectural parameters that affect a large-scale storage system. Third, we propose a framework that models a large-scale storage system and estimates the system IOPS and system resource usage for various architectures. We verify our framework against a real system and show its high accuracy. Lastly, we analyze a use case of a 240-SSD system and reveal how our framework guides architects in storage system scaling.
在本文中,我们首先分析了一个由 72 个固态硬盘组成的真实存储系统,该系统采用硬件 RAID(HW-RAID)或软件 RAID(SW-RAID),结果表明 SW-RAID 的速度最高可达 7 倍。然后,我们揭示了随着固态硬盘数量的增加,SAS 控制器中有限的 I/O 并行性和多机箱握手开销会导致性能大幅下降,从而将 144-SSD 系统的总每秒 I/O (IOPS) 降低到单个固态硬盘的水平。其次,我们揭示了影响大规模存储系统的最重要的架构参数。第三,我们提出了一个框架,该框架可对大规模存储系统进行建模,并估算各种架构下的系统 IOPS 和系统资源使用量。我们用实际系统验证了我们的框架,并展示了其高准确性。最后,我们分析了 240-SSD 系统的使用案例,揭示了我们的框架如何指导架构师进行存储系统扩展。
{"title":"Empirical Architectural Analysis on Performance Scalability of Petascale All-Flash Storage Systems","authors":"Mohammadamin Ajdari;Behrang Montazerzohour;Kimia Abdi;Hossein Asadi","doi":"10.1109/LCA.2024.3418874","DOIUrl":"10.1109/LCA.2024.3418874","url":null,"abstract":"In this paper, we \u0000<italic>first</i>\u0000 analyze a real storage system consisting of 72 SSDs utilizing either \u0000<italic>Hardware RAID</i>\u0000 (HW-RAID) or \u0000<italic>Software RAID</i>\u0000 (SW-RAID), and show that SW-RAID is up to 7× faster. We then reveal that with an increasing number of SSDs, the limited I/O parallelism in SAS controllers and multi-enclosure handshaking overheads cause a significant performance drop, minimizing the total \u0000<italic>I/O Per Second</i>\u0000 (IOPS) of a 144-SSD system to less than a single SSD. \u0000<italic>Second</i>\u0000, we disclose the most important architectural parameters that affect a large-scale storage system. \u0000<italic>Third</i>\u0000, we propose a framework that models a large-scale storage system and estimates the system IOPS and system resource usage for various architectures. We verify our framework against a real system and show its high accuracy. \u0000<italic>Lastly</i>\u0000, we analyze a use case of a 240-SSD system and reveal how our framework guides architects in storage system scaling.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating Programmable Bootstrapping Targeting Contemporary GPU Microarchitecture 加速以当代 GPU 微体系结构为目标的可编程引导
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-06-24 DOI: 10.1109/LCA.2024.3418448
Hyesung Ji;Sangpyo Kim;Jaewan Choi;Jung Ho Ahn
Fully homomorphic encryption (FHE) enables computation on encrypted data without privacy leakage, among which GSW-based schemes are notable for supporting the evaluation of arbitrary univariate functions using programmable bootstrapping (PBS). Despite their wide applicability, their computational complexity in a single PBS impedes widespread adoption. However, at the application level, there are enough number of independent PBSs to achieve high data-level parallelism, making them suitable for running on GPUs known for their high parallel computing capability. On contemporary GPUs, peak integer performance has steadily increased, and the sizes of L2 cache and shared memory have also grown rapidly since the Volta architecture. Prior attempts to accelerate PBS on GPUs have fallen short due to their outdated implementations that cannot leverage recent GPU advances. In this paper, we introduce a GPU implementation that supports the latest PBS algorithm and incorporates GPU-trend-aware optimizations. Our implementation achieves a 10.8× performance improvement over the state-of-the-art (SOTA) GPU implementations on RTX 4090 and even outperforms the SOTA ASIC implementation.
全同态加密(FHE)可在不泄露隐私的情况下对加密数据进行计算,其中基于 GSW 的方案因支持使用可编程引导(PBS)对任意单变量函数进行评估而备受瞩目。尽管它们具有广泛的适用性,但单个 PBS 的计算复杂性阻碍了它们的广泛应用。然而,在应用层面上,有足够数量的独立 PBS 可以实现数据级的高度并行性,使它们适合在以高并行计算能力著称的 GPU 上运行。在当代 GPU 上,整数峰值性能稳步提升,二级缓存和共享内存的大小自 Volta 架构以来也迅速增长。之前在 GPU 上加速 PBS 的尝试都因其过时的实现而失败,无法充分利用 GPU 的最新进展。在本文中,我们介绍了一种支持最新 PBS 算法的 GPU 实现,并结合了 GPU 趋势感知优化。与 RTX 4090 上最先进的(SOTA)GPU 实现相比,我们的实现提高了 10.8 倍的性能,甚至优于 SOTA ASIC 实现。
{"title":"Accelerating Programmable Bootstrapping Targeting Contemporary GPU Microarchitecture","authors":"Hyesung Ji;Sangpyo Kim;Jaewan Choi;Jung Ho Ahn","doi":"10.1109/LCA.2024.3418448","DOIUrl":"10.1109/LCA.2024.3418448","url":null,"abstract":"Fully homomorphic encryption (FHE) enables computation on encrypted data without privacy leakage, among which GSW-based schemes are notable for supporting the evaluation of arbitrary univariate functions using programmable bootstrapping (PBS). Despite their wide applicability, their computational complexity in a single PBS impedes widespread adoption. However, at the application level, there are enough number of independent PBSs to achieve high data-level parallelism, making them suitable for running on GPUs known for their high parallel computing capability. On contemporary GPUs, peak integer performance has steadily increased, and the sizes of L2 cache and shared memory have also grown rapidly since the Volta architecture. Prior attempts to accelerate PBS on GPUs have fallen short due to their outdated implementations that cannot leverage recent GPU advances. In this paper, we introduce a GPU implementation that supports the latest PBS algorithm and incorporates GPU-trend-aware optimizations. Our implementation achieves a 10.8× performance improvement over the state-of-the-art (SOTA) GPU implementations on RTX 4090 and even outperforms the SOTA ASIC implementation.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10570278","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141532506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1