首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
TeleVM: A Lightweight Virtual Machine for RISC-V Architecture TeleVM:适用于 RISC-V 架构的轻量级虚拟机
IF 2.3 3区 计算机科学 Pub Date : 2024-04-30 DOI: 10.1109/LCA.2024.3394835
Tianzheng Li;Enfang Cui;Yuting Wu;Qian Wei;Yue Gao
Serverless computing has become an important paradigm in cloud computing due to its advantages such as fast large-scale deployment and pay-as-you-go charging model. Due to shared infrastructure and multi-tenant environments, serverless applications have high security requirements. Traditional virtual machines and containers cannot fully meet the requirements of serverless applications. Therefore, lightweight virtual machine technology has emerged, which can reduce overhead and boot time while ensuring security. In this letter, we propose TeleVM, a lightweight virtual machine for RISC-V architecture. TeleVM can achieve strong isolation through the hypervisor extension of RISC-V. Compared with traditional virtual machines, TeleVM only implements a small number of IO devices and functions, which can effectively reduce memory overhead and boot time. We compared TeleVM and QEMU+KVM through experiments. Compared to QEMU+KVM, the boot time and memory overhead of TeleVM have decreased by 74% and 90% respectively. This work further improves the cloud computing software ecosystem of RISC-V architecture and promotes the use of RISC-V architecture in cloud computing scenarios.
无服务器计算具有快速大规模部署和现收现付收费模式等优势,已成为云计算的重要范式。由于共享基础设施和多租户环境,无服务器应用程序对安全性有很高的要求。传统的虚拟机和容器无法完全满足无服务器应用程序的要求。因此,轻量级虚拟机技术应运而生,它既能减少开销和启动时间,又能确保安全。在这封信中,我们提出了针对 RISC-V 架构的轻量级虚拟机 TeleVM。TeleVM 可通过对 RISC-V 的管理程序扩展实现强隔离。与传统虚拟机相比,TeleVM 只实现了少量的 IO 设备和功能,可以有效减少内存开销和启动时间。我们通过实验比较了TeleVM和QEMU+KVM。与QEMU+KVM相比,TeleVM的启动时间和内存开销分别减少了74%和90%。这项工作进一步完善了 RISC-V 架构的云计算软件生态系统,促进了 RISC-V 架构在云计算场景中的应用。
{"title":"TeleVM: A Lightweight Virtual Machine for RISC-V Architecture","authors":"Tianzheng Li;Enfang Cui;Yuting Wu;Qian Wei;Yue Gao","doi":"10.1109/LCA.2024.3394835","DOIUrl":"10.1109/LCA.2024.3394835","url":null,"abstract":"Serverless computing has become an important paradigm in cloud computing due to its advantages such as fast large-scale deployment and pay-as-you-go charging model. Due to shared infrastructure and multi-tenant environments, serverless applications have high security requirements. Traditional virtual machines and containers cannot fully meet the requirements of serverless applications. Therefore, lightweight virtual machine technology has emerged, which can reduce overhead and boot time while ensuring security. In this letter, we propose TeleVM, a lightweight virtual machine for RISC-V architecture. TeleVM can achieve strong isolation through the hypervisor extension of RISC-V. Compared with traditional virtual machines, TeleVM only implements a small number of IO devices and functions, which can effectively reduce memory overhead and boot time. We compared TeleVM and QEMU+KVM through experiments. Compared to QEMU+KVM, the boot time and memory overhead of TeleVM have decreased by 74% and 90% respectively. This work further improves the cloud computing software ecosystem of RISC-V architecture and promotes the use of RISC-V architecture in cloud computing scenarios.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analysis of Data Transfer Bottlenecks in Commercial PIM Systems: A Study With UPMEM-PIM 商业 PIM 系统中的数据传输瓶颈分析:UPMEM-PIM 研究
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-04-12 DOI: 10.1109/LCA.2024.3387472
Dongjae Lee;Bongjoon Hyun;Taehun Kim;Minsoo Rhu
Due to emerging workloads that require high memory bandwidth, Processing-in-Memory (PIM) has gained significant attention and led several industrial PIM products to be introduced which are integrated with conventional computing systems. This letter characterizes the data transfer overheads between conventional DRAM address space and PIM address space within a PIM-integrated system using the commercialized PIM device made by UPMEM. Our findings highlight the need for optimization in PIM-integrated systems to address these overheads, offering critical insights for future PIM technologies.
由于新出现的工作负载需要较高的内存带宽,内存处理(PIM)备受关注,并推出了几款与传统计算系统集成的工业 PIM 产品。本文利用 UPMEM 生产的商业化 PIM 设备,描述了 PIM 集成系统中传统 DRAM 地址空间和 PIM 地址空间之间的数据传输开销。我们的研究结果凸显了在 PIM 集成系统中针对这些开销进行优化的必要性,为未来的 PIM 技术提供了重要的启示。
{"title":"Analysis of Data Transfer Bottlenecks in Commercial PIM Systems: A Study With UPMEM-PIM","authors":"Dongjae Lee;Bongjoon Hyun;Taehun Kim;Minsoo Rhu","doi":"10.1109/LCA.2024.3387472","DOIUrl":"10.1109/LCA.2024.3387472","url":null,"abstract":"Due to emerging workloads that require high memory bandwidth, Processing-in-Memory (PIM) has gained significant attention and led several industrial PIM products to be introduced which are integrated with conventional computing systems. This letter characterizes the data transfer overheads between conventional DRAM address space and PIM address space within a PIM-integrated system using the commercialized PIM device made by UPMEM. Our findings highlight the need for optimization in PIM-integrated systems to address these overheads, offering critical insights for future PIM technologies.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140569511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GATe: Streamlining Memory Access and Communication to Accelerate Graph Attention Network With Near-Memory Processing GATe:简化内存访问和通信,利用近记忆处理加速图形注意网络
IF 2.3 3区 计算机科学 Pub Date : 2024-04-10 DOI: 10.1109/LCA.2024.3386734
Shiyan Yi;Yudi Qiu;Lingfei Lu;Guohao Xu;Yong Gong;Xiaoyang Zeng;Yibo Fan
Graph Attention Network (GAT) has gained widespread adoption thanks to its exceptional performance. The critical components of a GAT model involve aggregation and attention, which cause numerous main-memory access. Recently, much research has proposed near-memory processing (NMP) architectures to accelerate aggregation. However, graph attention requires additional operations distinct from aggregation, making previous NMP architectures less suitable for supporting GAT. In this paper, we propose GATe, a practical and efficient GAT accelerator with NMP architecture. To the best of our knowledge, this is the first time that accelerates both attention and aggregation computation on DIMM. In the attention and aggregation phases, we unify feature vector access to reduce repetitive memory accesses and refine the computation flow to reduce communication. Furthermore, we introduce a novel sharding method that enhances the data reusability. Experiments show that our work achieves substantial speedup of up to 6.77× and 2.46×, respectively, compared to state-of-the-art NMP works GNNear and GraNDe.
图形注意力网络(GAT)因其卓越的性能而得到广泛应用。图形注意力网络模型的关键组件包括聚合和注意力,它们会导致大量主内存访问。最近,许多研究提出了近内存处理(NMP)架构来加速聚合。然而,图注意需要与聚合不同的额外操作,这使得以前的 NMP 架构不太适合支持 GAT。在本文中,我们提出了 GATe,一种采用 NMP 架构的实用高效的 GAT 加速器。据我们所知,这是首次在 DIMM 上同时加速注意力和聚合计算。在注意和聚合阶段,我们统一了特征向量访问以减少重复内存访问,并改进了计算流程以减少通信。此外,我们还引入了一种新颖的分片方法,以提高数据的可重用性。实验表明,与最先进的 NMP 作品 GNNear 和 GraNDe 相比,我们的作品分别实现了高达 6.77 倍和 2.46 倍的大幅提速。
{"title":"GATe: Streamlining Memory Access and Communication to Accelerate Graph Attention Network With Near-Memory Processing","authors":"Shiyan Yi;Yudi Qiu;Lingfei Lu;Guohao Xu;Yong Gong;Xiaoyang Zeng;Yibo Fan","doi":"10.1109/LCA.2024.3386734","DOIUrl":"10.1109/LCA.2024.3386734","url":null,"abstract":"Graph Attention Network (GAT) has gained widespread adoption thanks to its exceptional performance. The critical components of a GAT model involve aggregation and attention, which cause numerous main-memory access. Recently, much research has proposed near-memory processing (NMP) architectures to accelerate aggregation. However, graph attention requires additional operations distinct from aggregation, making previous NMP architectures less suitable for supporting GAT. In this paper, we propose GATe, a practical and efficient \u0000<underline>GAT</u>\u0000 acc\u0000<underline>e</u>\u0000lerator with NMP architecture. To the best of our knowledge, this is the first time that accelerates both attention and aggregation computation on DIMM. In the attention and aggregation phases, we unify feature vector access to reduce repetitive memory accesses and refine the computation flow to reduce communication. Furthermore, we introduce a novel sharding method that enhances the data reusability. Experiments show that our work achieves substantial speedup of up to 6.77× and 2.46×, respectively, compared to state-of-the-art NMP works GNNear and GraNDe.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140569407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Area Efficient Architecture of a Novel Chaotic System for High Randomness Security in e-Health 用于电子医疗高随机性安全的新型混沌系统的面积效率架构
IF 2.3 3区 计算机科学 Pub Date : 2024-04-10 DOI: 10.1109/LCA.2024.3387352
Kyriaki Tsantikidou;Nicolas Sklavos
An e-Health application must be carefully designed, as a malicious attack has ethical and legal consequences. While common cryptography protocols enhance security, they also add high computation overhead. In this letter, an area efficient architecture of a novel chaotic system for high randomness security is proposed. It consists of the chaotic logistic map and a novel component that efficiently combines it with a block cipher's key generation function. The proposed architecture operates as both a key scheduling/management scheme and a stream cipher. All operations are implemented in an FPGA with appropriate resource utilization techniques. The proposed architecture achieves smaller area consumption, minimum 41.5%, compared to published cryptography architectures and a 5.7% increase in throughput-to-area efficiency compared to published chaotic designs. Finally, it passes all NIST randomness tests, presents avalanche effect and produces the highest number of random bits with a single seed compared to other published security systems.
电子医疗应用必须经过精心设计,因为恶意攻击会带来道德和法律后果。虽然常见的加密协议能提高安全性,但也会增加高计算开销。在这封信中,我们提出了一种用于高随机安全性的新型混沌系统的高效面积架构。它由混沌逻辑图和一个将其与区块密码密钥生成功能有效结合的新型组件组成。所提出的架构既是密钥调度/管理方案,又是流密码。所有操作都通过适当的资源利用技术在 FPGA 中实现。与已发布的加密体系结构相比,拟议的体系结构实现了更小的面积消耗,最小为 41.5%;与已发布的混沌设计相比,吞吐量-面积效率提高了 5.7%。最后,与其他已发布的安全系统相比,它通过了所有 NIST 随机性测试,呈现出雪崩效应,并能以单个种子产生最高数量的随机比特。
{"title":"An Area Efficient Architecture of a Novel Chaotic System for High Randomness Security in e-Health","authors":"Kyriaki Tsantikidou;Nicolas Sklavos","doi":"10.1109/LCA.2024.3387352","DOIUrl":"10.1109/LCA.2024.3387352","url":null,"abstract":"An e-Health application must be carefully designed, as a malicious attack has ethical and legal consequences. While common cryptography protocols enhance security, they also add high computation overhead. In this letter, an area efficient architecture of a novel chaotic system for high randomness security is proposed. It consists of the chaotic logistic map and a novel component that efficiently combines it with a block cipher's key generation function. The proposed architecture operates as both a key scheduling/management scheme and a stream cipher. All operations are implemented in an FPGA with appropriate resource utilization techniques. The proposed architecture achieves smaller area consumption, minimum 41.5%, compared to published cryptography architectures and a 5.7% increase in throughput-to-area efficiency compared to published chaotic designs. Finally, it passes all NIST randomness tests, presents avalanche effect and produces the highest number of random bits with a single seed compared to other published security systems.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140569832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Importance of Generalizability in Machine Learning for Systems 系统机器学习中通用性的重要性
IF 2.3 3区 计算机科学 Pub Date : 2024-04-02 DOI: 10.1109/LCA.2024.3384449
Varun Gohil;Sundar Dev;Gaurang Upasani;David Lo;Parthasarathy Ranganathan;Christina Delimitrou
Using machine learning (ML) to tackle computer systems tasks is gaining popularity. One of the shortcomings of such ML-based approaches is the inability of models to generalize to out-of-distribution data i.e., data whose distribution is different than the training dataset. We showcase that this issue exists in cloud environments by analyzing various ML models used to improve resource balance in Google's fleet. We discuss the trade-offs associated with different techniques used to detect out-of-distribution data. Finally, we propose and demonstrate the efficacy of using Bayesian models to detect the model's confidence in its output when used to improve cloud server resource balance.
使用机器学习(ML)来处理计算机系统任务越来越受欢迎。这种基于 ML 的方法的缺点之一是模型无法泛化到分布外数据,即分布与训练数据集不同的数据。我们通过分析用于改善谷歌机队资源平衡的各种 ML 模型,展示了云环境中存在的这一问题。我们讨论了与用于检测分布失衡数据的不同技术相关的权衡问题。最后,我们提出并展示了使用贝叶斯模型检测模型在用于改善云服务器资源平衡时对其输出的置信度的功效。
{"title":"The Importance of Generalizability in Machine Learning for Systems","authors":"Varun Gohil;Sundar Dev;Gaurang Upasani;David Lo;Parthasarathy Ranganathan;Christina Delimitrou","doi":"10.1109/LCA.2024.3384449","DOIUrl":"10.1109/LCA.2024.3384449","url":null,"abstract":"Using machine learning (ML) to tackle computer systems tasks is gaining popularity. One of the shortcomings of such ML-based approaches is the inability of models to generalize to out-of-distribution data i.e., data whose distribution is different than the training dataset. We showcase that this issue exists in cloud environments by analyzing various ML models used to improve resource balance in Google's fleet. We discuss the trade-offs associated with different techniques used to detect out-of-distribution data. Finally, we propose and demonstrate the efficacy of using Bayesian models to detect the model's confidence in its output when used to improve cloud server resource balance.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140569515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MajorK: Majority Based kmer Matching in Commodity DRAM MajorK:商品 DRAM 中基于多数的 kmer 匹配
IF 2.3 3区 计算机科学 Pub Date : 2024-04-02 DOI: 10.1109/LCA.2024.3384259
Z. Jahshan;L. Yavits
Fast parallel search capabilities on large datasets are required across multiple application domains. One such domain is genome analysis, which requires high-performance kmer matching in large genome databases. Recently proposed solutions implemented kmer matching in DRAM, utilizing its sheer capacity and parallelism. However, their operation is essentially bit-serial, which ultimately limits the performance, especially when matching long strings, as customary in genome analysis pipelines. The proposed solution, MajorK, enables bit-parallel majority based kmer matching in an unmodified commodity DRAM. MajorK employs multiple DRAM row activation, where the search patterns (query kmers) are coded into DRAM addresses. We evaluate MajorK on viral genome kmer matching and show that it can achieve up to 2.7$ times $ higher performance while providing a better matching accuracy compared to state-of-the-art DRAM based kmer matching accelerators.
多个应用领域都需要对大型数据集进行快速并行搜索。基因组分析就是这样一个领域,它需要在大型基因组数据库中进行高性能 kmer 匹配。最近提出的解决方案在 DRAM 中实现了 kmer 匹配,充分利用了 DRAM 的容量和并行性。然而,它们的操作本质上是比特串行的,最终限制了性能,尤其是在匹配长字符串时,这在基因组分析流水线中很常见。建议的解决方案 MajorK 可以在未修改的商品 DRAM 中实现基于比特并行多数的 kmer 匹配。MajorK 采用多 DRAM 行激活,将搜索模式(查询 kmers)编码到 DRAM 地址中。我们在病毒基因组kmer匹配上对MajorK进行了评估,结果表明,与基于DRAM的最先进的kmer匹配加速器相比,MajorK可以实现高达2.7倍的性能提升,同时提供更好的匹配精度。
{"title":"MajorK: Majority Based kmer Matching in Commodity DRAM","authors":"Z. Jahshan;L. Yavits","doi":"10.1109/LCA.2024.3384259","DOIUrl":"10.1109/LCA.2024.3384259","url":null,"abstract":"Fast parallel search capabilities on large datasets are required across multiple application domains. One such domain is genome analysis, which requires high-performance \u0000<i>k</i>\u0000mer matching in large genome databases. Recently proposed solutions implemented \u0000<i>k</i>\u0000mer matching in DRAM, utilizing its sheer capacity and parallelism. However, their operation is essentially bit-serial, which ultimately limits the performance, especially when matching long strings, as customary in genome analysis pipelines. The proposed solution, MajorK, enables bit-parallel majority based \u0000<i>k</i>\u0000mer matching in an unmodified commodity DRAM. MajorK employs multiple DRAM row activation, where the search patterns (query \u0000<i>k</i>\u0000mers) are coded into DRAM addresses. We evaluate MajorK on viral genome \u0000<i>k</i>\u0000mer matching and show that it can achieve up to 2.7\u0000<inline-formula><tex-math>$ times $</tex-math></inline-formula>\u0000 higher performance while providing a better matching accuracy compared to state-of-the-art DRAM based \u0000<i>k</i>\u0000mer matching accelerators.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140569825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SLO-Aware GPU DVFS for Energy-Efficient LLM Inference Serving 面向高能效 LLM 推理服务的 SLO 感知 GPU DVFS
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-03-28 DOI: 10.1109/LCA.2024.3406038
Andreas Kosmas Kakolyris;Dimosthenis Masouros;Sotirios Xydis;Dimitrios Soudris
The increasing popularity of LLM-based chatbots combined with their reliance on power-hungry GPU infrastructure forms a critical challenge for providers: minimizing energy consumption under Service-Level Objectives (SLOs) that ensure optimal user experience. Traditional energy optimization methods fall short for LLM inference due to their autoregressive architecture, which renders them incapable of meeting a predefined SLO without energy overprovisioning. This autoregressive nature however, allows for iteration-level adjustments, enabling continuous fine-tuning of the system throughout the inference process. In this letter, we propose a solution based on iteration-level GPU Dynamic Voltage Frequency Scaling (DVFS), aiming to reduce the energy impact of LLM serving, an approach that has the potential for more than 22.8% and up to 45.5% energy gains when tested in real world situations under varying SLO constraints. Our approach works on top of existing LLM hosting services, requires minimal profiling and no intervention to the inference serving system.
基于 LLM 的聊天机器人越来越受欢迎,再加上它们对耗电的 GPU 基础设施的依赖,给提供商带来了严峻的挑战:如何在确保最佳用户体验的服务级别目标(SLO)下最大限度地降低能耗。传统的能耗优化方法由于其自回归架构而无法满足预定义的 SLO,从而导致能耗超配,因此无法满足 LLM 推理的要求。不过,这种自回归特性允许进行迭代级调整,从而在整个推理过程中对系统进行持续微调。在这封信中,我们提出了一种基于迭代级 GPU 动态电压频率缩放(DVFS)的解决方案,旨在减少 LLM 服务对能源的影响,这种方法在不同 SLO 约束条件下的实际情况中进行测试时,有可能实现超过 22.8%、最多 45.5% 的能源增益。我们的方法可在现有 LLM 托管服务的基础上运行,只需最低限度的剖析,无需干预推理服务系统。
{"title":"SLO-Aware GPU DVFS for Energy-Efficient LLM Inference Serving","authors":"Andreas Kosmas Kakolyris;Dimosthenis Masouros;Sotirios Xydis;Dimitrios Soudris","doi":"10.1109/LCA.2024.3406038","DOIUrl":"10.1109/LCA.2024.3406038","url":null,"abstract":"The increasing popularity of LLM-based chatbots combined with their reliance on power-hungry GPU infrastructure forms a critical challenge for providers: minimizing energy consumption under Service-Level Objectives (SLOs) that ensure optimal user experience. Traditional energy optimization methods fall short for LLM inference due to their autoregressive architecture, which renders them incapable of meeting a predefined SLO without \u0000<italic>energy overprovisioning</i>\u0000. This autoregressive nature however, allows for iteration-level adjustments, enabling continuous fine-tuning of the system throughout the inference process. In this letter, we propose a solution based on iteration-level GPU Dynamic Voltage Frequency Scaling (DVFS), aiming to reduce the energy impact of LLM serving, an approach that has the potential for more than 22.8% and up to 45.5% energy gains when tested in real world situations under varying SLO constraints. Our approach works on top of existing LLM hosting services, requires minimal profiling and no intervention to the inference serving system.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141188853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dramaton: A Near-DRAM Accelerator for Large Number Theoretic Transforms DRAMATON: 用于大数理论变换的近 DRAM 加速器
IF 2.3 3区 计算机科学 Pub Date : 2024-03-27 DOI: 10.1109/LCA.2024.3381452
Yongmo Park;Subhankar Pal;Aporva Amarnath;Karthik Swaminathan;Wei D. Lu;Alper Buyuktosunoglu;Pradip Bose
With the rising popularity of post-quantum cryptographic schemes, realizing practical implementations for real-world applications is still a major challenge. A major bottleneck in such schemes is the fetching and processing of large polynomials in the Number Theoretic Transform (NTT), which makes non Von Neumann paradigms, such as near-memory processing, a viable option. We, therefore, propose a novel near-DRAM NTT accelerator design, called Dramaton. Additionally, we introduce a conflict-free mapping algorithm that enables Dramaton to process large NTTs with minimal hardware overhead using a fixed-permutation network. Dramaton achieves 5–207× speedup in latency over the state-of-the-art and 97× improvement in EDP over a recent near-memory NTT accelerator.
随着后量子加密算法的日益流行,在现实世界中实现实际应用仍然是一项重大挑战。此类方案的一个主要瓶颈是在数论变换(NTT)中获取和处理大型多项式,这使得近内存处理等非冯-诺依曼范式成为可行的选择。因此,我们提出了一种名为 Dramaton 的新型近内存 NTT 加速器设计。此外,我们还引入了一种无冲突映射算法,使 Dramaton 能够使用固定幂次网络以最小的硬件开销处理大型 NTT。与最新的近内存 NTT 加速器相比,Dramaton 的延迟速度提高了 5-207 倍,EDP 提高了 97 倍。
{"title":"Dramaton: A Near-DRAM Accelerator for Large Number Theoretic Transforms","authors":"Yongmo Park;Subhankar Pal;Aporva Amarnath;Karthik Swaminathan;Wei D. Lu;Alper Buyuktosunoglu;Pradip Bose","doi":"10.1109/LCA.2024.3381452","DOIUrl":"10.1109/LCA.2024.3381452","url":null,"abstract":"With the rising popularity of post-quantum cryptographic schemes, realizing practical implementations for real-world applications is still a major challenge. A major bottleneck in such schemes is the fetching and processing of large polynomials in the Number Theoretic Transform (NTT), which makes non Von Neumann paradigms, such as near-memory processing, a viable option. We, therefore, propose a novel near-DRAM NTT accelerator design, called \u0000<sc>Dramaton</small>\u0000. Additionally, we introduce a conflict-free mapping algorithm that enables \u0000<sc>Dramaton</small>\u0000 to process large NTTs with minimal hardware overhead using a fixed-permutation network. \u0000<sc>Dramaton</small>\u0000 achieves 5–207× speedup in latency over the state-of-the-art and 97× improvement in EDP over a recent near-memory NTT accelerator.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140314773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Characterizing Machine Learning-Based Runtime Prefetcher Selection 描述基于机器学习的运行时首选项选择
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-03-27 DOI: 10.1109/LCA.2024.3404887
Erika S. Alcorta;Mahesh Madhav;Richard Afoakwa;Scott Tetrick;Neeraja J. Yadwadkar;Andreas Gerstlauer
Modern computer designs support composite prefetching, where multiple prefetcher components are used to target different memory access patterns. However, multiple prefetchers competing for resources can sometimes hurt performance, especially in many-core systems where cache and other resources are limited. Recent work has proposed mitigating this issue by selectively enabling and disabling prefetcher components at runtime. Formulating the problem with machine learning (ML) methods is promising, but efficient and effective solutions in terms of cost and performance are not well understood. This work studies fundamental characteristics of the composite prefetcher selection problem through the lens of ML to inform future prefetcher selection designs. We show that prefetcher decisions do not have significant temporal dependencies, that a phase-based rather than sample-based definition of ground truth yields patterns that are easier to learn, and that prefetcher selection can be formulated as a workload-agnostic problem requiring little to no training at runtime.
现代计算机设计支持复合预取,即使用多个预取器组件来针对不同的内存访问模式。然而,多个预取器竞争资源有时会影响性能,尤其是在缓存和其他资源有限的多核系统中。最近的研究提出了通过在运行时有选择地启用和禁用预取器组件来缓解这一问题。用机器学习(ML)方法来解决这个问题很有前景,但在成本和性能方面,高效和有效的解决方案还没有得到很好的理解。这项工作通过 ML 的视角研究了复合预取器选择问题的基本特征,为未来的预取器选择设计提供参考。我们的研究表明,预取器决策并不具有显著的时间依赖性,基于阶段而非基于样本的地面实况定义会产生更易于学习的模式,而且预取器选择可以表述为一个与工作负载无关的问题,几乎不需要运行时的训练。
{"title":"Characterizing Machine Learning-Based Runtime Prefetcher Selection","authors":"Erika S. Alcorta;Mahesh Madhav;Richard Afoakwa;Scott Tetrick;Neeraja J. Yadwadkar;Andreas Gerstlauer","doi":"10.1109/LCA.2024.3404887","DOIUrl":"10.1109/LCA.2024.3404887","url":null,"abstract":"Modern computer designs support composite prefetching, where multiple prefetcher components are used to target different memory access patterns. However, multiple prefetchers competing for resources can sometimes hurt performance, especially in many-core systems where cache and other resources are limited. Recent work has proposed mitigating this issue by selectively enabling and disabling prefetcher components at runtime. Formulating the problem with machine learning (ML) methods is promising, but efficient and effective solutions in terms of cost and performance are not well understood. This work studies fundamental characteristics of the composite prefetcher selection problem through the lens of ML to inform future prefetcher selection designs. We show that prefetcher decisions do not have significant temporal dependencies, that a phase-based rather than sample-based definition of ground truth yields patterns that are easier to learn, and that prefetcher selection can be formulated as a workload-agnostic problem requiring little to no training at runtime.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141171208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference 利用英特尔® 高级矩阵扩展 (AMX) 进行大型语言模型推理
IF 2.3 3区 计算机科学 Pub Date : 2024-03-24 DOI: 10.1109/LCA.2024.3397747
Hyungyo Kim;Gaohan Ye;Nachuan Wang;Amir Yazdanbakhsh;Nam Sung Kim
The ever-increasing number of parameters in Large Language Models (LLMs) demands many expensive GPUs for both inference and training. This is because even such a high-end GPU such as NVIDIA A100 can store only a subset of parameters due to its limited memory capacity. To reduce the number of required GPUs, especially for inference, we may exploit the large memory capacity of (host) CPU to store not only all the model parameters but also intermediate outputs which also require a substantial memory capacity. However, this necessitates frequent data transfers between CPU and GPU over the slow PCIe interface, creating a bottleneck that hinders the accomplishment of both low latency and high throughput in inference. To address such a challenge, we first propose CPU-GPU cooperative computing that exploits the Advanced Matrix Extensions (AMX) capability of the latest Intel CPU, codenamed Sapphire Rapids (SPR). Second, we propose an adaptive model partitioning policy that determines the layers of a given LLM to be run on CPU and GPU, respectively, based on their memory capacity requirement and arithmetic intensity. As CPU executes the layers with large memory capacity but low arithmetic intensity, the amount of data transferred through the PCIe interface is significantly reduced, thereby improving the LLM inference performance. Our evaluation demonstrates that CPU-GPU cooperative computing, based on this policy, delivers 12.1× lower latency and 5.4× higher throughput than GPU-only computing for OPT-30B inference when both CPU-GPU and GPU-only computing store the model in CPU memory.
大型语言模型(LLM)中的参数数量不断增加,这就需要许多昂贵的 GPU 来进行推理和训练。这是因为即使是英伟达 A100 这样的高端 GPU,由于内存容量有限,也只能存储部分参数。为了减少所需的 GPU 数量,尤其是推理所需的 GPU 数量,我们可以利用(主机)CPU 的大内存容量,不仅存储所有模型参数,还存储同样需要大量内存容量的中间输出。然而,这就需要在 CPU 和 GPU 之间通过速度较慢的 PCIe 接口频繁传输数据,从而形成了一个瓶颈,阻碍了低延迟和高吞吐量推理的实现。为了应对这一挑战,我们首先提出了 CPU-GPU 协同计算方法,该方法利用了英特尔最新 CPU(代号为 Sapphire Rapids (SPR))的高级矩阵扩展(AMX)功能。其次,我们提出了一种自适应模型分区策略,该策略可根据内存容量要求和算术强度,决定在 CPU 和 GPU 上分别运行给定 LLM 的各层。由于 CPU 执行内存容量大但算术强度低的层,通过 PCIe 接口传输的数据量大大减少,从而提高了 LLM 的推理性能。我们的评估表明,当 CPU-GPU 和 GPU 单纯计算都将模型存储在 CPU 内存中时,基于该策略的 CPU-GPU 协同计算在 OPT-30B 推理中的延迟比 GPU 单纯计算低 12.1 倍,吞吐量高 5.4 倍。
{"title":"Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference","authors":"Hyungyo Kim;Gaohan Ye;Nachuan Wang;Amir Yazdanbakhsh;Nam Sung Kim","doi":"10.1109/LCA.2024.3397747","DOIUrl":"10.1109/LCA.2024.3397747","url":null,"abstract":"The ever-increasing number of parameters in Large Language Models (LLMs) demands many expensive GPUs for both inference and training. This is because even such a high-end GPU such as NVIDIA A100 can store only a subset of parameters due to its limited memory capacity. To reduce the number of required GPUs, especially for inference, we may exploit the large memory capacity of (host) CPU to store not only all the model parameters but also intermediate outputs which also require a substantial memory capacity. However, this necessitates frequent data transfers between CPU and GPU over the slow PCIe interface, creating a bottleneck that hinders the accomplishment of both low latency and high throughput in inference. To address such a challenge, we first propose CPU-GPU cooperative computing that exploits the Advanced Matrix Extensions (AMX) capability of the latest Intel CPU, codenamed Sapphire Rapids (SPR). Second, we propose an adaptive model partitioning policy that determines the layers of a given LLM to be run on CPU and GPU, respectively, based on their memory capacity requirement and arithmetic intensity. As CPU executes the layers with large memory capacity but low arithmetic intensity, the amount of data transferred through the PCIe interface is significantly reduced, thereby improving the LLM inference performance. Our evaluation demonstrates that CPU-GPU cooperative computing, based on this policy, delivers 12.1× lower latency and 5.4× higher throughput than GPU-only computing for OPT-30B inference when both CPU-GPU and GPU-only computing store the model in CPU memory.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10538369","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141146488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1