IEEE Computer Architecture Letters最新文献

英文中文

SmartQuant: CXL-Based AI Model Store in Support of Runtime Configurable Weight Quantization SmartQuant：基于 CXL 的人工智能模型存储，支持运行时可配置的权重量化

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2024-09-02 DOI: 10.1109/LCA.2024.3452699

Rui Xie;Asad Ul Haq;Linsen Ma;Krystal Sun;Sanchari Sen;Swagath Venkataramani;Liu Liu;Tong Zhang

Recent studies have revealed that, during the inference on generative AI models such as transformer, the importance of different weights exhibits substantial context-dependent variations. This naturally manifests a promising potential of adaptively configuring weight quantization to improve the generative AI inference efficiency. Although configurable weight quantization can readily leverage the hardware support of variable-precision arithmetics in modern GPU and AI accelerators, little prior research has studied how one could exploit variable weight quantization to proportionally improve the AI model memory access speed and energy efficiency. Motivated by the rapidly maturing CXL ecosystem, this work develops a CXL-based design solution to fill this gap. The key is to allow CXL memory controllers play an active role in supporting and exploiting runtime configurable weight quantization. Using transformer as a representative generative AI model, we carried out experiments that well demonstrate the effectiveness of the proposed design solution.

最近的研究发现，在生成式人工智能模型（如变压器）的推理过程中，不同权重的重要性会表现出很大的上下文依赖性变化。这自然体现了自适应配置权重量化以提高生成式人工智能推理效率的巨大潜力。虽然可配置的权重量化可以轻松利用现代 GPU 和人工智能加速器对可变精度算术的硬件支持，但之前的研究很少涉及如何利用可变权重量化成比例地提高人工智能模型的内存访问速度和能效。在快速成熟的 CXL 生态系统的推动下，这项工作开发了一种基于 CXL 的设计解决方案，以填补这一空白。关键是让 CXL 内存控制器在支持和利用运行时可配置权重量化方面发挥积极作用。我们使用变压器作为具有代表性的生成式人工智能模型进行了实验，很好地证明了所提设计方案的有效性。

引用次数: 0

Proactive Embedding on Cold Data for Deep Learning Recommendation Model Training 在冷数据上主动嵌入，用于深度学习推荐模型训练

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2024-08-28 DOI: 10.1109/LCA.2024.3445948

Haeyoon Cho;Hyojun Son;Jungmin Choi;Byungil Koh;Minho Ha;John Kim

Deep learning recommendation model (DLRM) is an important class of deep learning networks that are commonly used in many applications. DRLM presents unique challenges, especially for scale-out training since it not only has compute and memory-intensive components but the communication between the multiple GPUs is also on the critical path. In this work, we propose how cold data in DLRM embedding tables can be exploited to propose proactive embedding. In particular, proactive embedding allows embedding table accesses to be done in advance to reduce the impact of the memory access latency by overlapping the embedding access with communication. Our analysis of proactive embedding demonstrates that it can improve overall training performance by 46%.

深度学习推荐模型（DLRM）是一类重要的深度学习网络，常用于许多应用中。DRLM 带来了独特的挑战，尤其是在扩展训练方面，因为它不仅有计算和内存密集型组件，而且多个 GPU 之间的通信也是关键路径。在这项工作中，我们提出了如何利用 DRLRM 嵌入表中的冷数据来实现主动嵌入。特别是，主动嵌入允许提前访问嵌入表，通过将嵌入访问与通信重叠来减少内存访问延迟的影响。我们对主动嵌入的分析表明，它能将整体训练性能提高 46%。

引用次数: 0

Octopus: A Cycle-Accurate Cache System Simulator 章鱼：周期精确的高速缓存系统模拟器

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2024-08-12 DOI: 10.1109/LCA.2024.3441941

Mohamed Hossam;Salah Hessien;Mohamed Hassan

This paper introduces Octopus¹, an open-source cycle-accurate cache system simulator with flexible interconnect models. Octopus meticulously simulates various cache system and interconnect components, including controllers, data arrays, coherence protocols, and arbiters. Being cycle-accurate enables Octopus to precisely model the behavior of target systems, while monitoring every memory request cycle by cycle. The design approach of Octopus distinguishes it from existing cache memory simulators, as it does not enforce a fixed memory system architecture but instead offers flexibility in configuring component connections and parameters, enabling simulation of diverse memory architectures. Moreover, the simulator provides two dual modes of operation, standalone and full-system simulation, which attains the best of both worlds benefits: fast simulations and high accuracy.

本文介绍了具有灵活互连模型的开源周期精确高速缓存系统模拟器 Octopus1。Octopus 可细致模拟各种高速缓存系统和互连组件，包括控制器、数据阵列、一致性协议和仲裁器。周期精确性使 Octopus 能够精确模拟目标系统的行为，同时逐周期监控每个内存请求。Octopus 的设计方法有别于现有的高速缓冲存储器模拟器，因为它不强制执行固定的内存系统架构，而是灵活配置组件连接和参数，从而能够模拟各种内存架构。此外，该模拟器还提供了独立和全系统模拟两种双重操作模式，从而实现了两全其美的效果：快速模拟和高精度。

引用次数: 0

Cycle-Oriented Dynamic Approximation: Architectural Framework to Meet Performance Requirements 面向周期的动态逼近：满足性能要求的架构框架

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2024-08-06 DOI: 10.1109/LCA.2024.3439318

Yuya Degawa;Shota Suzuki;Junichiro Kadomoto;Hidetsugu Irie;Shuichi Sakai

Approximate computing achieves shorter execution times and reduced energy consumption in areas where precise computation written in a program is not essential to meet a goal. When applying the approximations, it is vital to satisfy the required quality-of-service (QoS) (execution time) and quality-of-results (QoR) (output accuracy). Existing methods have difficulty in maintaining a constant QoS or impose a burden on programmers. In this study, we propose the Cycle-oriented Dynamic Approximation (CODAX) algorithms and processor architecture that minimize the burden on the programmer and maintain the execution time close to the required QoS while providing the user with an option to satisfy their QoR requirement. CODAX operates based on a threshold that indicates the maximum number of cycles available for one loop iteration. The threshold automatically increases or decreases at runtime to bring the total number of elapsed cycles close to the required QoS. Furthermore, CODAX allows the user to change the threshold to indirectly guarantee the required QoR. Our simulation revealed that CODAX brought the actual number of executed cycles close to the expected number for four workloads.

近似计算可以缩短执行时间，降低能耗，在这些领域中，程序中的精确计算对于实现目标并非必不可少。在应用近似计算时，满足所需的服务质量（QoS）（执行时间）和结果质量（QoR）（输出精确度）至关重要。现有的方法难以保持稳定的 QoS，或给程序员带来负担。在本研究中，我们提出了面向循环的动态逼近 (CODAX) 算法和处理器架构，可最大限度地减轻程序员的负担，并将执行时间保持在所需的 QoS 附近，同时为用户提供满足 QoR 要求的选择。CODAX 基于一个阈值运行，该阈值表示一个循环迭代可用的最大周期数。该阈值会在运行时自动增加或减少，以使循环总次数接近所需的 QoS。此外，CODAX 还允许用户更改阈值，以间接保证所需的 QoR。我们的模拟显示，CODAX 使四种工作负载的实际执行周期数接近预期数。

{"title":"Cycle-Oriented Dynamic Approximation: Architectural Framework to Meet Performance Requirements","authors":"Yuya Degawa;Shota Suzuki;Junichiro Kadomoto;Hidetsugu Irie;Shuichi Sakai","doi":"10.1109/LCA.2024.3439318","DOIUrl":"10.1109/LCA.2024.3439318","url":null,"abstract":"Approximate computing achieves shorter execution times and reduced energy consumption in areas where precise computation written in a program is not essential to meet a goal. When applying the approximations, it is vital to satisfy the required quality-of-service (QoS) (execution time) and quality-of-results (QoR) (output accuracy). Existing methods have difficulty in maintaining a constant QoS or impose a burden on programmers. In this study, we propose the Cycle-oriented Dynamic Approximation (CODAX) algorithms and processor architecture that minimize the burden on the programmer and maintain the execution time close to the required QoS while providing the user with an option to satisfy their QoR requirement. CODAX operates based on a threshold that indicates the maximum number of cycles available for one loop iteration. The threshold automatically increases or decreases at runtime to bring the total number of elapsed cycles close to the required QoS. Furthermore, CODAX allows the user to change the threshold to indirectly guarantee the required QoR. Our simulation revealed that CODAX brought the actual number of executed cycles close to the expected number for four workloads.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"211-214"},"PeriodicalIF":1.4,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LTE: Lightweight and Time-Efficient Hardware Encoder for Post-Quantum Scheme HQC LTE：用于后量子方案 HQC 的轻量级省时硬件编码器

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2024-07-30 DOI: 10.1109/LCA.2024.3435495

Yazheng Tu;Pengzhou He;Chip-Hong Chang;Jiafeng Xie

Post-quantum cryptography (PQC) has gained increasing attention across the hardware research community, especially after the National Institute of Standards and Technology (NIST) started the PQC standardization process. There are, however, very few hardware implementations reported for the Hamming Quasi-Cyclic (HQC), which is one of the NIST fourth-round PQC candidates. As encoding is an important step in code-based public key encryption scheme, this paper presents a Lightweight and Time-Efficient (LTE) hardware encoder for HQC. Our proposed design features a streamlined data flow setup to manage the iterative computations between the Reed-Solomon encoder and the Reed-Muller encoder, and a detailed analysis to obtain an optimized Galois field multiplier. The proposed LTE encoder is also implemented on an FPGA platform to demonstrate its area-time efficiency. Our evaluation shows that the proposed hardware implementation of HQC encoder outperforms the most recently reported state-of-the-art hardware implementation with 34.5%, 26.7%, and 35.2% reduction in area-delay product (ADP) for hqc-128, hqc-192, and hqc-256, respectively.

后量子密码学（PQC）越来越受到硬件研究界的关注，尤其是在美国国家标准与技术研究院（NIST）启动 PQC 标准化进程之后。然而，作为 NIST 第四轮 PQC 候选方案之一的 Hamming Quasi-Cyclic (HQC) 的硬件实现却鲜有报道。由于编码是基于代码的公开密钥加密方案的重要步骤，本文提出了一种轻量级、省时（LTE）的 HQC 硬件编码器。我们提出的设计采用精简的数据流设置来管理里德-所罗门编码器和里德-穆勒编码器之间的迭代计算，并通过详细分析获得优化的伽罗瓦场乘法器。我们还在 FPGA 平台上实现了拟议的 LTE 编码器，以展示其面积-时间效率。我们的评估结果表明，所提出的 HQC 编码器硬件实现优于最新报道的最先进硬件实现，在 hqc-128、hqc-192 和 hqc-256 的面积-延迟积 (ADP) 方面分别减少了 34.5%、26.7% 和 35.2%。

{"title":"LTE: Lightweight and Time-Efficient Hardware Encoder for Post-Quantum Scheme HQC","authors":"Yazheng Tu;Pengzhou He;Chip-Hong Chang;Jiafeng Xie","doi":"10.1109/LCA.2024.3435495","DOIUrl":"10.1109/LCA.2024.3435495","url":null,"abstract":"Post-quantum cryptography (PQC) has gained increasing attention across the hardware research community, especially after the National Institute of Standards and Technology (NIST) started the PQC standardization process. There are, however, very few hardware implementations reported for the Hamming Quasi-Cyclic (HQC), which is one of the NIST fourth-round PQC candidates. As encoding is an important step in code-based public key encryption scheme, this paper presents a \u0000<bold>L</b>\u0000ightweight and \u0000<bold>T</b>\u0000ime-\u0000<bold>E</b>\u0000fficient (LTE) hardware encoder for HQC. Our proposed design features a streamlined data flow setup to manage the iterative computations between the Reed-Solomon encoder and the Reed-Muller encoder, and a detailed analysis to obtain an optimized Galois field multiplier. The proposed LTE encoder is also implemented on an FPGA platform to demonstrate its area-time efficiency. Our evaluation shows that the proposed hardware implementation of HQC encoder outperforms the most recently reported state-of-the-art hardware implementation with 34.5%, 26.7%, and 35.2% reduction in area-delay product (ADP) for hqc-128, hqc-192, and hqc-256, respectively.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"187-190"},"PeriodicalIF":1.4,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141870587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Architecting Compatible PIM Protocol for CPU-PIM Collaboration 为 CPU-PIM 协作构建兼容的 PIM 协议

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2024-07-24 DOI: 10.1109/LCA.2024.3432936

Seunghyuk Yu;Hyeonu Kim;Kyoungho Jeun;Sunyoung Hwang;Eojin Lee

Processing in Memory (PIM) technology is gaining traction with the introduction of several prototype products. However, the interfaces of existing PIM devices hinder CPU performance excessively by delaying normal memory requests for long periods during PIM operations. In this paper, we propose a new PIM command and protocol designed for compatibility across various PIM devices and host processors, focusing on DRAM standards with limited command space. Our proposed command, PIM-ACT, activates multiple banks simultaneously with assigning the specific PIM operation. It closely follows the functionality of the ACT command for straightforward control by the memory controller. We also explore memory scheduling policies that balance the latency of conventional memory requests with the throughput of PIM workloads. Our evaluation demonstrates the effectiveness of our approach in optimizing both PIM and conventional workload performance.

随着一些原型产品的推出，内存处理（PIM）技术正日益受到重视。然而，现有 PIM 设备的接口会在 PIM 操作期间长时间延迟正常的内存请求，从而严重影响 CPU 性能。在本文中，我们提出了一种新的 PIM 命令和协议，旨在兼容各种 PIM 设备和主机处理器，重点关注命令空间有限的 DRAM 标准。我们提出的 PIM-ACT 命令可同时激活多个存储体，并分配特定的 PIM 操作。它与 ACT 命令的功能密切相关，可由内存控制器直接控制。我们还探索了内存调度策略，以平衡传统内存请求的延迟和 PIM 工作负载的吞吐量。我们的评估证明了我们的方法在优化 PIM 和传统工作负载性能方面的有效性。

引用次数: 0

A Quantitative Analysis of State Space Model-Based Large Language Model: Study of Hungry Hungry Hippos 基于状态空间模型的大型语言模型定量分析：饥饿的河马》研究

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2024-07-03 DOI: 10.1109/LCA.2024.3422492

Dongho Yoon;Taehun Kim;Jae W. Lee;Minsoo Rhu

As the need for processing long contexts in large language models (LLMs) increases, attention-based LLMs face significant challenges due to their high computation and memory requirements. To overcome this challenge, there have been several recent works that seek to alleviate attention's system-level bottlenecks. An approach that has been receiving a lot of attraction lately is state space models (SSMs) thanks to their ability to substantially reduce computational complexity and memory footprint. Despite the excitement around SSMs, there is a lack of an in-depth characterization and analysis on this important model architecture. In this paper, we delve into a representative SSM named Hungry Hungry Hippos (H3), examining its advantages as well as its current limitations. We also discuss future research directions on improving the efficiency of SSMs via hardware architectural support.

随着在大型语言模型（LLM）中处理长语境的需求不断增加，基于注意力的 LLM 因其对计算和内存的高要求而面临巨大挑战。为了克服这一挑战，最近有几项研究试图缓解注意力的系统级瓶颈。状态空间模型（SSM）是近来备受关注的一种方法，因为它能大大降低计算复杂度和内存占用。尽管 SSM 备受关注，但对这种重要的模型架构却缺乏深入的描述和分析。在本文中，我们将深入研究一种具有代表性的 SSM，名为 "饥饿的河马"（Hungry Hungry Hippos，H3），研究它的优势以及目前的局限性。我们还讨论了通过硬件架构支持提高 SSM 效率的未来研究方向。

引用次数: 0

Empirical Architectural Analysis on Performance Scalability of Petascale All-Flash Storage Systems 有关 Petascale 全闪存存储系统性能可扩展性的经验架构分析

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2024-06-25 DOI: 10.1109/LCA.2024.3418874

Mohammadamin Ajdari;Behrang Montazerzohour;Kimia Abdi;Hossein Asadi

In this paper, we first analyze a real storage system consisting of 72 SSDs utilizing either Hardware RAID (HW-RAID) or Software RAID (SW-RAID), and show that SW-RAID is up to 7× faster. We then reveal that with an increasing number of SSDs, the limited I/O parallelism in SAS controllers and multi-enclosure handshaking overheads cause a significant performance drop, minimizing the total I/O Per Second (IOPS) of a 144-SSD system to less than a single SSD. Second, we disclose the most important architectural parameters that affect a large-scale storage system. Third, we propose a framework that models a large-scale storage system and estimates the system IOPS and system resource usage for various architectures. We verify our framework against a real system and show its high accuracy. Lastly, we analyze a use case of a 240-SSD system and reveal how our framework guides architects in storage system scaling.

在本文中，我们首先分析了一个由 72 个固态硬盘组成的真实存储系统，该系统采用硬件 RAID（HW-RAID）或软件 RAID（SW-RAID），结果表明 SW-RAID 的速度最高可达 7 倍。然后，我们揭示了随着固态硬盘数量的增加，SAS 控制器中有限的 I/O 并行性和多机箱握手开销会导致性能大幅下降，从而将 144-SSD 系统的总每秒 I/O (IOPS) 降低到单个固态硬盘的水平。其次，我们揭示了影响大规模存储系统的最重要的架构参数。第三，我们提出了一个框架，该框架可对大规模存储系统进行建模，并估算各种架构下的系统 IOPS 和系统资源使用量。我们用实际系统验证了我们的框架，并展示了其高准确性。最后，我们分析了 240-SSD 系统的使用案例，揭示了我们的框架如何指导架构师进行存储系统扩展。

引用次数: 0

Accelerating Programmable Bootstrapping Targeting Contemporary GPU Microarchitecture 加速以当代 GPU 微体系结构为目标的可编程引导

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2024-06-24 DOI: 10.1109/LCA.2024.3418448

Hyesung Ji;Sangpyo Kim;Jaewan Choi;Jung Ho Ahn

Fully homomorphic encryption (FHE) enables computation on encrypted data without privacy leakage, among which GSW-based schemes are notable for supporting the evaluation of arbitrary univariate functions using programmable bootstrapping (PBS). Despite their wide applicability, their computational complexity in a single PBS impedes widespread adoption. However, at the application level, there are enough number of independent PBSs to achieve high data-level parallelism, making them suitable for running on GPUs known for their high parallel computing capability. On contemporary GPUs, peak integer performance has steadily increased, and the sizes of L2 cache and shared memory have also grown rapidly since the Volta architecture. Prior attempts to accelerate PBS on GPUs have fallen short due to their outdated implementations that cannot leverage recent GPU advances. In this paper, we introduce a GPU implementation that supports the latest PBS algorithm and incorporates GPU-trend-aware optimizations. Our implementation achieves a 10.8× performance improvement over the state-of-the-art (SOTA) GPU implementations on RTX 4090 and even outperforms the SOTA ASIC implementation.

全同态加密（FHE）可在不泄露隐私的情况下对加密数据进行计算，其中基于 GSW 的方案因支持使用可编程引导（PBS）对任意单变量函数进行评估而备受瞩目。尽管它们具有广泛的适用性，但单个 PBS 的计算复杂性阻碍了它们的广泛应用。然而，在应用层面上，有足够数量的独立 PBS 可以实现数据级的高度并行性，使它们适合在以高并行计算能力著称的 GPU 上运行。在当代 GPU 上，整数峰值性能稳步提升，二级缓存和共享内存的大小自 Volta 架构以来也迅速增长。之前在 GPU 上加速 PBS 的尝试都因其过时的实现而失败，无法充分利用 GPU 的最新进展。在本文中，我们介绍了一种支持最新 PBS 算法的 GPU 实现，并结合了 GPU 趋势感知优化。与 RTX 4090 上最先进的（SOTA）GPU 实现相比，我们的实现提高了 10.8 倍的性能，甚至优于 SOTA ASIC 实现。

{"title":"Accelerating Programmable Bootstrapping Targeting Contemporary GPU Microarchitecture","authors":"Hyesung Ji;Sangpyo Kim;Jaewan Choi;Jung Ho Ahn","doi":"10.1109/LCA.2024.3418448","DOIUrl":"10.1109/LCA.2024.3418448","url":null,"abstract":"Fully homomorphic encryption (FHE) enables computation on encrypted data without privacy leakage, among which GSW-based schemes are notable for supporting the evaluation of arbitrary univariate functions using programmable bootstrapping (PBS). Despite their wide applicability, their computational complexity in a single PBS impedes widespread adoption. However, at the application level, there are enough number of independent PBSs to achieve high data-level parallelism, making them suitable for running on GPUs known for their high parallel computing capability. On contemporary GPUs, peak integer performance has steadily increased, and the sizes of L2 cache and shared memory have also grown rapidly since the Volta architecture. Prior attempts to accelerate PBS on GPUs have fallen short due to their outdated implementations that cannot leverage recent GPU advances. In this paper, we introduce a GPU implementation that supports the latest PBS algorithm and incorporates GPU-trend-aware optimizations. Our implementation achieves a 10.8× performance improvement over the state-of-the-art (SOTA) GPU implementations on RTX 4090 and even outperforms the SOTA ASIC implementation.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"207-210"},"PeriodicalIF":1.4,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10570278","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141532506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TeleVM: A Lightweight Virtual Machine for RISC-V Architecture TeleVM：适用于 RISC-V 架构的轻量级虚拟机

IF 2.3 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2024-04-30 DOI: 10.1109/LCA.2024.3394835

Tianzheng Li;Enfang Cui;Yuting Wu;Qian Wei;Yue Gao

Serverless computing has become an important paradigm in cloud computing due to its advantages such as fast large-scale deployment and pay-as-you-go charging model. Due to shared infrastructure and multi-tenant environments, serverless applications have high security requirements. Traditional virtual machines and containers cannot fully meet the requirements of serverless applications. Therefore, lightweight virtual machine technology has emerged, which can reduce overhead and boot time while ensuring security. In this letter, we propose TeleVM, a lightweight virtual machine for RISC-V architecture. TeleVM can achieve strong isolation through the hypervisor extension of RISC-V. Compared with traditional virtual machines, TeleVM only implements a small number of IO devices and functions, which can effectively reduce memory overhead and boot time. We compared TeleVM and QEMU+KVM through experiments. Compared to QEMU+KVM, the boot time and memory overhead of TeleVM have decreased by 74% and 90% respectively. This work further improves the cloud computing software ecosystem of RISC-V architecture and promotes the use of RISC-V architecture in cloud computing scenarios.

无服务器计算具有快速大规模部署和现收现付收费模式等优势，已成为云计算的重要范式。由于共享基础设施和多租户环境，无服务器应用程序对安全性有很高的要求。传统的虚拟机和容器无法完全满足无服务器应用程序的要求。因此，轻量级虚拟机技术应运而生，它既能减少开销和启动时间，又能确保安全。在这封信中，我们提出了针对 RISC-V 架构的轻量级虚拟机 TeleVM。TeleVM 可通过对 RISC-V 的管理程序扩展实现强隔离。与传统虚拟机相比，TeleVM 只实现了少量的 IO 设备和功能，可以有效减少内存开销和启动时间。我们通过实验比较了TeleVM和QEMU+KVM。与QEMU+KVM相比，TeleVM的启动时间和内存开销分别减少了74%和90%。这项工作进一步完善了 RISC-V 架构的云计算软件生态系统，促进了 RISC-V 架构在云计算场景中的应用。

{"title":"TeleVM: A Lightweight Virtual Machine for RISC-V Architecture","authors":"Tianzheng Li;Enfang Cui;Yuting Wu;Qian Wei;Yue Gao","doi":"10.1109/LCA.2024.3394835","DOIUrl":"10.1109/LCA.2024.3394835","url":null,"abstract":"Serverless computing has become an important paradigm in cloud computing due to its advantages such as fast large-scale deployment and pay-as-you-go charging model. Due to shared infrastructure and multi-tenant environments, serverless applications have high security requirements. Traditional virtual machines and containers cannot fully meet the requirements of serverless applications. Therefore, lightweight virtual machine technology has emerged, which can reduce overhead and boot time while ensuring security. In this letter, we propose TeleVM, a lightweight virtual machine for RISC-V architecture. TeleVM can achieve strong isolation through the hypervisor extension of RISC-V. Compared with traditional virtual machines, TeleVM only implements a small number of IO devices and functions, which can effectively reduce memory overhead and boot time. We compared TeleVM and QEMU+KVM through experiments. Compared to QEMU+KVM, the boot time and memory overhead of TeleVM have decreased by 74% and 90% respectively. This work further improves the cloud computing software ecosystem of RISC-V architecture and promotes the use of RISC-V architecture in cloud computing scenarios.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"121-124"},"PeriodicalIF":2.3,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

IEEE Computer Architecture Letters

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀