arXiv - CS - Operating Systems最新文献_第2页

Wave: A Split OS Architecture for Application Engines 波浪应用引擎的分离式操作系统架构

arXiv - CS - Operating Systems

Pub Date : 2024-08-30 DOI: arxiv-2408.17351

Jack Tigar Humphries, Neel Natu, Kostis Kaffes, Stanko Novaković, Paul Turner, Hank Levy, David Culler, Christos Kozyrakis

The end of Moore's Law and the tightening performance requirements in today'sclouds make re-architecting the software stack a necessity. To address this,cloud providers and vendors offload the virtualization control plane and dataplane, along with the host OS data plane, to IPUs (SmartNICs), recoveringscarce host resources that are then used by applications. However, the host OScontrol plane--encompassing kernel thread scheduling, memory management, thenetwork stack, file systems, and more--is left on the host CPU and degradesworkload performance. This paper presents Wave, a split OS architecture that moves OS subsystempolicies to the IPU while keeping OS mechanisms on the host CPU. Wave not onlyfrees host CPU resources, but it reduces host workload interference andleverages network insights on the IPU to improve policy decisions. Wave makesOS control plane offloading practical despite high host-IPU communicationlatency, lack of a coherent interconnect, and operation across two systemimages. We present Wave's design and implementation, and implement several OSsubsystems in Wave, including kernel thread scheduling, the control plane for anetwork stack, and memory management. We then evaluate the Wave subsystems onStubby (scheduling and network), our GCE VM service (scheduling), and RocksDB(memory management and scheduling). We demonstrate that Wave subsystems arecompetitive with and often superior to on-host subsystems, saving 8 host CPUsfor Stubby, 16 host CPUs for database memory management, and improving VMperformance by up to 11.2%.

摩尔定律的终结以及当今云计算对性能要求的不断提高，使得重新构建软件栈成为必然。为了解决这个问题，云提供商和供应商将虚拟化控制平面和数据平面以及主机操作系统数据平面卸载到IPU（SmartNIC）上，从而恢复了主机资源，供应用程序使用。然而，主机操作系统控制平面--包括内核线程调度、内存管理、网络堆栈、文件系统等--被留在主机 CPU 上，降低了工作负载性能。本文介绍的 Wave 是一种分离式操作系统架构，它将操作系统子系统策略移至 IPU，同时将操作系统机制保留在主机 CPU 上。Wave 不仅释放了主机 CPU 资源，还减少了主机工作负载干扰，并利用 IPU 上的网络洞察力改进了策略决策。Wave 使操作系统控制平面卸载成为现实，尽管主机与 IPU 之间的通信延迟很高、缺乏连贯的互连和跨两个系统图像运行。我们介绍了 Wave 的设计和实现，并在 Wave 中实现了多个操作系统子系统，包括内核线程调度、网络堆栈控制平面和内存管理。然后，我们在Stubby（调度和网络）、我们的GCE虚拟机服务（调度）和RocksDB（内存管理和调度）上对Wave子系统进行了评估。我们证明，Wave 子系统可与主机子系统竞争，而且往往优于主机子系统，为 Stubby 节省了 8 个主机 CPU，为数据库内存管理节省了 16 个主机 CPU，并将虚拟机性能提高了 11.2%。

{"title":"Wave: A Split OS Architecture for Application Engines","authors":"Jack Tigar Humphries, Neel Natu, Kostis Kaffes, Stanko Novaković, Paul Turner, Hank Levy, David Culler, Christos Kozyrakis","doi":"arxiv-2408.17351","DOIUrl":"https://doi.org/arxiv-2408.17351","url":null,"abstract":"The end of Moore's Law and the tightening performance requirements in today's\u0000clouds make re-architecting the software stack a necessity. To address this,\u0000cloud providers and vendors offload the virtualization control plane and data\u0000plane, along with the host OS data plane, to IPUs (SmartNICs), recovering\u0000scarce host resources that are then used by applications. However, the host OS\u0000control plane--encompassing kernel thread scheduling, memory management, the\u0000network stack, file systems, and more--is left on the host CPU and degrades\u0000workload performance. This paper presents Wave, a split OS architecture that moves OS subsystem\u0000policies to the IPU while keeping OS mechanisms on the host CPU. Wave not only\u0000frees host CPU resources, but it reduces host workload interference and\u0000leverages network insights on the IPU to improve policy decisions. Wave makes\u0000OS control plane offloading practical despite high host-IPU communication\u0000latency, lack of a coherent interconnect, and operation across two system\u0000images. We present Wave's design and implementation, and implement several OS\u0000subsystems in Wave, including kernel thread scheduling, the control plane for a\u0000network stack, and memory management. We then evaluate the Wave subsystems on\u0000Stubby (scheduling and network), our GCE VM service (scheduling), and RocksDB\u0000(memory management and scheduling). We demonstrate that Wave subsystems are\u0000competitive with and often superior to on-host subsystems, saving 8 host CPUs\u0000for Stubby, 16 host CPUs for database memory management, and improving VM\u0000performance by up to 11.2%.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FRAP: A Flexible Resource Accessing Protocol for Multiprocessor Real-Time Systems FRAP：多处理器实时系统的灵活资源访问协议

arXiv - CS - Operating Systems

Pub Date : 2024-08-25 DOI: arxiv-2408.13772

Shuai Zhao, Hanzhi Xu, Nan Chen, Ruoxian Su, Wanli Chang

Fully-partitioned fixed-priority scheduling (FP-FPS) multiprocessor systemsare widely found in real-time applications, where spin-based protocols areoften deployed to manage the mutually exclusive access of shared resources.Unfortunately, existing approaches either enforce rigid spin priority rules forresource accessing or carry significant pessimism in the schedulabilityanalysis, imposing substantial blocking time regardless of task executionurgency or resource over-provisioning. This paper proposes FRAP, a spin-basedflexible resource accessing protocol for FP-FPS systems. A task under FRAP canspin at any priority within a range for accessing a resource, allowing flexibleand fine-grained resource control with predictable worst-case behaviour. Underflexible spinning, we demonstrate that the existing analysis techniques canlead to incorrect timing bounds and present a novel MCMF (minimum cost maximumflow)-based blocking analysis, providing predictability guarantee for FRAP. Aspin priority assignment is reported that fully exploits flexible spinning toreduce the blocking time of tasks with high urgency, enhancing the performanceof FRAP. Experimental results show that FRAP outperforms the existingspin-based protocols in schedulability by 15.20%-32.73% on average, up to65.85%.

全分区固定优先级调度（FP-FPS）多处理器系统广泛存在于实时应用中，其中通常部署基于自旋的协议来管理共享资源的互斥访问。遗憾的是，现有的方法要么对资源访问强制执行僵化的自旋优先级规则，要么在可调度性分析中带有明显的悲观色彩，无论任务执行的紧迫性或资源的超额供应情况如何，都会带来大量的阻塞时间。本文为 FP-FPS 系统提出了基于自旋的灵活资源访问协议 FRAP。FRAP 下的任务可以在一定范围内的任意优先级旋转访问资源，从而以可预测的最坏情况行为实现灵活、细粒度的资源控制。在灵活旋转的情况下，我们证明了现有的分析技术会导致不正确的时序界限，并提出了一种新颖的基于 MCMF（最小成本最大流量）的阻塞分析，为 FRAP 提供了可预测性保证。报告中的 Aspin 优先级分配充分利用了灵活的旋转来减少高紧迫性任务的阻塞时间，从而提高了 FRAP 的性能。实验结果表明，FRAP 在可调度性方面平均比现有基于旋转的协议高出 15.20%-32.73%，最高可达 65.85%。

{"title":"FRAP: A Flexible Resource Accessing Protocol for Multiprocessor Real-Time Systems","authors":"Shuai Zhao, Hanzhi Xu, Nan Chen, Ruoxian Su, Wanli Chang","doi":"arxiv-2408.13772","DOIUrl":"https://doi.org/arxiv-2408.13772","url":null,"abstract":"Fully-partitioned fixed-priority scheduling (FP-FPS) multiprocessor systems\u0000are widely found in real-time applications, where spin-based protocols are\u0000often deployed to manage the mutually exclusive access of shared resources.\u0000Unfortunately, existing approaches either enforce rigid spin priority rules for\u0000resource accessing or carry significant pessimism in the schedulability\u0000analysis, imposing substantial blocking time regardless of task execution\u0000urgency or resource over-provisioning. This paper proposes FRAP, a spin-based\u0000flexible resource accessing protocol for FP-FPS systems. A task under FRAP can\u0000spin at any priority within a range for accessing a resource, allowing flexible\u0000and fine-grained resource control with predictable worst-case behaviour. Under\u0000flexible spinning, we demonstrate that the existing analysis techniques can\u0000lead to incorrect timing bounds and present a novel MCMF (minimum cost maximum\u0000flow)-based blocking analysis, providing predictability guarantee for FRAP. A\u0000spin priority assignment is reported that fully exploits flexible spinning to\u0000reduce the blocking time of tasks with high urgency, enhancing the performance\u0000of FRAP. Experimental results show that FRAP outperforms the existing\u0000spin-based protocols in schedulability by 15.20%-32.73% on average, up to\u000065.85%.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"78 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Telepathic Datacenters: Fast RPCs using Shared CXL Memory 远程数据中心：使用共享 CXL 内存实现快速 RPC

arXiv - CS - Operating Systems

Pub Date : 2024-08-21 DOI: arxiv-2408.11325

Suyash Mahar, Ehsan Hajyjasini, Seungjin Lee, Zifeng Zhang, Mingyao Shen, Steven Swanson

Datacenter applications often rely on remote procedure calls (RPCs) for fast,efficient, and secure communication. However, RPCs are slow, inefficient, andhard to use as they require expensive serialization and compression tocommunicate over a packetized serial network link. Compute Express Link 3.0(CXL) offers an alternative solution, allowing applications to share data usinga cache-coherent, shared-memory interface across clusters of machines. RPCool is a new framework that exploits CXL's shared memory capabilities.RPCool avoids serialization by passing pointers to data structures in sharedmemory. While avoiding serialization is useful, directly sharing pointer-richdata eliminates the isolation that copying data over traditional networksprovides, leaving the receiver vulnerable to invalid pointers and concurrentupdates to shared data by the sender. RPCool restores this safety with carefuland efficient management of memory permissions. Another significant challengewith CXL shared memory capabilities is that they are unlikely to scale to anentire datacenter. RPCool addresses this by falling back to RDMA-basedcommunication. Overall, RPCool reduces the round-trip latency by 1.93$times$ and7.2$times$ compared to state-of-the-art RDMA and CXL-based RPC mechanisms,respectively. Moreover, RPCool performs either comparably or better than otherRPC mechanisms across a range of workloads.

数据中心应用程序通常依赖远程过程调用（RPC）来实现快速、高效和安全的通信。但是，RPC 的速度慢、效率低，而且难以使用，因为它们需要昂贵的序列化和压缩，才能通过分组串行网络链接进行通信。Compute Express Link 3.0（CXL）提供了另一种解决方案，允许应用程序使用高速缓存相干的共享内存接口跨机器集群共享数据。RPCool 通过向共享内存中的数据结构传递指针来避免序列化。虽然避免序列化是有用的，但直接共享指针数据消除了通过传统网络复制数据所提供的隔离性，使接收方容易受到无效指针和发送方对共享数据并发更新的影响。RPCool 通过对内存权限进行细致而高效的管理，恢复了这种安全性。CXL 共享内存功能面临的另一个重大挑战是，它们不可能扩展到整个数据中心。RPCool 通过退回到基于 RDMA 的通信来解决这个问题。总体而言，与最先进的 RDMA 和基于 CXL 的 RPC 机制相比，RPCool 将往返延迟分别减少了 1.93 美元/次和 7.2 美元/次。此外，在一系列工作负载中，RPCool 的性能与其他 RPC 机制相当或更好。

{"title":"Telepathic Datacenters: Fast RPCs using Shared CXL Memory","authors":"Suyash Mahar, Ehsan Hajyjasini, Seungjin Lee, Zifeng Zhang, Mingyao Shen, Steven Swanson","doi":"arxiv-2408.11325","DOIUrl":"https://doi.org/arxiv-2408.11325","url":null,"abstract":"Datacenter applications often rely on remote procedure calls (RPCs) for fast,\u0000efficient, and secure communication. However, RPCs are slow, inefficient, and\u0000hard to use as they require expensive serialization and compression to\u0000communicate over a packetized serial network link. Compute Express Link 3.0\u0000(CXL) offers an alternative solution, allowing applications to share data using\u0000a cache-coherent, shared-memory interface across clusters of machines. RPCool is a new framework that exploits CXL's shared memory capabilities.\u0000RPCool avoids serialization by passing pointers to data structures in shared\u0000memory. While avoiding serialization is useful, directly sharing pointer-rich\u0000data eliminates the isolation that copying data over traditional networks\u0000provides, leaving the receiver vulnerable to invalid pointers and concurrent\u0000updates to shared data by the sender. RPCool restores this safety with careful\u0000and efficient management of memory permissions. Another significant challenge\u0000with CXL shared memory capabilities is that they are unlikely to scale to an\u0000entire datacenter. RPCool addresses this by falling back to RDMA-based\u0000communication. Overall, RPCool reduces the round-trip latency by 1.93$times$ and\u00007.2$times$ compared to state-of-the-art RDMA and CXL-based RPC mechanisms,\u0000respectively. Moreover, RPCool performs either comparably or better than other\u0000RPC mechanisms across a range of workloads.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CRISP: Confidentiality, Rollback, and Integrity Storage Protection for Confidential Cloud-Native Computing CRISP：为机密云原生计算提供机密性、回滚和完整性存储保护

arXiv - CS - Operating Systems

Pub Date : 2024-08-13 DOI: arxiv-2408.06822

Ardhi Putra Pratama Hartono, Andrey Brito, Christof Fetzer

Trusted execution environments (TEEs) protect the integrity andconfidentiality of running code and its associated data. Nevertheless, TEEs'integrity protection does not extend to the state saved on disk. Furthermore,modern cloud-native applications heavily rely on orchestration (e.g., throughsystems such as Kubernetes) and, thus, have their services frequentlyrestarted. During restarts, attackers can revert the state of confidentialservices to a previous version that may aid their malicious intent. This paperpresents CRISP, a rollback protection mechanism that uses an existing runtimefor Intel SGX and transparently prevents rollback. Our approach can constrainthe attack window to a fixed and short period or give developers the tools toavoid the vulnerability window altogether. Finally, experiments show thatapplying CRISP in a critical stateful cloud-native application may incur aresource increase but only a minor performance penalty.

可信执行环境（TEE）可保护运行代码及其相关数据的完整性和保密性。然而，TEEs 的完整性保护并不包括保存在磁盘上的状态。此外，现代云原生应用在很大程度上依赖于协调（例如，通过 Kubernetes 等系统），因此其服务会频繁重启。在重启过程中，攻击者可以将机密服务的状态恢复到以前的版本，这可能有助于他们的恶意意图。本文介绍的 CRISP 是一种回滚保护机制，它使用英特尔 SGX 的现有运行时，以透明方式防止回滚。我们的方法可以将攻击窗口限制在一个固定而短暂的时间段内，或者为开发人员提供完全避免漏洞窗口的工具。最后，实验表明，在关键的有状态云原生应用中应用 CRISP 可能会导致资源增加，但只有轻微的性能损失。

引用次数: 0

Wasm-bpf: Streamlining eBPF Deployment in Cloud Environments with WebAssembly Wasm-bpf：利用 WebAssembly 简化云环境中的 eBPF 部署

arXiv - CS - Operating Systems

Pub Date : 2024-08-09 DOI: arxiv-2408.04856

Yusheng Zheng, Tong Yu, Yiwei Yang, Andrew Quinn

The extended Berkeley Packet Filter (eBPF) is extensively utilized forobservability and performance analysis in cloud-native environments. However,deploying eBPF programs across a heterogeneous cloud environment presentschallenges, including compatibility issues across different kernel versions,operating systems, runtimes, and architectures. Traditional deployment methods,such as standalone containers or tightly integrated core applications, arecumbersome and inefficient, particularly when dynamic plugin management isrequired. To address these challenges, we introduce Wasm-bpf, a lightweightruntime on WebAssembly and the WebAssembly System Interface (WASI). LeveragingWasm platform independence and WASI standardized system interface, withenhanced relocation for different architectures, Wasm-bpf ensurescross-platform compatibility for eBPF programs. It simplifies deployment byintegrating with container toolchains, allowing eBPF programs to be packaged asWasm modules that can be easily managed within cloud environments.Additionally, Wasm-bpf supports dynamic plugin management in WebAssembly. Ourimplementation and evaluation demonstrate that Wasm-bpf introduces minimaloverhead compared to native eBPF implementations while simplifying thedeployment process.

扩展伯克利包过滤器（eBPF）被广泛用于云原生环境中的可观察性和性能分析。然而，在异构云环境中部署 eBPF 程序面临诸多挑战，包括不同内核版本、操作系统、运行时和架构之间的兼容性问题。传统的部署方法，如独立容器或紧密集成的核心应用程序，既繁琐又低效，尤其是在需要动态插件管理时。为了应对这些挑战，我们推出了基于 WebAssembly 和 WebAssembly 系统接口（WASI）的轻量级运行时 Wasm-bpf。Wasm-bpf 利用 Wasm 平台独立性和 WASI 标准化系统接口，以及针对不同架构的增强重定位功能，确保了 eBPF 程序的跨平台兼容性。此外，Wasm-bpf还支持WebAssembly中的动态插件管理。我们的实现和评估证明，与本地 eBPF 实现相比，Wasm-bpf 在简化部署流程的同时，带来的开销也是最小的。

{"title":"Wasm-bpf: Streamlining eBPF Deployment in Cloud Environments with WebAssembly","authors":"Yusheng Zheng, Tong Yu, Yiwei Yang, Andrew Quinn","doi":"arxiv-2408.04856","DOIUrl":"https://doi.org/arxiv-2408.04856","url":null,"abstract":"The extended Berkeley Packet Filter (eBPF) is extensively utilized for\u0000observability and performance analysis in cloud-native environments. However,\u0000deploying eBPF programs across a heterogeneous cloud environment presents\u0000challenges, including compatibility issues across different kernel versions,\u0000operating systems, runtimes, and architectures. Traditional deployment methods,\u0000such as standalone containers or tightly integrated core applications, are\u0000cumbersome and inefficient, particularly when dynamic plugin management is\u0000required. To address these challenges, we introduce Wasm-bpf, a lightweight\u0000runtime on WebAssembly and the WebAssembly System Interface (WASI). Leveraging\u0000Wasm platform independence and WASI standardized system interface, with\u0000enhanced relocation for different architectures, Wasm-bpf ensures\u0000cross-platform compatibility for eBPF programs. It simplifies deployment by\u0000integrating with container toolchains, allowing eBPF programs to be packaged as\u0000Wasm modules that can be easily managed within cloud environments.\u0000Additionally, Wasm-bpf supports dynamic plugin management in WebAssembly. Our\u0000implementation and evaluation demonstrate that Wasm-bpf introduces minimal\u0000overhead compared to native eBPF implementations while simplifying the\u0000deployment process.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141930974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Crash Consistency in DRAM-NVM-Disk Hybrid Storage System DRAM-NVM-Disk 混合存储系统的碰撞一致性

arXiv - CS - Operating Systems

Pub Date : 2024-08-08 DOI: arxiv-2408.04238

Guoyu Wang, Xilong Che, Haoyang Wei, Chenju Pei, Juncheng Hu

NVM is used as a new hierarchy in the storage system, due to its intermediatespeed and capacity between DRAM, and its byte granularity. However, consistencyproblems emerge when we attempt to put DRAM, NVM, and disk together as anefficient whole. In this paper, we discuss the challenging consistency problemsfaced by heterogeneous storage systems, and propose our solution to theproblems. The discussion is based on NVPC as a case study, but can be inspiringand adaptive to all similar heterogeneous storage systems.

由于 NVM 的速度和容量介于 DRAM 和字节粒度之间，因此被用作存储系统的新层次。然而，当我们试图将 DRAM、NVM 和磁盘组合成一个高效的整体时，一致性问题就出现了。在本文中，我们讨论了异构存储系统面临的具有挑战性的一致性问题，并提出了我们的解决方案。讨论以 NVPC 为案例，但对所有类似的异构存储系统都有启发和借鉴意义。

引用次数: 0

Hardware-Assisted Virtualization of Neural Processing Units for Cloud Platforms 云平台神经处理单元的硬件辅助虚拟化

arXiv - CS - Operating Systems

Pub Date : 2024-08-07 DOI: arxiv-2408.04104

Yuqi Xue, Yiqi Liu, Lifeng Nai, Jian Huang

Cloud platforms today have been deploying hardware accelerators like neuralprocessing units (NPUs) for powering machine learning (ML) inference services.To maximize the resource utilization while ensuring reasonable quality ofservice, a natural approach is to virtualize NPUs for efficient resourcesharing for multi-tenant ML services. However, virtualizing NPUs for moderncloud platforms is not easy. This is not only due to the lack of systemabstraction support for NPU hardware, but also due to the lack of architecturaland ISA support for enabling fine-grained dynamic operator scheduling forvirtualized NPUs. We present TCloud, a holistic NPU virtualization framework. We investigatevirtualization techniques for NPUs across the entire software and hardwarestack. TCloud consists of (1) a flexible NPU abstraction called vNPU, whichenables fine-grained virtualization of the heterogeneous compute units in aphysical NPU (pNPU); (2) a vNPU resource allocator that enables pay-as-you-gocomputing model and flexible vNPU-to-pNPU mappings for improved resourceutilization and cost-effectiveness; (3) an ISA extension of modern NPUarchitecture for facilitating fine-grained tensor operator scheduling formultiple vNPUs. We implement TCloud based on a production-level NPU simulator.Our experiments show that TCloud improves the throughput of ML inferenceservices by up to 1.4$times$ and reduces the tail latency by up to4.6$times$, while improving the NPU utilization by 1.2$times$ on average,compared to state-of-the-art NPU sharing approaches.

为了最大限度地提高资源利用率，同时确保合理的服务质量，一种自然的方法是虚拟化神经处理单元（NPU），以便为多租户 ML 服务实现高效的资源共享。然而，为现代云平台虚拟化 NPU 并不容易。这不仅是因为缺乏对 NPU 硬件的系统抽象支持，还因为缺乏架构和 ISA 支持，无法为虚拟化 NPU 实现细粒度的动态运算符调度。我们提出了 TCloud，这是一个全面的 NPU 虚拟化框架。我们研究了跨越整个软件和硬件栈的 NPU 虚拟化技术。TCloud 由以下部分组成：（1）称为 vNPU 的灵活 NPU 抽象，可对物理 NPU（pNPU）中的异构计算单元进行细粒度虚拟化；(2) vNPU 资源分配器，可实现 "即用即付 "计算模型和灵活的 vNPU 到 pNPU 映射，从而提高资源利用率和成本效益；(3) 现代 NPU 架构的 ISA 扩展，可促进多个 vNPU 的细粒度张量算子调度。我们在生产级 NPU 模拟器的基础上实现了 TCloud。我们的实验表明，与最先进的 NPU 共享方法相比，TCloud 将 ML 推断服务的吞吐量提高了 1.4 美元/次，将尾部延迟降低了 4.6 美元/次，同时将 NPU 利用率平均提高了 1.2 美元/次。

{"title":"Hardware-Assisted Virtualization of Neural Processing Units for Cloud Platforms","authors":"Yuqi Xue, Yiqi Liu, Lifeng Nai, Jian Huang","doi":"arxiv-2408.04104","DOIUrl":"https://doi.org/arxiv-2408.04104","url":null,"abstract":"Cloud platforms today have been deploying hardware accelerators like neural\u0000processing units (NPUs) for powering machine learning (ML) inference services.\u0000To maximize the resource utilization while ensuring reasonable quality of\u0000service, a natural approach is to virtualize NPUs for efficient resource\u0000sharing for multi-tenant ML services. However, virtualizing NPUs for modern\u0000cloud platforms is not easy. This is not only due to the lack of system\u0000abstraction support for NPU hardware, but also due to the lack of architectural\u0000and ISA support for enabling fine-grained dynamic operator scheduling for\u0000virtualized NPUs. We present TCloud, a holistic NPU virtualization framework. We investigate\u0000virtualization techniques for NPUs across the entire software and hardware\u0000stack. TCloud consists of (1) a flexible NPU abstraction called vNPU, which\u0000enables fine-grained virtualization of the heterogeneous compute units in a\u0000physical NPU (pNPU); (2) a vNPU resource allocator that enables pay-as-you-go\u0000computing model and flexible vNPU-to-pNPU mappings for improved resource\u0000utilization and cost-effectiveness; (3) an ISA extension of modern NPU\u0000architecture for facilitating fine-grained tensor operator scheduling for\u0000multiple vNPUs. We implement TCloud based on a production-level NPU simulator.\u0000Our experiments show that TCloud improves the throughput of ML inference\u0000services by up to 1.4$times$ and reduces the tail latency by up to\u00004.6$times$, while improving the NPU utilization by 1.2$times$ on average,\u0000compared to state-of-the-art NPU sharing approaches.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"105 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141930973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

NVPC: A Transparent NVM Page Cache NVPC：透明的 NVM 页面缓存

arXiv - CS - Operating Systems

Pub Date : 2024-08-06 DOI: arxiv-2408.02911

Guoyu Wang, Xilong Che, Haoyang Wei, Shuo Chen, Puyi He, Juncheng Hu

Towards a compatible utilization of NVM, NVM-specialized kernel file systemsand NVM-based disk file system accelerators have been proposed. However, thesestudies only focus on one or several characteristics of NVM, while failing toexploit its best practice by putting NVM in the proper position of the wholestorage stack. In this paper, we present NVPC, a transparent acceleration toexisting kernel file systems with an NVM-enhanced page cache. The accelerationlies in two aspects, respectively matching the desperate needs of existing diskfile systems: sync writes and cache-missed operations. Besides, the fast DRAMpage cache is preserved for cache-hit operations. For sync writes, ahigh-performance log-based sync absorbing area is provided to redirect datadestination from the slow disk to the fast NVM. Meanwhile, the byte-addressablefeature of NVM is used to prevent write amplification. For cache-missedoperations, NVPC makes use of the idle space on NVM to extend the DRAM pagecache, so that more and larger workloads can fit into the cache. NVPC isentirely implemented as a page cache, thus can provide efficient speed-up todisk file systems with full transparency to users and full compatibility tolower file systems. In Filebench macro-benchmarks, NVPC achieves at most 3.55x, 2.84x, and 2.64xfaster than NOVA, Ext-4, and SPFS. In RocksDB workloads with working set largerthan DRAM, NVPC achieves 1.12x, 2.59x, and 2.11x faster than NOVA, Ext-4, andSPFS. Meanwhile, NVPC gains positive revenue from NOVA, Ext-4, and SPFS in62.5% of the tested cases in our read/write/sync mixed evaluation,demonstrating that NVPC is more balanced and adaptive to complex real-worldworkloads. Experimental results also show that NVPC is the only method thataccelerates Ext-4 in particular cases for up to 15.19x, with no slow-down toany other use cases.

为了兼容利用 NVM，有人提出了 NVM 专用内核文件系统和基于 NVM 的磁盘文件系统加速器。然而，这些研究只关注了 NVM 的一个或几个特性，却没有将 NVM 放在整个存储堆栈的适当位置，从而利用其最佳实践。在本文中，我们介绍了 NVPC，一种利用 NVM 增强页面缓存对现有内核文件系统进行透明加速的方法。NVPC 的加速功能包括两个方面，分别满足现有磁盘文件系统的迫切需求：同步写入和缓存遗失操作。此外，快速 DRAM 页面缓存还可用于缓存命中操作。在同步写入方面，提供了基于日志的高性能同步吸收区，将数据从慢速磁盘重定向到快速 NVM。同时，利用 NVM 的字节可寻址特性来防止写入放大。对于缓存缺失的操作，NVPC 利用 NVM 上的空闲空间来扩展 DRAM 的分页缓存，这样缓存中就能容纳更多更大的工作负载。NVPC 完全是作为页面缓存实现的，因此可以为磁盘文件系统提供高效的加速，对用户完全透明，并与较低的文件系统完全兼容。在 Filebench 宏基准测试中，NVPC 最多比 NOVA、Ext-4 和 SPFS 快 3.55 倍、2.84 倍和 2.64 倍。在工作集大于 DRAM 的 RocksDB 工作负载中，NVPC 分别比 NOVA、Ext-4 和 SPFS 快 1.12 倍、2.59 倍和 2.11 倍。同时，在我们的读/写/同步混合评估中，NVPC 在 62.5% 的测试案例中从 NOVA、Ext-4 和 SPFS 中获得了正收益，这表明 NVPC 对复杂的现实世界工作负载具有更好的平衡性和适应性。实验结果还显示，NVPC 是唯一一种在特定情况下可将 Ext-4 的速度提升 15.19 倍的方法，而其他使用情况下的速度则没有降低。

{"title":"NVPC: A Transparent NVM Page Cache","authors":"Guoyu Wang, Xilong Che, Haoyang Wei, Shuo Chen, Puyi He, Juncheng Hu","doi":"arxiv-2408.02911","DOIUrl":"https://doi.org/arxiv-2408.02911","url":null,"abstract":"Towards a compatible utilization of NVM, NVM-specialized kernel file systems\u0000and NVM-based disk file system accelerators have been proposed. However, these\u0000studies only focus on one or several characteristics of NVM, while failing to\u0000exploit its best practice by putting NVM in the proper position of the whole\u0000storage stack. In this paper, we present NVPC, a transparent acceleration to\u0000existing kernel file systems with an NVM-enhanced page cache. The acceleration\u0000lies in two aspects, respectively matching the desperate needs of existing disk\u0000file systems: sync writes and cache-missed operations. Besides, the fast DRAM\u0000page cache is preserved for cache-hit operations. For sync writes, a\u0000high-performance log-based sync absorbing area is provided to redirect data\u0000destination from the slow disk to the fast NVM. Meanwhile, the byte-addressable\u0000feature of NVM is used to prevent write amplification. For cache-missed\u0000operations, NVPC makes use of the idle space on NVM to extend the DRAM page\u0000cache, so that more and larger workloads can fit into the cache. NVPC is\u0000entirely implemented as a page cache, thus can provide efficient speed-up to\u0000disk file systems with full transparency to users and full compatibility to\u0000lower file systems. In Filebench macro-benchmarks, NVPC achieves at most 3.55x, 2.84x, and 2.64x\u0000faster than NOVA, Ext-4, and SPFS. In RocksDB workloads with working set larger\u0000than DRAM, NVPC achieves 1.12x, 2.59x, and 2.11x faster than NOVA, Ext-4, and\u0000SPFS. Meanwhile, NVPC gains positive revenue from NOVA, Ext-4, and SPFS in\u000062.5% of the tested cases in our read/write/sync mixed evaluation,\u0000demonstrating that NVPC is more balanced and adaptive to complex real-world\u0000workloads. Experimental results also show that NVPC is the only method that\u0000accelerates Ext-4 in particular cases for up to 15.19x, with no slow-down to\u0000any other use cases.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Operating System And Artificial Intelligence: A Systematic Review 操作系统与人工智能：系统回顾

arXiv - CS - Operating Systems

Pub Date : 2024-07-19 DOI: arxiv-2407.14567

Yifan Zhang, Xinkui Zhao, Jianwei Yin, Lufei Zhang, Zuoning Chen

In the dynamic landscape of technology, the convergence of ArtificialIntelligence (AI) and Operating Systems (OS) has emerged as a pivotal arena forinnovation. Our exploration focuses on the symbiotic relationship between AIand OS, emphasizing how AI-driven tools enhance OS performance, security, andefficiency, while OS advancements facilitate more sophisticated AIapplications. We delve into various AI techniques employed to optimize OSfunctionalities, including memory management, process scheduling, and intrusiondetection. Simultaneously, we analyze the role of OS in providing essentialservices and infrastructure that enable effective AI application execution,from resource allocation to data processing. The article also addresseschallenges and future directions in this domain, emphasizing the imperative ofsecure and efficient AI integration within OS frameworks. By examining casestudies and recent developments, our review provides a comprehensive overviewof the current state of AI-OS integration, underscoring its significance inshaping the next generation of computing technologies. Finally, we explore thepromising prospects of Intelligent OSes, considering not only how innovative OSarchitectures will pave the way for groundbreaking opportunities but also howAI will significantly contribute to advancing these next-generation OSs.

在充满活力的技术领域，人工智能（AI）与操作系统（OS）的融合已成为创新的关键领域。我们的探索侧重于人工智能与操作系统之间的共生关系，强调人工智能驱动的工具如何提高操作系统的性能、安全性和效率，而操作系统的进步又如何促进更复杂的人工智能应用。我们深入研究了用于优化操作系统功能的各种人工智能技术，包括内存管理、进程调度和入侵检测。同时，我们分析了操作系统在提供基本服务和基础设施方面的作用，这些服务和基础设施使人工智能应用从资源分配到数据处理都能有效执行。文章还探讨了这一领域的挑战和未来方向，强调了在操作系统框架内安全、高效地集成人工智能的必要性。通过研究案例和最新进展，我们的综述全面概述了人工智能与操作系统集成的现状，强调了其在塑造下一代计算技术方面的重要意义。最后，我们探讨了智能操作系统的美好前景，不仅考虑了创新的操作系统架构将如何为开创性的机遇铺平道路，还考虑了人工智能将如何极大地推动这些下一代操作系统的发展。

{"title":"Operating System And Artificial Intelligence: A Systematic Review","authors":"Yifan Zhang, Xinkui Zhao, Jianwei Yin, Lufei Zhang, Zuoning Chen","doi":"arxiv-2407.14567","DOIUrl":"https://doi.org/arxiv-2407.14567","url":null,"abstract":"In the dynamic landscape of technology, the convergence of Artificial\u0000Intelligence (AI) and Operating Systems (OS) has emerged as a pivotal arena for\u0000innovation. Our exploration focuses on the symbiotic relationship between AI\u0000and OS, emphasizing how AI-driven tools enhance OS performance, security, and\u0000efficiency, while OS advancements facilitate more sophisticated AI\u0000applications. We delve into various AI techniques employed to optimize OS\u0000functionalities, including memory management, process scheduling, and intrusion\u0000detection. Simultaneously, we analyze the role of OS in providing essential\u0000services and infrastructure that enable effective AI application execution,\u0000from resource allocation to data processing. The article also addresses\u0000challenges and future directions in this domain, emphasizing the imperative of\u0000secure and efficient AI integration within OS frameworks. By examining case\u0000studies and recent developments, our review provides a comprehensive overview\u0000of the current state of AI-OS integration, underscoring its significance in\u0000shaping the next generation of computing technologies. Finally, we explore the\u0000promising prospects of Intelligent OSes, considering not only how innovative OS\u0000architectures will pave the way for groundbreaking opportunities but also how\u0000AI will significantly contribute to advancing these next-generation OSs.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141780869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Data-driven Software-based Power Estimation for Embedded Devices 基于数据驱动软件的嵌入式设备功耗估算

arXiv - CS - Operating Systems

Pub Date : 2024-07-03 DOI: arxiv-2407.02764

Haoyu Wang, Xinyi Li, Ti Zhou, Man Lin

Energy measurement of computer devices, which are widely used in the Internetof Things (IoT), is an important yet challenging task. Most of these IoTdevices lack ready-to-use hardware or software for power measurement. Acost-effective solution is to use low-end consumer-grade power meters. However,these low-end power meters cannot provide accurate instantaneous powermeasurements. In this paper, we propose an easy-to-use approach to derive aninstantaneous software-based energy estimation model with only low-end powermeters based on data-driven analysis through machine learning. Our solution isdemonstrated with a Jetson Nano board and Ruideng UM25C USB power meter.Various machine learning methods combined with our smart data collection methodand physical measurement are explored. Benchmarks were used to evaluate thederived software-power model for the Jetson Nano board and Raspberry Pi. Theresults show that 92% accuracy can be achieved compared to the long-durationmeasurement. A kernel module that can collect running traces of utilization andfrequencies needed is developed, together with the power model derived, forpower prediction for programs running in real environment.

对广泛应用于物联网（IoT）的计算机设备进行能量测量是一项重要而又具有挑战性的任务。这些物联网设备大多缺乏用于电能测量的即用型硬件或软件。成本效益高的解决方案是使用低端消费级功率计。然而，这些低端功率计无法提供精确的瞬时功率测量。在本文中，我们提出了一种简单易用的方法，基于机器学习的数据驱动分析，仅使用低端电能表就能推导出基于软件的瞬时电能估算模型。我们使用 Jetson Nano 板和瑞登 UM25C USB 功率计演示了我们的解决方案。我们使用基准测试来评估 Jetson Nano 板和 Raspberry Pi 的软件功率模型。结果表明，与长时间测量相比，准确率可达 92%。我们开发了一个内核模块，可以收集运行所需的利用率和频率跟踪，并结合得出的功耗模型，对在真实环境中运行的程序进行功耗预测。

{"title":"Data-driven Software-based Power Estimation for Embedded Devices","authors":"Haoyu Wang, Xinyi Li, Ti Zhou, Man Lin","doi":"arxiv-2407.02764","DOIUrl":"https://doi.org/arxiv-2407.02764","url":null,"abstract":"Energy measurement of computer devices, which are widely used in the Internet\u0000of Things (IoT), is an important yet challenging task. Most of these IoT\u0000devices lack ready-to-use hardware or software for power measurement. A\u0000cost-effective solution is to use low-end consumer-grade power meters. However,\u0000these low-end power meters cannot provide accurate instantaneous power\u0000measurements. In this paper, we propose an easy-to-use approach to derive an\u0000instantaneous software-based energy estimation model with only low-end power\u0000meters based on data-driven analysis through machine learning. Our solution is\u0000demonstrated with a Jetson Nano board and Ruideng UM25C USB power meter.\u0000Various machine learning methods combined with our smart data collection method\u0000and physical measurement are explored. Benchmarks were used to evaluate the\u0000derived software-power model for the Jetson Nano board and Raspberry Pi. The\u0000results show that 92% accuracy can be achieved compared to the long-duration\u0000measurement. A kernel module that can collect running traces of utilization and\u0000frequencies needed is developed, together with the power model derived, for\u0000power prediction for programs running in real environment.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"68 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141552683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0