Jack Tigar Humphries, Neel Natu, Kostis Kaffes, Stanko Novaković, Paul Turner, Hank Levy, David Culler, Christos Kozyrakis
The end of Moore's Law and the tightening performance requirements in today's clouds make re-architecting the software stack a necessity. To address this, cloud providers and vendors offload the virtualization control plane and data plane, along with the host OS data plane, to IPUs (SmartNICs), recovering scarce host resources that are then used by applications. However, the host OS control plane--encompassing kernel thread scheduling, memory management, the network stack, file systems, and more--is left on the host CPU and degrades workload performance. This paper presents Wave, a split OS architecture that moves OS subsystem policies to the IPU while keeping OS mechanisms on the host CPU. Wave not only frees host CPU resources, but it reduces host workload interference and leverages network insights on the IPU to improve policy decisions. Wave makes OS control plane offloading practical despite high host-IPU communication latency, lack of a coherent interconnect, and operation across two system images. We present Wave's design and implementation, and implement several OS subsystems in Wave, including kernel thread scheduling, the control plane for a network stack, and memory management. We then evaluate the Wave subsystems on Stubby (scheduling and network), our GCE VM service (scheduling), and RocksDB (memory management and scheduling). We demonstrate that Wave subsystems are competitive with and often superior to on-host subsystems, saving 8 host CPUs for Stubby, 16 host CPUs for database memory management, and improving VM performance by up to 11.2%.
摩尔定律的终结以及当今云计算对性能要求的不断提高,使得重新构建软件栈成为必然。为了解决这个问题,云提供商和供应商将虚拟化控制平面和数据平面以及主机操作系统数据平面卸载到IPU(SmartNIC)上,从而恢复了主机资源,供应用程序使用。然而,主机操作系统控制平面--包括内核线程调度、内存管理、网络堆栈、文件系统等--被留在主机 CPU 上,降低了工作负载性能。本文介绍的 Wave 是一种分离式操作系统架构,它将操作系统子系统策略移至 IPU,同时将操作系统机制保留在主机 CPU 上。Wave 不仅释放了主机 CPU 资源,还减少了主机工作负载干扰,并利用 IPU 上的网络洞察力改进了策略决策。Wave 使操作系统控制平面卸载成为现实,尽管主机与 IPU 之间的通信延迟很高、缺乏连贯的互连和跨两个系统图像运行。我们介绍了 Wave 的设计和实现,并在 Wave 中实现了多个操作系统子系统,包括内核线程调度、网络堆栈控制平面和内存管理。然后,我们在Stubby(调度和网络)、我们的GCE虚拟机服务(调度)和RocksDB(内存管理和调度)上对Wave子系统进行了评估。我们证明,Wave 子系统可与主机子系统竞争,而且往往优于主机子系统,为 Stubby 节省了 8 个主机 CPU,为数据库内存管理节省了 16 个主机 CPU,并将虚拟机性能提高了 11.2%。
{"title":"Wave: A Split OS Architecture for Application Engines","authors":"Jack Tigar Humphries, Neel Natu, Kostis Kaffes, Stanko Novaković, Paul Turner, Hank Levy, David Culler, Christos Kozyrakis","doi":"arxiv-2408.17351","DOIUrl":"https://doi.org/arxiv-2408.17351","url":null,"abstract":"The end of Moore's Law and the tightening performance requirements in today's\u0000clouds make re-architecting the software stack a necessity. To address this,\u0000cloud providers and vendors offload the virtualization control plane and data\u0000plane, along with the host OS data plane, to IPUs (SmartNICs), recovering\u0000scarce host resources that are then used by applications. However, the host OS\u0000control plane--encompassing kernel thread scheduling, memory management, the\u0000network stack, file systems, and more--is left on the host CPU and degrades\u0000workload performance. This paper presents Wave, a split OS architecture that moves OS subsystem\u0000policies to the IPU while keeping OS mechanisms on the host CPU. Wave not only\u0000frees host CPU resources, but it reduces host workload interference and\u0000leverages network insights on the IPU to improve policy decisions. Wave makes\u0000OS control plane offloading practical despite high host-IPU communication\u0000latency, lack of a coherent interconnect, and operation across two system\u0000images. We present Wave's design and implementation, and implement several OS\u0000subsystems in Wave, including kernel thread scheduling, the control plane for a\u0000network stack, and memory management. We then evaluate the Wave subsystems on\u0000Stubby (scheduling and network), our GCE VM service (scheduling), and RocksDB\u0000(memory management and scheduling). We demonstrate that Wave subsystems are\u0000competitive with and often superior to on-host subsystems, saving 8 host CPUs\u0000for Stubby, 16 host CPUs for database memory management, and improving VM\u0000performance by up to 11.2%.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuai Zhao, Hanzhi Xu, Nan Chen, Ruoxian Su, Wanli Chang
Fully-partitioned fixed-priority scheduling (FP-FPS) multiprocessor systems are widely found in real-time applications, where spin-based protocols are often deployed to manage the mutually exclusive access of shared resources. Unfortunately, existing approaches either enforce rigid spin priority rules for resource accessing or carry significant pessimism in the schedulability analysis, imposing substantial blocking time regardless of task execution urgency or resource over-provisioning. This paper proposes FRAP, a spin-based flexible resource accessing protocol for FP-FPS systems. A task under FRAP can spin at any priority within a range for accessing a resource, allowing flexible and fine-grained resource control with predictable worst-case behaviour. Under flexible spinning, we demonstrate that the existing analysis techniques can lead to incorrect timing bounds and present a novel MCMF (minimum cost maximum flow)-based blocking analysis, providing predictability guarantee for FRAP. A spin priority assignment is reported that fully exploits flexible spinning to reduce the blocking time of tasks with high urgency, enhancing the performance of FRAP. Experimental results show that FRAP outperforms the existing spin-based protocols in schedulability by 15.20%-32.73% on average, up to 65.85%.
{"title":"FRAP: A Flexible Resource Accessing Protocol for Multiprocessor Real-Time Systems","authors":"Shuai Zhao, Hanzhi Xu, Nan Chen, Ruoxian Su, Wanli Chang","doi":"arxiv-2408.13772","DOIUrl":"https://doi.org/arxiv-2408.13772","url":null,"abstract":"Fully-partitioned fixed-priority scheduling (FP-FPS) multiprocessor systems\u0000are widely found in real-time applications, where spin-based protocols are\u0000often deployed to manage the mutually exclusive access of shared resources.\u0000Unfortunately, existing approaches either enforce rigid spin priority rules for\u0000resource accessing or carry significant pessimism in the schedulability\u0000analysis, imposing substantial blocking time regardless of task execution\u0000urgency or resource over-provisioning. This paper proposes FRAP, a spin-based\u0000flexible resource accessing protocol for FP-FPS systems. A task under FRAP can\u0000spin at any priority within a range for accessing a resource, allowing flexible\u0000and fine-grained resource control with predictable worst-case behaviour. Under\u0000flexible spinning, we demonstrate that the existing analysis techniques can\u0000lead to incorrect timing bounds and present a novel MCMF (minimum cost maximum\u0000flow)-based blocking analysis, providing predictability guarantee for FRAP. A\u0000spin priority assignment is reported that fully exploits flexible spinning to\u0000reduce the blocking time of tasks with high urgency, enhancing the performance\u0000of FRAP. Experimental results show that FRAP outperforms the existing\u0000spin-based protocols in schedulability by 15.20%-32.73% on average, up to\u000065.85%.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"78 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Datacenter applications often rely on remote procedure calls (RPCs) for fast, efficient, and secure communication. However, RPCs are slow, inefficient, and hard to use as they require expensive serialization and compression to communicate over a packetized serial network link. Compute Express Link 3.0 (CXL) offers an alternative solution, allowing applications to share data using a cache-coherent, shared-memory interface across clusters of machines. RPCool is a new framework that exploits CXL's shared memory capabilities. RPCool avoids serialization by passing pointers to data structures in shared memory. While avoiding serialization is useful, directly sharing pointer-rich data eliminates the isolation that copying data over traditional networks provides, leaving the receiver vulnerable to invalid pointers and concurrent updates to shared data by the sender. RPCool restores this safety with careful and efficient management of memory permissions. Another significant challenge with CXL shared memory capabilities is that they are unlikely to scale to an entire datacenter. RPCool addresses this by falling back to RDMA-based communication. Overall, RPCool reduces the round-trip latency by 1.93$times$ and 7.2$times$ compared to state-of-the-art RDMA and CXL-based RPC mechanisms, respectively. Moreover, RPCool performs either comparably or better than other RPC mechanisms across a range of workloads.
{"title":"Telepathic Datacenters: Fast RPCs using Shared CXL Memory","authors":"Suyash Mahar, Ehsan Hajyjasini, Seungjin Lee, Zifeng Zhang, Mingyao Shen, Steven Swanson","doi":"arxiv-2408.11325","DOIUrl":"https://doi.org/arxiv-2408.11325","url":null,"abstract":"Datacenter applications often rely on remote procedure calls (RPCs) for fast,\u0000efficient, and secure communication. However, RPCs are slow, inefficient, and\u0000hard to use as they require expensive serialization and compression to\u0000communicate over a packetized serial network link. Compute Express Link 3.0\u0000(CXL) offers an alternative solution, allowing applications to share data using\u0000a cache-coherent, shared-memory interface across clusters of machines. RPCool is a new framework that exploits CXL's shared memory capabilities.\u0000RPCool avoids serialization by passing pointers to data structures in shared\u0000memory. While avoiding serialization is useful, directly sharing pointer-rich\u0000data eliminates the isolation that copying data over traditional networks\u0000provides, leaving the receiver vulnerable to invalid pointers and concurrent\u0000updates to shared data by the sender. RPCool restores this safety with careful\u0000and efficient management of memory permissions. Another significant challenge\u0000with CXL shared memory capabilities is that they are unlikely to scale to an\u0000entire datacenter. RPCool addresses this by falling back to RDMA-based\u0000communication. Overall, RPCool reduces the round-trip latency by 1.93$times$ and\u00007.2$times$ compared to state-of-the-art RDMA and CXL-based RPC mechanisms,\u0000respectively. Moreover, RPCool performs either comparably or better than other\u0000RPC mechanisms across a range of workloads.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ardhi Putra Pratama Hartono, Andrey Brito, Christof Fetzer
Trusted execution environments (TEEs) protect the integrity and confidentiality of running code and its associated data. Nevertheless, TEEs' integrity protection does not extend to the state saved on disk. Furthermore, modern cloud-native applications heavily rely on orchestration (e.g., through systems such as Kubernetes) and, thus, have their services frequently restarted. During restarts, attackers can revert the state of confidential services to a previous version that may aid their malicious intent. This paper presents CRISP, a rollback protection mechanism that uses an existing runtime for Intel SGX and transparently prevents rollback. Our approach can constrain the attack window to a fixed and short period or give developers the tools to avoid the vulnerability window altogether. Finally, experiments show that applying CRISP in a critical stateful cloud-native application may incur a resource increase but only a minor performance penalty.
{"title":"CRISP: Confidentiality, Rollback, and Integrity Storage Protection for Confidential Cloud-Native Computing","authors":"Ardhi Putra Pratama Hartono, Andrey Brito, Christof Fetzer","doi":"arxiv-2408.06822","DOIUrl":"https://doi.org/arxiv-2408.06822","url":null,"abstract":"Trusted execution environments (TEEs) protect the integrity and\u0000confidentiality of running code and its associated data. Nevertheless, TEEs'\u0000integrity protection does not extend to the state saved on disk. Furthermore,\u0000modern cloud-native applications heavily rely on orchestration (e.g., through\u0000systems such as Kubernetes) and, thus, have their services frequently\u0000restarted. During restarts, attackers can revert the state of confidential\u0000services to a previous version that may aid their malicious intent. This paper\u0000presents CRISP, a rollback protection mechanism that uses an existing runtime\u0000for Intel SGX and transparently prevents rollback. Our approach can constrain\u0000the attack window to a fixed and short period or give developers the tools to\u0000avoid the vulnerability window altogether. Finally, experiments show that\u0000applying CRISP in a critical stateful cloud-native application may incur a\u0000resource increase but only a minor performance penalty.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"170 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The extended Berkeley Packet Filter (eBPF) is extensively utilized for observability and performance analysis in cloud-native environments. However, deploying eBPF programs across a heterogeneous cloud environment presents challenges, including compatibility issues across different kernel versions, operating systems, runtimes, and architectures. Traditional deployment methods, such as standalone containers or tightly integrated core applications, are cumbersome and inefficient, particularly when dynamic plugin management is required. To address these challenges, we introduce Wasm-bpf, a lightweight runtime on WebAssembly and the WebAssembly System Interface (WASI). Leveraging Wasm platform independence and WASI standardized system interface, with enhanced relocation for different architectures, Wasm-bpf ensures cross-platform compatibility for eBPF programs. It simplifies deployment by integrating with container toolchains, allowing eBPF programs to be packaged as Wasm modules that can be easily managed within cloud environments. Additionally, Wasm-bpf supports dynamic plugin management in WebAssembly. Our implementation and evaluation demonstrate that Wasm-bpf introduces minimal overhead compared to native eBPF implementations while simplifying the deployment process.
{"title":"Wasm-bpf: Streamlining eBPF Deployment in Cloud Environments with WebAssembly","authors":"Yusheng Zheng, Tong Yu, Yiwei Yang, Andrew Quinn","doi":"arxiv-2408.04856","DOIUrl":"https://doi.org/arxiv-2408.04856","url":null,"abstract":"The extended Berkeley Packet Filter (eBPF) is extensively utilized for\u0000observability and performance analysis in cloud-native environments. However,\u0000deploying eBPF programs across a heterogeneous cloud environment presents\u0000challenges, including compatibility issues across different kernel versions,\u0000operating systems, runtimes, and architectures. Traditional deployment methods,\u0000such as standalone containers or tightly integrated core applications, are\u0000cumbersome and inefficient, particularly when dynamic plugin management is\u0000required. To address these challenges, we introduce Wasm-bpf, a lightweight\u0000runtime on WebAssembly and the WebAssembly System Interface (WASI). Leveraging\u0000Wasm platform independence and WASI standardized system interface, with\u0000enhanced relocation for different architectures, Wasm-bpf ensures\u0000cross-platform compatibility for eBPF programs. It simplifies deployment by\u0000integrating with container toolchains, allowing eBPF programs to be packaged as\u0000Wasm modules that can be easily managed within cloud environments.\u0000Additionally, Wasm-bpf supports dynamic plugin management in WebAssembly. Our\u0000implementation and evaluation demonstrate that Wasm-bpf introduces minimal\u0000overhead compared to native eBPF implementations while simplifying the\u0000deployment process.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141930974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guoyu Wang, Xilong Che, Haoyang Wei, Chenju Pei, Juncheng Hu
NVM is used as a new hierarchy in the storage system, due to its intermediate speed and capacity between DRAM, and its byte granularity. However, consistency problems emerge when we attempt to put DRAM, NVM, and disk together as an efficient whole. In this paper, we discuss the challenging consistency problems faced by heterogeneous storage systems, and propose our solution to the problems. The discussion is based on NVPC as a case study, but can be inspiring and adaptive to all similar heterogeneous storage systems.
由于 NVM 的速度和容量介于 DRAM 和字节粒度之间,因此被用作存储系统的新层次。然而,当我们试图将 DRAM、NVM 和磁盘组合成一个高效的整体时,一致性问题就出现了。在本文中,我们讨论了异构存储系统面临的具有挑战性的一致性问题,并提出了我们的解决方案。讨论以 NVPC 为案例,但对所有类似的异构存储系统都有启发和借鉴意义。
{"title":"Crash Consistency in DRAM-NVM-Disk Hybrid Storage System","authors":"Guoyu Wang, Xilong Che, Haoyang Wei, Chenju Pei, Juncheng Hu","doi":"arxiv-2408.04238","DOIUrl":"https://doi.org/arxiv-2408.04238","url":null,"abstract":"NVM is used as a new hierarchy in the storage system, due to its intermediate\u0000speed and capacity between DRAM, and its byte granularity. However, consistency\u0000problems emerge when we attempt to put DRAM, NVM, and disk together as an\u0000efficient whole. In this paper, we discuss the challenging consistency problems\u0000faced by heterogeneous storage systems, and propose our solution to the\u0000problems. The discussion is based on NVPC as a case study, but can be inspiring\u0000and adaptive to all similar heterogeneous storage systems.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141930975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cloud platforms today have been deploying hardware accelerators like neural processing units (NPUs) for powering machine learning (ML) inference services. To maximize the resource utilization while ensuring reasonable quality of service, a natural approach is to virtualize NPUs for efficient resource sharing for multi-tenant ML services. However, virtualizing NPUs for modern cloud platforms is not easy. This is not only due to the lack of system abstraction support for NPU hardware, but also due to the lack of architectural and ISA support for enabling fine-grained dynamic operator scheduling for virtualized NPUs. We present TCloud, a holistic NPU virtualization framework. We investigate virtualization techniques for NPUs across the entire software and hardware stack. TCloud consists of (1) a flexible NPU abstraction called vNPU, which enables fine-grained virtualization of the heterogeneous compute units in a physical NPU (pNPU); (2) a vNPU resource allocator that enables pay-as-you-go computing model and flexible vNPU-to-pNPU mappings for improved resource utilization and cost-effectiveness; (3) an ISA extension of modern NPU architecture for facilitating fine-grained tensor operator scheduling for multiple vNPUs. We implement TCloud based on a production-level NPU simulator. Our experiments show that TCloud improves the throughput of ML inference services by up to 1.4$times$ and reduces the tail latency by up to 4.6$times$, while improving the NPU utilization by 1.2$times$ on average, compared to state-of-the-art NPU sharing approaches.
{"title":"Hardware-Assisted Virtualization of Neural Processing Units for Cloud Platforms","authors":"Yuqi Xue, Yiqi Liu, Lifeng Nai, Jian Huang","doi":"arxiv-2408.04104","DOIUrl":"https://doi.org/arxiv-2408.04104","url":null,"abstract":"Cloud platforms today have been deploying hardware accelerators like neural\u0000processing units (NPUs) for powering machine learning (ML) inference services.\u0000To maximize the resource utilization while ensuring reasonable quality of\u0000service, a natural approach is to virtualize NPUs for efficient resource\u0000sharing for multi-tenant ML services. However, virtualizing NPUs for modern\u0000cloud platforms is not easy. This is not only due to the lack of system\u0000abstraction support for NPU hardware, but also due to the lack of architectural\u0000and ISA support for enabling fine-grained dynamic operator scheduling for\u0000virtualized NPUs. We present TCloud, a holistic NPU virtualization framework. We investigate\u0000virtualization techniques for NPUs across the entire software and hardware\u0000stack. TCloud consists of (1) a flexible NPU abstraction called vNPU, which\u0000enables fine-grained virtualization of the heterogeneous compute units in a\u0000physical NPU (pNPU); (2) a vNPU resource allocator that enables pay-as-you-go\u0000computing model and flexible vNPU-to-pNPU mappings for improved resource\u0000utilization and cost-effectiveness; (3) an ISA extension of modern NPU\u0000architecture for facilitating fine-grained tensor operator scheduling for\u0000multiple vNPUs. We implement TCloud based on a production-level NPU simulator.\u0000Our experiments show that TCloud improves the throughput of ML inference\u0000services by up to 1.4$times$ and reduces the tail latency by up to\u00004.6$times$, while improving the NPU utilization by 1.2$times$ on average,\u0000compared to state-of-the-art NPU sharing approaches.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"105 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141930973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guoyu Wang, Xilong Che, Haoyang Wei, Shuo Chen, Puyi He, Juncheng Hu
Towards a compatible utilization of NVM, NVM-specialized kernel file systems and NVM-based disk file system accelerators have been proposed. However, these studies only focus on one or several characteristics of NVM, while failing to exploit its best practice by putting NVM in the proper position of the whole storage stack. In this paper, we present NVPC, a transparent acceleration to existing kernel file systems with an NVM-enhanced page cache. The acceleration lies in two aspects, respectively matching the desperate needs of existing disk file systems: sync writes and cache-missed operations. Besides, the fast DRAM page cache is preserved for cache-hit operations. For sync writes, a high-performance log-based sync absorbing area is provided to redirect data destination from the slow disk to the fast NVM. Meanwhile, the byte-addressable feature of NVM is used to prevent write amplification. For cache-missed operations, NVPC makes use of the idle space on NVM to extend the DRAM page cache, so that more and larger workloads can fit into the cache. NVPC is entirely implemented as a page cache, thus can provide efficient speed-up to disk file systems with full transparency to users and full compatibility to lower file systems. In Filebench macro-benchmarks, NVPC achieves at most 3.55x, 2.84x, and 2.64x faster than NOVA, Ext-4, and SPFS. In RocksDB workloads with working set larger than DRAM, NVPC achieves 1.12x, 2.59x, and 2.11x faster than NOVA, Ext-4, and SPFS. Meanwhile, NVPC gains positive revenue from NOVA, Ext-4, and SPFS in 62.5% of the tested cases in our read/write/sync mixed evaluation, demonstrating that NVPC is more balanced and adaptive to complex real-world workloads. Experimental results also show that NVPC is the only method that accelerates Ext-4 in particular cases for up to 15.19x, with no slow-down to any other use cases.
{"title":"NVPC: A Transparent NVM Page Cache","authors":"Guoyu Wang, Xilong Che, Haoyang Wei, Shuo Chen, Puyi He, Juncheng Hu","doi":"arxiv-2408.02911","DOIUrl":"https://doi.org/arxiv-2408.02911","url":null,"abstract":"Towards a compatible utilization of NVM, NVM-specialized kernel file systems\u0000and NVM-based disk file system accelerators have been proposed. However, these\u0000studies only focus on one or several characteristics of NVM, while failing to\u0000exploit its best practice by putting NVM in the proper position of the whole\u0000storage stack. In this paper, we present NVPC, a transparent acceleration to\u0000existing kernel file systems with an NVM-enhanced page cache. The acceleration\u0000lies in two aspects, respectively matching the desperate needs of existing disk\u0000file systems: sync writes and cache-missed operations. Besides, the fast DRAM\u0000page cache is preserved for cache-hit operations. For sync writes, a\u0000high-performance log-based sync absorbing area is provided to redirect data\u0000destination from the slow disk to the fast NVM. Meanwhile, the byte-addressable\u0000feature of NVM is used to prevent write amplification. For cache-missed\u0000operations, NVPC makes use of the idle space on NVM to extend the DRAM page\u0000cache, so that more and larger workloads can fit into the cache. NVPC is\u0000entirely implemented as a page cache, thus can provide efficient speed-up to\u0000disk file systems with full transparency to users and full compatibility to\u0000lower file systems. In Filebench macro-benchmarks, NVPC achieves at most 3.55x, 2.84x, and 2.64x\u0000faster than NOVA, Ext-4, and SPFS. In RocksDB workloads with working set larger\u0000than DRAM, NVPC achieves 1.12x, 2.59x, and 2.11x faster than NOVA, Ext-4, and\u0000SPFS. Meanwhile, NVPC gains positive revenue from NOVA, Ext-4, and SPFS in\u000062.5% of the tested cases in our read/write/sync mixed evaluation,\u0000demonstrating that NVPC is more balanced and adaptive to complex real-world\u0000workloads. Experimental results also show that NVPC is the only method that\u0000accelerates Ext-4 in particular cases for up to 15.19x, with no slow-down to\u0000any other use cases.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the dynamic landscape of technology, the convergence of Artificial Intelligence (AI) and Operating Systems (OS) has emerged as a pivotal arena for innovation. Our exploration focuses on the symbiotic relationship between AI and OS, emphasizing how AI-driven tools enhance OS performance, security, and efficiency, while OS advancements facilitate more sophisticated AI applications. We delve into various AI techniques employed to optimize OS functionalities, including memory management, process scheduling, and intrusion detection. Simultaneously, we analyze the role of OS in providing essential services and infrastructure that enable effective AI application execution, from resource allocation to data processing. The article also addresses challenges and future directions in this domain, emphasizing the imperative of secure and efficient AI integration within OS frameworks. By examining case studies and recent developments, our review provides a comprehensive overview of the current state of AI-OS integration, underscoring its significance in shaping the next generation of computing technologies. Finally, we explore the promising prospects of Intelligent OSes, considering not only how innovative OS architectures will pave the way for groundbreaking opportunities but also how AI will significantly contribute to advancing these next-generation OSs.
{"title":"Operating System And Artificial Intelligence: A Systematic Review","authors":"Yifan Zhang, Xinkui Zhao, Jianwei Yin, Lufei Zhang, Zuoning Chen","doi":"arxiv-2407.14567","DOIUrl":"https://doi.org/arxiv-2407.14567","url":null,"abstract":"In the dynamic landscape of technology, the convergence of Artificial\u0000Intelligence (AI) and Operating Systems (OS) has emerged as a pivotal arena for\u0000innovation. Our exploration focuses on the symbiotic relationship between AI\u0000and OS, emphasizing how AI-driven tools enhance OS performance, security, and\u0000efficiency, while OS advancements facilitate more sophisticated AI\u0000applications. We delve into various AI techniques employed to optimize OS\u0000functionalities, including memory management, process scheduling, and intrusion\u0000detection. Simultaneously, we analyze the role of OS in providing essential\u0000services and infrastructure that enable effective AI application execution,\u0000from resource allocation to data processing. The article also addresses\u0000challenges and future directions in this domain, emphasizing the imperative of\u0000secure and efficient AI integration within OS frameworks. By examining case\u0000studies and recent developments, our review provides a comprehensive overview\u0000of the current state of AI-OS integration, underscoring its significance in\u0000shaping the next generation of computing technologies. Finally, we explore the\u0000promising prospects of Intelligent OSes, considering not only how innovative OS\u0000architectures will pave the way for groundbreaking opportunities but also how\u0000AI will significantly contribute to advancing these next-generation OSs.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141780869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Energy measurement of computer devices, which are widely used in the Internet of Things (IoT), is an important yet challenging task. Most of these IoT devices lack ready-to-use hardware or software for power measurement. A cost-effective solution is to use low-end consumer-grade power meters. However, these low-end power meters cannot provide accurate instantaneous power measurements. In this paper, we propose an easy-to-use approach to derive an instantaneous software-based energy estimation model with only low-end power meters based on data-driven analysis through machine learning. Our solution is demonstrated with a Jetson Nano board and Ruideng UM25C USB power meter. Various machine learning methods combined with our smart data collection method and physical measurement are explored. Benchmarks were used to evaluate the derived software-power model for the Jetson Nano board and Raspberry Pi. The results show that 92% accuracy can be achieved compared to the long-duration measurement. A kernel module that can collect running traces of utilization and frequencies needed is developed, together with the power model derived, for power prediction for programs running in real environment.
对广泛应用于物联网(IoT)的计算机设备进行能量测量是一项重要而又具有挑战性的任务。这些物联网设备大多缺乏用于电能测量的即用型硬件或软件。成本效益高的解决方案是使用低端消费级功率计。然而,这些低端功率计无法提供精确的瞬时功率测量。在本文中,我们提出了一种简单易用的方法,基于机器学习的数据驱动分析,仅使用低端电能表就能推导出基于软件的瞬时电能估算模型。我们使用 Jetson Nano 板和瑞登 UM25C USB 功率计演示了我们的解决方案。我们使用基准测试来评估 Jetson Nano 板和 Raspberry Pi 的软件功率模型。结果表明,与长时间测量相比,准确率可达 92%。我们开发了一个内核模块,可以收集运行所需的利用率和频率跟踪,并结合得出的功耗模型,对在真实环境中运行的程序进行功耗预测。
{"title":"Data-driven Software-based Power Estimation for Embedded Devices","authors":"Haoyu Wang, Xinyi Li, Ti Zhou, Man Lin","doi":"arxiv-2407.02764","DOIUrl":"https://doi.org/arxiv-2407.02764","url":null,"abstract":"Energy measurement of computer devices, which are widely used in the Internet\u0000of Things (IoT), is an important yet challenging task. Most of these IoT\u0000devices lack ready-to-use hardware or software for power measurement. A\u0000cost-effective solution is to use low-end consumer-grade power meters. However,\u0000these low-end power meters cannot provide accurate instantaneous power\u0000measurements. In this paper, we propose an easy-to-use approach to derive an\u0000instantaneous software-based energy estimation model with only low-end power\u0000meters based on data-driven analysis through machine learning. Our solution is\u0000demonstrated with a Jetson Nano board and Ruideng UM25C USB power meter.\u0000Various machine learning methods combined with our smart data collection method\u0000and physical measurement are explored. Benchmarks were used to evaluate the\u0000derived software-power model for the Jetson Nano board and Raspberry Pi. The\u0000results show that 92% accuracy can be achieved compared to the long-duration\u0000measurement. A kernel module that can collect running traces of utilization and\u0000frequencies needed is developed, together with the power model derived, for\u0000power prediction for programs running in real environment.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"68 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141552683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}