Frederic Schimmelpfennig, André Brinkmann, Hossein Asadi, Reza Salkhordeh
Memory access efficiency is significantly enhanced by caching recent address translations in the CPUs' Translation Lookaside Buffers (TLBs). However, since the operating system is not aware of which core is using a particular mapping, it flushes TLB entries across all cores where the application runs whenever addresses are unmapped, ensuring security and consistency. These TLB flushes, known as TLB shootdowns, are costly and create a performance and scalability bottleneck. A key contributor to TLB shootdowns is memory-mapped I/O, particularly during mmap-munmap cycles and page cache evictions. Often, the same physical pages are reassigned to the same process post-eviction, presenting an opportunity for the operating system to reduce the frequency of TLB shootdowns. We demonstrate, that by slightly extending the mmap function, TLB shootdowns for these "recycled pages" can be avoided. Therefore we introduce and implement the "fast page recycling" (FPR) feature within the mmap system call. FPR-mmaps maintain security by only triggering TLB shootdowns when a page exits its recycling cycle and is allocated to a different process. To ensure consistency when FPR-mmap pointers are used, we made minor adjustments to virtual memory management to avoid the ABA problem. Unlike previous methods to mitigate shootdown effects, our approach does not require any hardware modifications and operates transparently within the existing Linux virtual memory framework. Our evaluations across a variety of CPU, memory, and storage setups, including persistent memory and Optane SSDs, demonstrate that FPR delivers notable performance gains, with improvements of up to 28% in real-world applications and 92% in micro-benchmarks. Additionally, we show that TLB shootdowns are a significant source of bottlenecks, previously misattributed to other components of the Linux kernel.
{"title":"Skip TLB flushes for reused pages within mmap's","authors":"Frederic Schimmelpfennig, André Brinkmann, Hossein Asadi, Reza Salkhordeh","doi":"arxiv-2409.10946","DOIUrl":"https://doi.org/arxiv-2409.10946","url":null,"abstract":"Memory access efficiency is significantly enhanced by caching recent address\u0000translations in the CPUs' Translation Lookaside Buffers (TLBs). However, since\u0000the operating system is not aware of which core is using a particular mapping,\u0000it flushes TLB entries across all cores where the application runs whenever\u0000addresses are unmapped, ensuring security and consistency. These TLB flushes,\u0000known as TLB shootdowns, are costly and create a performance and scalability\u0000bottleneck. A key contributor to TLB shootdowns is memory-mapped I/O,\u0000particularly during mmap-munmap cycles and page cache evictions. Often, the\u0000same physical pages are reassigned to the same process post-eviction,\u0000presenting an opportunity for the operating system to reduce the frequency of\u0000TLB shootdowns. We demonstrate, that by slightly extending the mmap function,\u0000TLB shootdowns for these \"recycled pages\" can be avoided. Therefore we introduce and implement the \"fast page recycling\" (FPR) feature\u0000within the mmap system call. FPR-mmaps maintain security by only triggering TLB\u0000shootdowns when a page exits its recycling cycle and is allocated to a\u0000different process. To ensure consistency when FPR-mmap pointers are used, we\u0000made minor adjustments to virtual memory management to avoid the ABA problem.\u0000Unlike previous methods to mitigate shootdown effects, our approach does not\u0000require any hardware modifications and operates transparently within the\u0000existing Linux virtual memory framework. Our evaluations across a variety of CPU, memory, and storage setups,\u0000including persistent memory and Optane SSDs, demonstrate that FPR delivers\u0000notable performance gains, with improvements of up to 28% in real-world\u0000applications and 92% in micro-benchmarks. Additionally, we show that TLB\u0000shootdowns are a significant source of bottlenecks, previously misattributed to\u0000other components of the Linux kernel.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This research analyzed the performance and consistency of four synchronization mechanisms-reentrant locks, semaphores, synchronized methods, and synchronized blocks-across three operating systems: macOS, Windows, and Linux. Synchronization ensures that concurrent processes or threads access shared resources safely, and efficient synchronization is vital for maintaining system performance and reliability. The study aimed to identify the synchronization mechanism that balances efficiency, measured by execution time, and consistency, assessed by variance and standard deviation, across platforms. The initial hypothesis proposed that mutex-based mechanisms, specifically synchronized methods and blocks, would be the most efficient due to their simplicity. However, empirical results showed that reentrant locks had the lowest average execution time (14.67ms), making them the most efficient mechanism, but with the highest variability (standard deviation of 1.15). In contrast, synchronized methods, blocks, and semaphores exhibited higher average execution times (16.33ms for methods and 16.67ms for blocks) but with greater consistency (variance of 0.33). The findings indicated that while reentrant locks were faster, they were more platform-dependent, whereas mutex-based mechanisms provided more predictable performance across all operating systems. The use of virtual machines for Windows and Linux was a limitation, potentially affecting the results. Future research should include native testing and explore additional synchronization mechanisms and higher concurrency levels. These insights help developers and system designers optimize synchronization strategies for either performance or stability, depending on the application's requirements.
{"title":"Analysis of Synchronization Mechanisms in Operating Systems","authors":"Oluwatoyin Kode, Temitope Oyemade","doi":"arxiv-2409.11271","DOIUrl":"https://doi.org/arxiv-2409.11271","url":null,"abstract":"This research analyzed the performance and consistency of four\u0000synchronization mechanisms-reentrant locks, semaphores, synchronized methods,\u0000and synchronized blocks-across three operating systems: macOS, Windows, and\u0000Linux. Synchronization ensures that concurrent processes or threads access\u0000shared resources safely, and efficient synchronization is vital for maintaining\u0000system performance and reliability. The study aimed to identify the\u0000synchronization mechanism that balances efficiency, measured by execution time,\u0000and consistency, assessed by variance and standard deviation, across platforms.\u0000The initial hypothesis proposed that mutex-based mechanisms, specifically\u0000synchronized methods and blocks, would be the most efficient due to their\u0000simplicity. However, empirical results showed that reentrant locks had the\u0000lowest average execution time (14.67ms), making them the most efficient\u0000mechanism, but with the highest variability (standard deviation of 1.15). In\u0000contrast, synchronized methods, blocks, and semaphores exhibited higher average\u0000execution times (16.33ms for methods and 16.67ms for blocks) but with greater\u0000consistency (variance of 0.33). The findings indicated that while reentrant\u0000locks were faster, they were more platform-dependent, whereas mutex-based\u0000mechanisms provided more predictable performance across all operating systems.\u0000The use of virtual machines for Windows and Linux was a limitation, potentially\u0000affecting the results. Future research should include native testing and\u0000explore additional synchronization mechanisms and higher concurrency levels.\u0000These insights help developers and system designers optimize synchronization\u0000strategies for either performance or stability, depending on the application's\u0000requirements.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"191 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We leverage eBPF in order to implement custom policies in the Linux memory subsystem. Inspired by CBMM, we create a mechanism that provides the kernel with hints regarding the benefit of promoting a page to a specific size. We introduce a new hook point in Linux page fault handling path for eBPF programs, providing them the necessary context to determine the page size to be used. We then develop a framework that allows users to define profiles for their applications and load them into the kernel. A profile consists of memory regions of interest and their expected benefit from being backed by 4KB, 64KB and 2MB pages. In our evaluation, we profiled our workloads to identify hot memory regions using DAMON.
我们利用 eBPF 在 Linux 内存子系统中实施自定义策略。受 CBMM 的启发,我们创建了一种机制,为内核提供有关将页面提升到特定大小的好处的提示。我们在 Linux 页面故障处理路径中为 eBPF 程序引入了一个新的挂钩点,为它们提供必要的上下文,以确定要使用的页面大小。我们开发了一个框架,允许用户为应用程序定义配置文件,并将其加载到内核中。配置文件包括感兴趣的内存区域,以及它们从 4KB、64KB 和 2MB 页面支持中获得的预期收益。在我们的评估中,我们使用 DAMON 对工作负载进行了剖析,以识别热内存区域。
{"title":"eBPF-mm: Userspace-guided memory management in Linux with eBPF","authors":"Konstantinos Mores, Stratos Psomadakis, Georgios Goumas","doi":"arxiv-2409.11220","DOIUrl":"https://doi.org/arxiv-2409.11220","url":null,"abstract":"We leverage eBPF in order to implement custom policies in the Linux memory\u0000subsystem. Inspired by CBMM, we create a mechanism that provides the kernel\u0000with hints regarding the benefit of promoting a page to a specific size. We\u0000introduce a new hook point in Linux page fault handling path for eBPF programs,\u0000providing them the necessary context to determine the page size to be used. We\u0000then develop a framework that allows users to define profiles for their\u0000applications and load them into the kernel. A profile consists of memory\u0000regions of interest and their expected benefit from being backed by 4KB, 64KB\u0000and 2MB pages. In our evaluation, we profiled our workloads to identify hot\u0000memory regions using DAMON.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yinggang Guo, Zicheng Wang, Weiheng Bai, Qingkai Zeng, Kangjie Lu
The endless stream of vulnerabilities urgently calls for principled mitigation to confine the effect of exploitation. However, the monolithic architecture of commodity OS kernels, like the Linux kernel, allows an attacker to compromise the entire system by exploiting a vulnerability in any kernel component. Kernel compartmentalization is a promising approach that follows the least-privilege principle. However, existing mechanisms struggle with the trade-off on security, scalability, and performance, given the challenges stemming from mutual untrustworthiness among numerous and complex components. In this paper, we present BULKHEAD, a secure, scalable, and efficient kernel compartmentalization technique that offers bi-directional isolation for unlimited compartments. It leverages Intel's new hardware feature PKS to isolate data and code into mutually untrusted compartments and benefits from its fast compartment switching. With untrust in mind, BULKHEAD introduces a lightweight in-kernel monitor that enforces multiple important security invariants, including data integrity, execute-only memory, and compartment interface integrity. In addition, it provides a locality-aware two-level scheme that scales to unlimited compartments. We implement a prototype system on Linux v6.1 to compartmentalize loadable kernel modules (LKMs). Extensive evaluation confirms the effectiveness of our approach. As the system-wide impacts, BULKHEAD incurs an average performance overhead of 2.44% for real-world applications with 160 compartmentalized LKMs. While focusing on a specific compartment, ApacheBench tests on ipv6 show an overhead of less than 2%. Moreover, the performance is almost unaffected by the number of compartments, which makes it highly scalable.
{"title":"BULKHEAD: Secure, Scalable, and Efficient Kernel Compartmentalization with PKS","authors":"Yinggang Guo, Zicheng Wang, Weiheng Bai, Qingkai Zeng, Kangjie Lu","doi":"arxiv-2409.09606","DOIUrl":"https://doi.org/arxiv-2409.09606","url":null,"abstract":"The endless stream of vulnerabilities urgently calls for principled\u0000mitigation to confine the effect of exploitation. However, the monolithic\u0000architecture of commodity OS kernels, like the Linux kernel, allows an attacker\u0000to compromise the entire system by exploiting a vulnerability in any kernel\u0000component. Kernel compartmentalization is a promising approach that follows the\u0000least-privilege principle. However, existing mechanisms struggle with the\u0000trade-off on security, scalability, and performance, given the challenges\u0000stemming from mutual untrustworthiness among numerous and complex components. In this paper, we present BULKHEAD, a secure, scalable, and efficient kernel\u0000compartmentalization technique that offers bi-directional isolation for\u0000unlimited compartments. It leverages Intel's new hardware feature PKS to\u0000isolate data and code into mutually untrusted compartments and benefits from\u0000its fast compartment switching. With untrust in mind, BULKHEAD introduces a\u0000lightweight in-kernel monitor that enforces multiple important security\u0000invariants, including data integrity, execute-only memory, and compartment\u0000interface integrity. In addition, it provides a locality-aware two-level scheme\u0000that scales to unlimited compartments. We implement a prototype system on Linux\u0000v6.1 to compartmentalize loadable kernel modules (LKMs). Extensive evaluation\u0000confirms the effectiveness of our approach. As the system-wide impacts,\u0000BULKHEAD incurs an average performance overhead of 2.44% for real-world\u0000applications with 160 compartmentalized LKMs. While focusing on a specific\u0000compartment, ApacheBench tests on ipv6 show an overhead of less than 2%.\u0000Moreover, the performance is almost unaffected by the number of compartments,\u0000which makes it highly scalable.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"46 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anastasiia Ruzhanskaia, Pengcheng Xu, David Cock, Timothy Roscoe
Conventional wisdom holds that an efficient interface between an OS running on a CPU and a high-bandwidth I/O device should be based on Direct Memory Access (DMA), descriptor rings, and interrupts: DMA offloads transfers from the CPU, descriptor rings provide buffering and queuing, and interrupts facilitate asynchronous interaction between cores and device with a lightweight notification mechanism. In this paper we question this wisdom in the light of modern hardware and workloads, particularly in cloud servers. We argue that the assumptions that led to this model are obsolete, and in many use-cases use of programmed I/O, where the CPU explicitly transfers data and control information to and from a device via loads and stores, actually results in a more efficient system. We quantitatively demonstrate these advantages using three use-cases: fine-grained RPC-style invocation of functions on an accelerator, offloading of operators in a streaming dataflow engine, and a network interface targeting for serverless functions. Moreover, we show that while these advantages are significant over a modern PCIe peripheral bus, a truly cache-coherent interconnect offers significant additional efficiency gains.
{"title":"Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects","authors":"Anastasiia Ruzhanskaia, Pengcheng Xu, David Cock, Timothy Roscoe","doi":"arxiv-2409.08141","DOIUrl":"https://doi.org/arxiv-2409.08141","url":null,"abstract":"Conventional wisdom holds that an efficient interface between an OS running\u0000on a CPU and a high-bandwidth I/O device should be based on Direct Memory\u0000Access (DMA), descriptor rings, and interrupts: DMA offloads transfers from the\u0000CPU, descriptor rings provide buffering and queuing, and interrupts facilitate\u0000asynchronous interaction between cores and device with a lightweight\u0000notification mechanism. In this paper we question this wisdom in the light of\u0000modern hardware and workloads, particularly in cloud servers. We argue that the\u0000assumptions that led to this model are obsolete, and in many use-cases use of\u0000programmed I/O, where the CPU explicitly transfers data and control information\u0000to and from a device via loads and stores, actually results in a more efficient\u0000system. We quantitatively demonstrate these advantages using three use-cases:\u0000fine-grained RPC-style invocation of functions on an accelerator, offloading of\u0000operators in a streaming dataflow engine, and a network interface targeting for\u0000serverless functions. Moreover, we show that while these advantages are\u0000significant over a modern PCIe peripheral bus, a truly cache-coherent\u0000interconnect offers significant additional efficiency gains.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Soo Yee Lim, Tanya Prasad, Xueyuan Han, Thomas Pasquier
The eBPF framework enables execution of user-provided code in the Linux kernel. In the last few years, a large ecosystem of cloud services has leveraged eBPF to enhance container security, system observability, and network management. Meanwhile, incessant discoveries of memory safety vulnerabilities have left the systems community with no choice but to disallow unprivileged eBPF programs, which unfortunately limits eBPF use to only privileged users. To improve run-time safety of the framework, we introduce SafeBPF, a general design that isolates eBPF programs from the rest of the kernel to prevent memory safety vulnerabilities from being exploited. We present a pure software implementation using a Software-based Fault Isolation (SFI) approach and a hardware-assisted implementation that leverages ARM's Memory Tagging Extension (MTE). We show that SafeBPF incurs up to 4% overhead on macrobenchmarks while achieving desired security properties.
eBPF 框架可在 Linux 内核中执行用户提供的代码。在过去几年中,大量云服务生态系统利用 eBPF 增强了容器安全性、系统可观察性和网络管理。与此同时,不断发现的内存安全漏洞让系统社区别无选择,只能禁止非特权eBPF程序的使用,不幸的是,eBPF的使用仅限于特权用户。为了提高框架的运行时安全性,我们引入了 SafeBPF,这是一种通用设计,可以将 eBPF 程序与内核的其他部分隔离,防止内存安全漏洞被利用。我们介绍了使用基于软件的故障隔离(SFI)方法的纯软件实现,以及利用 ARM 的内存标记扩展(MTE)的硬件辅助实现。我们的研究表明,SafeBPF 在实现所需的安全特性的同时,在宏基准测试中的开销仅为 4%。
{"title":"SafeBPF: Hardware-assisted Defense-in-depth for eBPF Kernel Extensions","authors":"Soo Yee Lim, Tanya Prasad, Xueyuan Han, Thomas Pasquier","doi":"arxiv-2409.07508","DOIUrl":"https://doi.org/arxiv-2409.07508","url":null,"abstract":"The eBPF framework enables execution of user-provided code in the Linux\u0000kernel. In the last few years, a large ecosystem of cloud services has\u0000leveraged eBPF to enhance container security, system observability, and network\u0000management. Meanwhile, incessant discoveries of memory safety vulnerabilities\u0000have left the systems community with no choice but to disallow unprivileged\u0000eBPF programs, which unfortunately limits eBPF use to only privileged users. To\u0000improve run-time safety of the framework, we introduce SafeBPF, a general\u0000design that isolates eBPF programs from the rest of the kernel to prevent\u0000memory safety vulnerabilities from being exploited. We present a pure software\u0000implementation using a Software-based Fault Isolation (SFI) approach and a\u0000hardware-assisted implementation that leverages ARM's Memory Tagging Extension\u0000(MTE). We show that SafeBPF incurs up to 4% overhead on macrobenchmarks while\u0000achieving desired security properties.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chuqi Zhang, Jun Zeng, Yiming Zhang, Adil Ahmad, Fengwei Zhang, Hai Jin, Zhenkai Liang
Protecting system observability records (logs) from compromised OSs has gained significant traction in recent times, with several note-worthy approaches proposed. Unfortunately, none of the proposed approaches achieve high performance with tiny log protection delays. They also leverage risky environments for protection (eg many use general-purpose hypervisors or TrustZone, which have large TCB and attack surfaces). HitchHiker is an attempt to rectify this problem. The system is designed to ensure (a) in-memory protection of batched logs within a short and configurable real-time deadline by efficient hardware permission switching, and (b) an end-to-end high-assurance environment built upon hardware protection primitives with debloating strategies for secure log protection, persistence, and management. Security evaluations and validations show that HitchHiker reduces log protection delay by 93.3--99.3% compared to the state-of-the-art, while reducing TCB by 9.4--26.9X. Performance evaluations show HitchHiker incurs a geometric mean of less than 6% overhead on diverse real-world programs, improving on the state-of-the-art approach by 61.9--77.5%.
{"title":"The HitchHiker's Guide to High-Assurance System Observability Protection with Efficient Permission Switches","authors":"Chuqi Zhang, Jun Zeng, Yiming Zhang, Adil Ahmad, Fengwei Zhang, Hai Jin, Zhenkai Liang","doi":"arxiv-2409.04484","DOIUrl":"https://doi.org/arxiv-2409.04484","url":null,"abstract":"Protecting system observability records (logs) from compromised OSs has\u0000gained significant traction in recent times, with several note-worthy\u0000approaches proposed. Unfortunately, none of the proposed approaches achieve\u0000high performance with tiny log protection delays. They also leverage risky\u0000environments for protection (eg many use general-purpose hypervisors or\u0000TrustZone, which have large TCB and attack surfaces). HitchHiker is an attempt\u0000to rectify this problem. The system is designed to ensure (a) in-memory\u0000protection of batched logs within a short and configurable real-time deadline\u0000by efficient hardware permission switching, and (b) an end-to-end\u0000high-assurance environment built upon hardware protection primitives with\u0000debloating strategies for secure log protection, persistence, and management.\u0000Security evaluations and validations show that HitchHiker reduces log\u0000protection delay by 93.3--99.3% compared to the state-of-the-art, while\u0000reducing TCB by 9.4--26.9X. Performance evaluations show HitchHiker incurs a\u0000geometric mean of less than 6% overhead on diverse real-world programs,\u0000improving on the state-of-the-art approach by 61.9--77.5%.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"53 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although best-fit is known to be slow, it excels at optimizing memory space utilization. Interestingly, by keeping the free memory region at the top of the memory, the process of memory allocation and deallocation becomes approximately 34.86% faster while also maintaining external fragmentation at minimum.
{"title":"Head-First Memory Allocation on Best-Fit with Space-Fitting","authors":"Adam Noto Hakarsa","doi":"arxiv-2409.03488","DOIUrl":"https://doi.org/arxiv-2409.03488","url":null,"abstract":"Although best-fit is known to be slow, it excels at optimizing memory space\u0000utilization. Interestingly, by keeping the free memory region at the top of the\u0000memory, the process of memory allocation and deallocation becomes approximately\u000034.86% faster while also maintaining external fragmentation at minimum.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"72 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Efficient virtualization of CPU and memory is standardized and mature. Capabilities such as Intel VT-x [3] have been added by manufacturers for efficient hypervisor support. In contrast, virtualization of a block device and its presentation to the virtual machines on the host can be done in multiple ways. Indeed, hyperscalers develop in-house solutions to improve performance and cost-efficiency of their storage solutions for datacenters. Unfortunately, these storage solutions are based on specialized hardware and software which are not publicly available. The traditional solution is to expose virtual block device to the VM through a paravirtualized driver like virtio [2]. virtio provides significantly better performance than real block device driver emulation because of host OS and guest OS cooperation. The IO requests are then fulfilled by the host OS either with a local block device such as an SSD drive or with some form of disaggregated storage over the network like NVMe-oF or iSCSI. There are three main problems to the traditional solution. 1) Cost. IO operations consume host CPU cycles due to host OS involvement. These CPU cycles are doing useless work from the application point of view. 2) Inflexibility. Any change of the virtualized storage stack requires host OS and/or guest OS cooperation and cannot be done silently in production. 3) Performance. IO operations are causing recurring VM EXITs to do the transition from non-root mode to root mode on the host CPU. This results into excessive IO performance impact. We propose FlexBSO, a hardware-assisted solution, which solves all the mentioned issues. Our prototype is based on the publicly available Bluefield-2 SmartNIC with NVIDIA SNAP support, hence can be deployed without any obstacles.
CPU 和内存的高效虚拟化已经标准化并趋于成熟,制造商还增加了英特尔 VT-x [3] 等功能,以提供高效的管理程序支持。相比之下,块设备的虚拟化及其向主机上虚拟机的展示可以通过多种方式实现。事实上,超大规模企业开发了内部解决方案,以提高数据中心存储解决方案的性能和成本效益。遗憾的是,这些存储解决方案都是基于不公开的专用硬件和软件。传统的解决方案是通过虚拟化驱动程序(如 virtio [2])将虚拟块设备暴露给虚拟机。由于主机操作系统和客户操作系统的合作,虚拟驱动程序的性能明显优于真实块设备驱动。IO 请求由主机操作系统通过本地块设备(如 SSD 驱动器)或某种形式的网络分解存储(如 NVMe-oF 或 iSCSI)来完成。传统解决方案存在三个主要问题。1) 成本。由于主机操作系统的参与,IO 操作会消耗主机 CPU 周期。从应用程序的角度来看,这些 CPU 周期在做无用功。2) 不灵活性。虚拟化存储堆栈的任何更改都需要主机操作系统和/或客户操作系统的配合,无法在生产中悄无声息地完成。3) 性能。IO 操作会导致虚拟机反复退出,以便在主机 CPU 上完成从非 root 模式到 root 模式的转换。这对 IO 性能造成了过大的影响。我们提出的硬件辅助解决方案 FlexBSO 可以解决上述所有问题。我们的原型基于支持英伟达™(NVIDIA®)SNAP 的公开蓝域-2SmartNIC,因此可以毫无障碍地部署。
{"title":"FlexBSO: Flexible Block Storage Offload for Datacenters","authors":"Vojtech Aschenbrenner, John Shawger, Sadman Sakib","doi":"arxiv-2409.02381","DOIUrl":"https://doi.org/arxiv-2409.02381","url":null,"abstract":"Efficient virtualization of CPU and memory is standardized and mature.\u0000Capabilities such as Intel VT-x [3] have been added by manufacturers for\u0000efficient hypervisor support. In contrast, virtualization of a block device and\u0000its presentation to the virtual machines on the host can be done in multiple\u0000ways. Indeed, hyperscalers develop in-house solutions to improve performance\u0000and cost-efficiency of their storage solutions for datacenters. Unfortunately,\u0000these storage solutions are based on specialized hardware and software which\u0000are not publicly available. The traditional solution is to expose virtual block\u0000device to the VM through a paravirtualized driver like virtio [2]. virtio\u0000provides significantly better performance than real block device driver\u0000emulation because of host OS and guest OS cooperation. The IO requests are then\u0000fulfilled by the host OS either with a local block device such as an SSD drive\u0000or with some form of disaggregated storage over the network like NVMe-oF or\u0000iSCSI. There are three main problems to the traditional solution. 1) Cost. IO\u0000operations consume host CPU cycles due to host OS involvement. These CPU cycles\u0000are doing useless work from the application point of view. 2) Inflexibility.\u0000Any change of the virtualized storage stack requires host OS and/or guest OS\u0000cooperation and cannot be done silently in production. 3) Performance. IO\u0000operations are causing recurring VM EXITs to do the transition from non-root\u0000mode to root mode on the host CPU. This results into excessive IO performance\u0000impact. We propose FlexBSO, a hardware-assisted solution, which solves all the\u0000mentioned issues. Our prototype is based on the publicly available Bluefield-2\u0000SmartNIC with NVIDIA SNAP support, hence can be deployed without any obstacles.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guanzhou Hu, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau
We introduce explicit speculation, a variant of I/O speculation technique where I/O system calls can be parallelized under the guidance of explicit application code knowledge. We propose a formal abstraction -- the foreaction graph -- which describes the exact pattern of I/O system calls in an application function as well as any necessary computation associated to produce their argument values. I/O system calls can be issued ahead of time if the graph says it is safe and beneficial to do so. With explicit speculation, serial applications can exploit storage I/O parallelism without involving expensive prediction or checkpointing mechanisms. Based on explicit speculation, we implement Foreactor, a library framework that allows application developers to concretize foreaction graphs and enable concurrent I/O with little or no modification to application source code. Experimental results show that Foreactor is able to improve the performance of both synthetic benchmarks and real applications by significant amounts (29%-50%).
{"title":"Foreactor: Exploiting Storage I/O Parallelism with Explicit Speculation","authors":"Guanzhou Hu, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau","doi":"arxiv-2409.01580","DOIUrl":"https://doi.org/arxiv-2409.01580","url":null,"abstract":"We introduce explicit speculation, a variant of I/O speculation technique\u0000where I/O system calls can be parallelized under the guidance of explicit\u0000application code knowledge. We propose a formal abstraction -- the foreaction\u0000graph -- which describes the exact pattern of I/O system calls in an\u0000application function as well as any necessary computation associated to produce\u0000their argument values. I/O system calls can be issued ahead of time if the\u0000graph says it is safe and beneficial to do so. With explicit speculation,\u0000serial applications can exploit storage I/O parallelism without involving\u0000expensive prediction or checkpointing mechanisms. Based on explicit speculation, we implement Foreactor, a library framework\u0000that allows application developers to concretize foreaction graphs and enable\u0000concurrent I/O with little or no modification to application source code.\u0000Experimental results show that Foreactor is able to improve the performance of\u0000both synthetic benchmarks and real applications by significant amounts\u0000(29%-50%).","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}