首页 > 最新文献

arXiv - CS - Operating Systems最新文献

英文 中文
Skip TLB flushes for reused pages within mmap's 跳过在 mmap 的
Pub Date : 2024-09-17 DOI: arxiv-2409.10946
Frederic Schimmelpfennig, André Brinkmann, Hossein Asadi, Reza Salkhordeh
Memory access efficiency is significantly enhanced by caching recent addresstranslations in the CPUs' Translation Lookaside Buffers (TLBs). However, sincethe operating system is not aware of which core is using a particular mapping,it flushes TLB entries across all cores where the application runs wheneveraddresses are unmapped, ensuring security and consistency. These TLB flushes,known as TLB shootdowns, are costly and create a performance and scalabilitybottleneck. A key contributor to TLB shootdowns is memory-mapped I/O,particularly during mmap-munmap cycles and page cache evictions. Often, thesame physical pages are reassigned to the same process post-eviction,presenting an opportunity for the operating system to reduce the frequency ofTLB shootdowns. We demonstrate, that by slightly extending the mmap function,TLB shootdowns for these "recycled pages" can be avoided. Therefore we introduce and implement the "fast page recycling" (FPR) featurewithin the mmap system call. FPR-mmaps maintain security by only triggering TLBshootdowns when a page exits its recycling cycle and is allocated to adifferent process. To ensure consistency when FPR-mmap pointers are used, wemade minor adjustments to virtual memory management to avoid the ABA problem.Unlike previous methods to mitigate shootdown effects, our approach does notrequire any hardware modifications and operates transparently within theexisting Linux virtual memory framework. Our evaluations across a variety of CPU, memory, and storage setups,including persistent memory and Optane SSDs, demonstrate that FPR deliversnotable performance gains, with improvements of up to 28% in real-worldapplications and 92% in micro-benchmarks. Additionally, we show that TLBshootdowns are a significant source of bottlenecks, previously misattributed toother components of the Linux kernel.
通过将最近的地址映射缓存在 CPU 的转换旁路缓冲区(TLB)中,可以大大提高内存访问效率。但是,由于操作系统不知道哪个内核在使用特定映射,因此每当地址未映射时,操作系统就会在应用程序运行的所有内核中刷新 TLB 条目,以确保安全性和一致性。这些 TLB 刷新(称为 TLB 崩溃)代价高昂,会造成性能和可扩展性瓶颈。造成 TLB 崩溃的一个关键因素是内存映射 I/O,特别是在毫米映射-单元映射周期和页面缓存驱逐期间。通常,相同的物理页面会在驱逐后重新分配给同一个进程,这为操作系统降低 TLB 崩溃频率提供了机会。我们证明,只要稍微扩展一下 mmap 功能,就能避免这些 "回收页面 "的 TLB 崩溃。因此,我们在 mmap 系统调用中引入并实现了 "快速页面回收"(FPR)功能。FPR 映射仅在页面退出循环并分配给不同进程时才触发 TLB 崩溃,从而保持了安全性。为了确保使用 FPR-map 指针时的一致性,我们对虚拟内存管理进行了细微调整,以避免 ABA 问题。我们对包括持久内存和 Optane SSD 在内的各种 CPU、内存和存储设置进行了评估,结果表明,FPR 带来了显著的性能提升,在实际应用中提升了 28%,在微基准测试中提升了 92%。此外,我们还证明了 TLBshootdowns 是瓶颈的一个重要来源,以前曾被错误地归咎于 Linux 内核的其他组件。
{"title":"Skip TLB flushes for reused pages within mmap's","authors":"Frederic Schimmelpfennig, André Brinkmann, Hossein Asadi, Reza Salkhordeh","doi":"arxiv-2409.10946","DOIUrl":"https://doi.org/arxiv-2409.10946","url":null,"abstract":"Memory access efficiency is significantly enhanced by caching recent address\u0000translations in the CPUs' Translation Lookaside Buffers (TLBs). However, since\u0000the operating system is not aware of which core is using a particular mapping,\u0000it flushes TLB entries across all cores where the application runs whenever\u0000addresses are unmapped, ensuring security and consistency. These TLB flushes,\u0000known as TLB shootdowns, are costly and create a performance and scalability\u0000bottleneck. A key contributor to TLB shootdowns is memory-mapped I/O,\u0000particularly during mmap-munmap cycles and page cache evictions. Often, the\u0000same physical pages are reassigned to the same process post-eviction,\u0000presenting an opportunity for the operating system to reduce the frequency of\u0000TLB shootdowns. We demonstrate, that by slightly extending the mmap function,\u0000TLB shootdowns for these \"recycled pages\" can be avoided. Therefore we introduce and implement the \"fast page recycling\" (FPR) feature\u0000within the mmap system call. FPR-mmaps maintain security by only triggering TLB\u0000shootdowns when a page exits its recycling cycle and is allocated to a\u0000different process. To ensure consistency when FPR-mmap pointers are used, we\u0000made minor adjustments to virtual memory management to avoid the ABA problem.\u0000Unlike previous methods to mitigate shootdown effects, our approach does not\u0000require any hardware modifications and operates transparently within the\u0000existing Linux virtual memory framework. Our evaluations across a variety of CPU, memory, and storage setups,\u0000including persistent memory and Optane SSDs, demonstrate that FPR delivers\u0000notable performance gains, with improvements of up to 28% in real-world\u0000applications and 92% in micro-benchmarks. Additionally, we show that TLB\u0000shootdowns are a significant source of bottlenecks, previously misattributed to\u0000other components of the Linux kernel.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analysis of Synchronization Mechanisms in Operating Systems 操作系统中的同步机制分析
Pub Date : 2024-09-17 DOI: arxiv-2409.11271
Oluwatoyin Kode, Temitope Oyemade
This research analyzed the performance and consistency of foursynchronization mechanisms-reentrant locks, semaphores, synchronized methods,and synchronized blocks-across three operating systems: macOS, Windows, andLinux. Synchronization ensures that concurrent processes or threads accessshared resources safely, and efficient synchronization is vital for maintainingsystem performance and reliability. The study aimed to identify thesynchronization mechanism that balances efficiency, measured by execution time,and consistency, assessed by variance and standard deviation, across platforms.The initial hypothesis proposed that mutex-based mechanisms, specificallysynchronized methods and blocks, would be the most efficient due to theirsimplicity. However, empirical results showed that reentrant locks had thelowest average execution time (14.67ms), making them the most efficientmechanism, but with the highest variability (standard deviation of 1.15). Incontrast, synchronized methods, blocks, and semaphores exhibited higher averageexecution times (16.33ms for methods and 16.67ms for blocks) but with greaterconsistency (variance of 0.33). The findings indicated that while reentrantlocks were faster, they were more platform-dependent, whereas mutex-basedmechanisms provided more predictable performance across all operating systems.The use of virtual machines for Windows and Linux was a limitation, potentiallyaffecting the results. Future research should include native testing andexplore additional synchronization mechanisms and higher concurrency levels.These insights help developers and system designers optimize synchronizationstrategies for either performance or stability, depending on the application'srequirements.
本研究分析了 MacOS、Windows 和 Linux 三种操作系统中四种同步机制--反向锁、信号、同步方法和同步块的性能和一致性。同步能确保并发进程或线程安全地访问共享资源,高效的同步对保持系统性能和可靠性至关重要。最初的假设认为,基于互斥机制(特别是同步方法和区块)因其简单性将是最高效的。然而,实证结果表明,可重入锁的平均执行时间最短(14.67ms),是效率最高的机制,但其变异性也最高(标准偏差为 1.15)。相比之下,同步方法、块和 Semaphores 的平均执行时间较长(方法为 16.33ms,块为 16.67ms),但一致性较高(方差为 0.33)。研究结果表明,虽然重入锁的速度更快,但它们更依赖于平台,而基于mutex的机制在所有操作系统中都能提供更可预测的性能。未来的研究应包括本机测试,并探索更多同步机制和更高的并发级别。这些见解有助于开发人员和系统设计人员根据应用程序的要求,优化同步策略,以提高性能或稳定性。
{"title":"Analysis of Synchronization Mechanisms in Operating Systems","authors":"Oluwatoyin Kode, Temitope Oyemade","doi":"arxiv-2409.11271","DOIUrl":"https://doi.org/arxiv-2409.11271","url":null,"abstract":"This research analyzed the performance and consistency of four\u0000synchronization mechanisms-reentrant locks, semaphores, synchronized methods,\u0000and synchronized blocks-across three operating systems: macOS, Windows, and\u0000Linux. Synchronization ensures that concurrent processes or threads access\u0000shared resources safely, and efficient synchronization is vital for maintaining\u0000system performance and reliability. The study aimed to identify the\u0000synchronization mechanism that balances efficiency, measured by execution time,\u0000and consistency, assessed by variance and standard deviation, across platforms.\u0000The initial hypothesis proposed that mutex-based mechanisms, specifically\u0000synchronized methods and blocks, would be the most efficient due to their\u0000simplicity. However, empirical results showed that reentrant locks had the\u0000lowest average execution time (14.67ms), making them the most efficient\u0000mechanism, but with the highest variability (standard deviation of 1.15). In\u0000contrast, synchronized methods, blocks, and semaphores exhibited higher average\u0000execution times (16.33ms for methods and 16.67ms for blocks) but with greater\u0000consistency (variance of 0.33). The findings indicated that while reentrant\u0000locks were faster, they were more platform-dependent, whereas mutex-based\u0000mechanisms provided more predictable performance across all operating systems.\u0000The use of virtual machines for Windows and Linux was a limitation, potentially\u0000affecting the results. Future research should include native testing and\u0000explore additional synchronization mechanisms and higher concurrency levels.\u0000These insights help developers and system designers optimize synchronization\u0000strategies for either performance or stability, depending on the application's\u0000requirements.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"191 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
eBPF-mm: Userspace-guided memory management in Linux with eBPF eBPF-mm:使用 eBPF 在 Linux 中进行用户空间引导式内存管理
Pub Date : 2024-09-17 DOI: arxiv-2409.11220
Konstantinos Mores, Stratos Psomadakis, Georgios Goumas
We leverage eBPF in order to implement custom policies in the Linux memorysubsystem. Inspired by CBMM, we create a mechanism that provides the kernelwith hints regarding the benefit of promoting a page to a specific size. Weintroduce a new hook point in Linux page fault handling path for eBPF programs,providing them the necessary context to determine the page size to be used. Wethen develop a framework that allows users to define profiles for theirapplications and load them into the kernel. A profile consists of memoryregions of interest and their expected benefit from being backed by 4KB, 64KBand 2MB pages. In our evaluation, we profiled our workloads to identify hotmemory regions using DAMON.
我们利用 eBPF 在 Linux 内存子系统中实施自定义策略。受 CBMM 的启发,我们创建了一种机制,为内核提供有关将页面提升到特定大小的好处的提示。我们在 Linux 页面故障处理路径中为 eBPF 程序引入了一个新的挂钩点,为它们提供必要的上下文,以确定要使用的页面大小。我们开发了一个框架,允许用户为应用程序定义配置文件,并将其加载到内核中。配置文件包括感兴趣的内存区域,以及它们从 4KB、64KB 和 2MB 页面支持中获得的预期收益。在我们的评估中,我们使用 DAMON 对工作负载进行了剖析,以识别热内存区域。
{"title":"eBPF-mm: Userspace-guided memory management in Linux with eBPF","authors":"Konstantinos Mores, Stratos Psomadakis, Georgios Goumas","doi":"arxiv-2409.11220","DOIUrl":"https://doi.org/arxiv-2409.11220","url":null,"abstract":"We leverage eBPF in order to implement custom policies in the Linux memory\u0000subsystem. Inspired by CBMM, we create a mechanism that provides the kernel\u0000with hints regarding the benefit of promoting a page to a specific size. We\u0000introduce a new hook point in Linux page fault handling path for eBPF programs,\u0000providing them the necessary context to determine the page size to be used. We\u0000then develop a framework that allows users to define profiles for their\u0000applications and load them into the kernel. A profile consists of memory\u0000regions of interest and their expected benefit from being backed by 4KB, 64KB\u0000and 2MB pages. In our evaluation, we profiled our workloads to identify hot\u0000memory regions using DAMON.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BULKHEAD: Secure, Scalable, and Efficient Kernel Compartmentalization with PKS BULKHEAD:利用 PKS 实现安全、可扩展和高效的内核分隔
Pub Date : 2024-09-15 DOI: arxiv-2409.09606
Yinggang Guo, Zicheng Wang, Weiheng Bai, Qingkai Zeng, Kangjie Lu
The endless stream of vulnerabilities urgently calls for principledmitigation to confine the effect of exploitation. However, the monolithicarchitecture of commodity OS kernels, like the Linux kernel, allows an attackerto compromise the entire system by exploiting a vulnerability in any kernelcomponent. Kernel compartmentalization is a promising approach that follows theleast-privilege principle. However, existing mechanisms struggle with thetrade-off on security, scalability, and performance, given the challengesstemming from mutual untrustworthiness among numerous and complex components. In this paper, we present BULKHEAD, a secure, scalable, and efficient kernelcompartmentalization technique that offers bi-directional isolation forunlimited compartments. It leverages Intel's new hardware feature PKS toisolate data and code into mutually untrusted compartments and benefits fromits fast compartment switching. With untrust in mind, BULKHEAD introduces alightweight in-kernel monitor that enforces multiple important securityinvariants, including data integrity, execute-only memory, and compartmentinterface integrity. In addition, it provides a locality-aware two-level schemethat scales to unlimited compartments. We implement a prototype system on Linuxv6.1 to compartmentalize loadable kernel modules (LKMs). Extensive evaluationconfirms the effectiveness of our approach. As the system-wide impacts,BULKHEAD incurs an average performance overhead of 2.44% for real-worldapplications with 160 compartmentalized LKMs. While focusing on a specificcompartment, ApacheBench tests on ipv6 show an overhead of less than 2%.Moreover, the performance is almost unaffected by the number of compartments,which makes it highly scalable.
层出不穷的漏洞迫切需要有原则的缓解措施来限制漏洞利用的效果。然而,商品操作系统内核(如 Linux 内核)的单体架构允许攻击者利用任何内核组件中的漏洞入侵整个系统。内核分隔是一种很有前途的方法,它遵循权限最小原则。然而,现有的机制在安全性、可扩展性和性能之间难以取舍,因为众多复杂的组件之间存在互不信任的问题。在本文中,我们介绍了一种安全、可扩展和高效的内核分区技术--BULKHEAD,它可为无限分区提供双向隔离。它利用英特尔的新硬件功能 PKS 将数据和代码隔离到互不信任的分区中,并受益于其快速的分区切换。考虑到不信任因素,BULKHEAD 引入了轻量级内核监控器,可执行多个重要的安全变量,包括数据完整性、只执行内存和隔间接口完整性。此外,它还提供了一种本地感知的两级方案,可扩展到无限的隔间。我们在 Linuxv6.1 上实现了一个原型系统,用于分割可加载内核模块(LKM)。广泛的评估证实了我们方法的有效性。作为对整个系统的影响,BULKHEAD 在实际应用中使用 160 个分隔的 LKM 时,平均性能开销为 2.44%。此外,性能几乎不受分区数量的影响,因此具有很强的可扩展性。
{"title":"BULKHEAD: Secure, Scalable, and Efficient Kernel Compartmentalization with PKS","authors":"Yinggang Guo, Zicheng Wang, Weiheng Bai, Qingkai Zeng, Kangjie Lu","doi":"arxiv-2409.09606","DOIUrl":"https://doi.org/arxiv-2409.09606","url":null,"abstract":"The endless stream of vulnerabilities urgently calls for principled\u0000mitigation to confine the effect of exploitation. However, the monolithic\u0000architecture of commodity OS kernels, like the Linux kernel, allows an attacker\u0000to compromise the entire system by exploiting a vulnerability in any kernel\u0000component. Kernel compartmentalization is a promising approach that follows the\u0000least-privilege principle. However, existing mechanisms struggle with the\u0000trade-off on security, scalability, and performance, given the challenges\u0000stemming from mutual untrustworthiness among numerous and complex components. In this paper, we present BULKHEAD, a secure, scalable, and efficient kernel\u0000compartmentalization technique that offers bi-directional isolation for\u0000unlimited compartments. It leverages Intel's new hardware feature PKS to\u0000isolate data and code into mutually untrusted compartments and benefits from\u0000its fast compartment switching. With untrust in mind, BULKHEAD introduces a\u0000lightweight in-kernel monitor that enforces multiple important security\u0000invariants, including data integrity, execute-only memory, and compartment\u0000interface integrity. In addition, it provides a locality-aware two-level scheme\u0000that scales to unlimited compartments. We implement a prototype system on Linux\u0000v6.1 to compartmentalize loadable kernel modules (LKMs). Extensive evaluation\u0000confirms the effectiveness of our approach. As the system-wide impacts,\u0000BULKHEAD incurs an average performance overhead of 2.44% for real-world\u0000applications with 160 compartmentalized LKMs. While focusing on a specific\u0000compartment, ApacheBench tests on ipv6 show an overhead of less than 2%.\u0000Moreover, the performance is almost unaffected by the number of compartments,\u0000which makes it highly scalable.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"46 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects 重新思考面向快速设备、廉价内核和相干互连的编程 I/O
Pub Date : 2024-09-12 DOI: arxiv-2409.08141
Anastasiia Ruzhanskaia, Pengcheng Xu, David Cock, Timothy Roscoe
Conventional wisdom holds that an efficient interface between an OS runningon a CPU and a high-bandwidth I/O device should be based on Direct MemoryAccess (DMA), descriptor rings, and interrupts: DMA offloads transfers from theCPU, descriptor rings provide buffering and queuing, and interrupts facilitateasynchronous interaction between cores and device with a lightweightnotification mechanism. In this paper we question this wisdom in the light ofmodern hardware and workloads, particularly in cloud servers. We argue that theassumptions that led to this model are obsolete, and in many use-cases use ofprogrammed I/O, where the CPU explicitly transfers data and control informationto and from a device via loads and stores, actually results in a more efficientsystem. We quantitatively demonstrate these advantages using three use-cases:fine-grained RPC-style invocation of functions on an accelerator, offloading ofoperators in a streaming dataflow engine, and a network interface targeting forserverless functions. Moreover, we show that while these advantages aresignificant over a modern PCIe peripheral bus, a truly cache-coherentinterconnect offers significant additional efficiency gains.
传统观念认为,CPU 上运行的操作系统与高带宽 I/O 设备之间的高效接口应基于直接内存访问 (DMA)、描述符环和中断:DMA 可卸载来自 CPU 的传输,描述符环可提供缓冲和队列,而中断则可通过轻量级通知机制促进内核与设备之间的异步交互。在本文中,我们根据现代硬件和工作负载,尤其是云服务器的情况,对这一智慧提出了质疑。我们认为,导致这种模式的假设已经过时,在许多使用案例中,使用编程 I/O(CPU 通过加载和存储向设备明确传输数据和控制信息)实际上会带来更高效的系统。我们通过三个用例定量证明了这些优势:在加速器上对函数进行细粒度 RPC 式调用、卸载流数据流引擎中的操作器,以及针对无服务器函数的网络接口。此外,我们还展示了与现代 PCIe 外围总线相比这些优势的显著性,而真正的高速缓存相干互连则提供了额外的显著效率提升。
{"title":"Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects","authors":"Anastasiia Ruzhanskaia, Pengcheng Xu, David Cock, Timothy Roscoe","doi":"arxiv-2409.08141","DOIUrl":"https://doi.org/arxiv-2409.08141","url":null,"abstract":"Conventional wisdom holds that an efficient interface between an OS running\u0000on a CPU and a high-bandwidth I/O device should be based on Direct Memory\u0000Access (DMA), descriptor rings, and interrupts: DMA offloads transfers from the\u0000CPU, descriptor rings provide buffering and queuing, and interrupts facilitate\u0000asynchronous interaction between cores and device with a lightweight\u0000notification mechanism. In this paper we question this wisdom in the light of\u0000modern hardware and workloads, particularly in cloud servers. We argue that the\u0000assumptions that led to this model are obsolete, and in many use-cases use of\u0000programmed I/O, where the CPU explicitly transfers data and control information\u0000to and from a device via loads and stores, actually results in a more efficient\u0000system. We quantitatively demonstrate these advantages using three use-cases:\u0000fine-grained RPC-style invocation of functions on an accelerator, offloading of\u0000operators in a streaming dataflow engine, and a network interface targeting for\u0000serverless functions. Moreover, we show that while these advantages are\u0000significant over a modern PCIe peripheral bus, a truly cache-coherent\u0000interconnect offers significant additional efficiency gains.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SafeBPF: Hardware-assisted Defense-in-depth for eBPF Kernel Extensions SafeBPF:eBPF 内核扩展的硬件辅助深度防御
Pub Date : 2024-09-11 DOI: arxiv-2409.07508
Soo Yee Lim, Tanya Prasad, Xueyuan Han, Thomas Pasquier
The eBPF framework enables execution of user-provided code in the Linuxkernel. In the last few years, a large ecosystem of cloud services hasleveraged eBPF to enhance container security, system observability, and networkmanagement. Meanwhile, incessant discoveries of memory safety vulnerabilitieshave left the systems community with no choice but to disallow unprivilegedeBPF programs, which unfortunately limits eBPF use to only privileged users. Toimprove run-time safety of the framework, we introduce SafeBPF, a generaldesign that isolates eBPF programs from the rest of the kernel to preventmemory safety vulnerabilities from being exploited. We present a pure softwareimplementation using a Software-based Fault Isolation (SFI) approach and ahardware-assisted implementation that leverages ARM's Memory Tagging Extension(MTE). We show that SafeBPF incurs up to 4% overhead on macrobenchmarks whileachieving desired security properties.
eBPF 框架可在 Linux 内核中执行用户提供的代码。在过去几年中,大量云服务生态系统利用 eBPF 增强了容器安全性、系统可观察性和网络管理。与此同时,不断发现的内存安全漏洞让系统社区别无选择,只能禁止非特权eBPF程序的使用,不幸的是,eBPF的使用仅限于特权用户。为了提高框架的运行时安全性,我们引入了 SafeBPF,这是一种通用设计,可以将 eBPF 程序与内核的其他部分隔离,防止内存安全漏洞被利用。我们介绍了使用基于软件的故障隔离(SFI)方法的纯软件实现,以及利用 ARM 的内存标记扩展(MTE)的硬件辅助实现。我们的研究表明,SafeBPF 在实现所需的安全特性的同时,在宏基准测试中的开销仅为 4%。
{"title":"SafeBPF: Hardware-assisted Defense-in-depth for eBPF Kernel Extensions","authors":"Soo Yee Lim, Tanya Prasad, Xueyuan Han, Thomas Pasquier","doi":"arxiv-2409.07508","DOIUrl":"https://doi.org/arxiv-2409.07508","url":null,"abstract":"The eBPF framework enables execution of user-provided code in the Linux\u0000kernel. In the last few years, a large ecosystem of cloud services has\u0000leveraged eBPF to enhance container security, system observability, and network\u0000management. Meanwhile, incessant discoveries of memory safety vulnerabilities\u0000have left the systems community with no choice but to disallow unprivileged\u0000eBPF programs, which unfortunately limits eBPF use to only privileged users. To\u0000improve run-time safety of the framework, we introduce SafeBPF, a general\u0000design that isolates eBPF programs from the rest of the kernel to prevent\u0000memory safety vulnerabilities from being exploited. We present a pure software\u0000implementation using a Software-based Fault Isolation (SFI) approach and a\u0000hardware-assisted implementation that leverages ARM's Memory Tagging Extension\u0000(MTE). We show that SafeBPF incurs up to 4% overhead on macrobenchmarks while\u0000achieving desired security properties.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The HitchHiker's Guide to High-Assurance System Observability Protection with Efficient Permission Switches 利用高效权限开关实现高可靠性系统可观察性保护的搭便车者指南
Pub Date : 2024-09-06 DOI: arxiv-2409.04484
Chuqi Zhang, Jun Zeng, Yiming Zhang, Adil Ahmad, Fengwei Zhang, Hai Jin, Zhenkai Liang
Protecting system observability records (logs) from compromised OSs hasgained significant traction in recent times, with several note-worthyapproaches proposed. Unfortunately, none of the proposed approaches achievehigh performance with tiny log protection delays. They also leverage riskyenvironments for protection (eg many use general-purpose hypervisors orTrustZone, which have large TCB and attack surfaces). HitchHiker is an attemptto rectify this problem. The system is designed to ensure (a) in-memoryprotection of batched logs within a short and configurable real-time deadlineby efficient hardware permission switching, and (b) an end-to-endhigh-assurance environment built upon hardware protection primitives withdebloating strategies for secure log protection, persistence, and management.Security evaluations and validations show that HitchHiker reduces logprotection delay by 93.3--99.3% compared to the state-of-the-art, whilereducing TCB by 9.4--26.9X. Performance evaluations show HitchHiker incurs ageometric mean of less than 6% overhead on diverse real-world programs,improving on the state-of-the-art approach by 61.9--77.5%.
近来,保护系统可观察性记录(日志)免受操作系统入侵的研究取得了重大进展,并提出了几种值得注意的方法。遗憾的是,所提出的方法中没有一种能以极小的日志保护延迟实现高性能。它们还利用有风险的环境进行保护(例如,许多方法使用通用管理程序或TrustZone,它们有很大的TCB和攻击面)。HitchHiker 试图纠正这一问题。该系统旨在确保(a)通过高效的硬件权限切换,在较短且可配置的实时截止时间内对成批日志进行内存保护,以及(b)基于硬件保护原语和用于安全日志保护、持久化和管理的浮动策略,构建端到端的高保证环境。安全评估和验证表明,与最先进的技术相比,HitchHiker将日志保护延迟减少了93.3%-99.3%,同时将TCB减少了9.4-26.9倍。性能评估结果表明,HitchHiker在各种实际程序上产生的开销的年龄平均值小于6%,比最先进的方法提高了61.9%-77.5%。
{"title":"The HitchHiker's Guide to High-Assurance System Observability Protection with Efficient Permission Switches","authors":"Chuqi Zhang, Jun Zeng, Yiming Zhang, Adil Ahmad, Fengwei Zhang, Hai Jin, Zhenkai Liang","doi":"arxiv-2409.04484","DOIUrl":"https://doi.org/arxiv-2409.04484","url":null,"abstract":"Protecting system observability records (logs) from compromised OSs has\u0000gained significant traction in recent times, with several note-worthy\u0000approaches proposed. Unfortunately, none of the proposed approaches achieve\u0000high performance with tiny log protection delays. They also leverage risky\u0000environments for protection (eg many use general-purpose hypervisors or\u0000TrustZone, which have large TCB and attack surfaces). HitchHiker is an attempt\u0000to rectify this problem. The system is designed to ensure (a) in-memory\u0000protection of batched logs within a short and configurable real-time deadline\u0000by efficient hardware permission switching, and (b) an end-to-end\u0000high-assurance environment built upon hardware protection primitives with\u0000debloating strategies for secure log protection, persistence, and management.\u0000Security evaluations and validations show that HitchHiker reduces log\u0000protection delay by 93.3--99.3% compared to the state-of-the-art, while\u0000reducing TCB by 9.4--26.9X. Performance evaluations show HitchHiker incurs a\u0000geometric mean of less than 6% overhead on diverse real-world programs,\u0000improving on the state-of-the-art approach by 61.9--77.5%.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"53 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Head-First Memory Allocation on Best-Fit with Space-Fitting 带空间拟合的最佳拟合上的先头内存分配
Pub Date : 2024-09-05 DOI: arxiv-2409.03488
Adam Noto Hakarsa
Although best-fit is known to be slow, it excels at optimizing memory spaceutilization. Interestingly, by keeping the free memory region at the top of thememory, the process of memory allocation and deallocation becomes approximately34.86% faster while also maintaining external fragmentation at minimum.
虽然众所周知 best-fit 速度较慢,但它在优化内存空间利用方面表现出色。有趣的是,通过将空闲内存区域保持在内存顶部,内存分配和删除过程的速度提高了约 34.86%,同时还将外部碎片保持在最低水平。
{"title":"Head-First Memory Allocation on Best-Fit with Space-Fitting","authors":"Adam Noto Hakarsa","doi":"arxiv-2409.03488","DOIUrl":"https://doi.org/arxiv-2409.03488","url":null,"abstract":"Although best-fit is known to be slow, it excels at optimizing memory space\u0000utilization. Interestingly, by keeping the free memory region at the top of the\u0000memory, the process of memory allocation and deallocation becomes approximately\u000034.86% faster while also maintaining external fragmentation at minimum.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"72 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FlexBSO: Flexible Block Storage Offload for Datacenters FlexBSO:数据中心的灵活块存储卸载
Pub Date : 2024-09-04 DOI: arxiv-2409.02381
Vojtech Aschenbrenner, John Shawger, Sadman Sakib
Efficient virtualization of CPU and memory is standardized and mature.Capabilities such as Intel VT-x [3] have been added by manufacturers forefficient hypervisor support. In contrast, virtualization of a block device andits presentation to the virtual machines on the host can be done in multipleways. Indeed, hyperscalers develop in-house solutions to improve performanceand cost-efficiency of their storage solutions for datacenters. Unfortunately,these storage solutions are based on specialized hardware and software whichare not publicly available. The traditional solution is to expose virtual blockdevice to the VM through a paravirtualized driver like virtio [2]. virtioprovides significantly better performance than real block device driveremulation because of host OS and guest OS cooperation. The IO requests are thenfulfilled by the host OS either with a local block device such as an SSD driveor with some form of disaggregated storage over the network like NVMe-oF oriSCSI. There are three main problems to the traditional solution. 1) Cost. IOoperations consume host CPU cycles due to host OS involvement. These CPU cyclesare doing useless work from the application point of view. 2) Inflexibility.Any change of the virtualized storage stack requires host OS and/or guest OScooperation and cannot be done silently in production. 3) Performance. IOoperations are causing recurring VM EXITs to do the transition from non-rootmode to root mode on the host CPU. This results into excessive IO performanceimpact. We propose FlexBSO, a hardware-assisted solution, which solves all thementioned issues. Our prototype is based on the publicly available Bluefield-2SmartNIC with NVIDIA SNAP support, hence can be deployed without any obstacles.
CPU 和内存的高效虚拟化已经标准化并趋于成熟,制造商还增加了英特尔 VT-x [3] 等功能,以提供高效的管理程序支持。相比之下,块设备的虚拟化及其向主机上虚拟机的展示可以通过多种方式实现。事实上,超大规模企业开发了内部解决方案,以提高数据中心存储解决方案的性能和成本效益。遗憾的是,这些存储解决方案都是基于不公开的专用硬件和软件。传统的解决方案是通过虚拟化驱动程序(如 virtio [2])将虚拟块设备暴露给虚拟机。由于主机操作系统和客户操作系统的合作,虚拟驱动程序的性能明显优于真实块设备驱动。IO 请求由主机操作系统通过本地块设备(如 SSD 驱动器)或某种形式的网络分解存储(如 NVMe-oF 或 iSCSI)来完成。传统解决方案存在三个主要问题。1) 成本。由于主机操作系统的参与,IO 操作会消耗主机 CPU 周期。从应用程序的角度来看,这些 CPU 周期在做无用功。2) 不灵活性。虚拟化存储堆栈的任何更改都需要主机操作系统和/或客户操作系统的配合,无法在生产中悄无声息地完成。3) 性能。IO 操作会导致虚拟机反复退出,以便在主机 CPU 上完成从非 root 模式到 root 模式的转换。这对 IO 性能造成了过大的影响。我们提出的硬件辅助解决方案 FlexBSO 可以解决上述所有问题。我们的原型基于支持英伟达™(NVIDIA®)SNAP 的公开蓝域-2SmartNIC,因此可以毫无障碍地部署。
{"title":"FlexBSO: Flexible Block Storage Offload for Datacenters","authors":"Vojtech Aschenbrenner, John Shawger, Sadman Sakib","doi":"arxiv-2409.02381","DOIUrl":"https://doi.org/arxiv-2409.02381","url":null,"abstract":"Efficient virtualization of CPU and memory is standardized and mature.\u0000Capabilities such as Intel VT-x [3] have been added by manufacturers for\u0000efficient hypervisor support. In contrast, virtualization of a block device and\u0000its presentation to the virtual machines on the host can be done in multiple\u0000ways. Indeed, hyperscalers develop in-house solutions to improve performance\u0000and cost-efficiency of their storage solutions for datacenters. Unfortunately,\u0000these storage solutions are based on specialized hardware and software which\u0000are not publicly available. The traditional solution is to expose virtual block\u0000device to the VM through a paravirtualized driver like virtio [2]. virtio\u0000provides significantly better performance than real block device driver\u0000emulation because of host OS and guest OS cooperation. The IO requests are then\u0000fulfilled by the host OS either with a local block device such as an SSD drive\u0000or with some form of disaggregated storage over the network like NVMe-oF or\u0000iSCSI. There are three main problems to the traditional solution. 1) Cost. IO\u0000operations consume host CPU cycles due to host OS involvement. These CPU cycles\u0000are doing useless work from the application point of view. 2) Inflexibility.\u0000Any change of the virtualized storage stack requires host OS and/or guest OS\u0000cooperation and cannot be done silently in production. 3) Performance. IO\u0000operations are causing recurring VM EXITs to do the transition from non-root\u0000mode to root mode on the host CPU. This results into excessive IO performance\u0000impact. We propose FlexBSO, a hardware-assisted solution, which solves all the\u0000mentioned issues. Our prototype is based on the publicly available Bluefield-2\u0000SmartNIC with NVIDIA SNAP support, hence can be deployed without any obstacles.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Foreactor: Exploiting Storage I/O Parallelism with Explicit Speculation Foreactor:通过明确推测利用存储 I/O 并行性
Pub Date : 2024-09-03 DOI: arxiv-2409.01580
Guanzhou Hu, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau
We introduce explicit speculation, a variant of I/O speculation techniquewhere I/O system calls can be parallelized under the guidance of explicitapplication code knowledge. We propose a formal abstraction -- the foreactiongraph -- which describes the exact pattern of I/O system calls in anapplication function as well as any necessary computation associated to producetheir argument values. I/O system calls can be issued ahead of time if thegraph says it is safe and beneficial to do so. With explicit speculation,serial applications can exploit storage I/O parallelism without involvingexpensive prediction or checkpointing mechanisms. Based on explicit speculation, we implement Foreactor, a library frameworkthat allows application developers to concretize foreaction graphs and enableconcurrent I/O with little or no modification to application source code.Experimental results show that Foreactor is able to improve the performance ofboth synthetic benchmarks and real applications by significant amounts(29%-50%).
我们介绍了显式推测,这是 I/O 推测技术的一种变体,在显式应用代码知识的指导下,I/O 系统调用可以并行化。我们提出了一种形式抽象--前行动图,它描述了应用函数中 I/O 系统调用的确切模式,以及产生其参数值所需的相关计算。如果该图表明这样做是安全和有益的,那么就可以提前发出 I/O 系统调用。有了显式推测,串行应用程序就可以利用存储 I/O 并行性,而无需涉及昂贵的预测或检查点机制。实验结果表明,Foreactor 能够显著提高合成基准和实际应用的性能(29%-50%)。
{"title":"Foreactor: Exploiting Storage I/O Parallelism with Explicit Speculation","authors":"Guanzhou Hu, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau","doi":"arxiv-2409.01580","DOIUrl":"https://doi.org/arxiv-2409.01580","url":null,"abstract":"We introduce explicit speculation, a variant of I/O speculation technique\u0000where I/O system calls can be parallelized under the guidance of explicit\u0000application code knowledge. We propose a formal abstraction -- the foreaction\u0000graph -- which describes the exact pattern of I/O system calls in an\u0000application function as well as any necessary computation associated to produce\u0000their argument values. I/O system calls can be issued ahead of time if the\u0000graph says it is safe and beneficial to do so. With explicit speculation,\u0000serial applications can exploit storage I/O parallelism without involving\u0000expensive prediction or checkpointing mechanisms. Based on explicit speculation, we implement Foreactor, a library framework\u0000that allows application developers to concretize foreaction graphs and enable\u0000concurrent I/O with little or no modification to application source code.\u0000Experimental results show that Foreactor is able to improve the performance of\u0000both synthetic benchmarks and real applications by significant amounts\u0000(29%-50%).","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - CS - Operating Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1