numaPTE: Managing Page-Tables and TLBs on NUMA Systems

arXiv - CS - Operating Systems Pub Date : 2024-01-28 DOI:arxiv-2401.15558

Bin Gao, Qingxuan Kang, Hao-Wei Tee, Kyle Timothy Ng Chu, Alireza Sanaee, Djordje Jevdjic

{"title":"numaPTE: Managing Page-Tables and TLBs on NUMA Systems","authors":"Bin Gao, Qingxuan Kang, Hao-Wei Tee, Kyle Timothy Ng Chu, Alireza Sanaee, Djordje Jevdjic","doi":"arxiv-2401.15558","DOIUrl":null,"url":null,"abstract":"Memory management operations that modify page-tables, typically performed\nduring memory allocation/deallocation, are infamous for their poor performance\nin highly threaded applications, largely due to process-wide TLB shootdowns\nthat the OS must issue due to the lack of hardware support for TLB coherence.\nWe study these operations in NUMA settings, where we observe up to 40x overhead\nfor basic operations such as munmap or mprotect. The overhead further increases\nif page-table replication is used, where complete coherent copies of the\npage-tables are maintained across all NUMA nodes. While eager system-wide\nreplication is extremely effective at localizing page-table reads during\naddress translation, we find that it creates additional penalties upon any\npage-table changes due to the need to maintain all replicas coherent. In this paper, we propose a novel page-table management mechanism, called\nnumaPTE, to enable transparent, on-demand, and partial page-table replication\nacross NUMA nodes in order to perform address translation locally, while\navoiding the overheads and scalability issues of system-wide full page-table\nreplication. We then show that numaPTE's precise knowledge of page-table\nsharers can be leveraged to significantly reduce the number of TLB shootdowns\nissued upon any memory-management operation. As a result, numaPTE not only\navoids replication-related slowdowns, but also provides significant speedup\nover the baseline on memory allocation/deallocation and access control\noperations. We implement numaPTEin Linux on x86_64, evaluate it on 4- and\n8-socket systems, and show that numaPTE achieves the full benefits of eager\npage-table replication on a wide range of applications, while also achieving a\n12% and 36% runtime improvement on Webserver and Memcached respectively due to\na significant reduction in TLB shootdowns.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"8 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Operating Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2401.15558","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Memory management operations that modify page-tables, typically performed during memory allocation/deallocation, are infamous for their poor performance in highly threaded applications, largely due to process-wide TLB shootdowns that the OS must issue due to the lack of hardware support for TLB coherence. We study these operations in NUMA settings, where we observe up to 40x overhead for basic operations such as munmap or mprotect. The overhead further increases if page-table replication is used, where complete coherent copies of the page-tables are maintained across all NUMA nodes. While eager system-wide replication is extremely effective at localizing page-table reads during address translation, we find that it creates additional penalties upon any page-table changes due to the need to maintain all replicas coherent. In this paper, we propose a novel page-table management mechanism, called numaPTE, to enable transparent, on-demand, and partial page-table replication across NUMA nodes in order to perform address translation locally, while avoiding the overheads and scalability issues of system-wide full page-table replication. We then show that numaPTE's precise knowledge of page-table sharers can be leveraged to significantly reduce the number of TLB shootdowns issued upon any memory-management operation. As a result, numaPTE not only avoids replication-related slowdowns, but also provides significant speedup over the baseline on memory allocation/deallocation and access control operations. We implement numaPTEin Linux on x86_64, evaluate it on 4- and 8-socket systems, and show that numaPTE achieves the full benefits of eager page-table replication on a wide range of applications, while also achieving a 12% and 36% runtime improvement on Webserver and Memcached respectively due to a significant reduction in TLB shootdowns.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

numaPTE：在 NUMA 系统上管理页表和 TLB

修改页表的内存管理操作通常是在内存分配/去分配过程中执行的，在高线程应用中因性能不佳而臭名昭著，这主要是由于缺乏对 TLB 一致性的硬件支持，操作系统必须在整个进程范围内对 TLB 进行击穿。我们在 NUMA 设置中对这些操作进行了研究，观察到诸如 munmap 或 mprotect 等基本操作的开销高达 40 倍。如果使用页表复制，在所有 NUMA 节点上维护页表的完整一致性副本，则开销会进一步增加。虽然在地址转换过程中，急切的全系统复制在本地化页表读取方面非常有效，但我们发现，由于需要保持所有副本的一致性，它在页表发生任何变化时都会产生额外的惩罚。在本文中，我们提出了一种名为 numaPTE 的新型页表管理机制，以实现跨 NUMA 节点的透明、按需和部分页表复制，从而在本地执行地址转换，同时避免全系统全页表复制的开销和可扩展性问题。然后，我们展示了可以利用 numaPTE 对页表共享者的精确了解，大幅减少任何内存管理操作中的 TLB 崩溃次数。因此，numaPTE不仅避免了与复制相关的速度减慢，还在内存分配/去分配和访问控制操作上提供了比基线更显著的速度提升。我们在 x86_64 的 Linux 系统中实现了 numaPTE，并在 4ocket 和 8ocket 系统上进行了评估，结果表明 numaPTE 在各种应用中充分发挥了急切页表复制的优势，同时由于 TLB 崩溃的显著减少，在 Webserver 和 Memcached 上分别实现了 12% 和 36% 的运行时间改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Operating Systems

自引率

0.00%

发文量

期刊最新文献

Analysis of Synchronization Mechanisms in Operating Systems Skip TLB flushes for reused pages within mmap's eBPF-mm: Userspace-guided memory management in Linux with eBPF BULKHEAD: Secure, Scalable, and Efficient Kernel Compartmentalization with PKS Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects