Bin Gao, Qingxuan Kang, Hao-Wei Tee, Kyle Timothy Ng Chu, Alireza Sanaee, Djordje Jevdjic
{"title":"numaPTE:在 NUMA 系统上管理页表和 TLB","authors":"Bin Gao, Qingxuan Kang, Hao-Wei Tee, Kyle Timothy Ng Chu, Alireza Sanaee, Djordje Jevdjic","doi":"arxiv-2401.15558","DOIUrl":null,"url":null,"abstract":"Memory management operations that modify page-tables, typically performed\nduring memory allocation/deallocation, are infamous for their poor performance\nin highly threaded applications, largely due to process-wide TLB shootdowns\nthat the OS must issue due to the lack of hardware support for TLB coherence.\nWe study these operations in NUMA settings, where we observe up to 40x overhead\nfor basic operations such as munmap or mprotect. The overhead further increases\nif page-table replication is used, where complete coherent copies of the\npage-tables are maintained across all NUMA nodes. While eager system-wide\nreplication is extremely effective at localizing page-table reads during\naddress translation, we find that it creates additional penalties upon any\npage-table changes due to the need to maintain all replicas coherent. In this paper, we propose a novel page-table management mechanism, called\nnumaPTE, to enable transparent, on-demand, and partial page-table replication\nacross NUMA nodes in order to perform address translation locally, while\navoiding the overheads and scalability issues of system-wide full page-table\nreplication. We then show that numaPTE's precise knowledge of page-table\nsharers can be leveraged to significantly reduce the number of TLB shootdowns\nissued upon any memory-management operation. As a result, numaPTE not only\navoids replication-related slowdowns, but also provides significant speedup\nover the baseline on memory allocation/deallocation and access control\noperations. We implement numaPTEin Linux on x86_64, evaluate it on 4- and\n8-socket systems, and show that numaPTE achieves the full benefits of eager\npage-table replication on a wide range of applications, while also achieving a\n12% and 36% runtime improvement on Webserver and Memcached respectively due to\na significant reduction in TLB shootdowns.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"8 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"numaPTE: Managing Page-Tables and TLBs on NUMA Systems\",\"authors\":\"Bin Gao, Qingxuan Kang, Hao-Wei Tee, Kyle Timothy Ng Chu, Alireza Sanaee, Djordje Jevdjic\",\"doi\":\"arxiv-2401.15558\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Memory management operations that modify page-tables, typically performed\\nduring memory allocation/deallocation, are infamous for their poor performance\\nin highly threaded applications, largely due to process-wide TLB shootdowns\\nthat the OS must issue due to the lack of hardware support for TLB coherence.\\nWe study these operations in NUMA settings, where we observe up to 40x overhead\\nfor basic operations such as munmap or mprotect. The overhead further increases\\nif page-table replication is used, where complete coherent copies of the\\npage-tables are maintained across all NUMA nodes. While eager system-wide\\nreplication is extremely effective at localizing page-table reads during\\naddress translation, we find that it creates additional penalties upon any\\npage-table changes due to the need to maintain all replicas coherent. In this paper, we propose a novel page-table management mechanism, called\\nnumaPTE, to enable transparent, on-demand, and partial page-table replication\\nacross NUMA nodes in order to perform address translation locally, while\\navoiding the overheads and scalability issues of system-wide full page-table\\nreplication. We then show that numaPTE's precise knowledge of page-table\\nsharers can be leveraged to significantly reduce the number of TLB shootdowns\\nissued upon any memory-management operation. As a result, numaPTE not only\\navoids replication-related slowdowns, but also provides significant speedup\\nover the baseline on memory allocation/deallocation and access control\\noperations. We implement numaPTEin Linux on x86_64, evaluate it on 4- and\\n8-socket systems, and show that numaPTE achieves the full benefits of eager\\npage-table replication on a wide range of applications, while also achieving a\\n12% and 36% runtime improvement on Webserver and Memcached respectively due to\\na significant reduction in TLB shootdowns.\",\"PeriodicalId\":501333,\"journal\":{\"name\":\"arXiv - CS - Operating Systems\",\"volume\":\"8 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-01-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Operating Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2401.15558\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Operating Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2401.15558","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
numaPTE: Managing Page-Tables and TLBs on NUMA Systems
Memory management operations that modify page-tables, typically performed
during memory allocation/deallocation, are infamous for their poor performance
in highly threaded applications, largely due to process-wide TLB shootdowns
that the OS must issue due to the lack of hardware support for TLB coherence.
We study these operations in NUMA settings, where we observe up to 40x overhead
for basic operations such as munmap or mprotect. The overhead further increases
if page-table replication is used, where complete coherent copies of the
page-tables are maintained across all NUMA nodes. While eager system-wide
replication is extremely effective at localizing page-table reads during
address translation, we find that it creates additional penalties upon any
page-table changes due to the need to maintain all replicas coherent. In this paper, we propose a novel page-table management mechanism, called
numaPTE, to enable transparent, on-demand, and partial page-table replication
across NUMA nodes in order to perform address translation locally, while
avoiding the overheads and scalability issues of system-wide full page-table
replication. We then show that numaPTE's precise knowledge of page-table
sharers can be leveraged to significantly reduce the number of TLB shootdowns
issued upon any memory-management operation. As a result, numaPTE not only
avoids replication-related slowdowns, but also provides significant speedup
over the baseline on memory allocation/deallocation and access control
operations. We implement numaPTEin Linux on x86_64, evaluate it on 4- and
8-socket systems, and show that numaPTE achieves the full benefits of eager
page-table replication on a wide range of applications, while also achieving a
12% and 36% runtime improvement on Webserver and Memcached respectively due to
a significant reduction in TLB shootdowns.