arXiv - CS - Operating Systems最新文献_第6页

Formal Definitions and Performance Comparison of Consistency Models for Parallel File Systems 并行文件系统一致性模型的正式定义和性能比较

arXiv - CS - Operating Systems

Pub Date : 2024-02-21 DOI: arxiv-2402.14105

Chen Wang, Kathryn Mohror, Marc Snir

The semantics of HPC storage systems are defined by the consistency models towhich they abide. Storage consistency models have been less studied than theircounterparts in memory systems, with the exception of the POSIX standard andits strict consistency model. The use of POSIX consistency imposes aperformance penalty that becomes more significant as the scale of parallel filesystems increases and the access time to storage devices, such as node-localsolid storage devices, decreases. While some efforts have been made to adoptrelaxed storage consistency models, these models are often defined informallyand ambiguously as by-products of a particular implementation. In this work, weestablish a connection between memory consistency models and storageconsistency models and revisit the key design choices of storage consistencymodels from a high-level perspective. Further, we propose a formal and unifiedframework for defining storage consistency models and a layered implementationthat can be used to easily evaluate their relative performance for differentI/O workloads. Finally, we conduct a comprehensive performance comparison oftwo relaxed consistency models on a range of commonly-seen parallel I/Oworkloads, such as checkpoint/restart of scientific applications and randomreads of deep learning applications. We demonstrate that for certain I/Oscenarios, a weaker consistency model can significantly improve the I/Operformance. For instance, in small random reads that typically found in deeplearning applications, session consistency achieved an 5x improvement in I/Obandwidth compared to commit consistency, even at small scales.

高性能计算存储系统的语义是由其遵守的一致性模型定义的。除了 POSIX 标准及其严格的一致性模型之外，对存储一致性模型的研究要少于内存系统中的同类模型。随着并行文件系统规模的扩大和存储设备（如节点-本地固态存储设备）访问时间的缩短，使用 POSIX 一致性会带来更严重的性能损失。虽然人们已经做出了一些努力来采用宽松的存储一致性模型，但这些模型往往是作为特定实现的副产品而被非正式地、模棱两可地定义的。在这项工作中，我们建立了内存一致性模型和存储一致性模型之间的联系，并从高层次的角度重新审视了存储一致性模型的关键设计选择。此外，我们还提出了一个用于定义存储一致性模型和分层实现的正式统一框架，可用于轻松评估它们在不同 I/O 工作负载下的相对性能。最后，我们在一系列常见的并行 I/O 工作负载（如科学应用的检查点/重启和深度学习应用的随机读取）上对两种宽松的一致性模型进行了全面的性能比较。我们证明，对于某些 I/O 场景，较弱的一致性模型可以显著提高 I/O 性能。例如，在深度学习应用中常见的小规模随机读取中，会话一致性比提交一致性的 I/O 带宽提高了 5 倍，即使在小规模情况下也是如此。

{"title":"Formal Definitions and Performance Comparison of Consistency Models for Parallel File Systems","authors":"Chen Wang, Kathryn Mohror, Marc Snir","doi":"arxiv-2402.14105","DOIUrl":"https://doi.org/arxiv-2402.14105","url":null,"abstract":"The semantics of HPC storage systems are defined by the consistency models to\u0000which they abide. Storage consistency models have been less studied than their\u0000counterparts in memory systems, with the exception of the POSIX standard and\u0000its strict consistency model. The use of POSIX consistency imposes a\u0000performance penalty that becomes more significant as the scale of parallel file\u0000systems increases and the access time to storage devices, such as node-local\u0000solid storage devices, decreases. While some efforts have been made to adopt\u0000relaxed storage consistency models, these models are often defined informally\u0000and ambiguously as by-products of a particular implementation. In this work, we\u0000establish a connection between memory consistency models and storage\u0000consistency models and revisit the key design choices of storage consistency\u0000models from a high-level perspective. Further, we propose a formal and unified\u0000framework for defining storage consistency models and a layered implementation\u0000that can be used to easily evaluate their relative performance for different\u0000I/O workloads. Finally, we conduct a comprehensive performance comparison of\u0000two relaxed consistency models on a range of commonly-seen parallel I/O\u0000workloads, such as checkpoint/restart of scientific applications and random\u0000reads of deep learning applications. We demonstrate that for certain I/O\u0000scenarios, a weaker consistency model can significantly improve the I/O\u0000performance. For instance, in small random reads that typically found in deep\u0000learning applications, session consistency achieved an 5x improvement in I/O\u0000bandwidth compared to commit consistency, even at small scales.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139953585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fight Hardware with Hardware: System-wide Detection and Mitigation of Side-Channel Attacks using Performance Counters 用硬件对抗硬件：利用性能计数器在全系统范围内检测和缓解侧信道攻击

arXiv - CS - Operating Systems

Pub Date : 2024-02-18 DOI: arxiv-2402.13281

Stefano Carnà, Serena Ferracci, Francesco Quaglia, Alessandro Pellegrini

We present a kernel-level infrastructure that allows system-wide detection ofmalicious applications attempting to exploit cache-based side-channel attacksto break the process confinement enforced by standard operating systems. Thisinfrastructure relies on hardware performance counters to collect informationat runtime from all applications running on the machine. High-level detectionmetrics are derived from these measurements to maximize the likelihood ofpromptly detecting a malicious application. Our experimental assessment showsthat we can catch a large family of side-channel attacks with a significantlyreduced overhead. We also discuss countermeasures that can be enacted once aprocess is suspected of carrying out a side-channel attack to increase theoverall tradeoff between the system's security level and the deliveredperformance under non-suspected process executions.

我们提出了一种内核级基础架构，可在全系统范围内检测试图利用基于高速缓存的侧信道攻击来打破标准操作系统强制实施的进程限制的恶意应用程序。该基础架构依靠硬件性能计数器收集机器上运行的所有应用程序的运行信息。从这些测量结果中得出高级检测指标，从而最大限度地提高及时检测到恶意应用程序的可能性。我们的实验评估结果表明，我们可以捕捉到大量的侧信道攻击，并显著降低了开销。我们还讨论了一旦进程被怀疑实施了侧信道攻击时可以采取的应对措施，以提高系统的安全级别与非可疑进程执行下的交付性能之间的整体权衡。

引用次数: 0

Enabling performance portability of data-parallel OpenMP applications on asymmetric multicore processors 在非对称多核处理器上实现数据并行 OpenMP 应用程序的性能可移植性

arXiv - CS - Operating Systems

Pub Date : 2024-02-12 DOI: arxiv-2402.07664

Juan Carlos Saez, Fernando Castro, Manuel Prieto-Matias

Asymmetric multicore processors (AMPs) couple high-performance big cores andlow-power small cores with the same instruction-set architecture but differentfeatures, such as clock frequency or microarchitecture. Previous work has shownthat asymmetric designs may deliver higher energy efficiency than symmetricmulticores for diverse workloads. Despite their benefits, AMPs pose significantchallenges to runtime systems of parallel programming models. While previouswork has mainly explored how to efficiently execute task-based parallelapplications on AMPs, via enhancements in the runtime system, improving theperformance of unmodified data-parallel applications on these architectures isstill a big challenge. In this work we analyze the particular case ofloop-based OpenMP applications, which are widely used today in scientific andengineering domains, and constitute the dominant application type in manyparallel benchmark suites used for performance evaluation on multicore systems.We observed that conventional loop-scheduling OpenMP approaches are unable toefficiently cope with the load imbalance that naturally stems from thedifferent performance delivered by big and small cores. To address this shortcoming, we propose textit{Asymmetric IterationDistribution} (AID), a set of novel loop-scheduling methods for AMPs thatdistribute iterations unevenly across worker threads to efficiently deal withperformance asymmetry. We implemented AID in textit{libgomp} --the GNU OpenMPruntime system--, and evaluated it on two different asymmetric multicoreplatforms. Our analysis reveals that the AID methods constitute effectivereplacements of the texttt{static} and texttt{dynamic} methods on AMPs, andare capable of improving performance over these conventional strategies by upto 56% and 16.8%, respectively.

非对称多核处理器（AMP）将高性能的大核和低功耗的小核结合在一起，它们采用相同的指令集架构，但具有不同的特性，如时钟频率或微架构。以往的工作表明，对于不同的工作负载，非对称设计可能比对称多核处理器提供更高的能效。尽管 AMP 有很多优点，但它对并行编程模型的运行时系统提出了巨大挑战。虽然以前的工作主要探讨了如何通过增强运行时系统在 AMP 上高效执行基于任务的并行应用，但在这些架构上提高未经修改的数据并行应用的性能仍然是一个巨大的挑战。在这项工作中，我们分析了基于循环的 OpenMP 应用程序的特殊情况，这些应用程序目前广泛应用于科学和工程领域，并在许多用于多核系统性能评估的并行基准套件中构成了主要的应用程序类型。我们发现，传统的循环调度 OpenMP 方法无法有效地应对负载不平衡问题，而负载不平衡问题自然是由大小核提供的不同性能造成的。为了解决这一缺陷，我们提出了 textit{Asymmetric IterationDistribution} (AID)，这是一套适用于 AMP 的新型循环调度方法，可以在工作线程之间不均匀地分配迭代，从而有效地处理性能不对称问题。我们在 textit{libgomp} 中实现了 AID。textit{libgomp}--GNU OpenMPruntime系统--中实现了AID，并在两种不同的非对称多核平台上进行了评估。我们的分析表明，AID方法可以有效替代AMP上的（texttt{static}）和（texttt{dynamic}）方法，并且能够比这些传统策略分别提高56%和16.8%的性能。

{"title":"Enabling performance portability of data-parallel OpenMP applications on asymmetric multicore processors","authors":"Juan Carlos Saez, Fernando Castro, Manuel Prieto-Matias","doi":"arxiv-2402.07664","DOIUrl":"https://doi.org/arxiv-2402.07664","url":null,"abstract":"Asymmetric multicore processors (AMPs) couple high-performance big cores and\u0000low-power small cores with the same instruction-set architecture but different\u0000features, such as clock frequency or microarchitecture. Previous work has shown\u0000that asymmetric designs may deliver higher energy efficiency than symmetric\u0000multicores for diverse workloads. Despite their benefits, AMPs pose significant\u0000challenges to runtime systems of parallel programming models. While previous\u0000work has mainly explored how to efficiently execute task-based parallel\u0000applications on AMPs, via enhancements in the runtime system, improving the\u0000performance of unmodified data-parallel applications on these architectures is\u0000still a big challenge. In this work we analyze the particular case of\u0000loop-based OpenMP applications, which are widely used today in scientific and\u0000engineering domains, and constitute the dominant application type in many\u0000parallel benchmark suites used for performance evaluation on multicore systems.\u0000We observed that conventional loop-scheduling OpenMP approaches are unable to\u0000efficiently cope with the load imbalance that naturally stems from the\u0000different performance delivered by big and small cores. To address this shortcoming, we propose textit{Asymmetric Iteration\u0000Distribution} (AID), a set of novel loop-scheduling methods for AMPs that\u0000distribute iterations unevenly across worker threads to efficiently deal with\u0000performance asymmetry. We implemented AID in textit{libgomp} --the GNU OpenMP\u0000runtime system--, and evaluated it on two different asymmetric multicore\u0000platforms. Our analysis reveals that the AID methods constitute effective\u0000replacements of the texttt{static} and texttt{dynamic} methods on AMPs, and\u0000are capable of improving performance over these conventional strategies by up\u0000to 56% and 16.8%, respectively.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139770479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models Fiddler：为快速推断专家混合物模型而进行 CPU-GPU 协调

arXiv - CS - Operating Systems

Pub Date : 2024-02-10 DOI: arxiv-2402.07033

Keisuke Kamahori, Yile Gu, Kan Zhu, Baris Kasikci

Large Language Models (LLMs) based on Mixture-of-Experts (MoE) architectureare showing promising performance on various tasks. However, running them onresource-constrained settings, where GPU memory resources are not abundant, ischallenging due to huge model sizes. Existing systems that offload modelweights to CPU memory suffer from the significant overhead of frequently movingdata between CPU and GPU. In this paper, we propose Fiddler, aresource-efficient inference engine with CPU-GPU orchestration for MoE models.The key idea of Fiddler is to use the computation ability of the CPU tominimize the data movement between the CPU and GPU. Our evaluation shows thatFiddler can run the uncompressed Mixtral-8x7B model, which exceeds 90GB inparameters, to generate over $3$ tokens per second on a single GPU with 24GBmemory, showing an order of magnitude improvement over existing methods. Thecode of Fiddler is publicly available aturl{https://github.com/efeslab/fiddler}

基于专家混合物（MoE）架构的大型语言模型（LLM）在各种任务中表现出良好的性能。然而，在资源受限的环境下（GPU 内存资源并不充裕），由于模型规模巨大，运行这些模型非常困难。现有的将模型重量卸载到 CPU 内存的系统在频繁地在 CPU 和 GPU 之间移动数据的过程中开销巨大。在本文中，我们提出了Fiddler，一个针对MoE模型的CPU-GPU协调的源高效推理引擎。Fiddler的关键理念是利用CPU的计算能力，最大限度地减少CPU和GPU之间的数据移动。我们的评估结果表明，Fiddler可以运行参数超过90GB的未压缩Mixtral-8x7B模型，在单个24GB内存的GPU上每秒生成超过3美元的令牌，与现有方法相比有数量级的提升。Fiddler 的代码可在以下网址公开获取：url{https://github.com/efeslab/fiddler}。

引用次数: 0

bypass4netns: Accelerating TCP/IP Communications in Rootless Containers bypass4netns：在无根容器中加速 TCP/IP 通信

arXiv - CS - Operating Systems

Pub Date : 2024-02-01 DOI: arxiv-2402.00365

Naoki Matsumoto, Akihiro Suda

"Rootless containers" is a concept to run the entire container runtimes andcontainers without the root privileges. It protects the host environment fromattackers exploiting container runtime vulnerabilities. However, when rootlesscontainers communicate with external endpoints, the network performance is lowcompared to rootful containers because of the overhead of rootless networkingcomponents. In this paper, we propose bypass4netns that accelerates TCP/IPcommunications in rootless containers by bypassing slow networking components.bypass4netns uses sockets allocated on the host. It switches sockets incontainers to the host's sockets by intercepting syscalls and injecting thefile descriptors using Seccomp. Our method with Seccomp can handle staticallylinked applications that previous works could not handle. Also, we proposehigh-performance rootless multi-node communication. We confirmed that rootlesscontainers with bypass4netns achieve more than 30x faster throughput thanrootless containers without it. In addition, we evaluated performance withapplications and it showed large improvements on some applications.

"无根容器 "是在没有 root 权限的情况下运行整个容器运行时和容器的概念。它可以保护主机环境，防止攻击者利用容器运行时漏洞。然而，当无根容器与外部端点通信时，由于无根网络组件的开销，网络性能与有根容器相比较低。在本文中，我们提出了 bypass4netns，通过绕过缓慢的网络组件来加速无根容器中的 TCP/IP 通信。它通过拦截系统调用并使用 Seccomp 注入文件描述符，将容器中的套接字切换到主机的套接字。我们使用 Seccomp 的方法可以处理静态链接应用程序，这是以前的工作无法处理的。此外，我们还提出了高性能无根多节点通信。我们证实，带有 bypass4netns 的无根容器的吞吐量比没有 bypass4netns 的无根容器快 30 倍以上。此外，我们还通过应用对性能进行了评估，结果表明在某些应用上有很大改进。

{"title":"bypass4netns: Accelerating TCP/IP Communications in Rootless Containers","authors":"Naoki Matsumoto, Akihiro Suda","doi":"arxiv-2402.00365","DOIUrl":"https://doi.org/arxiv-2402.00365","url":null,"abstract":"\"Rootless containers\" is a concept to run the entire container runtimes and\u0000containers without the root privileges. It protects the host environment from\u0000attackers exploiting container runtime vulnerabilities. However, when rootless\u0000containers communicate with external endpoints, the network performance is low\u0000compared to rootful containers because of the overhead of rootless networking\u0000components. In this paper, we propose bypass4netns that accelerates TCP/IP\u0000communications in rootless containers by bypassing slow networking components.\u0000bypass4netns uses sockets allocated on the host. It switches sockets in\u0000containers to the host's sockets by intercepting syscalls and injecting the\u0000file descriptors using Seccomp. Our method with Seccomp can handle statically\u0000linked applications that previous works could not handle. Also, we propose\u0000high-performance rootless multi-node communication. We confirmed that rootless\u0000containers with bypass4netns achieve more than 30x faster throughput than\u0000rootless containers without it. In addition, we evaluated performance with\u0000applications and it showed large improvements on some applications.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"5 1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139664220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Beyond Control: Exploring Novel File System Objects for Data-Only Attacks on Linux Systems 超越控制：探索用于 Linux 系统数据攻击的新型文件系统对象

arXiv - CS - Operating Systems

Pub Date : 2024-01-31 DOI: arxiv-2401.17618

Jinmeng Zhou, Jiayi Hu, Ziyue Pan, Jiaxun Zhu, Guoren Li, Wenbo Shen, Yulei Sui, Zhiyun Qian

The widespread deployment of control-flow integrity has propelled non-controldata attacks into the mainstream. In the domain of OS kernel exploits, bycorrupting critical non-control data, local attackers can directly gain rootaccess or privilege escalation without hijacking the control flow. As a result,OS kernels have been restricting the availability of such non-control data.This forces attackers to continue to search for more exploitable non-controldata in OS kernels. However, discovering unknown non-control data can bedaunting because they are often tied heavily to semantics and lack universalpatterns. We make two contributions in this paper: (1) discover critical non-controlobjects in the file subsystem and (2) analyze their exploitability. This workrepresents the first study, with minimal domain knowledge, tosemi-automatically discover and evaluate exploitable non-control data withinthe file subsystem of the Linux kernel. Our solution utilizes a custom analysisand testing framework that statically and dynamically identifies promisingcandidate objects. Furthermore, we categorize these discovered objects intotypes that are suitable for various exploit strategies, including a novelstrategy necessary to overcome the defense that isolates many of these objects.These objects have the advantage of being exploitable without requiring KASLR,thus making the exploits simpler and more reliable. We use 18 real-world CVEsto evaluate the exploitability of the file system objects using various exploitstrategies. We develop 10 end-to-end exploits using a subset of CVEs againstthe kernel with all state-of-the-art mitigations enabled.

控制流完整性的广泛应用推动非控制数据攻击成为主流。在操作系统内核漏洞利用领域，通过破坏关键的非控制数据，本地攻击者可以直接获得root权限或权限升级，而无需劫持控制流。因此，操作系统内核一直在限制此类非控制数据的可用性，这迫使攻击者继续在操作系统内核中寻找更多可利用的非控制数据。然而，发现未知的非控制数据可能会很困难，因为这些数据通常与语义紧密相关，而且缺乏通用模式。我们在本文中有两个贡献：(1) 发现文件子系统中的关键非控制对象；(2) 分析它们的可利用性。这项工作是利用最少的领域知识半自动发现和评估 Linux 内核文件子系统中可利用的非控制数据的首次研究。我们的解决方案利用定制的分析和测试框架，静态和动态地识别有希望的候选对象。此外，我们还将这些发现的对象归类为适合各种利用策略的类型，包括一种新颖的策略，以克服隔离这些对象的防御。我们使用18个真实世界的CVE来评估使用各种利用策略对文件系统对象的可利用性。我们使用 CVE 子集开发了 10 个针对内核的端到端漏洞，并启用了所有最先进的缓解措施。

{"title":"Beyond Control: Exploring Novel File System Objects for Data-Only Attacks on Linux Systems","authors":"Jinmeng Zhou, Jiayi Hu, Ziyue Pan, Jiaxun Zhu, Guoren Li, Wenbo Shen, Yulei Sui, Zhiyun Qian","doi":"arxiv-2401.17618","DOIUrl":"https://doi.org/arxiv-2401.17618","url":null,"abstract":"The widespread deployment of control-flow integrity has propelled non-control\u0000data attacks into the mainstream. In the domain of OS kernel exploits, by\u0000corrupting critical non-control data, local attackers can directly gain root\u0000access or privilege escalation without hijacking the control flow. As a result,\u0000OS kernels have been restricting the availability of such non-control data.\u0000This forces attackers to continue to search for more exploitable non-control\u0000data in OS kernels. However, discovering unknown non-control data can be\u0000daunting because they are often tied heavily to semantics and lack universal\u0000patterns. We make two contributions in this paper: (1) discover critical non-control\u0000objects in the file subsystem and (2) analyze their exploitability. This work\u0000represents the first study, with minimal domain knowledge, to\u0000semi-automatically discover and evaluate exploitable non-control data within\u0000the file subsystem of the Linux kernel. Our solution utilizes a custom analysis\u0000and testing framework that statically and dynamically identifies promising\u0000candidate objects. Furthermore, we categorize these discovered objects into\u0000types that are suitable for various exploit strategies, including a novel\u0000strategy necessary to overcome the defense that isolates many of these objects.\u0000These objects have the advantage of being exploitable without requiring KASLR,\u0000thus making the exploits simpler and more reliable. We use 18 real-world CVEs\u0000to evaluate the exploitability of the file system objects using various exploit\u0000strategies. We develop 10 end-to-end exploits using a subset of CVEs against\u0000the kernel with all state-of-the-art mitigations enabled.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139656904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

numaPTE: Managing Page-Tables and TLBs on NUMA Systems numaPTE：在 NUMA 系统上管理页表和 TLB

arXiv - CS - Operating Systems

Pub Date : 2024-01-28 DOI: arxiv-2401.15558

Bin Gao, Qingxuan Kang, Hao-Wei Tee, Kyle Timothy Ng Chu, Alireza Sanaee, Djordje Jevdjic

Memory management operations that modify page-tables, typically performedduring memory allocation/deallocation, are infamous for their poor performancein highly threaded applications, largely due to process-wide TLB shootdownsthat the OS must issue due to the lack of hardware support for TLB coherence.We study these operations in NUMA settings, where we observe up to 40x overheadfor basic operations such as munmap or mprotect. The overhead further increasesif page-table replication is used, where complete coherent copies of thepage-tables are maintained across all NUMA nodes. While eager system-widereplication is extremely effective at localizing page-table reads duringaddress translation, we find that it creates additional penalties upon anypage-table changes due to the need to maintain all replicas coherent. In this paper, we propose a novel page-table management mechanism, callednumaPTE, to enable transparent, on-demand, and partial page-table replicationacross NUMA nodes in order to perform address translation locally, whileavoiding the overheads and scalability issues of system-wide full page-tablereplication. We then show that numaPTE's precise knowledge of page-tablesharers can be leveraged to significantly reduce the number of TLB shootdownsissued upon any memory-management operation. As a result, numaPTE not onlyavoids replication-related slowdowns, but also provides significant speedupover the baseline on memory allocation/deallocation and access controloperations. We implement numaPTEin Linux on x86_64, evaluate it on 4- and8-socket systems, and show that numaPTE achieves the full benefits of eagerpage-table replication on a wide range of applications, while also achieving a12% and 36% runtime improvement on Webserver and Memcached respectively due toa significant reduction in TLB shootdowns.

修改页表的内存管理操作通常是在内存分配/去分配过程中执行的，在高线程应用中因性能不佳而臭名昭著，这主要是由于缺乏对 TLB 一致性的硬件支持，操作系统必须在整个进程范围内对 TLB 进行击穿。我们在 NUMA 设置中对这些操作进行了研究，观察到诸如 munmap 或 mprotect 等基本操作的开销高达 40 倍。如果使用页表复制，在所有 NUMA 节点上维护页表的完整一致性副本，则开销会进一步增加。虽然在地址转换过程中，急切的全系统复制在本地化页表读取方面非常有效，但我们发现，由于需要保持所有副本的一致性，它在页表发生任何变化时都会产生额外的惩罚。在本文中，我们提出了一种名为 numaPTE 的新型页表管理机制，以实现跨 NUMA 节点的透明、按需和部分页表复制，从而在本地执行地址转换，同时避免全系统全页表复制的开销和可扩展性问题。然后，我们展示了可以利用 numaPTE 对页表共享者的精确了解，大幅减少任何内存管理操作中的 TLB 崩溃次数。因此，numaPTE不仅避免了与复制相关的速度减慢，还在内存分配/去分配和访问控制操作上提供了比基线更显著的速度提升。我们在 x86_64 的 Linux 系统中实现了 numaPTE，并在 4ocket 和 8ocket 系统上进行了评估，结果表明 numaPTE 在各种应用中充分发挥了急切页表复制的优势，同时由于 TLB 崩溃的显著减少，在 Webserver 和 Memcached 上分别实现了 12% 和 36% 的运行时间改进。

{"title":"numaPTE: Managing Page-Tables and TLBs on NUMA Systems","authors":"Bin Gao, Qingxuan Kang, Hao-Wei Tee, Kyle Timothy Ng Chu, Alireza Sanaee, Djordje Jevdjic","doi":"arxiv-2401.15558","DOIUrl":"https://doi.org/arxiv-2401.15558","url":null,"abstract":"Memory management operations that modify page-tables, typically performed\u0000during memory allocation/deallocation, are infamous for their poor performance\u0000in highly threaded applications, largely due to process-wide TLB shootdowns\u0000that the OS must issue due to the lack of hardware support for TLB coherence.\u0000We study these operations in NUMA settings, where we observe up to 40x overhead\u0000for basic operations such as munmap or mprotect. The overhead further increases\u0000if page-table replication is used, where complete coherent copies of the\u0000page-tables are maintained across all NUMA nodes. While eager system-wide\u0000replication is extremely effective at localizing page-table reads during\u0000address translation, we find that it creates additional penalties upon any\u0000page-table changes due to the need to maintain all replicas coherent. In this paper, we propose a novel page-table management mechanism, called\u0000numaPTE, to enable transparent, on-demand, and partial page-table replication\u0000across NUMA nodes in order to perform address translation locally, while\u0000avoiding the overheads and scalability issues of system-wide full page-table\u0000replication. We then show that numaPTE's precise knowledge of page-table\u0000sharers can be leveraged to significantly reduce the number of TLB shootdowns\u0000issued upon any memory-management operation. As a result, numaPTE not only\u0000avoids replication-related slowdowns, but also provides significant speedup\u0000over the baseline on memory allocation/deallocation and access control\u0000operations. We implement numaPTEin Linux on x86_64, evaluate it on 4- and\u00008-socket systems, and show that numaPTE achieves the full benefits of eager\u0000page-table replication on a wide range of applications, while also achieving a\u000012% and 36% runtime improvement on Webserver and Memcached respectively due to\u0000a significant reduction in TLB shootdowns.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139584358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Characterizing Network Requirements for GPU API Remoting in AI Applications 鉴定人工智能应用中 GPU API Remoting 的网络要求

arXiv - CS - Operating Systems

Pub Date : 2024-01-24 DOI: arxiv-2401.13354

Tianxia Wang, Zhuofu Chen, Xingda Wei, Jinyu Gu, Rong Chen, Haibo Chen

GPU remoting is a promising technique for supporting AI applications.Networking plays a key role in enabling remoting. However, for efficientremoting, the network requirements in terms of latency and bandwidth areunknown. In this paper, we take a GPU-centric approach to derive the minimumlatency and bandwidth requirements for GPU remoting, while ensuring no (orlittle) performance degradation for AI applications. Our study includingtheoretical model demonstrates that, with careful remoting design, unmodifiedAI applications can run on the remoting setup using commodity networkinghardware without any overhead or even with better performance, with low networkdemands.

GPU 远程控制是支持人工智能应用的一项前景广阔的技术。然而，要实现高效远程，网络在延迟和带宽方面的要求是未知的。在本文中，我们采用了一种以 GPU 为中心的方法来推导 GPU 远程通信的最低延迟和带宽要求，同时确保人工智能应用不会（或几乎不会）出现性能下降。我们的研究（包括理论模型）表明，通过精心的远程设计，未经修改的人工智能应用可以在使用商品网络硬件的远程设置上运行，而不会产生任何开销，甚至可以在网络需求较低的情况下获得更好的性能。

引用次数: 0

MATRYOSHKA: Non-Exclusive Memory Tiering via Transactional Page Migration MATRYOSHKA：通过事务性页面迁移实现非排他性内存分层

arXiv - CS - Operating Systems

Pub Date : 2024-01-24 DOI: arxiv-2401.13154

Lingfeng Xiang, Zhen Lin, Weishu Deng, Hui Lu, Jia Rao, Yifan Yuan, Ren Wang

With the advent of byte-addressable memory devices, such as CXL memory,persistent memory, and storage-class memory, tiered memory systems have becomea reality. Page migration is the de facto method within operating systems formanaging tiered memory. It aims to bring hot data whenever possible into fastmemory to optimize the performance of data accesses while using slow memory toaccommodate data spilled from fast memory. While the existing research hasdemonstrated the effectiveness of various optimizations on page migration, itfalls short of addressing a fundamental question: Is exclusive memory tiering,in which a page is either present in fast memory or slow memory, but not bothsimultaneously, the optimal strategy for tiered memory management? We demonstrate that page migration-based exclusive memory tiering sufferssignificant performance degradation when fast memory is under pressure. In thispaper, we propose non-exclusive memory tiering, a page management strategy thatretains a copy of pages recently promoted from slow memory to fast memory tomitigate memory thrashing. To enable non-exclusive memory tiering, we developMATRYOSHKA, a new mechanism that features transactional page migration and pageshadowing. MATRYOSHKA removes page migration off the program's critical pathand makes migration asynchronous. Evaluations with microbenchmarks andrealworld applications show that MATRYOSHKA achieves 6x performance improvementover the state-of-the-art transparent page placement (TPP) approach undermemory pressure. We also compare MATRYOSHKA with a recently proposedsampling-based migration approach and demonstrate MATRYOSHKA's strengths andpotential weaknesses in various scenarios. Through the evaluations, we discovera serious issue facing all tested approaches, unfortunately includingMATRYOSHKA, and call for further research on tiered memory-aware memoryallocation.

随着字节可寻址内存设备（如 CXL 内存、持久内存和存储级内存）的出现，分层内存系统已成为现实。页迁移是操作系统管理分层内存的实际方法。其目的是尽可能将热数据带入快速内存，以优化数据访问性能，同时使用慢速内存来容纳从快速内存溢出的数据。虽然现有研究已经证明了页面迁移中各种优化的有效性，但还不足以解决一个根本问题：在排他性内存分层中，页面要么存在于快内存中，要么存在于慢内存中，但不能同时存在于快内存和慢内存中，这是否是分层内存管理的最佳策略？我们证明，当快速内存受到压力时，基于页面迁移的排他性内存分层会导致明显的性能下降。在本文中，我们提出了非排他性内存分层，这是一种页面管理策略，它保留了最近从慢速内存提升到快速内存的页面副本，以减少内存冲击。为了实现非独占内存分层，我们开发了MATRYOSHKA，这是一种具有事务性页面迁移和页面阴影功能的新机制。MATRYOSHKA 将页面迁移从程序的关键路径中移除，并使迁移异步化。微基准测试和现实世界应用的评估表明，在内存压力下，MATRYOSHKA 的性能比最先进的透明页面放置（TPP）方法提高了 6 倍。我们还将 MATRYOSHKA 与最近提出的基于采样的迁移方法进行了比较，并展示了 MATRYOSHKA 在各种情况下的优势和潜在弱点。通过评估，我们发现所有测试方法（不幸的是包括 MATRYOSHKA）都面临一个严重问题，并呼吁进一步研究分层内存感知内存分配。

{"title":"MATRYOSHKA: Non-Exclusive Memory Tiering via Transactional Page Migration","authors":"Lingfeng Xiang, Zhen Lin, Weishu Deng, Hui Lu, Jia Rao, Yifan Yuan, Ren Wang","doi":"arxiv-2401.13154","DOIUrl":"https://doi.org/arxiv-2401.13154","url":null,"abstract":"With the advent of byte-addressable memory devices, such as CXL memory,\u0000persistent memory, and storage-class memory, tiered memory systems have become\u0000a reality. Page migration is the de facto method within operating systems for\u0000managing tiered memory. It aims to bring hot data whenever possible into fast\u0000memory to optimize the performance of data accesses while using slow memory to\u0000accommodate data spilled from fast memory. While the existing research has\u0000demonstrated the effectiveness of various optimizations on page migration, it\u0000falls short of addressing a fundamental question: Is exclusive memory tiering,\u0000in which a page is either present in fast memory or slow memory, but not both\u0000simultaneously, the optimal strategy for tiered memory management? We demonstrate that page migration-based exclusive memory tiering suffers\u0000significant performance degradation when fast memory is under pressure. In this\u0000paper, we propose non-exclusive memory tiering, a page management strategy that\u0000retains a copy of pages recently promoted from slow memory to fast memory to\u0000mitigate memory thrashing. To enable non-exclusive memory tiering, we develop\u0000MATRYOSHKA, a new mechanism that features transactional page migration and page\u0000shadowing. MATRYOSHKA removes page migration off the program's critical path\u0000and makes migration asynchronous. Evaluations with microbenchmarks and\u0000realworld applications show that MATRYOSHKA achieves 6x performance improvement\u0000over the state-of-the-art transparent page placement (TPP) approach under\u0000memory pressure. We also compare MATRYOSHKA with a recently proposed\u0000sampling-based migration approach and demonstrate MATRYOSHKA's strengths and\u0000potential weaknesses in various scenarios. Through the evaluations, we discover\u0000a serious issue facing all tested approaches, unfortunately including\u0000MATRYOSHKA, and call for further research on tiered memory-aware memory\u0000allocation.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"2019 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139561196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

File System Aging 文件系统老化

arXiv - CS - Operating Systems

Pub Date : 2024-01-16 DOI: arxiv-2401.08858

Alex Conway, Ainesh Bakshi, Arghya Bhattacharya, Rory Bennett, Yizheng Jiao, Eric Knorr, Yang Zhan, Michael A. Bender, William Jannen, Rob Johnson, Bradley C. Kuszmaul, Donald E. Porter, Jun Yuan, Martin Farach-Colton

File systems must allocate space for files without knowing what will be addedor removed in the future. Over the life of a file system, this may causesuboptimal file placement decisions that eventually lead to slower performance,or aging. Conventional wisdom suggests that file system aging is a solvedproblem in the common case; heuristics to avoid aging, such as colocatingrelated files and data blocks, are effective until a storage device fills up,at which point space pressure exacerbates fragmentation-based aging. However,this article describes both realistic and synthetic workloads that can causethese heuristics to fail, inducing large performance declines due to aging,even when the storage device is nearly empty. We argue that these slowdowns are caused by poor layout. We demonstrate acorrelation between the read performance of a directory scan and the localitywithin a file system's access patterns, using a dynamic layout score. Wecomplement these results with microbenchmarks that show that space pressure cancause a substantial amount of inter-file and intra-file fragmentation. However,our results suggest that the effect of free-space fragmentation on readperformance is best described as accelerating the file system aging process.The effect on write performance is non-existent in some cases, and, in mostcases, an order of magnitude smaller than the read degradation fromfragmentation caused by normal usage. In short, many file systems are exquisitely prone to read aging after avariety of write patterns. We show, however, that aging is not inevitable.BetrFS, a file system based on write-optimized dictionaries, exhibits almost noaging in our experiments. We present a framework for understanding andpredicting aging, and identify the key features of BetrFS that avoid aging.

文件系统必须为文件分配空间，而不知道将来会添加或删除什么文件。在文件系统的生命周期中，这可能会导致次优文件放置决策，最终导致性能降低或老化。传统观点认为，文件系统的老化在一般情况下是一个可以解决的问题；避免老化的启发式方法（如将相关文件和数据块同地放置）在存储设备填满之前是有效的，此时空间压力会加剧基于碎片的老化。然而，本文描述的现实工作负载和合成工作负载会导致这些启发式方法失效，甚至在存储设备几乎耗尽的情况下也会因老化而导致性能大幅下降。我们认为，这些性能下降是由于布局不当造成的。我们使用动态布局评分证明了目录扫描的读取性能与文件系统访问模式的本地性之间的相关性。我们使用微基准测试对这些结果进行了补充，结果表明空间压力会导致大量文件间和文件内碎片。然而，我们的结果表明，自由空间碎片对读取性能的影响最好的描述就是加速文件系统的老化过程。在某些情况下，自由空间碎片对写入性能的影响并不存在，在大多数情况下，自由空间碎片对写入性能的影响要比正常使用造成的碎片读取性能下降小一个数量级。简而言之，在各种写入模式下，许多文件系统都极易出现读老化。BetrFS 是一个基于写优化字典的文件系统，在我们的实验中几乎没有出现老化现象。我们提出了一个理解和预测老化的框架，并确定了 BetrFS 避免老化的关键特性。

{"title":"File System Aging","authors":"Alex Conway, Ainesh Bakshi, Arghya Bhattacharya, Rory Bennett, Yizheng Jiao, Eric Knorr, Yang Zhan, Michael A. Bender, William Jannen, Rob Johnson, Bradley C. Kuszmaul, Donald E. Porter, Jun Yuan, Martin Farach-Colton","doi":"arxiv-2401.08858","DOIUrl":"https://doi.org/arxiv-2401.08858","url":null,"abstract":"File systems must allocate space for files without knowing what will be added\u0000or removed in the future. Over the life of a file system, this may cause\u0000suboptimal file placement decisions that eventually lead to slower performance,\u0000or aging. Conventional wisdom suggests that file system aging is a solved\u0000problem in the common case; heuristics to avoid aging, such as colocating\u0000related files and data blocks, are effective until a storage device fills up,\u0000at which point space pressure exacerbates fragmentation-based aging. However,\u0000this article describes both realistic and synthetic workloads that can cause\u0000these heuristics to fail, inducing large performance declines due to aging,\u0000even when the storage device is nearly empty. We argue that these slowdowns are caused by poor layout. We demonstrate a\u0000correlation between the read performance of a directory scan and the locality\u0000within a file system's access patterns, using a dynamic layout score. We\u0000complement these results with microbenchmarks that show that space pressure can\u0000cause a substantial amount of inter-file and intra-file fragmentation. However,\u0000our results suggest that the effect of free-space fragmentation on read\u0000performance is best described as accelerating the file system aging process.\u0000The effect on write performance is non-existent in some cases, and, in most\u0000cases, an order of magnitude smaller than the read degradation from\u0000fragmentation caused by normal usage. In short, many file systems are exquisitely prone to read aging after a\u0000variety of write patterns. We show, however, that aging is not inevitable.\u0000BetrFS, a file system based on write-optimized dictionaries, exhibits almost no\u0000aging in our experiments. We present a framework for understanding and\u0000predicting aging, and identify the key features of BetrFS that avoid aging.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139500399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0