The semantics of HPC storage systems are defined by the consistency models to which they abide. Storage consistency models have been less studied than their counterparts in memory systems, with the exception of the POSIX standard and its strict consistency model. The use of POSIX consistency imposes a performance penalty that becomes more significant as the scale of parallel file systems increases and the access time to storage devices, such as node-local solid storage devices, decreases. While some efforts have been made to adopt relaxed storage consistency models, these models are often defined informally and ambiguously as by-products of a particular implementation. In this work, we establish a connection between memory consistency models and storage consistency models and revisit the key design choices of storage consistency models from a high-level perspective. Further, we propose a formal and unified framework for defining storage consistency models and a layered implementation that can be used to easily evaluate their relative performance for different I/O workloads. Finally, we conduct a comprehensive performance comparison of two relaxed consistency models on a range of commonly-seen parallel I/O workloads, such as checkpoint/restart of scientific applications and random reads of deep learning applications. We demonstrate that for certain I/O scenarios, a weaker consistency model can significantly improve the I/O performance. For instance, in small random reads that typically found in deep learning applications, session consistency achieved an 5x improvement in I/O bandwidth compared to commit consistency, even at small scales.
{"title":"Formal Definitions and Performance Comparison of Consistency Models for Parallel File Systems","authors":"Chen Wang, Kathryn Mohror, Marc Snir","doi":"arxiv-2402.14105","DOIUrl":"https://doi.org/arxiv-2402.14105","url":null,"abstract":"The semantics of HPC storage systems are defined by the consistency models to\u0000which they abide. Storage consistency models have been less studied than their\u0000counterparts in memory systems, with the exception of the POSIX standard and\u0000its strict consistency model. The use of POSIX consistency imposes a\u0000performance penalty that becomes more significant as the scale of parallel file\u0000systems increases and the access time to storage devices, such as node-local\u0000solid storage devices, decreases. While some efforts have been made to adopt\u0000relaxed storage consistency models, these models are often defined informally\u0000and ambiguously as by-products of a particular implementation. In this work, we\u0000establish a connection between memory consistency models and storage\u0000consistency models and revisit the key design choices of storage consistency\u0000models from a high-level perspective. Further, we propose a formal and unified\u0000framework for defining storage consistency models and a layered implementation\u0000that can be used to easily evaluate their relative performance for different\u0000I/O workloads. Finally, we conduct a comprehensive performance comparison of\u0000two relaxed consistency models on a range of commonly-seen parallel I/O\u0000workloads, such as checkpoint/restart of scientific applications and random\u0000reads of deep learning applications. We demonstrate that for certain I/O\u0000scenarios, a weaker consistency model can significantly improve the I/O\u0000performance. For instance, in small random reads that typically found in deep\u0000learning applications, session consistency achieved an 5x improvement in I/O\u0000bandwidth compared to commit consistency, even at small scales.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139953585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stefano Carnà, Serena Ferracci, Francesco Quaglia, Alessandro Pellegrini
We present a kernel-level infrastructure that allows system-wide detection of malicious applications attempting to exploit cache-based side-channel attacks to break the process confinement enforced by standard operating systems. This infrastructure relies on hardware performance counters to collect information at runtime from all applications running on the machine. High-level detection metrics are derived from these measurements to maximize the likelihood of promptly detecting a malicious application. Our experimental assessment shows that we can catch a large family of side-channel attacks with a significantly reduced overhead. We also discuss countermeasures that can be enacted once a process is suspected of carrying out a side-channel attack to increase the overall tradeoff between the system's security level and the delivered performance under non-suspected process executions.
{"title":"Fight Hardware with Hardware: System-wide Detection and Mitigation of Side-Channel Attacks using Performance Counters","authors":"Stefano Carnà, Serena Ferracci, Francesco Quaglia, Alessandro Pellegrini","doi":"arxiv-2402.13281","DOIUrl":"https://doi.org/arxiv-2402.13281","url":null,"abstract":"We present a kernel-level infrastructure that allows system-wide detection of\u0000malicious applications attempting to exploit cache-based side-channel attacks\u0000to break the process confinement enforced by standard operating systems. This\u0000infrastructure relies on hardware performance counters to collect information\u0000at runtime from all applications running on the machine. High-level detection\u0000metrics are derived from these measurements to maximize the likelihood of\u0000promptly detecting a malicious application. Our experimental assessment shows\u0000that we can catch a large family of side-channel attacks with a significantly\u0000reduced overhead. We also discuss countermeasures that can be enacted once a\u0000process is suspected of carrying out a side-channel attack to increase the\u0000overall tradeoff between the system's security level and the delivered\u0000performance under non-suspected process executions.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"167 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139919245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Juan Carlos Saez, Fernando Castro, Manuel Prieto-Matias
Asymmetric multicore processors (AMPs) couple high-performance big cores and low-power small cores with the same instruction-set architecture but different features, such as clock frequency or microarchitecture. Previous work has shown that asymmetric designs may deliver higher energy efficiency than symmetric multicores for diverse workloads. Despite their benefits, AMPs pose significant challenges to runtime systems of parallel programming models. While previous work has mainly explored how to efficiently execute task-based parallel applications on AMPs, via enhancements in the runtime system, improving the performance of unmodified data-parallel applications on these architectures is still a big challenge. In this work we analyze the particular case of loop-based OpenMP applications, which are widely used today in scientific and engineering domains, and constitute the dominant application type in many parallel benchmark suites used for performance evaluation on multicore systems. We observed that conventional loop-scheduling OpenMP approaches are unable to efficiently cope with the load imbalance that naturally stems from the different performance delivered by big and small cores. To address this shortcoming, we propose textit{Asymmetric Iteration Distribution} (AID), a set of novel loop-scheduling methods for AMPs that distribute iterations unevenly across worker threads to efficiently deal with performance asymmetry. We implemented AID in textit{libgomp} --the GNU OpenMP runtime system--, and evaluated it on two different asymmetric multicore platforms. Our analysis reveals that the AID methods constitute effective replacements of the texttt{static} and texttt{dynamic} methods on AMPs, and are capable of improving performance over these conventional strategies by up to 56% and 16.8%, respectively.
{"title":"Enabling performance portability of data-parallel OpenMP applications on asymmetric multicore processors","authors":"Juan Carlos Saez, Fernando Castro, Manuel Prieto-Matias","doi":"arxiv-2402.07664","DOIUrl":"https://doi.org/arxiv-2402.07664","url":null,"abstract":"Asymmetric multicore processors (AMPs) couple high-performance big cores and\u0000low-power small cores with the same instruction-set architecture but different\u0000features, such as clock frequency or microarchitecture. Previous work has shown\u0000that asymmetric designs may deliver higher energy efficiency than symmetric\u0000multicores for diverse workloads. Despite their benefits, AMPs pose significant\u0000challenges to runtime systems of parallel programming models. While previous\u0000work has mainly explored how to efficiently execute task-based parallel\u0000applications on AMPs, via enhancements in the runtime system, improving the\u0000performance of unmodified data-parallel applications on these architectures is\u0000still a big challenge. In this work we analyze the particular case of\u0000loop-based OpenMP applications, which are widely used today in scientific and\u0000engineering domains, and constitute the dominant application type in many\u0000parallel benchmark suites used for performance evaluation on multicore systems.\u0000We observed that conventional loop-scheduling OpenMP approaches are unable to\u0000efficiently cope with the load imbalance that naturally stems from the\u0000different performance delivered by big and small cores. To address this shortcoming, we propose textit{Asymmetric Iteration\u0000Distribution} (AID), a set of novel loop-scheduling methods for AMPs that\u0000distribute iterations unevenly across worker threads to efficiently deal with\u0000performance asymmetry. We implemented AID in textit{libgomp} --the GNU OpenMP\u0000runtime system--, and evaluated it on two different asymmetric multicore\u0000platforms. Our analysis reveals that the AID methods constitute effective\u0000replacements of the texttt{static} and texttt{dynamic} methods on AMPs, and\u0000are capable of improving performance over these conventional strategies by up\u0000to 56% and 16.8%, respectively.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139770479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large Language Models (LLMs) based on Mixture-of-Experts (MoE) architecture are showing promising performance on various tasks. However, running them on resource-constrained settings, where GPU memory resources are not abundant, is challenging due to huge model sizes. Existing systems that offload model weights to CPU memory suffer from the significant overhead of frequently moving data between CPU and GPU. In this paper, we propose Fiddler, a resource-efficient inference engine with CPU-GPU orchestration for MoE models. The key idea of Fiddler is to use the computation ability of the CPU to minimize the data movement between the CPU and GPU. Our evaluation shows that Fiddler can run the uncompressed Mixtral-8x7B model, which exceeds 90GB in parameters, to generate over $3$ tokens per second on a single GPU with 24GB memory, showing an order of magnitude improvement over existing methods. The code of Fiddler is publicly available at url{https://github.com/efeslab/fiddler}
基于专家混合物(MoE)架构的大型语言模型(LLM)在各种任务中表现出良好的性能。然而,在资源受限的环境下(GPU 内存资源并不充裕),由于模型规模巨大,运行这些模型非常困难。现有的将模型重量卸载到 CPU 内存的系统在频繁地在 CPU 和 GPU 之间移动数据的过程中开销巨大。在本文中,我们提出了Fiddler,一个针对MoE模型的CPU-GPU协调的源高效推理引擎。Fiddler的关键理念是利用CPU的计算能力,最大限度地减少CPU和GPU之间的数据移动。我们的评估结果表明,Fiddler可以运行参数超过90GB的未压缩Mixtral-8x7B模型,在单个24GB内存的GPU上每秒生成超过3美元的令牌,与现有方法相比有数量级的提升。Fiddler 的代码可在以下网址公开获取:url{https://github.com/efeslab/fiddler}。
{"title":"Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models","authors":"Keisuke Kamahori, Yile Gu, Kan Zhu, Baris Kasikci","doi":"arxiv-2402.07033","DOIUrl":"https://doi.org/arxiv-2402.07033","url":null,"abstract":"Large Language Models (LLMs) based on Mixture-of-Experts (MoE) architecture\u0000are showing promising performance on various tasks. However, running them on\u0000resource-constrained settings, where GPU memory resources are not abundant, is\u0000challenging due to huge model sizes. Existing systems that offload model\u0000weights to CPU memory suffer from the significant overhead of frequently moving\u0000data between CPU and GPU. In this paper, we propose Fiddler, a\u0000resource-efficient inference engine with CPU-GPU orchestration for MoE models.\u0000The key idea of Fiddler is to use the computation ability of the CPU to\u0000minimize the data movement between the CPU and GPU. Our evaluation shows that\u0000Fiddler can run the uncompressed Mixtral-8x7B model, which exceeds 90GB in\u0000parameters, to generate over $3$ tokens per second on a single GPU with 24GB\u0000memory, showing an order of magnitude improvement over existing methods. The\u0000code of Fiddler is publicly available at\u0000url{https://github.com/efeslab/fiddler}","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"96 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139770556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
"Rootless containers" is a concept to run the entire container runtimes and containers without the root privileges. It protects the host environment from attackers exploiting container runtime vulnerabilities. However, when rootless containers communicate with external endpoints, the network performance is low compared to rootful containers because of the overhead of rootless networking components. In this paper, we propose bypass4netns that accelerates TCP/IP communications in rootless containers by bypassing slow networking components. bypass4netns uses sockets allocated on the host. It switches sockets in containers to the host's sockets by intercepting syscalls and injecting the file descriptors using Seccomp. Our method with Seccomp can handle statically linked applications that previous works could not handle. Also, we propose high-performance rootless multi-node communication. We confirmed that rootless containers with bypass4netns achieve more than 30x faster throughput than rootless containers without it. In addition, we evaluated performance with applications and it showed large improvements on some applications.
{"title":"bypass4netns: Accelerating TCP/IP Communications in Rootless Containers","authors":"Naoki Matsumoto, Akihiro Suda","doi":"arxiv-2402.00365","DOIUrl":"https://doi.org/arxiv-2402.00365","url":null,"abstract":"\"Rootless containers\" is a concept to run the entire container runtimes and\u0000containers without the root privileges. It protects the host environment from\u0000attackers exploiting container runtime vulnerabilities. However, when rootless\u0000containers communicate with external endpoints, the network performance is low\u0000compared to rootful containers because of the overhead of rootless networking\u0000components. In this paper, we propose bypass4netns that accelerates TCP/IP\u0000communications in rootless containers by bypassing slow networking components.\u0000bypass4netns uses sockets allocated on the host. It switches sockets in\u0000containers to the host's sockets by intercepting syscalls and injecting the\u0000file descriptors using Seccomp. Our method with Seccomp can handle statically\u0000linked applications that previous works could not handle. Also, we propose\u0000high-performance rootless multi-node communication. We confirmed that rootless\u0000containers with bypass4netns achieve more than 30x faster throughput than\u0000rootless containers without it. In addition, we evaluated performance with\u0000applications and it showed large improvements on some applications.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"5 1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139664220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The widespread deployment of control-flow integrity has propelled non-control data attacks into the mainstream. In the domain of OS kernel exploits, by corrupting critical non-control data, local attackers can directly gain root access or privilege escalation without hijacking the control flow. As a result, OS kernels have been restricting the availability of such non-control data. This forces attackers to continue to search for more exploitable non-control data in OS kernels. However, discovering unknown non-control data can be daunting because they are often tied heavily to semantics and lack universal patterns. We make two contributions in this paper: (1) discover critical non-control objects in the file subsystem and (2) analyze their exploitability. This work represents the first study, with minimal domain knowledge, to semi-automatically discover and evaluate exploitable non-control data within the file subsystem of the Linux kernel. Our solution utilizes a custom analysis and testing framework that statically and dynamically identifies promising candidate objects. Furthermore, we categorize these discovered objects into types that are suitable for various exploit strategies, including a novel strategy necessary to overcome the defense that isolates many of these objects. These objects have the advantage of being exploitable without requiring KASLR, thus making the exploits simpler and more reliable. We use 18 real-world CVEs to evaluate the exploitability of the file system objects using various exploit strategies. We develop 10 end-to-end exploits using a subset of CVEs against the kernel with all state-of-the-art mitigations enabled.
控制流完整性的广泛应用推动非控制数据攻击成为主流。在操作系统内核漏洞利用领域,通过破坏关键的非控制数据,本地攻击者可以直接获得root权限或权限升级,而无需劫持控制流。因此,操作系统内核一直在限制此类非控制数据的可用性,这迫使攻击者继续在操作系统内核中寻找更多可利用的非控制数据。然而,发现未知的非控制数据可能会很困难,因为这些数据通常与语义紧密相关,而且缺乏通用模式。我们在本文中有两个贡献:(1) 发现文件子系统中的关键非控制对象;(2) 分析它们的可利用性。这项工作是利用最少的领域知识半自动发现和评估 Linux 内核文件子系统中可利用的非控制数据的首次研究。我们的解决方案利用定制的分析和测试框架,静态和动态地识别有希望的候选对象。此外,我们还将这些发现的对象归类为适合各种利用策略的类型,包括一种新颖的策略,以克服隔离这些对象的防御。我们使用18个真实世界的CVE来评估使用各种利用策略对文件系统对象的可利用性。我们使用 CVE 子集开发了 10 个针对内核的端到端漏洞,并启用了所有最先进的缓解措施。
{"title":"Beyond Control: Exploring Novel File System Objects for Data-Only Attacks on Linux Systems","authors":"Jinmeng Zhou, Jiayi Hu, Ziyue Pan, Jiaxun Zhu, Guoren Li, Wenbo Shen, Yulei Sui, Zhiyun Qian","doi":"arxiv-2401.17618","DOIUrl":"https://doi.org/arxiv-2401.17618","url":null,"abstract":"The widespread deployment of control-flow integrity has propelled non-control\u0000data attacks into the mainstream. In the domain of OS kernel exploits, by\u0000corrupting critical non-control data, local attackers can directly gain root\u0000access or privilege escalation without hijacking the control flow. As a result,\u0000OS kernels have been restricting the availability of such non-control data.\u0000This forces attackers to continue to search for more exploitable non-control\u0000data in OS kernels. However, discovering unknown non-control data can be\u0000daunting because they are often tied heavily to semantics and lack universal\u0000patterns. We make two contributions in this paper: (1) discover critical non-control\u0000objects in the file subsystem and (2) analyze their exploitability. This work\u0000represents the first study, with minimal domain knowledge, to\u0000semi-automatically discover and evaluate exploitable non-control data within\u0000the file subsystem of the Linux kernel. Our solution utilizes a custom analysis\u0000and testing framework that statically and dynamically identifies promising\u0000candidate objects. Furthermore, we categorize these discovered objects into\u0000types that are suitable for various exploit strategies, including a novel\u0000strategy necessary to overcome the defense that isolates many of these objects.\u0000These objects have the advantage of being exploitable without requiring KASLR,\u0000thus making the exploits simpler and more reliable. We use 18 real-world CVEs\u0000to evaluate the exploitability of the file system objects using various exploit\u0000strategies. We develop 10 end-to-end exploits using a subset of CVEs against\u0000the kernel with all state-of-the-art mitigations enabled.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139656904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bin Gao, Qingxuan Kang, Hao-Wei Tee, Kyle Timothy Ng Chu, Alireza Sanaee, Djordje Jevdjic
Memory management operations that modify page-tables, typically performed during memory allocation/deallocation, are infamous for their poor performance in highly threaded applications, largely due to process-wide TLB shootdowns that the OS must issue due to the lack of hardware support for TLB coherence. We study these operations in NUMA settings, where we observe up to 40x overhead for basic operations such as munmap or mprotect. The overhead further increases if page-table replication is used, where complete coherent copies of the page-tables are maintained across all NUMA nodes. While eager system-wide replication is extremely effective at localizing page-table reads during address translation, we find that it creates additional penalties upon any page-table changes due to the need to maintain all replicas coherent. In this paper, we propose a novel page-table management mechanism, called numaPTE, to enable transparent, on-demand, and partial page-table replication across NUMA nodes in order to perform address translation locally, while avoiding the overheads and scalability issues of system-wide full page-table replication. We then show that numaPTE's precise knowledge of page-table sharers can be leveraged to significantly reduce the number of TLB shootdowns issued upon any memory-management operation. As a result, numaPTE not only avoids replication-related slowdowns, but also provides significant speedup over the baseline on memory allocation/deallocation and access control operations. We implement numaPTEin Linux on x86_64, evaluate it on 4- and 8-socket systems, and show that numaPTE achieves the full benefits of eager page-table replication on a wide range of applications, while also achieving a 12% and 36% runtime improvement on Webserver and Memcached respectively due to a significant reduction in TLB shootdowns.
{"title":"numaPTE: Managing Page-Tables and TLBs on NUMA Systems","authors":"Bin Gao, Qingxuan Kang, Hao-Wei Tee, Kyle Timothy Ng Chu, Alireza Sanaee, Djordje Jevdjic","doi":"arxiv-2401.15558","DOIUrl":"https://doi.org/arxiv-2401.15558","url":null,"abstract":"Memory management operations that modify page-tables, typically performed\u0000during memory allocation/deallocation, are infamous for their poor performance\u0000in highly threaded applications, largely due to process-wide TLB shootdowns\u0000that the OS must issue due to the lack of hardware support for TLB coherence.\u0000We study these operations in NUMA settings, where we observe up to 40x overhead\u0000for basic operations such as munmap or mprotect. The overhead further increases\u0000if page-table replication is used, where complete coherent copies of the\u0000page-tables are maintained across all NUMA nodes. While eager system-wide\u0000replication is extremely effective at localizing page-table reads during\u0000address translation, we find that it creates additional penalties upon any\u0000page-table changes due to the need to maintain all replicas coherent. In this paper, we propose a novel page-table management mechanism, called\u0000numaPTE, to enable transparent, on-demand, and partial page-table replication\u0000across NUMA nodes in order to perform address translation locally, while\u0000avoiding the overheads and scalability issues of system-wide full page-table\u0000replication. We then show that numaPTE's precise knowledge of page-table\u0000sharers can be leveraged to significantly reduce the number of TLB shootdowns\u0000issued upon any memory-management operation. As a result, numaPTE not only\u0000avoids replication-related slowdowns, but also provides significant speedup\u0000over the baseline on memory allocation/deallocation and access control\u0000operations. We implement numaPTEin Linux on x86_64, evaluate it on 4- and\u00008-socket systems, and show that numaPTE achieves the full benefits of eager\u0000page-table replication on a wide range of applications, while also achieving a\u000012% and 36% runtime improvement on Webserver and Memcached respectively due to\u0000a significant reduction in TLB shootdowns.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139584358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GPU remoting is a promising technique for supporting AI applications. Networking plays a key role in enabling remoting. However, for efficient remoting, the network requirements in terms of latency and bandwidth are unknown. In this paper, we take a GPU-centric approach to derive the minimum latency and bandwidth requirements for GPU remoting, while ensuring no (or little) performance degradation for AI applications. Our study including theoretical model demonstrates that, with careful remoting design, unmodified AI applications can run on the remoting setup using commodity networking hardware without any overhead or even with better performance, with low network demands.
{"title":"Characterizing Network Requirements for GPU API Remoting in AI Applications","authors":"Tianxia Wang, Zhuofu Chen, Xingda Wei, Jinyu Gu, Rong Chen, Haibo Chen","doi":"arxiv-2401.13354","DOIUrl":"https://doi.org/arxiv-2401.13354","url":null,"abstract":"GPU remoting is a promising technique for supporting AI applications.\u0000Networking plays a key role in enabling remoting. However, for efficient\u0000remoting, the network requirements in terms of latency and bandwidth are\u0000unknown. In this paper, we take a GPU-centric approach to derive the minimum\u0000latency and bandwidth requirements for GPU remoting, while ensuring no (or\u0000little) performance degradation for AI applications. Our study including\u0000theoretical model demonstrates that, with careful remoting design, unmodified\u0000AI applications can run on the remoting setup using commodity networking\u0000hardware without any overhead or even with better performance, with low network\u0000demands.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139561029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the advent of byte-addressable memory devices, such as CXL memory, persistent memory, and storage-class memory, tiered memory systems have become a reality. Page migration is the de facto method within operating systems for managing tiered memory. It aims to bring hot data whenever possible into fast memory to optimize the performance of data accesses while using slow memory to accommodate data spilled from fast memory. While the existing research has demonstrated the effectiveness of various optimizations on page migration, it falls short of addressing a fundamental question: Is exclusive memory tiering, in which a page is either present in fast memory or slow memory, but not both simultaneously, the optimal strategy for tiered memory management? We demonstrate that page migration-based exclusive memory tiering suffers significant performance degradation when fast memory is under pressure. In this paper, we propose non-exclusive memory tiering, a page management strategy that retains a copy of pages recently promoted from slow memory to fast memory to mitigate memory thrashing. To enable non-exclusive memory tiering, we develop MATRYOSHKA, a new mechanism that features transactional page migration and page shadowing. MATRYOSHKA removes page migration off the program's critical path and makes migration asynchronous. Evaluations with microbenchmarks and realworld applications show that MATRYOSHKA achieves 6x performance improvement over the state-of-the-art transparent page placement (TPP) approach under memory pressure. We also compare MATRYOSHKA with a recently proposed sampling-based migration approach and demonstrate MATRYOSHKA's strengths and potential weaknesses in various scenarios. Through the evaluations, we discover a serious issue facing all tested approaches, unfortunately including MATRYOSHKA, and call for further research on tiered memory-aware memory allocation.
{"title":"MATRYOSHKA: Non-Exclusive Memory Tiering via Transactional Page Migration","authors":"Lingfeng Xiang, Zhen Lin, Weishu Deng, Hui Lu, Jia Rao, Yifan Yuan, Ren Wang","doi":"arxiv-2401.13154","DOIUrl":"https://doi.org/arxiv-2401.13154","url":null,"abstract":"With the advent of byte-addressable memory devices, such as CXL memory,\u0000persistent memory, and storage-class memory, tiered memory systems have become\u0000a reality. Page migration is the de facto method within operating systems for\u0000managing tiered memory. It aims to bring hot data whenever possible into fast\u0000memory to optimize the performance of data accesses while using slow memory to\u0000accommodate data spilled from fast memory. While the existing research has\u0000demonstrated the effectiveness of various optimizations on page migration, it\u0000falls short of addressing a fundamental question: Is exclusive memory tiering,\u0000in which a page is either present in fast memory or slow memory, but not both\u0000simultaneously, the optimal strategy for tiered memory management? We demonstrate that page migration-based exclusive memory tiering suffers\u0000significant performance degradation when fast memory is under pressure. In this\u0000paper, we propose non-exclusive memory tiering, a page management strategy that\u0000retains a copy of pages recently promoted from slow memory to fast memory to\u0000mitigate memory thrashing. To enable non-exclusive memory tiering, we develop\u0000MATRYOSHKA, a new mechanism that features transactional page migration and page\u0000shadowing. MATRYOSHKA removes page migration off the program's critical path\u0000and makes migration asynchronous. Evaluations with microbenchmarks and\u0000realworld applications show that MATRYOSHKA achieves 6x performance improvement\u0000over the state-of-the-art transparent page placement (TPP) approach under\u0000memory pressure. We also compare MATRYOSHKA with a recently proposed\u0000sampling-based migration approach and demonstrate MATRYOSHKA's strengths and\u0000potential weaknesses in various scenarios. Through the evaluations, we discover\u0000a serious issue facing all tested approaches, unfortunately including\u0000MATRYOSHKA, and call for further research on tiered memory-aware memory\u0000allocation.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"2019 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139561196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alex Conway, Ainesh Bakshi, Arghya Bhattacharya, Rory Bennett, Yizheng Jiao, Eric Knorr, Yang Zhan, Michael A. Bender, William Jannen, Rob Johnson, Bradley C. Kuszmaul, Donald E. Porter, Jun Yuan, Martin Farach-Colton
File systems must allocate space for files without knowing what will be added or removed in the future. Over the life of a file system, this may cause suboptimal file placement decisions that eventually lead to slower performance, or aging. Conventional wisdom suggests that file system aging is a solved problem in the common case; heuristics to avoid aging, such as colocating related files and data blocks, are effective until a storage device fills up, at which point space pressure exacerbates fragmentation-based aging. However, this article describes both realistic and synthetic workloads that can cause these heuristics to fail, inducing large performance declines due to aging, even when the storage device is nearly empty. We argue that these slowdowns are caused by poor layout. We demonstrate a correlation between the read performance of a directory scan and the locality within a file system's access patterns, using a dynamic layout score. We complement these results with microbenchmarks that show that space pressure can cause a substantial amount of inter-file and intra-file fragmentation. However, our results suggest that the effect of free-space fragmentation on read performance is best described as accelerating the file system aging process. The effect on write performance is non-existent in some cases, and, in most cases, an order of magnitude smaller than the read degradation from fragmentation caused by normal usage. In short, many file systems are exquisitely prone to read aging after a variety of write patterns. We show, however, that aging is not inevitable. BetrFS, a file system based on write-optimized dictionaries, exhibits almost no aging in our experiments. We present a framework for understanding and predicting aging, and identify the key features of BetrFS that avoid aging.
{"title":"File System Aging","authors":"Alex Conway, Ainesh Bakshi, Arghya Bhattacharya, Rory Bennett, Yizheng Jiao, Eric Knorr, Yang Zhan, Michael A. Bender, William Jannen, Rob Johnson, Bradley C. Kuszmaul, Donald E. Porter, Jun Yuan, Martin Farach-Colton","doi":"arxiv-2401.08858","DOIUrl":"https://doi.org/arxiv-2401.08858","url":null,"abstract":"File systems must allocate space for files without knowing what will be added\u0000or removed in the future. Over the life of a file system, this may cause\u0000suboptimal file placement decisions that eventually lead to slower performance,\u0000or aging. Conventional wisdom suggests that file system aging is a solved\u0000problem in the common case; heuristics to avoid aging, such as colocating\u0000related files and data blocks, are effective until a storage device fills up,\u0000at which point space pressure exacerbates fragmentation-based aging. However,\u0000this article describes both realistic and synthetic workloads that can cause\u0000these heuristics to fail, inducing large performance declines due to aging,\u0000even when the storage device is nearly empty. We argue that these slowdowns are caused by poor layout. We demonstrate a\u0000correlation between the read performance of a directory scan and the locality\u0000within a file system's access patterns, using a dynamic layout score. We\u0000complement these results with microbenchmarks that show that space pressure can\u0000cause a substantial amount of inter-file and intra-file fragmentation. However,\u0000our results suggest that the effect of free-space fragmentation on read\u0000performance is best described as accelerating the file system aging process.\u0000The effect on write performance is non-existent in some cases, and, in most\u0000cases, an order of magnitude smaller than the read degradation from\u0000fragmentation caused by normal usage. In short, many file systems are exquisitely prone to read aging after a\u0000variety of write patterns. We show, however, that aging is not inevitable.\u0000BetrFS, a file system based on write-optimized dictionaries, exhibits almost no\u0000aging in our experiments. We present a framework for understanding and\u0000predicting aging, and identify the key features of BetrFS that avoid aging.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139500399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}