Checkpointing (C) and restoring (R) are key components for GPU tasks. POS is an OS-level GPU C/R system: It can transparently checkpoint or restore processes that use the GPU, without requiring any cooperation from the application, a key feature required by modern systems like the cloud. Moreover, POS is the first OS-level C/R system that can concurrently execute C/R with the application execution: a critical feature that can be trivially achieved when the processes only running on the CPU, but becomes challenging when the processes use GPU. The problem is how to ensure consistency during concurrent execution with the lack of application semantics due to transparency. CPU processes can leverage OS and hardware paging to fix inconsistency without application semantics. Unfortunately, GPU bypasses OS and paging for high performance. POS fills the semantic gap by speculatively extracting buffer access information of GPU kernels during runtime. Thanks to the simple and well-structured nature of GPU kernels, our speculative extraction (with runtime validation) achieves 100% accuracy on applications from training to inference whose domains span from vision, large language models, and reinforcement learning. Based on the extracted semantics, we systematically overlap C/R with application execution, and achieves orders of magnitude higher performance under various tasks compared with the state-of-the-art OS-level GPU C/R, including training fault tolerance, live GPU process migration, and cold starts acceleration in GPU-based serverless computing.
{"title":"PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation","authors":"Zhuobin Huang, Xingda Wei, Yingyi Hao, Rong Chen, Mingcong Han, Jinyu Gu, Haibo Chen","doi":"arxiv-2405.12079","DOIUrl":"https://doi.org/arxiv-2405.12079","url":null,"abstract":"Checkpointing (C) and restoring (R) are key components for GPU tasks. POS is\u0000an OS-level GPU C/R system: It can transparently checkpoint or restore\u0000processes that use the GPU, without requiring any cooperation from the\u0000application, a key feature required by modern systems like the cloud. Moreover,\u0000POS is the first OS-level C/R system that can concurrently execute C/R with the\u0000application execution: a critical feature that can be trivially achieved when\u0000the processes only running on the CPU, but becomes challenging when the\u0000processes use GPU. The problem is how to ensure consistency during concurrent\u0000execution with the lack of application semantics due to transparency. CPU\u0000processes can leverage OS and hardware paging to fix inconsistency without\u0000application semantics. Unfortunately, GPU bypasses OS and paging for high\u0000performance. POS fills the semantic gap by speculatively extracting buffer\u0000access information of GPU kernels during runtime. Thanks to the simple and\u0000well-structured nature of GPU kernels, our speculative extraction (with runtime\u0000validation) achieves 100% accuracy on applications from training to inference\u0000whose domains span from vision, large language models, and reinforcement\u0000learning. Based on the extracted semantics, we systematically overlap C/R with\u0000application execution, and achieves orders of magnitude higher performance\u0000under various tasks compared with the state-of-the-art OS-level GPU C/R,\u0000including training fault tolerance, live GPU process migration, and cold starts\u0000acceleration in GPU-based serverless computing.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"64 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141147277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bhagyashri Tushir, Vikram K Ramanna, Yuhong Liu, Behnam Dezfouli
Identifying IoT devices is crucial for network monitoring, security enforcement, and inventory tracking. However, most existing identification methods rely on deep packet inspection, which raises privacy concerns and adds computational complexity. More importantly, existing works overlook the impact of wireless channel dynamics on the accuracy of layer-2 features, thereby limiting their effectiveness in real-world scenarios. In this work, we define and use the latency of specific probe-response packet exchanges, referred to as "device latency," as the main feature for device identification. Additionally, we reveal the critical impact of wireless channel dynamics on the accuracy of device identification based on device latency. Specifically, this work introduces "accumulation score" as a novel approach to capturing fine-grained channel dynamics and their impact on device latency when training machine learning models. We implement the proposed methods and measure the accuracy and overhead of device identification in real-world scenarios. The results confirm that by incorporating the accumulation score for balanced data collection and training machine learning algorithms, we achieve an F1 score of over 97% for device identification, even amidst wireless channel dynamics, a significant improvement over the 75% F1 score achieved by disregarding the impact of channel dynamics on data collection and device latency.
识别物联网设备对于网络监控、安全执法和库存跟踪至关重要。然而,现有的大多数识别方法都依赖于深度数据包检测,这不仅会引发隐私问题,还会增加计算的复杂性。更重要的是,现有的工作忽略了无线信道动态对第 2 层特征准确性的影响,从而限制了它们在实际场景中的有效性。在这项工作中,我们定义并使用特定探测-响应数据包交换的延迟(称为 "设备延迟")作为设备识别的主要特征。此外,我们还揭示了无线信道动态对基于设备延迟的设备识别准确性的重要影响。具体来说,这项工作引入了 "累积分数 "作为一种新方法,在训练机器学习模型时捕捉细粒度信道动态及其对设备延迟的影响。我们实施了所提出的方法,并测量了真实世界场景中设备识别的准确性和开销。结果证实,通过在平衡数据收集和训练机器学习算法时采用累积分数,即使在无线信道动态条件下,我们在设备识别方面的 F1 分数也能达到 97% 以上,与忽略信道动态对数据收集和设备延迟的影响时 75% 的 F1 分数相比,有了显著提高。
{"title":"Leveraging Machine Learning for Accurate IoT Device Identification in Dynamic Wireless Contexts","authors":"Bhagyashri Tushir, Vikram K Ramanna, Yuhong Liu, Behnam Dezfouli","doi":"arxiv-2405.17442","DOIUrl":"https://doi.org/arxiv-2405.17442","url":null,"abstract":"Identifying IoT devices is crucial for network monitoring, security\u0000enforcement, and inventory tracking. However, most existing identification\u0000methods rely on deep packet inspection, which raises privacy concerns and adds\u0000computational complexity. More importantly, existing works overlook the impact\u0000of wireless channel dynamics on the accuracy of layer-2 features, thereby\u0000limiting their effectiveness in real-world scenarios. In this work, we define\u0000and use the latency of specific probe-response packet exchanges, referred to as\u0000\"device latency,\" as the main feature for device identification. Additionally,\u0000we reveal the critical impact of wireless channel dynamics on the accuracy of\u0000device identification based on device latency. Specifically, this work\u0000introduces \"accumulation score\" as a novel approach to capturing fine-grained\u0000channel dynamics and their impact on device latency when training machine\u0000learning models. We implement the proposed methods and measure the accuracy and\u0000overhead of device identification in real-world scenarios. The results confirm\u0000that by incorporating the accumulation score for balanced data collection and\u0000training machine learning algorithms, we achieve an F1 score of over 97% for\u0000device identification, even amidst wireless channel dynamics, a significant\u0000improvement over the 75% F1 score achieved by disregarding the impact of\u0000channel dynamics on data collection and device latency.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"70 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141170877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Efficient task scheduling in heterogeneous computing environments is imperative for optimizing resource utilization and minimizing task completion times. In this study, we conducted a comprehensive benchmarking analysis to evaluate the performance of four scheduling algorithms First Come, First-Served (FCFS), FCFS with No Queuing (FCFS-NQ), Minimum Expected Completion Time (MECT), and Minimum Expected Execution Time (MEET) across varying workload scenarios. We defined three workload scenarios: low, medium, and high, each representing different levels of computational demands. Through rigorous experimentation and analysis, we assessed the effectiveness of each algorithm in terms of total completion percentage, energy consumption, wasted energy, and energy per completion. Our findings highlight the strengths and limitations of each algorithm, with MECT and MEET emerging as robust contenders, dynamically prioritizing tasks based on comprehensive estimates of completion and execution times. Furthermore, MECT and MEET exhibit superior energy efficiency compared to FCFS and FCFS-NQ, underscoring their suitability for resource-constrained environments. This study provides valuable insights into the efficacy of task scheduling algorithms in heterogeneous computing environments, enabling informed decision-making to enhance resource allocation, minimize task completion times, and improve energy efficiency
{"title":"Optimizing Task Scheduling in Heterogeneous Computing Environments: A Comparative Analysis of CPU, GPU, and ASIC Platforms Using E2C Simulator","authors":"Ali Mohammadjafari, Poorya Khajouie","doi":"arxiv-2405.08187","DOIUrl":"https://doi.org/arxiv-2405.08187","url":null,"abstract":"Efficient task scheduling in heterogeneous computing environments is\u0000imperative for optimizing resource utilization and minimizing task completion\u0000times. In this study, we conducted a comprehensive benchmarking analysis to\u0000evaluate the performance of four scheduling algorithms First Come, First-Served\u0000(FCFS), FCFS with No Queuing (FCFS-NQ), Minimum Expected Completion Time\u0000(MECT), and Minimum Expected Execution Time (MEET) across varying workload\u0000scenarios. We defined three workload scenarios: low, medium, and high, each\u0000representing different levels of computational demands. Through rigorous\u0000experimentation and analysis, we assessed the effectiveness of each algorithm\u0000in terms of total completion percentage, energy consumption, wasted energy, and\u0000energy per completion. Our findings highlight the strengths and limitations of\u0000each algorithm, with MECT and MEET emerging as robust contenders, dynamically\u0000prioritizing tasks based on comprehensive estimates of completion and execution\u0000times. Furthermore, MECT and MEET exhibit superior energy efficiency compared\u0000to FCFS and FCFS-NQ, underscoring their suitability for resource-constrained\u0000environments. This study provides valuable insights into the efficacy of task\u0000scheduling algorithms in heterogeneous computing environments, enabling\u0000informed decision-making to enhance resource allocation, minimize task\u0000completion times, and improve energy efficiency","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"71 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141063715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reid PriedhorskyLos Alamos National Laboratory, Michael JenningsLos Alamos National Laboratory, Megan Phinney
Do Linux distribution package managers need the privileged operations they request to actually happen? Apparently not, at least for building container images for HPC applications. We use this observation to implement a root emulation mode using a Linux seccomp filter that intercepts some privileged system calls, does nothing, and returns success to the calling program. This approach provides no consistency whatsoever but appears sufficient to build all Dockerfiles we examined, simplifying fully-unprivileged workflows needed for HPC application containers.
Linux 发行版软件包管理器需要实际执行它们所请求的特权操作吗?显然不需要,至少在为高性能计算应用构建容器图像时不需要。我们利用这一观察结果,使用 Linux seccomp 过滤器实现了一种根模拟模式,该过滤器拦截一些特权系统调用,什么也不做,然后向调用程序返回成功。这种方法没有提供任何一致性,但似乎足以构建我们检查过的所有Docker文件,简化了高性能计算应用容器所需的完全非特权工作流程。
{"title":"Zero-consistency root emulation for unprivileged container image build","authors":"Reid PriedhorskyLos Alamos National Laboratory, Michael JenningsLos Alamos National Laboratory, Megan Phinney","doi":"arxiv-2405.06085","DOIUrl":"https://doi.org/arxiv-2405.06085","url":null,"abstract":"Do Linux distribution package managers need the privileged operations they\u0000request to actually happen? Apparently not, at least for building container\u0000images for HPC applications. We use this observation to implement a root\u0000emulation mode using a Linux seccomp filter that intercepts some privileged\u0000system calls, does nothing, and returns success to the calling program. This\u0000approach provides no consistency whatsoever but appears sufficient to build all\u0000Dockerfiles we examined, simplifying fully-unprivileged workflows needed for\u0000HPC application containers.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140929077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maxime Letemple, Gaulthier Gain, Sami Ben Mariem, Laurent Mathy, Benoit Donnet
The last twenty years have seen the development and popularity of network measurement infrastructures. Internet measurement platforms have become common and have demonstrated their relevance in Internet understanding and security observation. However, despite their popularity, those platforms lack of flexibility and reactivity, as they are usually used for longitudinal measurements. As a consequence, they may miss detecting events that are security or Internet-related. During the same period, operating systems have evolved to virtual machines (VMs) as self-contained units for running applications, with the recent rise of unikernels, ultra-lightweight VMs tailored for specific applications, eliminating the need for a host OS. In this paper, we advocate that measurement infrastructures could take advantage of unikernels to become more flexible and efficient. We propose uTNT, a proof-of-concept unikernel-based implementation of TNT, a traceroute extension able to reveal MPLS tunnels. This paper documents the full toolchain for porting TNT into a unikernel and evaluates uTNT performance with respect to more traditional approaches. The paper also discusses a use case in which uTNT could find a suitable usage. uTNT source code is publicly available on Gitlab.
{"title":"uTNT: Unikernels for Efficient and Flexible Internet Probing","authors":"Maxime Letemple, Gaulthier Gain, Sami Ben Mariem, Laurent Mathy, Benoit Donnet","doi":"arxiv-2405.04036","DOIUrl":"https://doi.org/arxiv-2405.04036","url":null,"abstract":"The last twenty years have seen the development and popularity of network\u0000measurement infrastructures. Internet measurement platforms have become common\u0000and have demonstrated their relevance in Internet understanding and security\u0000observation. However, despite their popularity, those platforms lack of\u0000flexibility and reactivity, as they are usually used for longitudinal\u0000measurements. As a consequence, they may miss detecting events that are\u0000security or Internet-related. During the same period, operating systems have\u0000evolved to virtual machines (VMs) as self-contained units for running\u0000applications, with the recent rise of unikernels, ultra-lightweight VMs\u0000tailored for specific applications, eliminating the need for a host OS. In this\u0000paper, we advocate that measurement infrastructures could take advantage of\u0000unikernels to become more flexible and efficient. We propose uTNT, a\u0000proof-of-concept unikernel-based implementation of TNT, a traceroute extension\u0000able to reveal MPLS tunnels. This paper documents the full toolchain for\u0000porting TNT into a unikernel and evaluates uTNT performance with respect to\u0000more traditional approaches. The paper also discusses a use case in which uTNT\u0000could find a suitable usage. uTNT source code is publicly available on Gitlab.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140928991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Efficient use of GPU memory is essential for high throughput LLM inference. Prior systems reserved memory for the KV-cache ahead-of-time, resulting in wasted capacity due to internal fragmentation. Inspired by OS-based virtual memory systems, vLLM proposed PagedAttention to enable dynamic memory allocation for KV-cache. This approach eliminates fragmentation, enabling high-throughput LLM serving with larger batch sizes. However, to be able to allocate physical memory dynamically, PagedAttention changes the layout of KV-cache from contiguous virtual memory to non-contiguous virtual memory. This change requires attention kernels to be rewritten to support paging, and serving framework to implement a memory manager. Thus, the PagedAttention model leads to software complexity, portability issues, redundancy and inefficiency. In this paper, we propose vAttention for dynamic KV-cache memory management. In contrast to PagedAttention, vAttention retains KV-cache in contiguous virtual memory and leverages low-level system support for demand paging, that already exists, to enable on-demand physical memory allocation. Thus, vAttention unburdens the attention kernel developer from having to explicitly support paging and avoids re-implementation of memory management in the serving framework. We show that vAttention enables seamless dynamic memory management for unchanged implementations of various attention kernels. vAttention also generates tokens up to 1.97x faster than vLLM, while processing input prompts up to 3.92x and 1.45x faster than the PagedAttention variants of FlashAttention and FlashInfer.
{"title":"vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention","authors":"Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar","doi":"arxiv-2405.04437","DOIUrl":"https://doi.org/arxiv-2405.04437","url":null,"abstract":"Efficient use of GPU memory is essential for high throughput LLM inference.\u0000Prior systems reserved memory for the KV-cache ahead-of-time, resulting in\u0000wasted capacity due to internal fragmentation. Inspired by OS-based virtual\u0000memory systems, vLLM proposed PagedAttention to enable dynamic memory\u0000allocation for KV-cache. This approach eliminates fragmentation, enabling\u0000high-throughput LLM serving with larger batch sizes. However, to be able to\u0000allocate physical memory dynamically, PagedAttention changes the layout of\u0000KV-cache from contiguous virtual memory to non-contiguous virtual memory. This\u0000change requires attention kernels to be rewritten to support paging, and\u0000serving framework to implement a memory manager. Thus, the PagedAttention model\u0000leads to software complexity, portability issues, redundancy and inefficiency. In this paper, we propose vAttention for dynamic KV-cache memory management.\u0000In contrast to PagedAttention, vAttention retains KV-cache in contiguous\u0000virtual memory and leverages low-level system support for demand paging, that\u0000already exists, to enable on-demand physical memory allocation. Thus,\u0000vAttention unburdens the attention kernel developer from having to explicitly\u0000support paging and avoids re-implementation of memory management in the serving\u0000framework. We show that vAttention enables seamless dynamic memory management\u0000for unchanged implementations of various attention kernels. vAttention also\u0000generates tokens up to 1.97x faster than vLLM, while processing input prompts\u0000up to 3.92x and 1.45x faster than the PagedAttention variants of FlashAttention\u0000and FlashInfer.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140928988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The commonly used caching policies, such as LRU or LFU, exhibit optimal performance only for specific traffic patterns. Even advanced Machine Learning-based methods, which detect patterns in historical request data, struggle when future requests deviate from past trends. Recently, a new class of policies has emerged that makes no assumptions about the request arrival process. These algorithms solve an online optimization problem, enabling continuous adaptation to the context. They offer theoretical guarantees on the regret metric, which is the gap between the gain of the online policy and the gain of the optimal static cache allocation in hindsight. Nevertheless, the high computational complexity of these solutions hinders their practical adoption. In this study, we introduce a groundbreaking gradient-based online caching policy, the first to achieve logarithmic computational complexity relative to catalog size along with regret guarantees. This means our algorithm can efficiently handle large-scale data while minimizing the performance gap between real-time decisions and optimal hindsight choices. As requests arrive, our policy dynamically adjusts the probabilities of including items in the cache, which drive cache update decisions. Our algorithm's streamlined complexity is a key advantage, enabling its application to real-world traces featuring millions of requests and items. This is a significant achievement, as traces of this scale have been out of reach for existing policies with regret guarantees. To the best of our knowledge, our experimental results show for the first time that the regret guarantees of gradient-based caching policies bring significant benefits in scenarios of practical interest.
{"title":"An Online Gradient-Based Caching Policy with Logarithmic Complexity and Regret Guarantees","authors":"Damiano Carra, Giovanni Neglia","doi":"arxiv-2405.01263","DOIUrl":"https://doi.org/arxiv-2405.01263","url":null,"abstract":"The commonly used caching policies, such as LRU or LFU, exhibit optimal\u0000performance only for specific traffic patterns. Even advanced Machine\u0000Learning-based methods, which detect patterns in historical request data,\u0000struggle when future requests deviate from past trends. Recently, a new class\u0000of policies has emerged that makes no assumptions about the request arrival\u0000process. These algorithms solve an online optimization problem, enabling\u0000continuous adaptation to the context. They offer theoretical guarantees on the\u0000regret metric, which is the gap between the gain of the online policy and the\u0000gain of the optimal static cache allocation in hindsight. Nevertheless, the\u0000high computational complexity of these solutions hinders their practical\u0000adoption. In this study, we introduce a groundbreaking gradient-based online\u0000caching policy, the first to achieve logarithmic computational complexity\u0000relative to catalog size along with regret guarantees. This means our algorithm\u0000can efficiently handle large-scale data while minimizing the performance gap\u0000between real-time decisions and optimal hindsight choices. As requests arrive,\u0000our policy dynamically adjusts the probabilities of including items in the\u0000cache, which drive cache update decisions. Our algorithm's streamlined\u0000complexity is a key advantage, enabling its application to real-world traces\u0000featuring millions of requests and items. This is a significant achievement, as\u0000traces of this scale have been out of reach for existing policies with regret\u0000guarantees. To the best of our knowledge, our experimental results show for the\u0000first time that the regret guarantees of gradient-based caching policies bring\u0000significant benefits in scenarios of practical interest.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"837 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140840024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luis Gerhorst, Henriette Herzog, Peter Wägemann, Maximilian Ott, Rüdiger Kapitza, Timo Hönig
High-performance IO demands low-overhead communication between user- and kernel space. This demand can no longer be fulfilled by traditional system calls. Linux's extended Berkeley Packet Filter (BPF) avoids user-/kernel transitions by just-in-time compiling user-provided bytecode and executing it in kernel mode with near-native speed. To still isolate BPF programs from the kernel, they are statically analyzed for memory- and type-safety, which imposes some restrictions but allows for good expressiveness and high performance. However, to mitigate the Spectre vulnerabilities disclosed in 2018, defenses which reject potentially-dangerous programs had to be deployed. We find that this affects 24% to 54% of programs in a dataset with 844 real-world BPF programs from popular open-source projects. To solve this, users are forced to disable the defenses to continue using the programs, which puts the entire system at risk. To enable secure and expressive untrusted Linux kernel extensions, we propose Berrify, an enhancement to the kernel's Spectre defenses that reduces the number of BPF application programs rejected from 54% to zero. We measure Berrify's overhead for all mainstream performance-sensitive applications of BPF (i.e., event tracing, profiling, and packet processing) and find that it improves significantly upon the status-quo where affected BPF programs are either unusable or enable transient execution attacks on the kernel.
{"title":"Mitigating Spectre-PHT using Speculation Barriers in Linux BPF","authors":"Luis Gerhorst, Henriette Herzog, Peter Wägemann, Maximilian Ott, Rüdiger Kapitza, Timo Hönig","doi":"arxiv-2405.00078","DOIUrl":"https://doi.org/arxiv-2405.00078","url":null,"abstract":"High-performance IO demands low-overhead communication between user- and\u0000kernel space. This demand can no longer be fulfilled by traditional system\u0000calls. Linux's extended Berkeley Packet Filter (BPF) avoids user-/kernel\u0000transitions by just-in-time compiling user-provided bytecode and executing it\u0000in kernel mode with near-native speed. To still isolate BPF programs from the\u0000kernel, they are statically analyzed for memory- and type-safety, which imposes\u0000some restrictions but allows for good expressiveness and high performance.\u0000However, to mitigate the Spectre vulnerabilities disclosed in 2018, defenses\u0000which reject potentially-dangerous programs had to be deployed. We find that\u0000this affects 24% to 54% of programs in a dataset with 844 real-world BPF\u0000programs from popular open-source projects. To solve this, users are forced to\u0000disable the defenses to continue using the programs, which puts the entire\u0000system at risk. To enable secure and expressive untrusted Linux kernel extensions, we propose\u0000Berrify, an enhancement to the kernel's Spectre defenses that reduces the\u0000number of BPF application programs rejected from 54% to zero. We measure\u0000Berrify's overhead for all mainstream performance-sensitive applications of BPF\u0000(i.e., event tracing, profiling, and packet processing) and find that it\u0000improves significantly upon the status-quo where affected BPF programs are\u0000either unusable or enable transient execution attacks on the kernel.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140840091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lazar Cvetković, François Costa, Mihajlo Djokic, Michal Friedman, Ana Klimovic
While Function as a Service (FaaS) platforms can initialize function sandboxes on worker nodes in 10-100s of milliseconds, the latency to schedule functions in real FaaS clusters can be orders of magnitude higher. We find that the current approach of building FaaS cluster managers on top of legacy orchestration systems like Kubernetes leads to high scheduling delay at high sandbox churn, which is typical in FaaS clusters. While generic cluster managers use hierarchical abstractions and multiple internal components to manage and reconcile state with frequent persistent updates, this becomes a bottleneck for FaaS, where cluster state frequently changes as sandboxes are created on the critical path of requests. Based on our root cause analysis of performance issues in existing FaaS cluster managers, we propose Dirigent, a clean-slate system architecture for FaaS orchestration with three key principles. First, Dirigent optimizes internal cluster manager abstractions to simplify state management. Second, it eliminates persistent state updates on the critical path of function invocations, leveraging the fact that FaaS abstracts sandboxes from users to relax exact state reconstruction guarantees. Finally, Dirigent runs monolithic control and data planes to minimize internal communication overheads and maximize throughput. We compare Dirigent to state-of-the-art FaaS platforms and show that Dirigent reduces 99th percentile per-function scheduling latency for a production workload by 2.79x compared to AWS Lambda and can spin up 2500 sandboxes per second at low latency, which is 1250x more than with Knative.
{"title":"Dirigent: Lightweight Serverless Orchestration","authors":"Lazar Cvetković, François Costa, Mihajlo Djokic, Michal Friedman, Ana Klimovic","doi":"arxiv-2404.16393","DOIUrl":"https://doi.org/arxiv-2404.16393","url":null,"abstract":"While Function as a Service (FaaS) platforms can initialize function\u0000sandboxes on worker nodes in 10-100s of milliseconds, the latency to schedule\u0000functions in real FaaS clusters can be orders of magnitude higher. We find that\u0000the current approach of building FaaS cluster managers on top of legacy\u0000orchestration systems like Kubernetes leads to high scheduling delay at high\u0000sandbox churn, which is typical in FaaS clusters. While generic cluster\u0000managers use hierarchical abstractions and multiple internal components to\u0000manage and reconcile state with frequent persistent updates, this becomes a\u0000bottleneck for FaaS, where cluster state frequently changes as sandboxes are\u0000created on the critical path of requests. Based on our root cause analysis of\u0000performance issues in existing FaaS cluster managers, we propose Dirigent, a\u0000clean-slate system architecture for FaaS orchestration with three key\u0000principles. First, Dirigent optimizes internal cluster manager abstractions to\u0000simplify state management. Second, it eliminates persistent state updates on\u0000the critical path of function invocations, leveraging the fact that FaaS\u0000abstracts sandboxes from users to relax exact state reconstruction guarantees.\u0000Finally, Dirigent runs monolithic control and data planes to minimize internal\u0000communication overheads and maximize throughput. We compare Dirigent to\u0000state-of-the-art FaaS platforms and show that Dirigent reduces 99th percentile\u0000per-function scheduling latency for a production workload by 2.79x compared to\u0000AWS Lambda and can spin up 2500 sandboxes per second at low latency, which is\u00001250x more than with Knative.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"244 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140801329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Memory accounts for 33 - 50% of the total cost of ownership (TCO) in modern data centers. We propose a novel solution to tame memory TCO through the novel creation and judicious management of multiple software-defined compressed memory tiers. As opposed to the state-of-the-art solutions that employ a 2-Tier solution, a single compressed tier along with DRAM, we define multiple compressed tiers implemented through a combination of different compression algorithms, memory allocators for compressed objects, and backing media to store compressed objects. These compressed memory tiers represent distinct points in the access latency, data compressibility, and unit memory usage cost spectrum, allowing rich and flexible trade-offs between memory TCO savings and application performance impact. A key advantage with ntier is that it enables aggressive memory TCO saving opportunities by placing warm data in low latency compressed tiers with a reasonable performance impact while simultaneously placing cold data in the best memory TCO saving tiers. We believe our work represents an important server system configuration and optimization capability to achieve the best SLA-aware performance per dollar for applications hosted in production data center environments. We present a comprehensive and rigorous analytical cost model for performance and TCO trade-off based on continuous monitoring of the application's data access profile. Guided by this model, our placement model takes informed actions to dynamically manage the placement and migration of application data across multiple software-defined compressed tiers. On real-world benchmarks, our solution increases memory TCO savings by 22% - 40% percentage points while maintaining performance parity or improves performance by 2% - 10% percentage points while maintaining memory TCO parity compared to state-of-the-art 2-Tier solutions.
{"title":"Taming Server Memory TCO with Multiple Software-Defined Compressed Tiers","authors":"Sandeep Kumar, Aravinda Prasad, Sreenivas Subramoney","doi":"arxiv-2404.13886","DOIUrl":"https://doi.org/arxiv-2404.13886","url":null,"abstract":"Memory accounts for 33 - 50% of the total cost of ownership (TCO) in modern\u0000data centers. We propose a novel solution to tame memory TCO through the novel\u0000creation and judicious management of multiple software-defined compressed\u0000memory tiers. As opposed to the state-of-the-art solutions that employ a 2-Tier solution, a\u0000single compressed tier along with DRAM, we define multiple compressed tiers\u0000implemented through a combination of different compression algorithms, memory\u0000allocators for compressed objects, and backing media to store compressed\u0000objects. These compressed memory tiers represent distinct points in the access\u0000latency, data compressibility, and unit memory usage cost spectrum, allowing\u0000rich and flexible trade-offs between memory TCO savings and application\u0000performance impact. A key advantage with ntier is that it enables aggressive\u0000memory TCO saving opportunities by placing warm data in low latency compressed\u0000tiers with a reasonable performance impact while simultaneously placing cold\u0000data in the best memory TCO saving tiers. We believe our work represents an\u0000important server system configuration and optimization capability to achieve\u0000the best SLA-aware performance per dollar for applications hosted in production\u0000data center environments. We present a comprehensive and rigorous analytical cost model for performance\u0000and TCO trade-off based on continuous monitoring of the application's data\u0000access profile. Guided by this model, our placement model takes informed\u0000actions to dynamically manage the placement and migration of application data\u0000across multiple software-defined compressed tiers. On real-world benchmarks,\u0000our solution increases memory TCO savings by 22% - 40% percentage points while\u0000maintaining performance parity or improves performance by 2% - 10% percentage\u0000points while maintaining memory TCO parity compared to state-of-the-art 2-Tier\u0000solutions.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140801327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}