arXiv - CS - Operating Systems最新文献_第4页

PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation PARALLELGPUOS：使用验证推测的并行操作系统级 GPU 检查点和还原系统

arXiv - CS - Operating Systems

Pub Date : 2024-05-20 DOI: arxiv-2405.12079

Zhuobin Huang, Xingda Wei, Yingyi Hao, Rong Chen, Mingcong Han, Jinyu Gu, Haibo Chen

Checkpointing (C) and restoring (R) are key components for GPU tasks. POS isan OS-level GPU C/R system: It can transparently checkpoint or restoreprocesses that use the GPU, without requiring any cooperation from theapplication, a key feature required by modern systems like the cloud. Moreover,POS is the first OS-level C/R system that can concurrently execute C/R with theapplication execution: a critical feature that can be trivially achieved whenthe processes only running on the CPU, but becomes challenging when theprocesses use GPU. The problem is how to ensure consistency during concurrentexecution with the lack of application semantics due to transparency. CPUprocesses can leverage OS and hardware paging to fix inconsistency withoutapplication semantics. Unfortunately, GPU bypasses OS and paging for highperformance. POS fills the semantic gap by speculatively extracting bufferaccess information of GPU kernels during runtime. Thanks to the simple andwell-structured nature of GPU kernels, our speculative extraction (with runtimevalidation) achieves 100% accuracy on applications from training to inferencewhose domains span from vision, large language models, and reinforcementlearning. Based on the extracted semantics, we systematically overlap C/R withapplication execution, and achieves orders of magnitude higher performanceunder various tasks compared with the state-of-the-art OS-level GPU C/R,including training fault tolerance, live GPU process migration, and cold startsacceleration in GPU-based serverless computing.

检查点（C）和恢复（R）是 GPU 任务的关键组成部分。POS 是一个操作系统级的 GPU C/R 系统：它可以透明地检查点或还原使用 GPU 的进程，而不需要应用程序的任何配合，这是云计算等现代系统所需的关键功能。此外，POS 还是首个能与应用程序同时执行 C/R 的操作系统级 C/R 系统：当进程仅在 CPU 上运行时，这一关键功能可以轻松实现，但当进程使用 GPU 时，这一功能就变得非常具有挑战性。问题在于如何在并发执行过程中确保一致性，同时又能避免因透明性而导致的应用语义缺失。CPU 进程可以利用操作系统和硬件分页来解决不应用语义的不一致性问题。遗憾的是，GPU 为获得高性能，绕过了操作系统和分页。POS 通过在运行期间推测性地提取 GPU 内核的缓冲区访问信息，填补了语义空白。得益于GPU内核简单且结构良好的特性，我们的推测性提取（带运行时验证）在从训练到推理的应用中实现了100%的准确率，其应用领域涵盖视觉、大型语言模型和强化学习等。基于提取的语义，我们将C/R与应用执行进行了系统性的重叠，与最先进的操作系统级GPU C/R相比，在各种任务中实现了数量级更高的性能，包括训练容错、实时GPU进程迁移以及基于GPU的无服务器计算中的冷启动加速。

{"title":"PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation","authors":"Zhuobin Huang, Xingda Wei, Yingyi Hao, Rong Chen, Mingcong Han, Jinyu Gu, Haibo Chen","doi":"arxiv-2405.12079","DOIUrl":"https://doi.org/arxiv-2405.12079","url":null,"abstract":"Checkpointing (C) and restoring (R) are key components for GPU tasks. POS is\u0000an OS-level GPU C/R system: It can transparently checkpoint or restore\u0000processes that use the GPU, without requiring any cooperation from the\u0000application, a key feature required by modern systems like the cloud. Moreover,\u0000POS is the first OS-level C/R system that can concurrently execute C/R with the\u0000application execution: a critical feature that can be trivially achieved when\u0000the processes only running on the CPU, but becomes challenging when the\u0000processes use GPU. The problem is how to ensure consistency during concurrent\u0000execution with the lack of application semantics due to transparency. CPU\u0000processes can leverage OS and hardware paging to fix inconsistency without\u0000application semantics. Unfortunately, GPU bypasses OS and paging for high\u0000performance. POS fills the semantic gap by speculatively extracting buffer\u0000access information of GPU kernels during runtime. Thanks to the simple and\u0000well-structured nature of GPU kernels, our speculative extraction (with runtime\u0000validation) achieves 100% accuracy on applications from training to inference\u0000whose domains span from vision, large language models, and reinforcement\u0000learning. Based on the extracted semantics, we systematically overlap C/R with\u0000application execution, and achieves orders of magnitude higher performance\u0000under various tasks compared with the state-of-the-art OS-level GPU C/R,\u0000including training fault tolerance, live GPU process migration, and cold starts\u0000acceleration in GPU-based serverless computing.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"64 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141147277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Leveraging Machine Learning for Accurate IoT Device Identification in Dynamic Wireless Contexts 利用机器学习在动态无线环境中准确识别物联网设备

arXiv - CS - Operating Systems

Pub Date : 2024-05-15 DOI: arxiv-2405.17442

Bhagyashri Tushir, Vikram K Ramanna, Yuhong Liu, Behnam Dezfouli

Identifying IoT devices is crucial for network monitoring, securityenforcement, and inventory tracking. However, most existing identificationmethods rely on deep packet inspection, which raises privacy concerns and addscomputational complexity. More importantly, existing works overlook the impactof wireless channel dynamics on the accuracy of layer-2 features, therebylimiting their effectiveness in real-world scenarios. In this work, we defineand use the latency of specific probe-response packet exchanges, referred to as"device latency," as the main feature for device identification. Additionally,we reveal the critical impact of wireless channel dynamics on the accuracy ofdevice identification based on device latency. Specifically, this workintroduces "accumulation score" as a novel approach to capturing fine-grainedchannel dynamics and their impact on device latency when training machinelearning models. We implement the proposed methods and measure the accuracy andoverhead of device identification in real-world scenarios. The results confirmthat by incorporating the accumulation score for balanced data collection andtraining machine learning algorithms, we achieve an F1 score of over 97% fordevice identification, even amidst wireless channel dynamics, a significantimprovement over the 75% F1 score achieved by disregarding the impact ofchannel dynamics on data collection and device latency.

识别物联网设备对于网络监控、安全执法和库存跟踪至关重要。然而，现有的大多数识别方法都依赖于深度数据包检测，这不仅会引发隐私问题，还会增加计算的复杂性。更重要的是，现有的工作忽略了无线信道动态对第 2 层特征准确性的影响，从而限制了它们在实际场景中的有效性。在这项工作中，我们定义并使用特定探测-响应数据包交换的延迟（称为 "设备延迟"）作为设备识别的主要特征。此外，我们还揭示了无线信道动态对基于设备延迟的设备识别准确性的重要影响。具体来说，这项工作引入了 "累积分数 "作为一种新方法，在训练机器学习模型时捕捉细粒度信道动态及其对设备延迟的影响。我们实施了所提出的方法，并测量了真实世界场景中设备识别的准确性和开销。结果证实，通过在平衡数据收集和训练机器学习算法时采用累积分数，即使在无线信道动态条件下，我们在设备识别方面的 F1 分数也能达到 97% 以上，与忽略信道动态对数据收集和设备延迟的影响时 75% 的 F1 分数相比，有了显著提高。

{"title":"Leveraging Machine Learning for Accurate IoT Device Identification in Dynamic Wireless Contexts","authors":"Bhagyashri Tushir, Vikram K Ramanna, Yuhong Liu, Behnam Dezfouli","doi":"arxiv-2405.17442","DOIUrl":"https://doi.org/arxiv-2405.17442","url":null,"abstract":"Identifying IoT devices is crucial for network monitoring, security\u0000enforcement, and inventory tracking. However, most existing identification\u0000methods rely on deep packet inspection, which raises privacy concerns and adds\u0000computational complexity. More importantly, existing works overlook the impact\u0000of wireless channel dynamics on the accuracy of layer-2 features, thereby\u0000limiting their effectiveness in real-world scenarios. In this work, we define\u0000and use the latency of specific probe-response packet exchanges, referred to as\u0000\"device latency,\" as the main feature for device identification. Additionally,\u0000we reveal the critical impact of wireless channel dynamics on the accuracy of\u0000device identification based on device latency. Specifically, this work\u0000introduces \"accumulation score\" as a novel approach to capturing fine-grained\u0000channel dynamics and their impact on device latency when training machine\u0000learning models. We implement the proposed methods and measure the accuracy and\u0000overhead of device identification in real-world scenarios. The results confirm\u0000that by incorporating the accumulation score for balanced data collection and\u0000training machine learning algorithms, we achieve an F1 score of over 97% for\u0000device identification, even amidst wireless channel dynamics, a significant\u0000improvement over the 75% F1 score achieved by disregarding the impact of\u0000channel dynamics on data collection and device latency.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"70 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141170877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimizing Task Scheduling in Heterogeneous Computing Environments: A Comparative Analysis of CPU, GPU, and ASIC Platforms Using E2C Simulator 优化异构计算环境中的任务调度：使用 E2C 模拟器对 CPU、GPU 和 ASIC 平台进行比较分析

arXiv - CS - Operating Systems

Pub Date : 2024-05-13 DOI: arxiv-2405.08187

Ali Mohammadjafari, Poorya Khajouie

Efficient task scheduling in heterogeneous computing environments isimperative for optimizing resource utilization and minimizing task completiontimes. In this study, we conducted a comprehensive benchmarking analysis toevaluate the performance of four scheduling algorithms First Come, First-Served(FCFS), FCFS with No Queuing (FCFS-NQ), Minimum Expected Completion Time(MECT), and Minimum Expected Execution Time (MEET) across varying workloadscenarios. We defined three workload scenarios: low, medium, and high, eachrepresenting different levels of computational demands. Through rigorousexperimentation and analysis, we assessed the effectiveness of each algorithmin terms of total completion percentage, energy consumption, wasted energy, andenergy per completion. Our findings highlight the strengths and limitations ofeach algorithm, with MECT and MEET emerging as robust contenders, dynamicallyprioritizing tasks based on comprehensive estimates of completion and executiontimes. Furthermore, MECT and MEET exhibit superior energy efficiency comparedto FCFS and FCFS-NQ, underscoring their suitability for resource-constrainedenvironments. This study provides valuable insights into the efficacy of taskscheduling algorithms in heterogeneous computing environments, enablinginformed decision-making to enhance resource allocation, minimize taskcompletion times, and improve energy efficiency

异构计算环境中的高效任务调度对于优化资源利用率和缩短任务完成时间至关重要。在本研究中，我们进行了全面的基准测试分析，以评估四种调度算法的性能：先到先得（FCFS）、无队列 FCFS（FCFS-NQ）、最小预期完成时间（MECT）和最小预期执行时间（MEET）在不同工作负载场景下的性能。我们定义了三种工作负载场景：低、中、高，分别代表不同级别的计算需求。通过严格的实验和分析，我们评估了每种算法在总完成百分比、能耗、浪费能源和每次完成能耗方面的有效性。我们的研究结果凸显了每种算法的优势和局限性，其中 MECT 和 MEET 根据对完成度和执行时间的综合估计，动态地对任务进行优先排序，成为强有力的竞争者。此外，与 FCFS 和 FCFS-NQ 相比，MECT 和 MEET 表现出更高的能效，这表明它们适用于资源受限的环境。这项研究为了解异构计算环境中任务调度算法的功效提供了宝贵的见解，有助于做出明智的决策，以加强资源分配，最大限度地缩短任务完成时间，提高能效

{"title":"Optimizing Task Scheduling in Heterogeneous Computing Environments: A Comparative Analysis of CPU, GPU, and ASIC Platforms Using E2C Simulator","authors":"Ali Mohammadjafari, Poorya Khajouie","doi":"arxiv-2405.08187","DOIUrl":"https://doi.org/arxiv-2405.08187","url":null,"abstract":"Efficient task scheduling in heterogeneous computing environments is\u0000imperative for optimizing resource utilization and minimizing task completion\u0000times. In this study, we conducted a comprehensive benchmarking analysis to\u0000evaluate the performance of four scheduling algorithms First Come, First-Served\u0000(FCFS), FCFS with No Queuing (FCFS-NQ), Minimum Expected Completion Time\u0000(MECT), and Minimum Expected Execution Time (MEET) across varying workload\u0000scenarios. We defined three workload scenarios: low, medium, and high, each\u0000representing different levels of computational demands. Through rigorous\u0000experimentation and analysis, we assessed the effectiveness of each algorithm\u0000in terms of total completion percentage, energy consumption, wasted energy, and\u0000energy per completion. Our findings highlight the strengths and limitations of\u0000each algorithm, with MECT and MEET emerging as robust contenders, dynamically\u0000prioritizing tasks based on comprehensive estimates of completion and execution\u0000times. Furthermore, MECT and MEET exhibit superior energy efficiency compared\u0000to FCFS and FCFS-NQ, underscoring their suitability for resource-constrained\u0000environments. This study provides valuable insights into the efficacy of task\u0000scheduling algorithms in heterogeneous computing environments, enabling\u0000informed decision-making to enhance resource allocation, minimize task\u0000completion times, and improve energy efficiency","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"71 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141063715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Zero-consistency root emulation for unprivileged container image build 为无特权容器镜像构建零一致性 root 仿真

arXiv - CS - Operating Systems

Pub Date : 2024-05-09 DOI: arxiv-2405.06085

Reid PriedhorskyLos Alamos National Laboratory, Michael JenningsLos Alamos National Laboratory, Megan Phinney

Do Linux distribution package managers need the privileged operations theyrequest to actually happen? Apparently not, at least for building containerimages for HPC applications. We use this observation to implement a rootemulation mode using a Linux seccomp filter that intercepts some privilegedsystem calls, does nothing, and returns success to the calling program. Thisapproach provides no consistency whatsoever but appears sufficient to build allDockerfiles we examined, simplifying fully-unprivileged workflows needed forHPC application containers.

Linux 发行版软件包管理器需要实际执行它们所请求的特权操作吗？显然不需要，至少在为高性能计算应用构建容器图像时不需要。我们利用这一观察结果，使用 Linux seccomp 过滤器实现了一种根模拟模式，该过滤器拦截一些特权系统调用，什么也不做，然后向调用程序返回成功。这种方法没有提供任何一致性，但似乎足以构建我们检查过的所有Docker文件，简化了高性能计算应用容器所需的完全非特权工作流程。

引用次数: 0

uTNT: Unikernels for Efficient and Flexible Internet Probing uTNT：高效灵活的互联网探测单核

arXiv - CS - Operating Systems

Pub Date : 2024-05-07 DOI: arxiv-2405.04036

Maxime Letemple, Gaulthier Gain, Sami Ben Mariem, Laurent Mathy, Benoit Donnet

The last twenty years have seen the development and popularity of networkmeasurement infrastructures. Internet measurement platforms have become commonand have demonstrated their relevance in Internet understanding and securityobservation. However, despite their popularity, those platforms lack offlexibility and reactivity, as they are usually used for longitudinalmeasurements. As a consequence, they may miss detecting events that aresecurity or Internet-related. During the same period, operating systems haveevolved to virtual machines (VMs) as self-contained units for runningapplications, with the recent rise of unikernels, ultra-lightweight VMstailored for specific applications, eliminating the need for a host OS. In thispaper, we advocate that measurement infrastructures could take advantage ofunikernels to become more flexible and efficient. We propose uTNT, aproof-of-concept unikernel-based implementation of TNT, a traceroute extensionable to reveal MPLS tunnels. This paper documents the full toolchain forporting TNT into a unikernel and evaluates uTNT performance with respect tomore traditional approaches. The paper also discusses a use case in which uTNTcould find a suitable usage. uTNT source code is publicly available on Gitlab.

过去二十年来，网络测量基础设施得到了发展和普及。互联网测量平台已变得十分普遍，并在互联网理解和安全观测方面发挥了重要作用。然而，尽管这些平台很受欢迎，但由于它们通常用于纵向测量，因此缺乏灵活性和反应能力。因此，它们可能无法检测到与安全或互联网有关的事件。在同一时期，操作系统已经发展为虚拟机（VM），作为运行应用程序的独立单元，最近又兴起了单核，即为特定应用程序定制的超轻量级虚拟机，从而消除了对主机操作系统的需求。在本文中，我们主张测量基础架构可以利用单核来提高灵活性和效率。我们提出了 uTNT，它是 TNT 基于单内核的概念验证实现，TNT 是一种可用于揭示 MPLS 隧道的跟踪路由扩展。本文记录了将 TNT 移植到 unikernel 的完整工具链，并评估了 uTNT 相对于传统方法的性能。本文还讨论了 uTNT 的一个使用案例。uTNT 的源代码可在 Gitlab 上公开获取。

{"title":"uTNT: Unikernels for Efficient and Flexible Internet Probing","authors":"Maxime Letemple, Gaulthier Gain, Sami Ben Mariem, Laurent Mathy, Benoit Donnet","doi":"arxiv-2405.04036","DOIUrl":"https://doi.org/arxiv-2405.04036","url":null,"abstract":"The last twenty years have seen the development and popularity of network\u0000measurement infrastructures. Internet measurement platforms have become common\u0000and have demonstrated their relevance in Internet understanding and security\u0000observation. However, despite their popularity, those platforms lack of\u0000flexibility and reactivity, as they are usually used for longitudinal\u0000measurements. As a consequence, they may miss detecting events that are\u0000security or Internet-related. During the same period, operating systems have\u0000evolved to virtual machines (VMs) as self-contained units for running\u0000applications, with the recent rise of unikernels, ultra-lightweight VMs\u0000tailored for specific applications, eliminating the need for a host OS. In this\u0000paper, we advocate that measurement infrastructures could take advantage of\u0000unikernels to become more flexible and efficient. We propose uTNT, a\u0000proof-of-concept unikernel-based implementation of TNT, a traceroute extension\u0000able to reveal MPLS tunnels. This paper documents the full toolchain for\u0000porting TNT into a unikernel and evaluates uTNT performance with respect to\u0000more traditional approaches. The paper also discusses a use case in which uTNT\u0000could find a suitable usage. uTNT source code is publicly available on Gitlab.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140928991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention vAttention：无需 PagedAttention 即可为 LLM 服务的动态内存管理

arXiv - CS - Operating Systems

Pub Date : 2024-05-07 DOI: arxiv-2405.04437

Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar

Efficient use of GPU memory is essential for high throughput LLM inference.Prior systems reserved memory for the KV-cache ahead-of-time, resulting inwasted capacity due to internal fragmentation. Inspired by OS-based virtualmemory systems, vLLM proposed PagedAttention to enable dynamic memoryallocation for KV-cache. This approach eliminates fragmentation, enablinghigh-throughput LLM serving with larger batch sizes. However, to be able toallocate physical memory dynamically, PagedAttention changes the layout ofKV-cache from contiguous virtual memory to non-contiguous virtual memory. Thischange requires attention kernels to be rewritten to support paging, andserving framework to implement a memory manager. Thus, the PagedAttention modelleads to software complexity, portability issues, redundancy and inefficiency. In this paper, we propose vAttention for dynamic KV-cache memory management.In contrast to PagedAttention, vAttention retains KV-cache in contiguousvirtual memory and leverages low-level system support for demand paging, thatalready exists, to enable on-demand physical memory allocation. Thus,vAttention unburdens the attention kernel developer from having to explicitlysupport paging and avoids re-implementation of memory management in the servingframework. We show that vAttention enables seamless dynamic memory managementfor unchanged implementations of various attention kernels. vAttention alsogenerates tokens up to 1.97x faster than vLLM, while processing input promptsup to 3.92x and 1.45x faster than the PagedAttention variants of FlashAttentionand FlashInfer.

高效利用 GPU 内存对高吞吐量 LLM 推理至关重要。以前的系统会提前为 KV 缓存预留内存，结果由于内部碎片化而浪费了容量。受基于操作系统的虚拟内存系统的启发，vLLM 提出了分页保留（PagedAttention）技术，为 KV 缓存实现动态内存分配。这种方法消除了碎片，使高吞吐量的 LLM 能够为更大的批量提供服务。但是，为了能够动态分配物理内存，PagedAttention 将 KV 缓存的布局从连续虚拟内存改为非连续虚拟内存。这种变化要求重写注意力内核以支持分页，并要求服务框架实现内存管理器。因此，分页注意力模型导致了软件复杂性、可移植性问题、冗余和低效。与 PagedAttention 不同的是，vAttention 将 KV 缓存保留在连续的虚拟内存中，并利用已经存在的对需求分页的底层系统支持，实现按需分配物理内存。因此，vAttention 无需让内核开发人员明确支持分页，也避免了在服务框架中重新实施内存管理。我们的研究表明，vAttention 能够为各种注意力内核的不变实现提供无缝的动态内存管理。vAttention 生成令牌的速度比 vLLM 快 1.97 倍，而处理输入提示的速度比 FlashAttention 和 FlashInfer 的分页注意力变体分别快 3.92 倍和 1.45 倍。

{"title":"vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention","authors":"Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar","doi":"arxiv-2405.04437","DOIUrl":"https://doi.org/arxiv-2405.04437","url":null,"abstract":"Efficient use of GPU memory is essential for high throughput LLM inference.\u0000Prior systems reserved memory for the KV-cache ahead-of-time, resulting in\u0000wasted capacity due to internal fragmentation. Inspired by OS-based virtual\u0000memory systems, vLLM proposed PagedAttention to enable dynamic memory\u0000allocation for KV-cache. This approach eliminates fragmentation, enabling\u0000high-throughput LLM serving with larger batch sizes. However, to be able to\u0000allocate physical memory dynamically, PagedAttention changes the layout of\u0000KV-cache from contiguous virtual memory to non-contiguous virtual memory. This\u0000change requires attention kernels to be rewritten to support paging, and\u0000serving framework to implement a memory manager. Thus, the PagedAttention model\u0000leads to software complexity, portability issues, redundancy and inefficiency. In this paper, we propose vAttention for dynamic KV-cache memory management.\u0000In contrast to PagedAttention, vAttention retains KV-cache in contiguous\u0000virtual memory and leverages low-level system support for demand paging, that\u0000already exists, to enable on-demand physical memory allocation. Thus,\u0000vAttention unburdens the attention kernel developer from having to explicitly\u0000support paging and avoids re-implementation of memory management in the serving\u0000framework. We show that vAttention enables seamless dynamic memory management\u0000for unchanged implementations of various attention kernels. vAttention also\u0000generates tokens up to 1.97x faster than vLLM, while processing input prompts\u0000up to 3.92x and 1.45x faster than the PagedAttention variants of FlashAttention\u0000and FlashInfer.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140928988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Online Gradient-Based Caching Policy with Logarithmic Complexity and Regret Guarantees 具有对数复杂性和遗憾保证的基于梯度的在线缓存策略

arXiv - CS - Operating Systems

Pub Date : 2024-05-02 DOI: arxiv-2405.01263

Damiano Carra, Giovanni Neglia

The commonly used caching policies, such as LRU or LFU, exhibit optimalperformance only for specific traffic patterns. Even advanced MachineLearning-based methods, which detect patterns in historical request data,struggle when future requests deviate from past trends. Recently, a new classof policies has emerged that makes no assumptions about the request arrivalprocess. These algorithms solve an online optimization problem, enablingcontinuous adaptation to the context. They offer theoretical guarantees on theregret metric, which is the gap between the gain of the online policy and thegain of the optimal static cache allocation in hindsight. Nevertheless, thehigh computational complexity of these solutions hinders their practicaladoption. In this study, we introduce a groundbreaking gradient-based onlinecaching policy, the first to achieve logarithmic computational complexityrelative to catalog size along with regret guarantees. This means our algorithmcan efficiently handle large-scale data while minimizing the performance gapbetween real-time decisions and optimal hindsight choices. As requests arrive,our policy dynamically adjusts the probabilities of including items in thecache, which drive cache update decisions. Our algorithm's streamlinedcomplexity is a key advantage, enabling its application to real-world tracesfeaturing millions of requests and items. This is a significant achievement, astraces of this scale have been out of reach for existing policies with regretguarantees. To the best of our knowledge, our experimental results show for thefirst time that the regret guarantees of gradient-based caching policies bringsignificant benefits in scenarios of practical interest.

常用的缓存策略，如 LRU 或 LFU，仅在特定流量模式下表现出最佳性能。即使是先进的基于机器学习的方法，也只能检测历史请求数据中的模式，当未来的请求偏离过去的趋势时，这种方法就会陷入困境。最近，出现了一类新的策略，对请求到达过程不做任何假设。这些算法解决的是一个在线优化问题，能够不断适应环境。它们从理论上保证了 "遗憾度量"，即在线策略收益与事后最优静态缓存分配收益之间的差距。然而，这些解决方案的计算复杂度很高，阻碍了它们的实际应用。在本研究中，我们引入了一种开创性的基于梯度的在线缓存策略，这是第一个实现了相对于目录大小的对数计算复杂度和后悔保证的策略。这意味着我们的算法可以高效处理大规模数据，同时最大限度地缩小实时决策与事后最优选择之间的性能差距。当请求到达时，我们的策略会动态调整缓存中包含项目的概率，从而驱动缓存更新决策。我们的算法具有简化复杂性的关键优势，使其能够应用于包含数百万个请求和项目的实际跟踪。这是一项重大成就，因为对于现有的后悔保证策略来说，这种规模的跟踪是遥不可及的。据我们所知，我们的实验结果首次表明，基于梯度的遗憾保证缓存策略在实际应用场景中具有显著优势。

{"title":"An Online Gradient-Based Caching Policy with Logarithmic Complexity and Regret Guarantees","authors":"Damiano Carra, Giovanni Neglia","doi":"arxiv-2405.01263","DOIUrl":"https://doi.org/arxiv-2405.01263","url":null,"abstract":"The commonly used caching policies, such as LRU or LFU, exhibit optimal\u0000performance only for specific traffic patterns. Even advanced Machine\u0000Learning-based methods, which detect patterns in historical request data,\u0000struggle when future requests deviate from past trends. Recently, a new class\u0000of policies has emerged that makes no assumptions about the request arrival\u0000process. These algorithms solve an online optimization problem, enabling\u0000continuous adaptation to the context. They offer theoretical guarantees on the\u0000regret metric, which is the gap between the gain of the online policy and the\u0000gain of the optimal static cache allocation in hindsight. Nevertheless, the\u0000high computational complexity of these solutions hinders their practical\u0000adoption. In this study, we introduce a groundbreaking gradient-based online\u0000caching policy, the first to achieve logarithmic computational complexity\u0000relative to catalog size along with regret guarantees. This means our algorithm\u0000can efficiently handle large-scale data while minimizing the performance gap\u0000between real-time decisions and optimal hindsight choices. As requests arrive,\u0000our policy dynamically adjusts the probabilities of including items in the\u0000cache, which drive cache update decisions. Our algorithm's streamlined\u0000complexity is a key advantage, enabling its application to real-world traces\u0000featuring millions of requests and items. This is a significant achievement, as\u0000traces of this scale have been out of reach for existing policies with regret\u0000guarantees. To the best of our knowledge, our experimental results show for the\u0000first time that the regret guarantees of gradient-based caching policies bring\u0000significant benefits in scenarios of practical interest.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"837 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140840024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mitigating Spectre-PHT using Speculation Barriers in Linux BPF 在 Linux BPF 中使用投机障碍缓解 Spectre-PHT

arXiv - CS - Operating Systems

Pub Date : 2024-04-30 DOI: arxiv-2405.00078

Luis Gerhorst, Henriette Herzog, Peter Wägemann, Maximilian Ott, Rüdiger Kapitza, Timo Hönig

High-performance IO demands low-overhead communication between user- andkernel space. This demand can no longer be fulfilled by traditional systemcalls. Linux's extended Berkeley Packet Filter (BPF) avoids user-/kerneltransitions by just-in-time compiling user-provided bytecode and executing itin kernel mode with near-native speed. To still isolate BPF programs from thekernel, they are statically analyzed for memory- and type-safety, which imposessome restrictions but allows for good expressiveness and high performance.However, to mitigate the Spectre vulnerabilities disclosed in 2018, defenseswhich reject potentially-dangerous programs had to be deployed. We find thatthis affects 24% to 54% of programs in a dataset with 844 real-world BPFprograms from popular open-source projects. To solve this, users are forced todisable the defenses to continue using the programs, which puts the entiresystem at risk. To enable secure and expressive untrusted Linux kernel extensions, we proposeBerrify, an enhancement to the kernel's Spectre defenses that reduces thenumber of BPF application programs rejected from 54% to zero. We measureBerrify's overhead for all mainstream performance-sensitive applications of BPF(i.e., event tracing, profiling, and packet processing) and find that itimproves significantly upon the status-quo where affected BPF programs areeither unusable or enable transient execution attacks on the kernel.

高性能 IO 要求在用户空间和内核空间之间进行低开销通信。传统的系统调用已无法满足这一要求。Linux 的扩展伯克利包过滤器（BPF）通过即时编译用户提供的字节码，并以接近原生的速度在内核模式下执行，避免了用户与内核之间的转换。为了仍然将 BPF 程序与内核隔离开来，它们被静态分析以确保内存和类型安全，这虽然会带来一些限制，但却能实现良好的表现力和高性能。然而，为了缓解 2018 年披露的 Spectre 漏洞，必须部署拒绝潜在危险程序的防御措施。我们发现，在一个包含 844 个来自流行开源项目的真实世界 BPF 程序的数据集中，24% 到 54% 的程序会受到影响。为了解决这个问题，用户不得不关闭防御功能才能继续使用这些程序，这就给整个系统带来了风险。为了实现安全且富有表现力的不受信任的 Linux 内核扩展，我们提出了 Berrify，它是对内核 Spectre 防御的一种增强，可将被拒绝的 BPF 应用程序数量从 54% 降为零。我们测量了 BPF 所有主流性能敏感应用（即事件跟踪、剖析和数据包处理）的 Berrify 开销，发现它大大改善了受影响的 BPF 程序要么无法使用、要么能对内核发起瞬时执行攻击的现状。

{"title":"Mitigating Spectre-PHT using Speculation Barriers in Linux BPF","authors":"Luis Gerhorst, Henriette Herzog, Peter Wägemann, Maximilian Ott, Rüdiger Kapitza, Timo Hönig","doi":"arxiv-2405.00078","DOIUrl":"https://doi.org/arxiv-2405.00078","url":null,"abstract":"High-performance IO demands low-overhead communication between user- and\u0000kernel space. This demand can no longer be fulfilled by traditional system\u0000calls. Linux's extended Berkeley Packet Filter (BPF) avoids user-/kernel\u0000transitions by just-in-time compiling user-provided bytecode and executing it\u0000in kernel mode with near-native speed. To still isolate BPF programs from the\u0000kernel, they are statically analyzed for memory- and type-safety, which imposes\u0000some restrictions but allows for good expressiveness and high performance.\u0000However, to mitigate the Spectre vulnerabilities disclosed in 2018, defenses\u0000which reject potentially-dangerous programs had to be deployed. We find that\u0000this affects 24% to 54% of programs in a dataset with 844 real-world BPF\u0000programs from popular open-source projects. To solve this, users are forced to\u0000disable the defenses to continue using the programs, which puts the entire\u0000system at risk. To enable secure and expressive untrusted Linux kernel extensions, we propose\u0000Berrify, an enhancement to the kernel's Spectre defenses that reduces the\u0000number of BPF application programs rejected from 54% to zero. We measure\u0000Berrify's overhead for all mainstream performance-sensitive applications of BPF\u0000(i.e., event tracing, profiling, and packet processing) and find that it\u0000improves significantly upon the status-quo where affected BPF programs are\u0000either unusable or enable transient execution attacks on the kernel.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140840091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dirigent: Lightweight Serverless Orchestration 指挥：轻量级无服务器协调

arXiv - CS - Operating Systems

Pub Date : 2024-04-25 DOI: arxiv-2404.16393

Lazar Cvetković, François Costa, Mihajlo Djokic, Michal Friedman, Ana Klimovic

While Function as a Service (FaaS) platforms can initialize functionsandboxes on worker nodes in 10-100s of milliseconds, the latency to schedulefunctions in real FaaS clusters can be orders of magnitude higher. We find thatthe current approach of building FaaS cluster managers on top of legacyorchestration systems like Kubernetes leads to high scheduling delay at highsandbox churn, which is typical in FaaS clusters. While generic clustermanagers use hierarchical abstractions and multiple internal components tomanage and reconcile state with frequent persistent updates, this becomes abottleneck for FaaS, where cluster state frequently changes as sandboxes arecreated on the critical path of requests. Based on our root cause analysis ofperformance issues in existing FaaS cluster managers, we propose Dirigent, aclean-slate system architecture for FaaS orchestration with three keyprinciples. First, Dirigent optimizes internal cluster manager abstractions tosimplify state management. Second, it eliminates persistent state updates onthe critical path of function invocations, leveraging the fact that FaaSabstracts sandboxes from users to relax exact state reconstruction guarantees.Finally, Dirigent runs monolithic control and data planes to minimize internalcommunication overheads and maximize throughput. We compare Dirigent tostate-of-the-art FaaS platforms and show that Dirigent reduces 99th percentileper-function scheduling latency for a production workload by 2.79x compared toAWS Lambda and can spin up 2500 sandboxes per second at low latency, which is1250x more than with Knative.

虽然功能即服务（FaaS）平台可以在 10-100 毫秒内初始化工作者节点上的功能和盒子，但在实际的 FaaS 集群中调度功能的延迟可能要高出几个数量级。我们发现，目前在 Kubernetes 等传统编排系统基础上构建 FaaS 集群管理器的方法会导致在 FaaS 集群中典型的高容器流失时出现高调度延迟。虽然通用集群管理器使用分层抽象和多个内部组件来管理和调节频繁持续更新的状态，但这成为 FaaS 的瓶颈，因为在请求的关键路径上创建沙箱时，集群状态会频繁变化。基于对现有 FaaS 集群管理器性能问题的根本原因分析，我们提出了 Dirigent，这是一种用于 FaaS 协调的无缝系统架构，具有三个关键原则。首先，Dirigent 优化了内部集群管理器抽象，简化了状态管理。其次，它消除了函数调用关键路径上的持续状态更新，利用 FaaS 将沙箱抽象为用户的事实，放宽了对精确状态重建的保证。最后，Dirigent 运行单片式控制平面和数据平面，以最大限度地减少内部通信开销，最大限度地提高吞吐量。我们将 Dirigent 与目前最先进的 FaaS 平台进行了比较，结果表明，与 AWS Lambda 相比，Dirigent 可将生产工作负载的第 99 百分位数函数调度延迟降低 2.79 倍，并能以低延迟每秒启动 2500 个沙盒，是 Knative 的 1250 倍。

{"title":"Dirigent: Lightweight Serverless Orchestration","authors":"Lazar Cvetković, François Costa, Mihajlo Djokic, Michal Friedman, Ana Klimovic","doi":"arxiv-2404.16393","DOIUrl":"https://doi.org/arxiv-2404.16393","url":null,"abstract":"While Function as a Service (FaaS) platforms can initialize function\u0000sandboxes on worker nodes in 10-100s of milliseconds, the latency to schedule\u0000functions in real FaaS clusters can be orders of magnitude higher. We find that\u0000the current approach of building FaaS cluster managers on top of legacy\u0000orchestration systems like Kubernetes leads to high scheduling delay at high\u0000sandbox churn, which is typical in FaaS clusters. While generic cluster\u0000managers use hierarchical abstractions and multiple internal components to\u0000manage and reconcile state with frequent persistent updates, this becomes a\u0000bottleneck for FaaS, where cluster state frequently changes as sandboxes are\u0000created on the critical path of requests. Based on our root cause analysis of\u0000performance issues in existing FaaS cluster managers, we propose Dirigent, a\u0000clean-slate system architecture for FaaS orchestration with three key\u0000principles. First, Dirigent optimizes internal cluster manager abstractions to\u0000simplify state management. Second, it eliminates persistent state updates on\u0000the critical path of function invocations, leveraging the fact that FaaS\u0000abstracts sandboxes from users to relax exact state reconstruction guarantees.\u0000Finally, Dirigent runs monolithic control and data planes to minimize internal\u0000communication overheads and maximize throughput. We compare Dirigent to\u0000state-of-the-art FaaS platforms and show that Dirigent reduces 99th percentile\u0000per-function scheduling latency for a production workload by 2.79x compared to\u0000AWS Lambda and can spin up 2500 sandboxes per second at low latency, which is\u00001250x more than with Knative.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"244 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140801329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Taming Server Memory TCO with Multiple Software-Defined Compressed Tiers 利用多个软件定义的压缩层降低服务器内存总拥有成本

arXiv - CS - Operating Systems

Pub Date : 2024-04-22 DOI: arxiv-2404.13886

Sandeep Kumar, Aravinda Prasad, Sreenivas Subramoney

Memory accounts for 33 - 50% of the total cost of ownership (TCO) in moderndata centers. We propose a novel solution to tame memory TCO through the novelcreation and judicious management of multiple software-defined compressedmemory tiers. As opposed to the state-of-the-art solutions that employ a 2-Tier solution, asingle compressed tier along with DRAM, we define multiple compressed tiersimplemented through a combination of different compression algorithms, memoryallocators for compressed objects, and backing media to store compressedobjects. These compressed memory tiers represent distinct points in the accesslatency, data compressibility, and unit memory usage cost spectrum, allowingrich and flexible trade-offs between memory TCO savings and applicationperformance impact. A key advantage with ntier is that it enables aggressivememory TCO saving opportunities by placing warm data in low latency compressedtiers with a reasonable performance impact while simultaneously placing colddata in the best memory TCO saving tiers. We believe our work represents animportant server system configuration and optimization capability to achievethe best SLA-aware performance per dollar for applications hosted in productiondata center environments. We present a comprehensive and rigorous analytical cost model for performanceand TCO trade-off based on continuous monitoring of the application's dataaccess profile. Guided by this model, our placement model takes informedactions to dynamically manage the placement and migration of application dataacross multiple software-defined compressed tiers. On real-world benchmarks,our solution increases memory TCO savings by 22% - 40% percentage points whilemaintaining performance parity or improves performance by 2% - 10% percentagepoints while maintaining memory TCO parity compared to state-of-the-art 2-Tiersolutions.

在现代数据中心中，内存占总拥有成本（TCO）的 33 - 50%。我们提出了一种新颖的解决方案，通过新颖地创建和明智地管理多个软件定义的压缩内存层来降低内存总拥有成本。与采用 2 层解决方案、单一压缩层和 DRAM 的最先进解决方案不同，我们通过结合不同的压缩算法、压缩对象的内存分配器以及存储压缩对象的后备介质，定义了多个压缩层。这些压缩内存层代表了访问延迟、数据可压缩性和单位内存使用成本光谱中的不同点，允许在内存总拥有成本节省和应用性能影响之间进行灵活的权衡。ntier 的一个关键优势是，它可以将热数据放在对性能有合理影响的低延迟压缩层中，同时将冷数据放在最佳内存 TCO 节约层中，从而提供积极的内存 TCO 节约机会。我们相信，我们的工作代表了一种重要的服务器系统配置和优化能力，可为托管在生产数据中心环境中的应用实现最佳的 SLA 感知性能。我们基于对应用程序数据访问情况的持续监控，为性能和总拥有成本的权衡提出了一个全面而严谨的分析成本模型。在该模型的指导下，我们的放置模型采取明智的行动，动态管理多个软件定义的压缩层之间的应用数据放置和迁移。在实际基准测试中，与最先进的 2 层解决方案相比，我们的解决方案在保持性能均等的同时，将内存 TCO 节省率提高了 22% - 40% 个百分点，或在保持内存 TCO 均等的同时，将性能提高了 2% - 10% 个百分点。

{"title":"Taming Server Memory TCO with Multiple Software-Defined Compressed Tiers","authors":"Sandeep Kumar, Aravinda Prasad, Sreenivas Subramoney","doi":"arxiv-2404.13886","DOIUrl":"https://doi.org/arxiv-2404.13886","url":null,"abstract":"Memory accounts for 33 - 50% of the total cost of ownership (TCO) in modern\u0000data centers. We propose a novel solution to tame memory TCO through the novel\u0000creation and judicious management of multiple software-defined compressed\u0000memory tiers. As opposed to the state-of-the-art solutions that employ a 2-Tier solution, a\u0000single compressed tier along with DRAM, we define multiple compressed tiers\u0000implemented through a combination of different compression algorithms, memory\u0000allocators for compressed objects, and backing media to store compressed\u0000objects. These compressed memory tiers represent distinct points in the access\u0000latency, data compressibility, and unit memory usage cost spectrum, allowing\u0000rich and flexible trade-offs between memory TCO savings and application\u0000performance impact. A key advantage with ntier is that it enables aggressive\u0000memory TCO saving opportunities by placing warm data in low latency compressed\u0000tiers with a reasonable performance impact while simultaneously placing cold\u0000data in the best memory TCO saving tiers. We believe our work represents an\u0000important server system configuration and optimization capability to achieve\u0000the best SLA-aware performance per dollar for applications hosted in production\u0000data center environments. We present a comprehensive and rigorous analytical cost model for performance\u0000and TCO trade-off based on continuous monitoring of the application's data\u0000access profile. Guided by this model, our placement model takes informed\u0000actions to dynamically manage the placement and migration of application data\u0000across multiple software-defined compressed tiers. On real-world benchmarks,\u0000our solution increases memory TCO savings by 22% - 40% percentage points while\u0000maintaining performance parity or improves performance by 2% - 10% percentage\u0000points while maintaining memory TCO parity compared to state-of-the-art 2-Tier\u0000solutions.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140801327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0