vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

arXiv - CS - Operating Systems Pub Date : 2024-05-07 DOI:arxiv-2405.04437

Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar

{"title":"vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention","authors":"Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar","doi":"arxiv-2405.04437","DOIUrl":null,"url":null,"abstract":"Efficient use of GPU memory is essential for high throughput LLM inference.\nPrior systems reserved memory for the KV-cache ahead-of-time, resulting in\nwasted capacity due to internal fragmentation. Inspired by OS-based virtual\nmemory systems, vLLM proposed PagedAttention to enable dynamic memory\nallocation for KV-cache. This approach eliminates fragmentation, enabling\nhigh-throughput LLM serving with larger batch sizes. However, to be able to\nallocate physical memory dynamically, PagedAttention changes the layout of\nKV-cache from contiguous virtual memory to non-contiguous virtual memory. This\nchange requires attention kernels to be rewritten to support paging, and\nserving framework to implement a memory manager. Thus, the PagedAttention model\nleads to software complexity, portability issues, redundancy and inefficiency. In this paper, we propose vAttention for dynamic KV-cache memory management.\nIn contrast to PagedAttention, vAttention retains KV-cache in contiguous\nvirtual memory and leverages low-level system support for demand paging, that\nalready exists, to enable on-demand physical memory allocation. Thus,\nvAttention unburdens the attention kernel developer from having to explicitly\nsupport paging and avoids re-implementation of memory management in the serving\nframework. We show that vAttention enables seamless dynamic memory management\nfor unchanged implementations of various attention kernels. vAttention also\ngenerates tokens up to 1.97x faster than vLLM, while processing input prompts\nup to 3.92x and 1.45x faster than the PagedAttention variants of FlashAttention\nand FlashInfer.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Operating Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.04437","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Efficient use of GPU memory is essential for high throughput LLM inference. Prior systems reserved memory for the KV-cache ahead-of-time, resulting in wasted capacity due to internal fragmentation. Inspired by OS-based virtual memory systems, vLLM proposed PagedAttention to enable dynamic memory allocation for KV-cache. This approach eliminates fragmentation, enabling high-throughput LLM serving with larger batch sizes. However, to be able to allocate physical memory dynamically, PagedAttention changes the layout of KV-cache from contiguous virtual memory to non-contiguous virtual memory. This change requires attention kernels to be rewritten to support paging, and serving framework to implement a memory manager. Thus, the PagedAttention model leads to software complexity, portability issues, redundancy and inefficiency. In this paper, we propose vAttention for dynamic KV-cache memory management. In contrast to PagedAttention, vAttention retains KV-cache in contiguous virtual memory and leverages low-level system support for demand paging, that already exists, to enable on-demand physical memory allocation. Thus, vAttention unburdens the attention kernel developer from having to explicitly support paging and avoids re-implementation of memory management in the serving framework. We show that vAttention enables seamless dynamic memory management for unchanged implementations of various attention kernels. vAttention also generates tokens up to 1.97x faster than vLLM, while processing input prompts up to 3.92x and 1.45x faster than the PagedAttention variants of FlashAttention and FlashInfer.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

vAttention：无需 PagedAttention 即可为 LLM 服务的动态内存管理

高效利用 GPU 内存对高吞吐量 LLM 推理至关重要。以前的系统会提前为 KV 缓存预留内存，结果由于内部碎片化而浪费了容量。受基于操作系统的虚拟内存系统的启发，vLLM 提出了分页保留（PagedAttention）技术，为 KV 缓存实现动态内存分配。这种方法消除了碎片，使高吞吐量的 LLM 能够为更大的批量提供服务。但是，为了能够动态分配物理内存，PagedAttention 将 KV 缓存的布局从连续虚拟内存改为非连续虚拟内存。这种变化要求重写注意力内核以支持分页，并要求服务框架实现内存管理器。因此，分页注意力模型导致了软件复杂性、可移植性问题、冗余和低效。与 PagedAttention 不同的是，vAttention 将 KV 缓存保留在连续的虚拟内存中，并利用已经存在的对需求分页的底层系统支持，实现按需分配物理内存。因此，vAttention 无需让内核开发人员明确支持分页，也避免了在服务框架中重新实施内存管理。我们的研究表明，vAttention 能够为各种注意力内核的不变实现提供无缝的动态内存管理。vAttention 生成令牌的速度比 vLLM 快 1.97 倍，而处理输入提示的速度比 FlashAttention 和 FlashInfer 的分页注意力变体分别快 3.92 倍和 1.45 倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Operating Systems

自引率

0.00%

发文量

期刊最新文献

Analysis of Synchronization Mechanisms in Operating Systems Skip TLB flushes for reused pages within mmap's eBPF-mm: Userspace-guided memory management in Linux with eBPF BULKHEAD: Secure, Scalable, and Efficient Kernel Compartmentalization with PKS Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects