{"title":"vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention","authors":"Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar","doi":"arxiv-2405.04437","DOIUrl":null,"url":null,"abstract":"Efficient use of GPU memory is essential for high throughput LLM inference.\nPrior systems reserved memory for the KV-cache ahead-of-time, resulting in\nwasted capacity due to internal fragmentation. Inspired by OS-based virtual\nmemory systems, vLLM proposed PagedAttention to enable dynamic memory\nallocation for KV-cache. This approach eliminates fragmentation, enabling\nhigh-throughput LLM serving with larger batch sizes. However, to be able to\nallocate physical memory dynamically, PagedAttention changes the layout of\nKV-cache from contiguous virtual memory to non-contiguous virtual memory. This\nchange requires attention kernels to be rewritten to support paging, and\nserving framework to implement a memory manager. Thus, the PagedAttention model\nleads to software complexity, portability issues, redundancy and inefficiency. In this paper, we propose vAttention for dynamic KV-cache memory management.\nIn contrast to PagedAttention, vAttention retains KV-cache in contiguous\nvirtual memory and leverages low-level system support for demand paging, that\nalready exists, to enable on-demand physical memory allocation. Thus,\nvAttention unburdens the attention kernel developer from having to explicitly\nsupport paging and avoids re-implementation of memory management in the serving\nframework. We show that vAttention enables seamless dynamic memory management\nfor unchanged implementations of various attention kernels. vAttention also\ngenerates tokens up to 1.97x faster than vLLM, while processing input prompts\nup to 3.92x and 1.45x faster than the PagedAttention variants of FlashAttention\nand FlashInfer.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Operating Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.04437","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Efficient use of GPU memory is essential for high throughput LLM inference.
Prior systems reserved memory for the KV-cache ahead-of-time, resulting in
wasted capacity due to internal fragmentation. Inspired by OS-based virtual
memory systems, vLLM proposed PagedAttention to enable dynamic memory
allocation for KV-cache. This approach eliminates fragmentation, enabling
high-throughput LLM serving with larger batch sizes. However, to be able to
allocate physical memory dynamically, PagedAttention changes the layout of
KV-cache from contiguous virtual memory to non-contiguous virtual memory. This
change requires attention kernels to be rewritten to support paging, and
serving framework to implement a memory manager. Thus, the PagedAttention model
leads to software complexity, portability issues, redundancy and inefficiency. In this paper, we propose vAttention for dynamic KV-cache memory management.
In contrast to PagedAttention, vAttention retains KV-cache in contiguous
virtual memory and leverages low-level system support for demand paging, that
already exists, to enable on-demand physical memory allocation. Thus,
vAttention unburdens the attention kernel developer from having to explicitly
support paging and avoids re-implementation of memory management in the serving
framework. We show that vAttention enables seamless dynamic memory management
for unchanged implementations of various attention kernels. vAttention also
generates tokens up to 1.97x faster than vLLM, while processing input prompts
up to 3.92x and 1.45x faster than the PagedAttention variants of FlashAttention
and FlashInfer.