vAttention:无需 PagedAttention 即可为 LLM 服务的动态内存管理

Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar
{"title":"vAttention:无需 PagedAttention 即可为 LLM 服务的动态内存管理","authors":"Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar","doi":"arxiv-2405.04437","DOIUrl":null,"url":null,"abstract":"Efficient use of GPU memory is essential for high throughput LLM inference.\nPrior systems reserved memory for the KV-cache ahead-of-time, resulting in\nwasted capacity due to internal fragmentation. Inspired by OS-based virtual\nmemory systems, vLLM proposed PagedAttention to enable dynamic memory\nallocation for KV-cache. This approach eliminates fragmentation, enabling\nhigh-throughput LLM serving with larger batch sizes. However, to be able to\nallocate physical memory dynamically, PagedAttention changes the layout of\nKV-cache from contiguous virtual memory to non-contiguous virtual memory. This\nchange requires attention kernels to be rewritten to support paging, and\nserving framework to implement a memory manager. Thus, the PagedAttention model\nleads to software complexity, portability issues, redundancy and inefficiency. In this paper, we propose vAttention for dynamic KV-cache memory management.\nIn contrast to PagedAttention, vAttention retains KV-cache in contiguous\nvirtual memory and leverages low-level system support for demand paging, that\nalready exists, to enable on-demand physical memory allocation. Thus,\nvAttention unburdens the attention kernel developer from having to explicitly\nsupport paging and avoids re-implementation of memory management in the serving\nframework. We show that vAttention enables seamless dynamic memory management\nfor unchanged implementations of various attention kernels. vAttention also\ngenerates tokens up to 1.97x faster than vLLM, while processing input prompts\nup to 3.92x and 1.45x faster than the PagedAttention variants of FlashAttention\nand FlashInfer.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention\",\"authors\":\"Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar\",\"doi\":\"arxiv-2405.04437\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Efficient use of GPU memory is essential for high throughput LLM inference.\\nPrior systems reserved memory for the KV-cache ahead-of-time, resulting in\\nwasted capacity due to internal fragmentation. Inspired by OS-based virtual\\nmemory systems, vLLM proposed PagedAttention to enable dynamic memory\\nallocation for KV-cache. This approach eliminates fragmentation, enabling\\nhigh-throughput LLM serving with larger batch sizes. However, to be able to\\nallocate physical memory dynamically, PagedAttention changes the layout of\\nKV-cache from contiguous virtual memory to non-contiguous virtual memory. This\\nchange requires attention kernels to be rewritten to support paging, and\\nserving framework to implement a memory manager. Thus, the PagedAttention model\\nleads to software complexity, portability issues, redundancy and inefficiency. In this paper, we propose vAttention for dynamic KV-cache memory management.\\nIn contrast to PagedAttention, vAttention retains KV-cache in contiguous\\nvirtual memory and leverages low-level system support for demand paging, that\\nalready exists, to enable on-demand physical memory allocation. Thus,\\nvAttention unburdens the attention kernel developer from having to explicitly\\nsupport paging and avoids re-implementation of memory management in the serving\\nframework. We show that vAttention enables seamless dynamic memory management\\nfor unchanged implementations of various attention kernels. vAttention also\\ngenerates tokens up to 1.97x faster than vLLM, while processing input prompts\\nup to 3.92x and 1.45x faster than the PagedAttention variants of FlashAttention\\nand FlashInfer.\",\"PeriodicalId\":501333,\"journal\":{\"name\":\"arXiv - CS - Operating Systems\",\"volume\":\"15 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-05-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Operating Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2405.04437\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Operating Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.04437","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

高效利用 GPU 内存对高吞吐量 LLM 推理至关重要。以前的系统会提前为 KV 缓存预留内存,结果由于内部碎片化而浪费了容量。受基于操作系统的虚拟内存系统的启发,vLLM 提出了分页保留(PagedAttention)技术,为 KV 缓存实现动态内存分配。这种方法消除了碎片,使高吞吐量的 LLM 能够为更大的批量提供服务。但是,为了能够动态分配物理内存,PagedAttention 将 KV 缓存的布局从连续虚拟内存改为非连续虚拟内存。这种变化要求重写注意力内核以支持分页,并要求服务框架实现内存管理器。因此,分页注意力模型导致了软件复杂性、可移植性问题、冗余和低效。与 PagedAttention 不同的是,vAttention 将 KV 缓存保留在连续的虚拟内存中,并利用已经存在的对需求分页的底层系统支持,实现按需分配物理内存。因此,vAttention 无需让内核开发人员明确支持分页,也避免了在服务框架中重新实施内存管理。我们的研究表明,vAttention 能够为各种注意力内核的不变实现提供无缝的动态内存管理。vAttention 生成令牌的速度比 vLLM 快 1.97 倍,而处理输入提示的速度比 FlashAttention 和 FlashInfer 的分页注意力变体分别快 3.92 倍和 1.45 倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
Efficient use of GPU memory is essential for high throughput LLM inference. Prior systems reserved memory for the KV-cache ahead-of-time, resulting in wasted capacity due to internal fragmentation. Inspired by OS-based virtual memory systems, vLLM proposed PagedAttention to enable dynamic memory allocation for KV-cache. This approach eliminates fragmentation, enabling high-throughput LLM serving with larger batch sizes. However, to be able to allocate physical memory dynamically, PagedAttention changes the layout of KV-cache from contiguous virtual memory to non-contiguous virtual memory. This change requires attention kernels to be rewritten to support paging, and serving framework to implement a memory manager. Thus, the PagedAttention model leads to software complexity, portability issues, redundancy and inefficiency. In this paper, we propose vAttention for dynamic KV-cache memory management. In contrast to PagedAttention, vAttention retains KV-cache in contiguous virtual memory and leverages low-level system support for demand paging, that already exists, to enable on-demand physical memory allocation. Thus, vAttention unburdens the attention kernel developer from having to explicitly support paging and avoids re-implementation of memory management in the serving framework. We show that vAttention enables seamless dynamic memory management for unchanged implementations of various attention kernels. vAttention also generates tokens up to 1.97x faster than vLLM, while processing input prompts up to 3.92x and 1.45x faster than the PagedAttention variants of FlashAttention and FlashInfer.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Analysis of Synchronization Mechanisms in Operating Systems Skip TLB flushes for reused pages within mmap's eBPF-mm: Userspace-guided memory management in Linux with eBPF BULKHEAD: Secure, Scalable, and Efficient Kernel Compartmentalization with PKS Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1