Inf-MLLM:在单个 GPU 上实现多模态大型语言模型的高效流推理

Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo
{"title":"Inf-MLLM:在单个 GPU 上实现多模态大型语言模型的高效流推理","authors":"Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo","doi":"arxiv-2409.09086","DOIUrl":null,"url":null,"abstract":"Multimodal Large Language Models (MLLMs) are distinguished by their\nmultimodal comprehensive ability and widely used in many real-world\napplications including GPT-4o, autonomous driving and robotics. Despite their\nimpressive performance, the multimodal inputs always incur long context. The\ninference under long context requires caching massive Key and Value states (KV\ncache) of previous tokens, which introduces high latency and excessive memory\nconsumption. Due to this reason, it is challenging to deploy streaming\ninference of MLLMs on edge devices, which largely constrains the power and\nusage of MLLMs in real-world applications. In this paper, we introduce\nInf-MLLM, an efficient inference framework for MLLMs, which enable streaming\ninference of MLLM on a single GPU with infinite context. Inf-MLLM is based on\nour key observation of the attention pattern in both LLMs and MLLMs called\n\"attention saddles\". Thanks to the newly discovered attention pattern, Inf-MLLM\nmaintains a size-constrained KV cache by dynamically caching recent tokens and\nrelevant tokens. Furthermore, Inf-MLLM proposes attention bias, a novel\napproach to enable MLLMs to capture long-term dependency. We show that Inf-MLLM\nenables multiple LLMs and MLLMs to achieve stable performance over 4M-token\nlong texts and multi-round conversations with 1-hour-long videos on a single\nGPU. In addition, Inf-MLLM exhibits superior streaming reasoning quality than\nexisting methods such as StreamingLLM and 2x speedup than H2O.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"42 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU\",\"authors\":\"Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo\",\"doi\":\"arxiv-2409.09086\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multimodal Large Language Models (MLLMs) are distinguished by their\\nmultimodal comprehensive ability and widely used in many real-world\\napplications including GPT-4o, autonomous driving and robotics. Despite their\\nimpressive performance, the multimodal inputs always incur long context. The\\ninference under long context requires caching massive Key and Value states (KV\\ncache) of previous tokens, which introduces high latency and excessive memory\\nconsumption. Due to this reason, it is challenging to deploy streaming\\ninference of MLLMs on edge devices, which largely constrains the power and\\nusage of MLLMs in real-world applications. In this paper, we introduce\\nInf-MLLM, an efficient inference framework for MLLMs, which enable streaming\\ninference of MLLM on a single GPU with infinite context. Inf-MLLM is based on\\nour key observation of the attention pattern in both LLMs and MLLMs called\\n\\\"attention saddles\\\". Thanks to the newly discovered attention pattern, Inf-MLLM\\nmaintains a size-constrained KV cache by dynamically caching recent tokens and\\nrelevant tokens. Furthermore, Inf-MLLM proposes attention bias, a novel\\napproach to enable MLLMs to capture long-term dependency. We show that Inf-MLLM\\nenables multiple LLMs and MLLMs to achieve stable performance over 4M-token\\nlong texts and multi-round conversations with 1-hour-long videos on a single\\nGPU. In addition, Inf-MLLM exhibits superior streaming reasoning quality than\\nexisting methods such as StreamingLLM and 2x speedup than H2O.\",\"PeriodicalId\":501291,\"journal\":{\"name\":\"arXiv - CS - Performance\",\"volume\":\"42 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Performance\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.09086\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09086","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

多模态大语言模型(MLLM)以其多模态综合能力而著称,并广泛应用于 GPT-4o、自动驾驶和机器人等许多现实世界的应用中。尽管多模态模型的性能令人印象深刻,但多模态输入总是会产生长语境。长语境下的推理需要缓存大量以前标记的键和值状态(KVcache),这会带来高延迟和过多的内存消耗。因此,在边缘设备上部署 MLLM 的流式推断具有挑战性,这在很大程度上限制了 MLLM 在实际应用中的功率和使用。在本文中,我们介绍了Inf-MLLM--一种高效的MLLM推理框架,它可以在单个GPU上实现无限上下文的MLLM流式推理。Inf-MLLM 基于我们对 LLM 和 MLLM 中的注意力模式(称为 "注意力鞍")的关键观察。得益于新发现的注意力模式,Inf-MLLM通过动态缓存最近标记和相关标记来维持大小受限的KV缓存。此外,Inf-MLLM 还提出了注意力偏置(attention bias),这是一种使 MLLM 能够捕捉长期依赖性的新方法。我们的研究表明,Inf-MLLM 使多个 LLM 和 MLLM 能够在单 GPU 上对 4M 标记长度的文本和 1 小时长视频的多轮对话实现稳定的性能。此外,Inf-MLLM 的流推理质量优于 StreamingLLM 等现有方法,速度是 H2O 的 2 倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU
Multimodal Large Language Models (MLLMs) are distinguished by their multimodal comprehensive ability and widely used in many real-world applications including GPT-4o, autonomous driving and robotics. Despite their impressive performance, the multimodal inputs always incur long context. The inference under long context requires caching massive Key and Value states (KV cache) of previous tokens, which introduces high latency and excessive memory consumption. Due to this reason, it is challenging to deploy streaming inference of MLLMs on edge devices, which largely constrains the power and usage of MLLMs in real-world applications. In this paper, we introduce Inf-MLLM, an efficient inference framework for MLLMs, which enable streaming inference of MLLM on a single GPU with infinite context. Inf-MLLM is based on our key observation of the attention pattern in both LLMs and MLLMs called "attention saddles". Thanks to the newly discovered attention pattern, Inf-MLLM maintains a size-constrained KV cache by dynamically caching recent tokens and relevant tokens. Furthermore, Inf-MLLM proposes attention bias, a novel approach to enable MLLMs to capture long-term dependency. We show that Inf-MLLM enables multiple LLMs and MLLMs to achieve stable performance over 4M-token long texts and multi-round conversations with 1-hour-long videos on a single GPU. In addition, Inf-MLLM exhibits superior streaming reasoning quality than existing methods such as StreamingLLM and 2x speedup than H2O.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
HRA: A Multi-Criteria Framework for Ranking Metaheuristic Optimization Algorithms Temporal Load Imbalance on Ondes3D Seismic Simulator for Different Multicore Architectures Can Graph Reordering Speed Up Graph Neural Network Training? An Experimental Study The Landscape of GPU-Centric Communication A Global Perspective on the Past, Present, and Future of Video Streaming over Starlink
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1