Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs

arXiv - CS - Multimedia Pub Date : 2024-09-17 DOI:arxiv-2409.10994

Dingjie Song, Wenjun Wang, Shunian Chen, Xidong Wang, Michael Guan, Benyou Wang

引用次数: 0

Abstract

The rapid advancement of Multimodal Large Language Models (MLLMs) has led to remarkable performances across various domains. However, this progress is accompanied by a substantial surge in the resource consumption of these models. We address this pressing issue by introducing a new approach, Token Reduction using CLIP Metric (TRIM), aimed at improving the efficiency of MLLMs without sacrificing their performance. Inspired by human attention patterns in Visual Question Answering (VQA) tasks, TRIM presents a fresh perspective on the selection and reduction of image tokens. The TRIM method has been extensively tested across 12 datasets, and the results demonstrate a significant reduction in computational overhead while maintaining a consistent level of performance. This research marks a critical stride in efficient MLLM development, promoting greater accessibility and sustainability of high-performing models.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

少即是多：一种简单而有效的标记减少方法，可实现高效的多模态 LLM

多模态大语言模型（MLLMs）的快速发展使其在各个领域都取得了令人瞩目的成绩。为了解决这一紧迫问题，我们引入了一种新方法--使用 CLIP 度量的标记减少法（TRIM），旨在提高 MLLM 的效率，同时不影响其性能。受视觉问题解答（VQA）任务中人类注意力模式的启发，TRIM 为图像标记的选择和减少提供了一个全新的视角。TRIM 方法已在 12 个数据集上进行了广泛测试，结果表明在保持性能水平一致的同时显著降低了计算开销。这项研究标志着在高效 MLLM 开发方面迈出了关键一步，促进了高性能模型的更大可及性和可持续性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Multimedia

自引率

0.00%

发文量

期刊最新文献

Vista3D: Unravel the 3D Darkside of a Single Image MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion Efficient Low-Resolution Face Recognition via Bridge Distillation Enhancing Few-Shot Classification without Forgetting through Multi-Level Contrastive Constraints NVLM: Open Frontier-Class Multimodal LLMs