Homogeneous tokenizer matters: Homogeneous visual tokenizer for remote sensing image understanding

IF 10.6 1区地球科学 Q1 GEOGRAPHY, PHYSICAL ISPRS Journal of Photogrammetry and Remote Sensing Pub Date : 2024-09-21 DOI:10.1016/j.isprsjprs.2024.09.009

Run Shao , Zhaoyang Zhang , Chao Tao , Yunsheng Zhang , Chengli Peng , Haifeng Li

{"title":"Homogeneous tokenizer matters: Homogeneous visual tokenizer for remote sensing image understanding","authors":"Run Shao , Zhaoyang Zhang , Chao Tao , Yunsheng Zhang , Chengli Peng , Haifeng Li","doi":"10.1016/j.isprsjprs.2024.09.009","DOIUrl":null,"url":null,"abstract":"<div><p>On the basis of the transformer architecture and the pretext task of “next-token prediction”, multimodal large language models (MLLMs) are revolutionizing the paradigm in the field of remote sensing image understanding. However, the tokenizer, as one of the fundamental components of MLLMs, has long been overlooked or even misunderstood in visual tasks. A key factor contributing to the great comprehension power of large language models is that natural language tokenizers utilize meaningful words or subwords as the basic elements of language. In contrast, mainstream visual tokenizers, represented by patch-based methods such as Patch Embed, rely on meaningless rectangular patches as basic elements of vision. Analogous to words or subwords in language, we define semantically independent regions (SIRs) for vision and then propose two properties that an ideal visual tokenizer should possess: (1) homogeneity, where SIRs serve as the basic elements of vision, and (2) adaptivity, which allows for a flexible number of tokens to accommodate images of any size and tasks of any granularity. On this basis, we design a simple HOmogeneous visual tOKenizer: HOOK. HOOK consists of two modules: an object perception module (OPM) and an object vectorization module (OVM). To achieve homogeneity, the OPM splits the image into 4 × 4 pixel seeds and then uses a self-attention mechanism to identify SIRs. The OVM employs cross-attention to merge seeds within the same SIR. To achieve adaptability, the OVM predefines a variable number of learnable vectors as cross-attention queries, allowing for the adjustment of the token quantity. We conducted experiments on the NWPU-RESISC45, WHU-RS19, and NaSC-TG2 classification datasets for sparse tasks and the GID5 and DGLCC segmentation datasets for dense tasks. The results show that the visual tokens obtained by HOOK correspond to individual objects, thereby verifying their homogeneity. Compared with randomly initialized or pretrained Patch Embed, which required more than one hundred tokens per image, HOOK required only 6 and 8 tokens for sparse and dense tasks, respectively, resulting in performance improvements of 2% to 10% and efficiency improvements of 1.5 to 2.8 times. The homogeneity and adaptability of the proposed approach provide new perspectives for the study of visual tokenizers. Guided by these principles, the developed HOOK has the potential to replace traditional Patch Embed. The code is available at <span><span>https://github.com/GeoX-Lab/Hook</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"218 ","pages":"Pages 294-310"},"PeriodicalIF":10.6000,"publicationDate":"2024-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ISPRS Journal of Photogrammetry and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0924271624003472","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GEOGRAPHY, PHYSICAL","Score":null,"Total":0}

引用次数: 0

Abstract

On the basis of the transformer architecture and the pretext task of “next-token prediction”, multimodal large language models (MLLMs) are revolutionizing the paradigm in the field of remote sensing image understanding. However, the tokenizer, as one of the fundamental components of MLLMs, has long been overlooked or even misunderstood in visual tasks. A key factor contributing to the great comprehension power of large language models is that natural language tokenizers utilize meaningful words or subwords as the basic elements of language. In contrast, mainstream visual tokenizers, represented by patch-based methods such as Patch Embed, rely on meaningless rectangular patches as basic elements of vision. Analogous to words or subwords in language, we define semantically independent regions (SIRs) for vision and then propose two properties that an ideal visual tokenizer should possess: (1) homogeneity, where SIRs serve as the basic elements of vision, and (2) adaptivity, which allows for a flexible number of tokens to accommodate images of any size and tasks of any granularity. On this basis, we design a simple HOmogeneous visual tOKenizer: HOOK. HOOK consists of two modules: an object perception module (OPM) and an object vectorization module (OVM). To achieve homogeneity, the OPM splits the image into 4 × 4 pixel seeds and then uses a self-attention mechanism to identify SIRs. The OVM employs cross-attention to merge seeds within the same SIR. To achieve adaptability, the OVM predefines a variable number of learnable vectors as cross-attention queries, allowing for the adjustment of the token quantity. We conducted experiments on the NWPU-RESISC45, WHU-RS19, and NaSC-TG2 classification datasets for sparse tasks and the GID5 and DGLCC segmentation datasets for dense tasks. The results show that the visual tokens obtained by HOOK correspond to individual objects, thereby verifying their homogeneity. Compared with randomly initialized or pretrained Patch Embed, which required more than one hundred tokens per image, HOOK required only 6 and 8 tokens for sparse and dense tasks, respectively, resulting in performance improvements of 2% to 10% and efficiency improvements of 1.5 to 2.8 times. The homogeneity and adaptability of the proposed approach provide new perspectives for the study of visual tokenizers. Guided by these principles, the developed HOOK has the potential to replace traditional Patch Embed. The code is available at https://github.com/GeoX-Lab/Hook.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

同质标记器很重要用于遥感图像理解的同质视觉标记器

多模态大语言模型（MLLMs）以转换器架构和 "下一个标记预测 "的前置任务为基础，正在遥感图像理解领域掀起一场范式革命。然而，作为多模态大语言模型的基本组成部分之一，标记符号生成器在视觉任务中长期被忽视甚至误解。大型语言模型之所以具有强大的理解能力，一个关键因素是自然语言标记器利用有意义的词或子词作为语言的基本元素。相比之下，以 Patch Embed 等基于补丁的方法为代表的主流视觉标记器则依赖无意义的矩形补丁作为视觉的基本元素。与语言中的单词或子单词类似，我们为视觉定义了语义独立区域（SIR），然后提出了理想的视觉标记器应具备的两个特性：（1）同质性，即 SIR 作为视觉的基本元素；（2）自适应性，即允许灵活的标记数量，以适应任何大小的图像和任何粒度的任务。在此基础上，我们设计了一个简单的 HOmogeneous 视觉识别器：HOOK。HOOK 由两个模块组成：对象感知模块（OPM）和对象矢量化模块（OVM）。为了实现同质化，OPM 将图像分割成 4 × 4 像素种子，然后使用自注意机制来识别 SIR。OVM 采用交叉注意来合并同一 SIR 中的种子。为了实现适应性，OVM 预定义了数量可变的可学习向量作为交叉注意查询，以便调整标记数量。我们在 NWPU-RESISC45、WHU-RS19 和 NaSC-TG2 分类数据集上进行了稀疏任务实验，在 GID5 和 DGLCC 分割数据集上进行了密集任务实验。结果表明，HOOK 获得的视觉标记与单个对象相对应，从而验证了它们的同质性。与随机初始化或预训练的 Patch Embed（每幅图像需要一百多个标记）相比，HOOK 在稀疏和密集任务中分别只需要 6 个和 8 个标记，性能提高了 2% 至 10%，效率提高了 1.5 至 2.8 倍。所提方法的同质性和适应性为视觉标记化器的研究提供了新的视角。在这些原则的指导下，所开发的 HOOK 有可能取代传统的 Patch Embed。代码见 https://github.com/GeoX-Lab/Hook。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ISPRS Journal of Photogrammetry and Remote Sensing 工程技术-成像科学与照相技术

CiteScore

21.00

自引率

6.30%

发文量

273

审稿时长

40 days

期刊介绍： The ISPRS Journal of Photogrammetry and Remote Sensing (P&RS) serves as the official journal of the International Society for Photogrammetry and Remote Sensing (ISPRS). It acts as a platform for scientists and professionals worldwide who are involved in various disciplines that utilize photogrammetry, remote sensing, spatial information systems, computer vision, and related fields. The journal aims to facilitate communication and dissemination of advancements in these disciplines, while also acting as a comprehensive source of reference and archive. P&RS endeavors to publish high-quality, peer-reviewed research papers that are preferably original and have not been published before. These papers can cover scientific/research, technological development, or application/practical aspects. Additionally, the journal welcomes papers that are based on presentations from ISPRS meetings, as long as they are considered significant contributions to the aforementioned fields. In particular, P&RS encourages the submission of papers that are of broad scientific interest, showcase innovative applications (especially in emerging fields), have an interdisciplinary focus, discuss topics that have received limited attention in P&RS or related journals, or explore new directions in scientific or professional realms. It is preferred that theoretical papers include practical applications, while papers focusing on systems and applications should include a theoretical background.