Lingfeng Yang;Xiang Li;Yueze Wang;Xinlong Wang;Jian Yang
{"title":"精细的可视化文本提示","authors":"Lingfeng Yang;Xiang Li;Yueze Wang;Xinlong Wang;Jian Yang","doi":"10.1109/TPAMI.2024.3504568","DOIUrl":null,"url":null,"abstract":"Vision-Language Models (VLMs), such as CLIP, excel in zero-shot image-level visual understanding but struggle with object-based tasks requiring precise localization and recognition. Visual prompts, like colorful boxes or circles, are suggested to enhance local perception. However, these methods often include irrelevant and noisy pixels, leading to suboptimal performance. The design of better visual prompts and their collaboration with text prompting remains underexplored. This paper introduces Fine-Grained Visual Text Prompting (FGVTP), a new zero-shot framework for object-based tasks using precise semantic masks and reinforced image-text alignment. FGVTP comprises Fine-Grained Visual Prompting (FGVP) and Consistency-Enhanced Text Prompting (CETP). Specifically, we carefully study visual prompting designs by exploring more visual markings that vary in shape and form. FGVP uses semantic masks from a segmenter like the Segment Anything Model (SAM) and employs background blurring (Blur Reverse Mask) to highlight targets while maintaining spatial coherence. Further, CETP enhances image-text alignment by prompting captions based on FGVP-processed images. As a result, FGVTP achieves superior zero-shot referring expression comprehension on RefCOCO/+/g benchmarks, outperforming previous SOTA methods by 5.8% on average. Part detection experiments conducted on the PACO dataset further validate the preponderance of FGVTP over existing works.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"1594-1609"},"PeriodicalIF":18.6000,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fine-Grained Visual Text Prompting\",\"authors\":\"Lingfeng Yang;Xiang Li;Yueze Wang;Xinlong Wang;Jian Yang\",\"doi\":\"10.1109/TPAMI.2024.3504568\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Vision-Language Models (VLMs), such as CLIP, excel in zero-shot image-level visual understanding but struggle with object-based tasks requiring precise localization and recognition. Visual prompts, like colorful boxes or circles, are suggested to enhance local perception. However, these methods often include irrelevant and noisy pixels, leading to suboptimal performance. The design of better visual prompts and their collaboration with text prompting remains underexplored. This paper introduces Fine-Grained Visual Text Prompting (FGVTP), a new zero-shot framework for object-based tasks using precise semantic masks and reinforced image-text alignment. FGVTP comprises Fine-Grained Visual Prompting (FGVP) and Consistency-Enhanced Text Prompting (CETP). Specifically, we carefully study visual prompting designs by exploring more visual markings that vary in shape and form. FGVP uses semantic masks from a segmenter like the Segment Anything Model (SAM) and employs background blurring (Blur Reverse Mask) to highlight targets while maintaining spatial coherence. Further, CETP enhances image-text alignment by prompting captions based on FGVP-processed images. As a result, FGVTP achieves superior zero-shot referring expression comprehension on RefCOCO/+/g benchmarks, outperforming previous SOTA methods by 5.8% on average. Part detection experiments conducted on the PACO dataset further validate the preponderance of FGVTP over existing works.\",\"PeriodicalId\":94034,\"journal\":{\"name\":\"IEEE transactions on pattern analysis and machine intelligence\",\"volume\":\"47 3\",\"pages\":\"1594-1609\"},\"PeriodicalIF\":18.6000,\"publicationDate\":\"2024-11-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on pattern analysis and machine intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10763465/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10763465/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
视觉语言模型(VLMs),如CLIP,在零镜头图像级视觉理解方面表现出色,但在需要精确定位和识别的基于对象的任务中表现不佳。视觉提示,如彩色的盒子或圆圈,建议加强局部感知。然而,这些方法通常包含不相关和有噪声的像素,导致性能不佳。更好的视觉提示的设计及其与文本提示的协作仍有待探索。本文介绍了细粒度视觉文本提示(FGVTP),这是一种新的基于对象任务的零射击框架,使用精确的语义掩码和增强的图像-文本对齐。FGVTP包括细粒度视觉提示(FGVP)和一致性增强文本提示(CETP)。具体来说,我们通过探索更多形状和形式不同的视觉标记来仔细研究视觉提示设计。FGVP使用来自Segment Anything Model (SAM)等分割器的语义掩码,并使用背景模糊(Blur Reverse Mask)来突出目标,同时保持空间一致性。此外,CETP通过基于fgvp处理的图像提示标题来增强图像-文本对齐。因此,FGVTP在RefCOCO/+/g基准上实现了优越的零射击参考表达式理解,平均比以前的SOTA方法高出5.8%。在PACO数据集上进行的零件检测实验进一步验证了FGVTP相对于现有工作的优势。
Vision-Language Models (VLMs), such as CLIP, excel in zero-shot image-level visual understanding but struggle with object-based tasks requiring precise localization and recognition. Visual prompts, like colorful boxes or circles, are suggested to enhance local perception. However, these methods often include irrelevant and noisy pixels, leading to suboptimal performance. The design of better visual prompts and their collaboration with text prompting remains underexplored. This paper introduces Fine-Grained Visual Text Prompting (FGVTP), a new zero-shot framework for object-based tasks using precise semantic masks and reinforced image-text alignment. FGVTP comprises Fine-Grained Visual Prompting (FGVP) and Consistency-Enhanced Text Prompting (CETP). Specifically, we carefully study visual prompting designs by exploring more visual markings that vary in shape and form. FGVP uses semantic masks from a segmenter like the Segment Anything Model (SAM) and employs background blurring (Blur Reverse Mask) to highlight targets while maintaining spatial coherence. Further, CETP enhances image-text alignment by prompting captions based on FGVP-processed images. As a result, FGVTP achieves superior zero-shot referring expression comprehension on RefCOCO/+/g benchmarks, outperforming previous SOTA methods by 5.8% on average. Part detection experiments conducted on the PACO dataset further validate the preponderance of FGVTP over existing works.