Qiujie Ma , Shuqi Yang , Lijuan Zhang , Qing Lan , Dongdong Yang , Honghan Chen , Ying Tan
{"title":"APOVIS: Automated pixel-level open-vocabulary instance segmentation through integration of pre-trained vision-language models and foundational segmentation models","authors":"Qiujie Ma , Shuqi Yang , Lijuan Zhang , Qing Lan , Dongdong Yang , Honghan Chen , Ying Tan","doi":"10.1016/j.imavis.2024.105384","DOIUrl":null,"url":null,"abstract":"<div><div>In recent years, substantial advancements have been achieved in vision-language integration and image segmentation, particularly through the use of pre-trained models like BERT and Vision Transformer (ViT). Within the domain of open-vocabulary instance segmentation (OVIS), accurately identifying an instance's positional information is critical, as it directly influences the precision of subsequent segmentation tasks. However, many existing methods rely on supplementary networks to generate pseudo-labels, such as multiple anchor frames containing object positional information. While these pseudo-labels aid visual language models in recognizing the absolute position of objects, they often compromise the overall efficiency and performance of the OVIS pipeline. In this study, we introduce a novel Automated Pixel-level OVIS (APOVIS) framework aimed at enhancing OVIS. Our approach automatically generates pixel-level annotations by leveraging the matching capabilities of pre-trained vision-language models for image-text pairs alongside a foundational segmentation model that accepts multiple prompts (e.g., points or anchor boxes) to guide the segmentation process. Specifically, our method first utilizes a pre-trained vision-language model to match instances within image-text pairs to identify relative positions. Next, we employ activation maps to visualize the instances, enabling us to extract instance location information and generate pseudo-label prompts that direct the segmentation process. These pseudo-labels then guide the segmentation model to execute pixel-level segmentation, enhancing both the accuracy and generalizability of object segmentation across images. Extensive experimental results demonstrate that our model significantly outperforms current state-of-the-art models in object detection accuracy and pixel-level instance segmentation on the COCO dataset. Additionally, the generalizability of our approach is validated through image-text pair data inference tasks on the Open Images, Pascal VOC 2012, Pascal Context, and ADE20K datasets. The code will be available at <span><span>https://github.com/ijetma/APOVIS</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105384"},"PeriodicalIF":4.2000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S026288562400489X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In recent years, substantial advancements have been achieved in vision-language integration and image segmentation, particularly through the use of pre-trained models like BERT and Vision Transformer (ViT). Within the domain of open-vocabulary instance segmentation (OVIS), accurately identifying an instance's positional information is critical, as it directly influences the precision of subsequent segmentation tasks. However, many existing methods rely on supplementary networks to generate pseudo-labels, such as multiple anchor frames containing object positional information. While these pseudo-labels aid visual language models in recognizing the absolute position of objects, they often compromise the overall efficiency and performance of the OVIS pipeline. In this study, we introduce a novel Automated Pixel-level OVIS (APOVIS) framework aimed at enhancing OVIS. Our approach automatically generates pixel-level annotations by leveraging the matching capabilities of pre-trained vision-language models for image-text pairs alongside a foundational segmentation model that accepts multiple prompts (e.g., points or anchor boxes) to guide the segmentation process. Specifically, our method first utilizes a pre-trained vision-language model to match instances within image-text pairs to identify relative positions. Next, we employ activation maps to visualize the instances, enabling us to extract instance location information and generate pseudo-label prompts that direct the segmentation process. These pseudo-labels then guide the segmentation model to execute pixel-level segmentation, enhancing both the accuracy and generalizability of object segmentation across images. Extensive experimental results demonstrate that our model significantly outperforms current state-of-the-art models in object detection accuracy and pixel-level instance segmentation on the COCO dataset. Additionally, the generalizability of our approach is validated through image-text pair data inference tasks on the Open Images, Pascal VOC 2012, Pascal Context, and ADE20K datasets. The code will be available at https://github.com/ijetma/APOVIS.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.