Jianjie Luo;Yehao Li;Yingwei Pan;Ting Yao;Jianlin Feng;Hongyang Chao;Tao Mei
{"title":"Exploring Vision-Language Foundation Model for Novel Object Captioning","authors":"Jianjie Luo;Yehao Li;Yingwei Pan;Ting Yao;Jianlin Feng;Hongyang Chao;Tao Mei","doi":"10.1109/TCSVT.2024.3452437","DOIUrl":null,"url":null,"abstract":"It is always well believed that pre-trained vision-language foundation models (e.g., CLIP) would substantially facilitate vision-language tasks. Nevertheless, there has been less evidence in support of the idea on describing novel objects in images. In this paper, we propose the Novel Object Transformer with CLIP (NOTC), a Transformer-based model that innovatively exploits the powerful vision-language representation ability of CLIP to enhance novel object captioning model’s training and sentence decoding processes. Technically, given the primary bag-of-objects extracted by Faster R-CNN, NOTC first capitalize on an object distiller module to emphasize the most salient objects and infer the missing novel ones. The refined object words are additionally fed into the object-centric word predictor to generate sentence word-by-word. During training, we design a CLIP-based self-critical sequence training paradigm to select visually-grounded sampled sentence with higher CLIP score reward, which enables a joint training process of captioning model over out-domain training images with novel objects. Moreover, at inference, a new CLIP beam search algorithm is devised to enforce the existence of novel objects and encourage the partial word sequences with higher CLIP scores, thereby decoding both visually-grounded and comprehensive sentences. Extensive experiments are conducted on held-out COCO and nocaps datasets, and competitive performances are reported when compared to state-of-the-art approaches.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"91-102"},"PeriodicalIF":11.1000,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10659916/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
It is always well believed that pre-trained vision-language foundation models (e.g., CLIP) would substantially facilitate vision-language tasks. Nevertheless, there has been less evidence in support of the idea on describing novel objects in images. In this paper, we propose the Novel Object Transformer with CLIP (NOTC), a Transformer-based model that innovatively exploits the powerful vision-language representation ability of CLIP to enhance novel object captioning model’s training and sentence decoding processes. Technically, given the primary bag-of-objects extracted by Faster R-CNN, NOTC first capitalize on an object distiller module to emphasize the most salient objects and infer the missing novel ones. The refined object words are additionally fed into the object-centric word predictor to generate sentence word-by-word. During training, we design a CLIP-based self-critical sequence training paradigm to select visually-grounded sampled sentence with higher CLIP score reward, which enables a joint training process of captioning model over out-domain training images with novel objects. Moreover, at inference, a new CLIP beam search algorithm is devised to enforce the existence of novel objects and encourage the partial word sequences with higher CLIP scores, thereby decoding both visually-grounded and comprehensive sentences. Extensive experiments are conducted on held-out COCO and nocaps datasets, and competitive performances are reported when compared to state-of-the-art approaches.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.