{"title":"See or Guess: Counterfactually Regularized Image Captioning","authors":"Qian Cao, Xu Chen, Ruihua Song, Xiting Wang, Xinting Huang, Yuchen Ren","doi":"arxiv-2408.16809","DOIUrl":null,"url":null,"abstract":"Image captioning, which generates natural language descriptions of the visual\ninformation in an image, is a crucial task in vision-language research.\nPrevious models have typically addressed this task by aligning the generative\ncapabilities of machines with human intelligence through statistical fitting of\nexisting datasets. While effective for normal images, they may struggle to\naccurately describe those where certain parts of the image are obscured or\nedited, unlike humans who excel in such cases. These weaknesses they exhibit,\nincluding hallucinations and limited interpretability, often hinder performance\nin scenarios with shifted association patterns. In this paper, we present a\ngeneric image captioning framework that employs causal inference to make\nexisting models more capable of interventional tasks, and counterfactually\nexplainable. Our approach includes two variants leveraging either total effect\nor natural direct effect. Integrating them into the training process enables\nmodels to handle counterfactual scenarios, increasing their generalizability.\nExtensive experiments on various datasets show that our method effectively\nreduces hallucinations and improves the model's faithfulness to images,\ndemonstrating high portability across both small-scale and large-scale\nimage-to-text models. The code is available at\nhttps://github.com/Aman-4-Real/See-or-Guess.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"44 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.16809","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Image captioning, which generates natural language descriptions of the visual
information in an image, is a crucial task in vision-language research.
Previous models have typically addressed this task by aligning the generative
capabilities of machines with human intelligence through statistical fitting of
existing datasets. While effective for normal images, they may struggle to
accurately describe those where certain parts of the image are obscured or
edited, unlike humans who excel in such cases. These weaknesses they exhibit,
including hallucinations and limited interpretability, often hinder performance
in scenarios with shifted association patterns. In this paper, we present a
generic image captioning framework that employs causal inference to make
existing models more capable of interventional tasks, and counterfactually
explainable. Our approach includes two variants leveraging either total effect
or natural direct effect. Integrating them into the training process enables
models to handle counterfactual scenarios, increasing their generalizability.
Extensive experiments on various datasets show that our method effectively
reduces hallucinations and improves the model's faithfulness to images,
demonstrating high portability across both small-scale and large-scale
image-to-text models. The code is available at
https://github.com/Aman-4-Real/See-or-Guess.