StableNet：区分困难样本，克服视觉问题解答中的语言先验

IF 1.3 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE IET Computer Vision Pub Date : 2023-10-28 DOI:10.1049/cvi2.12249

Zhengtao Yu, Jia Zhao, Chenliang Guo, Ying Yang

{"title":"StableNet：区分困难样本，克服视觉问题解答中的语言先验","authors":"Zhengtao Yu, Jia Zhao, Chenliang Guo, Ying Yang","doi":"10.1049/cvi2.12249","DOIUrl":null,"url":null,"abstract":"<p>With the booming fields of computer vision and natural language processing, cross-modal intersections such as visual question answering (VQA) have become very popular. However, several studies have shown that many VQA models suffer from severe language prior problems. After a series of experiments, the authors found that previous VQA models are in an unstable state, that is, when training is repeated several times on the same dataset, there are significant differences between the distributions of the predicted answers given by the models each time, and these models also perform unsatisfactorily in terms of accuracy. The reason for model instability is that some of the difficult samples bring serious interference to model training, so we design a method to measure model stability quantitatively and further propose a method that can alleviate both model imbalance and instability phenomena. Precisely, the question types are classified into simple and difficult ones different weighting measures are applied. By imposing constraints on the training process for both types of questions, the stability and accuracy of the model improve. Experimental results demonstrate the effectiveness of our method, which achieves 63.11% on VQA-CP v2 and 75.49% with the addition of the pre-trained model.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 2","pages":"315-327"},"PeriodicalIF":1.3000,"publicationDate":"2023-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12249","citationCount":"0","resultStr":"{\"title\":\"StableNet: Distinguishing the hard samples to overcome language priors in visual question answering\",\"authors\":\"Zhengtao Yu, Jia Zhao, Chenliang Guo, Ying Yang\",\"doi\":\"10.1049/cvi2.12249\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>With the booming fields of computer vision and natural language processing, cross-modal intersections such as visual question answering (VQA) have become very popular. However, several studies have shown that many VQA models suffer from severe language prior problems. After a series of experiments, the authors found that previous VQA models are in an unstable state, that is, when training is repeated several times on the same dataset, there are significant differences between the distributions of the predicted answers given by the models each time, and these models also perform unsatisfactorily in terms of accuracy. The reason for model instability is that some of the difficult samples bring serious interference to model training, so we design a method to measure model stability quantitatively and further propose a method that can alleviate both model imbalance and instability phenomena. Precisely, the question types are classified into simple and difficult ones different weighting measures are applied. By imposing constraints on the training process for both types of questions, the stability and accuracy of the model improve. Experimental results demonstrate the effectiveness of our method, which achieves 63.11% on VQA-CP v2 and 75.49% with the addition of the pre-trained model.</p>\",\"PeriodicalId\":56304,\"journal\":{\"name\":\"IET Computer Vision\",\"volume\":\"18 2\",\"pages\":\"315-327\"},\"PeriodicalIF\":1.3000,\"publicationDate\":\"2023-10-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12249\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IET Computer Vision\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cvi2.12249\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cvi2.12249","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

随着计算机视觉和自然语言处理领域的蓬勃发展，视觉问题解答（VQA）等跨模态交叉技术已变得非常流行。然而，多项研究表明，许多 VQA 模型都存在严重的语言先验问题。经过一系列实验，作者发现以往的 VQA 模型处于不稳定状态，即在同一数据集上重复训练多次后，每次模型给出的预测答案的分布都存在显著差异，而且这些模型在准确率方面的表现也不尽如人意。造成模型不稳定的原因是一些困难样本给模型训练带来了严重干扰，因此我们设计了一种定量测量模型稳定性的方法，并进一步提出了一种能同时缓解模型不平衡和不稳定现象的方法。具体来说，就是将问题类型分为简单和困难两种，并采用不同的加权措施。通过对两类问题的训练过程施加约束，模型的稳定性和准确性都得到了提高。实验结果表明，我们的方法非常有效，在 VQA-CP v2 上的准确率达到了 63.11%，在加入预训练模型后达到了 75.49%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

StableNet: Distinguishing the hard samples to overcome language priors in visual question answering

With the booming fields of computer vision and natural language processing, cross-modal intersections such as visual question answering (VQA) have become very popular. However, several studies have shown that many VQA models suffer from severe language prior problems. After a series of experiments, the authors found that previous VQA models are in an unstable state, that is, when training is repeated several times on the same dataset, there are significant differences between the distributions of the predicted answers given by the models each time, and these models also perform unsatisfactorily in terms of accuracy. The reason for model instability is that some of the difficult samples bring serious interference to model training, so we design a method to measure model stability quantitatively and further propose a method that can alleviate both model imbalance and instability phenomena. Precisely, the question types are classified into simple and difficult ones different weighting measures are applied. By imposing constraints on the training process for both types of questions, the stability and accuracy of the model improve. Experimental results demonstrate the effectiveness of our method, which achieves 63.11% on VQA-CP v2 and 75.49% with the addition of the pre-trained model.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IET Computer Vision 工程技术-工程：电子与电气

CiteScore

3.30

自引率

11.80%

发文量

审稿时长

3.4 months

期刊介绍： IET Computer Vision seeks original research papers in a wide range of areas of computer vision. The vision of the journal is to publish the highest quality research work that is relevant and topical to the field, but not forgetting those works that aim to introduce new horizons and set the agenda for future avenues of research in computer vision. IET Computer Vision welcomes submissions on the following topics: Biologically and perceptually motivated approaches to low level vision (feature detection, etc.); Perceptual grouping and organisation Representation, analysis and matching of 2D and 3D shape Shape-from-X Object recognition Image understanding Learning with visual inputs Motion analysis and object tracking Multiview scene analysis Cognitive approaches in low, mid and high level vision Control in visual systems Colour, reflectance and light Statistical and probabilistic models Face and gesture Surveillance Biometrics and security Robotics Vehicle guidance Automatic model aquisition Medical image analysis and understanding Aerial scene analysis and remote sensing Deep learning models in computer vision Both methodological and applications orientated papers are welcome. Manuscripts submitted are expected to include a detailed and analytical review of the literature and state-of-the-art exposition of the original proposed research and its methodology, its thorough experimental evaluation, and last but not least, comparative evaluation against relevant and state-of-the-art methods. Submissions not abiding by these minimum requirements may be returned to authors without being sent to review. Special Issues Current Call for Papers: Computer Vision for Smart Cameras and Camera Networks - https://digital-library.theiet.org/files/IET_CVI_SC.pdf Computer Vision for the Creative Industries - https://digital-library.theiet.org/files/IET_CVI_CVCI.pdf