{"title":"利用累积学习策略克服视觉问题解答中的语言先验","authors":"","doi":"10.1016/j.neucom.2024.128419","DOIUrl":null,"url":null,"abstract":"<div><p>The performance of visual question answering(VQA) has witnessed great progress over the last few years. However, many current VQA models tend to rely on superficial linguistic correlations between questions and answers, often failing to sufficiently learn multi-modal knowledge from both vision and language, and thus suffering significant performance drops. To address this issue, the VQA-CP v2.0 dataset was developed to reduce language biases by greedily re-partitioning the distribution of VQA v2.0’s training and test sets. According to the fact that achieving high performance on real-world datasets requires effective learning from minor classes, in this paper we analyze the presence of skewed long-tail distributions in the VQA-CP v2.0 dataset and propose a new ensemble-based parameter-insensitive framework. This framework is built on two representation learning branches and a joint learning block, which are designed to reduce language biases in VQA tasks. Specifically, the representation learning branches can ensure the superior representative ability learned from the major and minor classes. The joint learning block forces the model to initially concentrate on major classes for robust representation and then gradually shifts its focus towards minor classes for classification during the training progress. Experimental results demonstrate that our approach outperforms the state-of-the-art works on the VQA-CP v2.0 dataset without requiring additional annotations. Notably, on the “num” type, our framework exceeds the second-best method (without extra annotations) by 8.64%. Meanwhile, our approach does not sacrifice accuracy performance on the VQA v2.0 dataset compared with the baseline model.</p></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":null,"pages":null},"PeriodicalIF":5.5000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Overcoming language priors in visual question answering with cumulative learning strategy\",\"authors\":\"\",\"doi\":\"10.1016/j.neucom.2024.128419\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>The performance of visual question answering(VQA) has witnessed great progress over the last few years. However, many current VQA models tend to rely on superficial linguistic correlations between questions and answers, often failing to sufficiently learn multi-modal knowledge from both vision and language, and thus suffering significant performance drops. To address this issue, the VQA-CP v2.0 dataset was developed to reduce language biases by greedily re-partitioning the distribution of VQA v2.0’s training and test sets. According to the fact that achieving high performance on real-world datasets requires effective learning from minor classes, in this paper we analyze the presence of skewed long-tail distributions in the VQA-CP v2.0 dataset and propose a new ensemble-based parameter-insensitive framework. This framework is built on two representation learning branches and a joint learning block, which are designed to reduce language biases in VQA tasks. Specifically, the representation learning branches can ensure the superior representative ability learned from the major and minor classes. The joint learning block forces the model to initially concentrate on major classes for robust representation and then gradually shifts its focus towards minor classes for classification during the training progress. Experimental results demonstrate that our approach outperforms the state-of-the-art works on the VQA-CP v2.0 dataset without requiring additional annotations. Notably, on the “num” type, our framework exceeds the second-best method (without extra annotations) by 8.64%. Meanwhile, our approach does not sacrifice accuracy performance on the VQA v2.0 dataset compared with the baseline model.</p></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2024-08-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231224011901\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224011901","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Overcoming language priors in visual question answering with cumulative learning strategy
The performance of visual question answering(VQA) has witnessed great progress over the last few years. However, many current VQA models tend to rely on superficial linguistic correlations between questions and answers, often failing to sufficiently learn multi-modal knowledge from both vision and language, and thus suffering significant performance drops. To address this issue, the VQA-CP v2.0 dataset was developed to reduce language biases by greedily re-partitioning the distribution of VQA v2.0’s training and test sets. According to the fact that achieving high performance on real-world datasets requires effective learning from minor classes, in this paper we analyze the presence of skewed long-tail distributions in the VQA-CP v2.0 dataset and propose a new ensemble-based parameter-insensitive framework. This framework is built on two representation learning branches and a joint learning block, which are designed to reduce language biases in VQA tasks. Specifically, the representation learning branches can ensure the superior representative ability learned from the major and minor classes. The joint learning block forces the model to initially concentrate on major classes for robust representation and then gradually shifts its focus towards minor classes for classification during the training progress. Experimental results demonstrate that our approach outperforms the state-of-the-art works on the VQA-CP v2.0 dataset without requiring additional annotations. Notably, on the “num” type, our framework exceeds the second-best method (without extra annotations) by 8.64%. Meanwhile, our approach does not sacrifice accuracy performance on the VQA v2.0 dataset compared with the baseline model.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.