Counting in Visual Question Answering: Methods, Datasets, and Future Work

IF 0.8 Q4 COMPUTER SCIENCE, SOFTWARE ENGINEERING International Journal of Image and Graphics Pub Date : 2023-10-20 DOI:10.1142/s0219467825500445

Tesfayee Meshu Welde, Lejian Liao

{"title":"Counting in Visual Question Answering: Methods, Datasets, and Future Work","authors":"Tesfayee Meshu Welde, Lejian Liao","doi":"10.1142/s0219467825500445","DOIUrl":null,"url":null,"abstract":"Visual Question Answering (VQA) is a language-based method for analyzing images, which is highly helpful in assisting people with visual impairment. The VQA system requires a demonstrated holistic image understanding and conducts basic reasoning tasks concerning the image in contrast to the specific task-oriented models that simply classifies object into categories. Thus, VQA systems contribute to the growth of Artificial Intelligence (AI) technology by answering open-ended, arbitrary questions about a given image. In addition, VQA is also used to assess the system’s ability by conducting Visual Turing Test (VTT). However, because of the inability to generate the essential datasets and being incapable of evaluating the systems due to flawlessness and bias, the VQA system is incapable of assessing the system’s overall efficiency. This is seen as a possible and significant limitation of the VQA system. This, in turn, has a negative impact on the progress of performance observed in VQA algorithms. Currently, the research on the VQA system is dealing with more specific sub-problems, which include counting in VQA systems. The counting sub-problem of VQA is a more sophisticated one, riddling with several challenging questions, especially when it comes to complex counting questions such as those that demand object identifications along with detection of objects attributes and positional reasoning. The pooling operation that is considered to perform an attention mechanism in VQA is found to degrade the counting performance. A number of algorithms have been developed to address this issue. In this paper, we provide a comprehensive survey of counting techniques in the VQA system that is developed especially for answering questions such as “How many?”. However, the performance progress achieved by this system is still not satisfactory due to bias that occurs in the datasets from the way we phrase the questions and because of weak evaluation metrics. In the future, fully-fledged architecture, wide-size datasets with complex counting questions and a detailed breakdown in categories, and strong evaluation metrics for evaluating the ability of the system to answer complex counting questions, such as positional and comparative reasoning will be executed.","PeriodicalId":44688,"journal":{"name":"International Journal of Image and Graphics","volume":"184 1","pages":"0"},"PeriodicalIF":0.8000,"publicationDate":"2023-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Image and Graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/s0219467825500445","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Visual Question Answering (VQA) is a language-based method for analyzing images, which is highly helpful in assisting people with visual impairment. The VQA system requires a demonstrated holistic image understanding and conducts basic reasoning tasks concerning the image in contrast to the specific task-oriented models that simply classifies object into categories. Thus, VQA systems contribute to the growth of Artificial Intelligence (AI) technology by answering open-ended, arbitrary questions about a given image. In addition, VQA is also used to assess the system’s ability by conducting Visual Turing Test (VTT). However, because of the inability to generate the essential datasets and being incapable of evaluating the systems due to flawlessness and bias, the VQA system is incapable of assessing the system’s overall efficiency. This is seen as a possible and significant limitation of the VQA system. This, in turn, has a negative impact on the progress of performance observed in VQA algorithms. Currently, the research on the VQA system is dealing with more specific sub-problems, which include counting in VQA systems. The counting sub-problem of VQA is a more sophisticated one, riddling with several challenging questions, especially when it comes to complex counting questions such as those that demand object identifications along with detection of objects attributes and positional reasoning. The pooling operation that is considered to perform an attention mechanism in VQA is found to degrade the counting performance. A number of algorithms have been developed to address this issue. In this paper, we provide a comprehensive survey of counting techniques in the VQA system that is developed especially for answering questions such as “How many?”. However, the performance progress achieved by this system is still not satisfactory due to bias that occurs in the datasets from the way we phrase the questions and because of weak evaluation metrics. In the future, fully-fledged architecture, wide-size datasets with complex counting questions and a detailed breakdown in categories, and strong evaluation metrics for evaluating the ability of the system to answer complex counting questions, such as positional and comparative reasoning will be executed.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

视觉问答中的计数:方法、数据集和未来工作

视觉问答(Visual Question answer, VQA)是一种基于语言的图像分析方法，对视障人士有很大的帮助。VQA系统需要一个完整的图像理解，并对图像进行基本的推理任务，而不是简单地将对象分类的特定任务导向模型。因此，VQA系统通过回答关于给定图像的开放式、任意问题，促进了人工智能(AI)技术的发展。此外，VQA还用于通过进行视觉图灵测试(VTT)来评估系统的能力。然而，由于无法生成必要的数据集，并且由于缺陷和偏见而无法评估系统，VQA系统无法评估系统的整体效率。这被视为VQA系统的一个可能的和重要的限制。这反过来又会对VQA算法中观察到的性能进展产生负面影响。目前，对VQA系统的研究主要集中在更具体的子问题上，其中包括VQA系统中的计数问题。VQA的计数子问题是一个更复杂的问题，包含几个具有挑战性的问题，特别是当涉及到复杂的计数问题时，例如那些需要识别对象以及检测对象属性和位置推理的问题。在VQA中，池化操作被认为执行了一种注意机制，结果发现池化操作降低了计数性能。已经开发了许多算法来解决这个问题。在本文中，我们提供了VQA系统中计数技术的全面调查，该系统是专门为回答诸如“有多少?”之类的问题而开发的。然而，由于我们表达问题的方式和薄弱的评估指标在数据集中发生的偏差，该系统取得的性能进展仍然不令人满意。在未来，将会执行完全成熟的架构、具有复杂计数问题和类别详细细分的大尺寸数据集，以及用于评估系统回答复杂计数问题(如位置和比较推理)的能力的强大评估指标。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

International Journal of Image and Graphics COMPUTER SCIENCE, SOFTWARE ENGINEERING-

CiteScore

2.40

自引率

18.80%

发文量