视觉问答中的计数:方法、数据集和未来工作

IF 0.8 Q4 COMPUTER SCIENCE, SOFTWARE ENGINEERING International Journal of Image and Graphics Pub Date : 2023-10-20 DOI:10.1142/s0219467825500445
Tesfayee Meshu Welde, Lejian Liao
{"title":"视觉问答中的计数:方法、数据集和未来工作","authors":"Tesfayee Meshu Welde, Lejian Liao","doi":"10.1142/s0219467825500445","DOIUrl":null,"url":null,"abstract":"Visual Question Answering (VQA) is a language-based method for analyzing images, which is highly helpful in assisting people with visual impairment. The VQA system requires a demonstrated holistic image understanding and conducts basic reasoning tasks concerning the image in contrast to the specific task-oriented models that simply classifies object into categories. Thus, VQA systems contribute to the growth of Artificial Intelligence (AI) technology by answering open-ended, arbitrary questions about a given image. In addition, VQA is also used to assess the system’s ability by conducting Visual Turing Test (VTT). However, because of the inability to generate the essential datasets and being incapable of evaluating the systems due to flawlessness and bias, the VQA system is incapable of assessing the system’s overall efficiency. This is seen as a possible and significant limitation of the VQA system. This, in turn, has a negative impact on the progress of performance observed in VQA algorithms. Currently, the research on the VQA system is dealing with more specific sub-problems, which include counting in VQA systems. The counting sub-problem of VQA is a more sophisticated one, riddling with several challenging questions, especially when it comes to complex counting questions such as those that demand object identifications along with detection of objects attributes and positional reasoning. The pooling operation that is considered to perform an attention mechanism in VQA is found to degrade the counting performance. A number of algorithms have been developed to address this issue. In this paper, we provide a comprehensive survey of counting techniques in the VQA system that is developed especially for answering questions such as “How many?”. However, the performance progress achieved by this system is still not satisfactory due to bias that occurs in the datasets from the way we phrase the questions and because of weak evaluation metrics. In the future, fully-fledged architecture, wide-size datasets with complex counting questions and a detailed breakdown in categories, and strong evaluation metrics for evaluating the ability of the system to answer complex counting questions, such as positional and comparative reasoning will be executed.","PeriodicalId":44688,"journal":{"name":"International Journal of Image and Graphics","volume":null,"pages":null},"PeriodicalIF":0.8000,"publicationDate":"2023-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Counting in Visual Question Answering: Methods, Datasets, and Future Work\",\"authors\":\"Tesfayee Meshu Welde, Lejian Liao\",\"doi\":\"10.1142/s0219467825500445\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Visual Question Answering (VQA) is a language-based method for analyzing images, which is highly helpful in assisting people with visual impairment. The VQA system requires a demonstrated holistic image understanding and conducts basic reasoning tasks concerning the image in contrast to the specific task-oriented models that simply classifies object into categories. Thus, VQA systems contribute to the growth of Artificial Intelligence (AI) technology by answering open-ended, arbitrary questions about a given image. In addition, VQA is also used to assess the system’s ability by conducting Visual Turing Test (VTT). However, because of the inability to generate the essential datasets and being incapable of evaluating the systems due to flawlessness and bias, the VQA system is incapable of assessing the system’s overall efficiency. This is seen as a possible and significant limitation of the VQA system. This, in turn, has a negative impact on the progress of performance observed in VQA algorithms. Currently, the research on the VQA system is dealing with more specific sub-problems, which include counting in VQA systems. The counting sub-problem of VQA is a more sophisticated one, riddling with several challenging questions, especially when it comes to complex counting questions such as those that demand object identifications along with detection of objects attributes and positional reasoning. The pooling operation that is considered to perform an attention mechanism in VQA is found to degrade the counting performance. A number of algorithms have been developed to address this issue. In this paper, we provide a comprehensive survey of counting techniques in the VQA system that is developed especially for answering questions such as “How many?”. However, the performance progress achieved by this system is still not satisfactory due to bias that occurs in the datasets from the way we phrase the questions and because of weak evaluation metrics. In the future, fully-fledged architecture, wide-size datasets with complex counting questions and a detailed breakdown in categories, and strong evaluation metrics for evaluating the ability of the system to answer complex counting questions, such as positional and comparative reasoning will be executed.\",\"PeriodicalId\":44688,\"journal\":{\"name\":\"International Journal of Image and Graphics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.8000,\"publicationDate\":\"2023-10-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Image and Graphics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1142/s0219467825500445\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Image and Graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/s0219467825500445","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

摘要

视觉问答(Visual Question answer, VQA)是一种基于语言的图像分析方法,对视障人士有很大的帮助。VQA系统需要一个完整的图像理解,并对图像进行基本的推理任务,而不是简单地将对象分类的特定任务导向模型。因此,VQA系统通过回答关于给定图像的开放式、任意问题,促进了人工智能(AI)技术的发展。此外,VQA还用于通过进行视觉图灵测试(VTT)来评估系统的能力。然而,由于无法生成必要的数据集,并且由于缺陷和偏见而无法评估系统,VQA系统无法评估系统的整体效率。这被视为VQA系统的一个可能的和重要的限制。这反过来又会对VQA算法中观察到的性能进展产生负面影响。目前,对VQA系统的研究主要集中在更具体的子问题上,其中包括VQA系统中的计数问题。VQA的计数子问题是一个更复杂的问题,包含几个具有挑战性的问题,特别是当涉及到复杂的计数问题时,例如那些需要识别对象以及检测对象属性和位置推理的问题。在VQA中,池化操作被认为执行了一种注意机制,结果发现池化操作降低了计数性能。已经开发了许多算法来解决这个问题。在本文中,我们提供了VQA系统中计数技术的全面调查,该系统是专门为回答诸如“有多少?”之类的问题而开发的。然而,由于我们表达问题的方式和薄弱的评估指标在数据集中发生的偏差,该系统取得的性能进展仍然不令人满意。在未来,将会执行完全成熟的架构、具有复杂计数问题和类别详细细分的大尺寸数据集,以及用于评估系统回答复杂计数问题(如位置和比较推理)的能力的强大评估指标。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Counting in Visual Question Answering: Methods, Datasets, and Future Work
Visual Question Answering (VQA) is a language-based method for analyzing images, which is highly helpful in assisting people with visual impairment. The VQA system requires a demonstrated holistic image understanding and conducts basic reasoning tasks concerning the image in contrast to the specific task-oriented models that simply classifies object into categories. Thus, VQA systems contribute to the growth of Artificial Intelligence (AI) technology by answering open-ended, arbitrary questions about a given image. In addition, VQA is also used to assess the system’s ability by conducting Visual Turing Test (VTT). However, because of the inability to generate the essential datasets and being incapable of evaluating the systems due to flawlessness and bias, the VQA system is incapable of assessing the system’s overall efficiency. This is seen as a possible and significant limitation of the VQA system. This, in turn, has a negative impact on the progress of performance observed in VQA algorithms. Currently, the research on the VQA system is dealing with more specific sub-problems, which include counting in VQA systems. The counting sub-problem of VQA is a more sophisticated one, riddling with several challenging questions, especially when it comes to complex counting questions such as those that demand object identifications along with detection of objects attributes and positional reasoning. The pooling operation that is considered to perform an attention mechanism in VQA is found to degrade the counting performance. A number of algorithms have been developed to address this issue. In this paper, we provide a comprehensive survey of counting techniques in the VQA system that is developed especially for answering questions such as “How many?”. However, the performance progress achieved by this system is still not satisfactory due to bias that occurs in the datasets from the way we phrase the questions and because of weak evaluation metrics. In the future, fully-fledged architecture, wide-size datasets with complex counting questions and a detailed breakdown in categories, and strong evaluation metrics for evaluating the ability of the system to answer complex counting questions, such as positional and comparative reasoning will be executed.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
International Journal of Image and Graphics
International Journal of Image and Graphics COMPUTER SCIENCE, SOFTWARE ENGINEERING-
CiteScore
2.40
自引率
18.80%
发文量
67
期刊最新文献
Modified Whale Algorithm and Morley PSO-ML-Based Hyperparameter Optimization for Intrusion Detection A Novel Hybrid Attention-Based Dilated Network for Depression Classification Model from Multimodal Data Using Improved Heuristic Approach An Extensive Review on Lung Cancer Detection Models CMVT: ConVit Transformer Network Recombined with Convolutional Layer Two-Phase Speckle Noise Removal in US Images: Speckle Reducing Improved Anisotropic Diffusion and Optimal Bayes Threshold
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1