Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing Pub Date : 2022-12-14 DOI:10.48550/arXiv.2212.06971

Haoxuan You, Rui Sun, Zhecan Wang, Kai-Wei Chang, Shih-Fu Chang

{"title":"Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding","authors":"Haoxuan You, Rui Sun, Zhecan Wang, Kai-Wei Chang, Shih-Fu Chang","doi":"10.48550/arXiv.2212.06971","DOIUrl":null,"url":null,"abstract":"From a visual scene containing multiple people, human is able to distinguish each individual given the context descriptions about what happened before, their mental/physical states or intentions, etc. Above ability heavily relies on human-centric commonsense knowledge and reasoning. For example, if asked to identify the\"person who needs healing\"in an image, we need to first know that they usually have injuries or suffering expressions, then find the corresponding visual clues before finally grounding the person. We present a new commonsense task, Human-centric Commonsense Grounding, that tests the models' ability to ground individuals given the context descriptions about what happened before, and their mental/physical states or intentions. We further create a benchmark, HumanCog, a dataset with 130k grounded commonsensical descriptions annotated on 67k images, covering diverse types of commonsense and visual scenes. We set up a context-object-aware method as a strong baseline that outperforms previous pre-trained and non-pretrained models. Further analysis demonstrates that rich visual commonsense and powerful integration of multi-modal commonsense are essential, which sheds light on future works. Data and code will be available https://github.com/Hxyou/HumanCog.","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"152 1","pages":"5444-5454"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2212.06971","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

From a visual scene containing multiple people, human is able to distinguish each individual given the context descriptions about what happened before, their mental/physical states or intentions, etc. Above ability heavily relies on human-centric commonsense knowledge and reasoning. For example, if asked to identify the"person who needs healing"in an image, we need to first know that they usually have injuries or suffering expressions, then find the corresponding visual clues before finally grounding the person. We present a new commonsense task, Human-centric Commonsense Grounding, that tests the models' ability to ground individuals given the context descriptions about what happened before, and their mental/physical states or intentions. We further create a benchmark, HumanCog, a dataset with 130k grounded commonsensical descriptions annotated on 67k images, covering diverse types of commonsense and visual scenes. We set up a context-object-aware method as a strong baseline that outperforms previous pre-trained and non-pretrained models. Further analysis demonstrates that rich visual commonsense and powerful integration of multi-modal commonsense are essential, which sheds light on future works. Data and code will be available https://github.com/Hxyou/HumanCog.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

找到这样的人:以人为中心的视觉常识理解

从包含多人的视觉场景中，人类能够根据之前发生的事情，他们的精神/身体状态或意图等上下文描述来区分每个人。以上能力在很大程度上依赖于以人为中心的常识和推理。例如，如果我们被要求在一幅图像中识别“需要治疗的人”，我们首先需要知道他们通常有受伤或痛苦的表情，然后找到相应的视觉线索，最后再把这个人放下去。我们提出了一个新的常识性任务，以人类为中心的常识性基础，它测试了模型在给定之前发生的事情的上下文描述以及他们的精神/身体状态或意图的情况下对个体进行基础的能力。我们进一步创建了一个基准，HumanCog，这是一个数据集，在67k图像上标注了130k基于常识的描述，涵盖了不同类型的常识和视觉场景。我们建立了一个上下文-对象感知方法作为强基线，优于先前的预训练和非预训练模型。进一步的分析表明，丰富的视觉常识和强大的多模态常识整合是必不可少的，这对未来的工作有一定的启示。数据和代码将提供https://github.com/Hxyou/HumanCog。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing

自引率

0.00%

发文量