Coreference resolution helps visual dialogs to focus

IF 3.2 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS High-Confidence Computing Pub Date : 2023-11-22 DOI:10.1016/j.hcc.2023.100184

Tianwei Yue , Wenping Wang , Chen Liang, Dachi Chen, Congrui Hetang, Xuewei Wang

{"title":"Coreference resolution helps visual dialogs to focus","authors":"Tianwei Yue , Wenping Wang , Chen Liang, Dachi Chen, Congrui Hetang, Xuewei Wang","doi":"10.1016/j.hcc.2023.100184","DOIUrl":null,"url":null,"abstract":"<div><p>Visual Dialog is a multi-modal task involving both computer vision and dialog systems. The goal is to answer multiple questions in conversation style, given an image as the context. Neural networks with attention modules are widely used for this task, because of their effectiveness in reasoning the relevance between the texts and images. In this work, we study how to further improve the quality of such reasoning, which is an open challenge. Our baseline is the Recursive Visual Attention (RVA) model, which refines the vision-text attention by iteratively visiting the dialog history. Building on top of that, we propose to improve the attention mechanism with contrastive learning. We train a Matching-Aware Attention Kernel (MAAK) by aligning the deep feature embeddings of an image and its caption, to provide better attention scores. Experiments show consistent improvements from MAAK. In addition, we study the effect of using Multimodal Compact Bilinear (MCB) pooling as a three-way feature fusion for the visual, textual and dialog history embeddings. We analyze the performance of both methods in the discussion section, and propose further ideas to resolve current limitations.</p></div>","PeriodicalId":100605,"journal":{"name":"High-Confidence Computing","volume":"4 2","pages":"Article 100184"},"PeriodicalIF":3.2000,"publicationDate":"2023-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S266729522300082X/pdfft?md5=ab949d922d5965a06641ae36f6129271&pid=1-s2.0-S266729522300082X-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"High-Confidence Computing","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S266729522300082X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Visual Dialog is a multi-modal task involving both computer vision and dialog systems. The goal is to answer multiple questions in conversation style, given an image as the context. Neural networks with attention modules are widely used for this task, because of their effectiveness in reasoning the relevance between the texts and images. In this work, we study how to further improve the quality of such reasoning, which is an open challenge. Our baseline is the Recursive Visual Attention (RVA) model, which refines the vision-text attention by iteratively visiting the dialog history. Building on top of that, we propose to improve the attention mechanism with contrastive learning. We train a Matching-Aware Attention Kernel (MAAK) by aligning the deep feature embeddings of an image and its caption, to provide better attention scores. Experiments show consistent improvements from MAAK. In addition, we study the effect of using Multimodal Compact Bilinear (MCB) pooling as a three-way feature fusion for the visual, textual and dialog history embeddings. We analyze the performance of both methods in the discussion section, and propose further ideas to resolve current limitations.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

解决核心问题有助于视觉对话聚焦

视觉对话是一项涉及计算机视觉和对话系统的多模式任务。其目标是在以图像为背景的情况下，以对话的方式回答多个问题。带有注意力模块的神经网络在推理文本和图像之间的相关性方面非常有效，因此被广泛应用于这项任务。在这项工作中，我们将研究如何进一步提高这种推理的质量，这是一项公开的挑战。我们的基准是递归视觉注意力（RVA）模型，它通过反复访问对话历史记录来完善视觉-文本注意力。在此基础上，我们建议通过对比学习来改进注意力机制。我们通过对图像及其标题的深度特征嵌入进行对齐来训练匹配感知注意力内核（MAK），从而提供更好的注意力分数。实验表明，MAK 能带来一致的改进。此外，我们还研究了使用多模态紧凑双线性（MCB）池作为视觉、文本和对话历史嵌入的三方特征融合的效果。我们在讨论部分分析了这两种方法的性能，并提出了解决当前局限性的进一步想法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

High-Confidence Computing

CiteScore

4.70

自引率

0.00%

发文量