Coreference resolution helps visual dialogs to focus

IF 3.2 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS High-Confidence Computing Pub Date : 2023-11-22 DOI:10.1016/j.hcc.2023.100184
Tianwei Yue , Wenping Wang , Chen Liang, Dachi Chen, Congrui Hetang, Xuewei Wang
{"title":"Coreference resolution helps visual dialogs to focus","authors":"Tianwei Yue ,&nbsp;Wenping Wang ,&nbsp;Chen Liang,&nbsp;Dachi Chen,&nbsp;Congrui Hetang,&nbsp;Xuewei Wang","doi":"10.1016/j.hcc.2023.100184","DOIUrl":null,"url":null,"abstract":"<div><p>Visual Dialog is a multi-modal task involving both computer vision and dialog systems. The goal is to answer multiple questions in conversation style, given an image as the context. Neural networks with attention modules are widely used for this task, because of their effectiveness in reasoning the relevance between the texts and images. In this work, we study how to further improve the quality of such reasoning, which is an open challenge. Our baseline is the Recursive Visual Attention (RVA) model, which refines the vision-text attention by iteratively visiting the dialog history. Building on top of that, we propose to improve the attention mechanism with contrastive learning. We train a Matching-Aware Attention Kernel (MAAK) by aligning the deep feature embeddings of an image and its caption, to provide better attention scores. Experiments show consistent improvements from MAAK. In addition, we study the effect of using Multimodal Compact Bilinear (MCB) pooling as a three-way feature fusion for the visual, textual and dialog history embeddings. We analyze the performance of both methods in the discussion section, and propose further ideas to resolve current limitations.</p></div>","PeriodicalId":100605,"journal":{"name":"High-Confidence Computing","volume":"4 2","pages":"Article 100184"},"PeriodicalIF":3.2000,"publicationDate":"2023-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S266729522300082X/pdfft?md5=ab949d922d5965a06641ae36f6129271&pid=1-s2.0-S266729522300082X-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"High-Confidence Computing","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S266729522300082X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Visual Dialog is a multi-modal task involving both computer vision and dialog systems. The goal is to answer multiple questions in conversation style, given an image as the context. Neural networks with attention modules are widely used for this task, because of their effectiveness in reasoning the relevance between the texts and images. In this work, we study how to further improve the quality of such reasoning, which is an open challenge. Our baseline is the Recursive Visual Attention (RVA) model, which refines the vision-text attention by iteratively visiting the dialog history. Building on top of that, we propose to improve the attention mechanism with contrastive learning. We train a Matching-Aware Attention Kernel (MAAK) by aligning the deep feature embeddings of an image and its caption, to provide better attention scores. Experiments show consistent improvements from MAAK. In addition, we study the effect of using Multimodal Compact Bilinear (MCB) pooling as a three-way feature fusion for the visual, textual and dialog history embeddings. We analyze the performance of both methods in the discussion section, and propose further ideas to resolve current limitations.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
解决核心问题有助于视觉对话聚焦
视觉对话是一项涉及计算机视觉和对话系统的多模式任务。其目标是在以图像为背景的情况下,以对话的方式回答多个问题。带有注意力模块的神经网络在推理文本和图像之间的相关性方面非常有效,因此被广泛应用于这项任务。在这项工作中,我们将研究如何进一步提高这种推理的质量,这是一项公开的挑战。我们的基准是递归视觉注意力(RVA)模型,它通过反复访问对话历史记录来完善视觉-文本注意力。在此基础上,我们建议通过对比学习来改进注意力机制。我们通过对图像及其标题的深度特征嵌入进行对齐来训练匹配感知注意力内核(MAK),从而提供更好的注意力分数。实验表明,MAK 能带来一致的改进。此外,我们还研究了使用多模态紧凑双线性(MCB)池作为视觉、文本和对话历史嵌入的三方特征融合的效果。我们在讨论部分分析了这两种方法的性能,并提出了解决当前局限性的进一步想法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
4.70
自引率
0.00%
发文量
0
期刊最新文献
Erratum to “Exploring Personalized Internet of Things (PIoT), social connectivity, and Artificial Social Intelligence (ASI): A survey” [High-Confidence Computing 4 (2024) 100242] Identity-based threshold (multi) signature with private accountability for privacy-preserving blockchain Navigating the Digital Twin Network landscape: A survey on architecture, applications, privacy and security Erratum to “An effective digital audio watermarking using a deep convolutional neural network with a search location optimization algorithm for improvement in Robustness and Imperceptibility” [High-Confid. Comput. 3 (2023) 100153] A lightweight practical consensus mechanism for supply chain blockchain
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1