Bypass network for semantics driven image paragraph captioning

IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computer Vision and Image Understanding Pub Date : 2024-09-06 DOI:10.1016/j.cviu.2024.104154
Qi Zheng , Chaoyue Wang , Dadong Wang
{"title":"Bypass network for semantics driven image paragraph captioning","authors":"Qi Zheng ,&nbsp;Chaoyue Wang ,&nbsp;Dadong Wang","doi":"10.1016/j.cviu.2024.104154","DOIUrl":null,"url":null,"abstract":"<div><p>Image paragraph captioning aims to describe a given image with a sequence of coherent sentences. Most existing methods model the coherence through the topic transition that dynamically infers a topic vector from preceding sentences. However, these methods still suffer from immediate or delayed repetitions in generated paragraphs because (i) the entanglement of syntax and semantics distracts the topic vector from attending pertinent visual regions; (ii) there are few constraints or rewards for learning long-range transitions. In this paper, we propose a bypass network that separately models semantics and linguistic syntax of preceding sentences. Specifically, the proposed model consists of two main modules, i.e. a topic transition module and a sentence generation module. The former takes previous semantic vectors as queries and applies attention mechanism on regional features to acquire the next topic vector, which reduces immediate repetition by eliminating linguistics. The latter decodes the topic vector and the preceding syntax state to produce the following sentence. To further reduce delayed repetition in generated paragraphs, we devise a replacement-based reward for the REINFORCE training. Comprehensive experiments on the widely used benchmark demonstrate the superiority of the proposed model over the state of the art for coherence while maintaining high accuracy.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104154"},"PeriodicalIF":4.3000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224002352","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Image paragraph captioning aims to describe a given image with a sequence of coherent sentences. Most existing methods model the coherence through the topic transition that dynamically infers a topic vector from preceding sentences. However, these methods still suffer from immediate or delayed repetitions in generated paragraphs because (i) the entanglement of syntax and semantics distracts the topic vector from attending pertinent visual regions; (ii) there are few constraints or rewards for learning long-range transitions. In this paper, we propose a bypass network that separately models semantics and linguistic syntax of preceding sentences. Specifically, the proposed model consists of two main modules, i.e. a topic transition module and a sentence generation module. The former takes previous semantic vectors as queries and applies attention mechanism on regional features to acquire the next topic vector, which reduces immediate repetition by eliminating linguistics. The latter decodes the topic vector and the preceding syntax state to produce the following sentence. To further reduce delayed repetition in generated paragraphs, we devise a replacement-based reward for the REINFORCE training. Comprehensive experiments on the widely used benchmark demonstrate the superiority of the proposed model over the state of the art for coherence while maintaining high accuracy.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
语义驱动的图像段落标题旁路网络
图像段落标题旨在用一连串连贯的句子来描述给定图像。现有的大多数方法都是通过主题转换来建立连贯性模型的,主题转换可以动态地从前面的句子中推断出主题向量。然而,这些方法在生成的段落中仍然存在即时或延迟重复的问题,这是因为:(i) 句法和语义的纠缠分散了主题向量对相关视觉区域的注意力;(ii) 学习长距离转换的约束或奖励很少。在本文中,我们提出了一种旁路网络,它能分别对前一句的语义和语言句法进行建模。具体来说,我们提出的模型由两个主要模块组成,即主题转换模块和句子生成模块。前者将之前的语义向量作为查询,并对区域特征应用注意力机制,以获取下一个主题向量,通过消除语病来减少直接重复。后者对主题向量和前一个句法状态进行解码,生成下一个句子。为了进一步减少生成段落中的延迟重复,我们为 REINFORCE 训练设计了一种基于替换的奖励机制。在广泛使用的基准上进行的综合实验证明,所提出的模型在保持高准确性的同时,在连贯性方面优于现有技术。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Computer Vision and Image Understanding
Computer Vision and Image Understanding 工程技术-工程:电子与电气
CiteScore
7.80
自引率
4.40%
发文量
112
审稿时长
79 days
期刊介绍: The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems
期刊最新文献
Scene-cGAN: A GAN for underwater restoration and scene depth estimation 2S-SGCN: A two-stage stratified graph convolutional network model for facial landmark detection on 3D data Dual stage semantic information based generative adversarial network for image super-resolution Enhancing scene text detectors with realistic text image synthesis using diffusion models Unsupervised co-generation of foreground–background segmentation from Text-to-Image synthesis
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1