S2TD: A Tree-Structured Decoder for Image Paragraph Captioning

Yihui Shi, Yun Liu, Fangxiang Feng, Ruifan Li, Zhanyu Ma, Xiaojie Wang
{"title":"S2TD: A Tree-Structured Decoder for Image Paragraph Captioning","authors":"Yihui Shi, Yun Liu, Fangxiang Feng, Ruifan Li, Zhanyu Ma, Xiaojie Wang","doi":"10.1145/3469877.3490585","DOIUrl":null,"url":null,"abstract":"Image paragraph captioning, a task to generate the paragraph description for a given image, usually requires mining and organizing linguistic counterparts from abundant visual clues. Limited by sequential decoding perspective, previous methods have difficulty in organizing the visual clues holistically or capturing the structural nature of linguistic descriptions. In this paper, we propose a novel tree-structured visual paragraph decoder network, called Splitting to Tree Decoder (S2TD) to address this problem. The key idea is to model the paragraph decoding process as a top-down binary tree expansion. S2TD consists of three modules: a split module, a score module, and a word-level RNN. The split module iteratively splits ancestral visual representations into two parts through a gating mechanism. To determine the tree topology, the score module uses cosine similarity to evaluate the nodes splitting. A novel tree structure loss is proposed to enable end-to-end learning. After the tree expansion, the word-level RNN decodes leaf nodes into sentences forming a coherent paragraph. Extensive experiments are conducted on the Stanford benchmark dataset. The experimental results show promising performance of our proposed S2TD.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Multimedia Asia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3469877.3490585","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

Image paragraph captioning, a task to generate the paragraph description for a given image, usually requires mining and organizing linguistic counterparts from abundant visual clues. Limited by sequential decoding perspective, previous methods have difficulty in organizing the visual clues holistically or capturing the structural nature of linguistic descriptions. In this paper, we propose a novel tree-structured visual paragraph decoder network, called Splitting to Tree Decoder (S2TD) to address this problem. The key idea is to model the paragraph decoding process as a top-down binary tree expansion. S2TD consists of three modules: a split module, a score module, and a word-level RNN. The split module iteratively splits ancestral visual representations into two parts through a gating mechanism. To determine the tree topology, the score module uses cosine similarity to evaluate the nodes splitting. A novel tree structure loss is proposed to enable end-to-end learning. After the tree expansion, the word-level RNN decodes leaf nodes into sentences forming a coherent paragraph. Extensive experiments are conducted on the Stanford benchmark dataset. The experimental results show promising performance of our proposed S2TD.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
S2TD:一种用于图像段落标题的树结构解码器
图像段落字幕是一项为给定图像生成段落描述的任务,通常需要从丰富的视觉线索中挖掘和组织语言对应。以往的解码方法受顺序解码视角的限制,难以从整体上组织视觉线索或捕捉语言描述的结构本质。在本文中,我们提出了一种新的树状结构的视觉段落解码器网络,称为拆分到树解码器(S2TD)来解决这个问题。关键思想是将段落解码过程建模为自上而下的二叉树扩展。S2TD由三个模块组成:分割模块、评分模块和词级RNN。split模块通过一个门控机制将祖先的视觉表示迭代地分成两部分。为了确定树的拓扑结构,评分模块使用余弦相似度来评估节点的分裂。提出了一种新的树形结构损失算法实现端到端学习。在树展开后,词级RNN将叶节点解码成句子,形成连贯的段落。在斯坦福基准数据集上进行了广泛的实验。实验结果表明,我们提出的S2TD具有良好的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Multi-Scale Graph Convolutional Network and Dynamic Iterative Class Loss for Ship Segmentation in Remote Sensing Images Structural Knowledge Organization and Transfer for Class-Incremental Learning Hard-Boundary Attention Network for Nuclei Instance Segmentation Score Transformer: Generating Musical Score from Note-level Representation CMRD-Net: An Improved Method for Underwater Image Enhancement
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1