Improving Diversity of Image Captioning Through Variational Autoencoders and Adversarial Learning

Li Ren, Guo-Jun Qi, K. Hua
{"title":"Improving Diversity of Image Captioning Through Variational Autoencoders and Adversarial Learning","authors":"Li Ren, Guo-Jun Qi, K. Hua","doi":"10.1109/WACV.2019.00034","DOIUrl":null,"url":null,"abstract":"Learning translation from images to human-readable natural language has become a great challenge in computer vision research in recent years. Existing works explore the semantic correlation between the visual and language domains via encoder-to-decoder learning frameworks based on classifying visual features in the language domain. This approach, however, is criticized for its lacking of naturalness and diversity. In this paper, we demonstrate a novel way to learn a semantic connection between visual information and natural language directly based on a Variational Autoencoder (VAE) that is trained in an adversarial routine. Instead of using the classification based discriminator, our method directly learns to estimate the diversity between a hidden vector embedded from a text encoder and an informative feature that is sampled from a learned distribution of the autoencoders. We show that the sentences learned from this matching contains accurate semantic meaning with high diversity in the image captioning task. Our experiments on the popular MSCOCO dataset indicates that our method learns to generate high-quality natural language with competitive scores on both correctness and diversity.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WACV.2019.00034","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Learning translation from images to human-readable natural language has become a great challenge in computer vision research in recent years. Existing works explore the semantic correlation between the visual and language domains via encoder-to-decoder learning frameworks based on classifying visual features in the language domain. This approach, however, is criticized for its lacking of naturalness and diversity. In this paper, we demonstrate a novel way to learn a semantic connection between visual information and natural language directly based on a Variational Autoencoder (VAE) that is trained in an adversarial routine. Instead of using the classification based discriminator, our method directly learns to estimate the diversity between a hidden vector embedded from a text encoder and an informative feature that is sampled from a learned distribution of the autoencoders. We show that the sentences learned from this matching contains accurate semantic meaning with high diversity in the image captioning task. Our experiments on the popular MSCOCO dataset indicates that our method learns to generate high-quality natural language with competitive scores on both correctness and diversity.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
通过变分自编码器和对抗学习提高图像标题的多样性
学习从图像到人类可读的自然语言的翻译是近年来计算机视觉研究的一大挑战。现有的研究通过基于语言领域视觉特征分类的编码器到解码器学习框架来探索视觉和语言领域之间的语义相关性。然而,这种方法因缺乏自然性和多样性而受到批评。在本文中,我们展示了一种新的方法来直接学习视觉信息和自然语言之间的语义连接,该方法基于在对抗例程中训练的变分自编码器(VAE)。我们的方法不是使用基于分类的鉴别器,而是直接学习估计从文本编码器嵌入的隐藏向量与从自编码器的学习分布中采样的信息特征之间的多样性。在图像字幕任务中,我们证明了从这种匹配中学习到的句子包含了准确的语义和高度的多样性。我们在流行的MSCOCO数据集上的实验表明,我们的方法可以学习生成高质量的自然语言,在正确性和多样性方面都具有竞争力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Ancient Painting to Natural Image: A New Solution for Painting Processing GAN-Based Pose-Aware Regulation for Video-Based Person Re-Identification Coupled Generative Adversarial Network for Continuous Fine-Grained Action Segmentation Dense 3D Point Cloud Reconstruction Using a Deep Pyramid Network 3D Reconstruction and Texture Optimization Using a Sparse Set of RGB-D Cameras
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1