Partial Off-policy Learning: Balance Accuracy and Diversity for Human-Oriented Image Captioning

Jiahe Shi, Yali Li, Shengjin Wang
{"title":"Partial Off-policy Learning: Balance Accuracy and Diversity for Human-Oriented Image Captioning","authors":"Jiahe Shi, Yali Li, Shengjin Wang","doi":"10.1109/ICCV48922.2021.00219","DOIUrl":null,"url":null,"abstract":"Human-oriented image captioning with both high diversity and accuracy is a challenging task in vision+language modeling. The reinforcement learning (RL) based frameworks promote the accuracy of image captioning, yet seriously hurt the diversity. In contrast, other methods based on variational auto-encoder (VAE) or generative adversarial network (GAN) can produce diverse yet less accurate captions. In this work, we devote our attention to promote the diversity of RL-based image captioning. To be specific, we devise a partial off-policy learning scheme to balance accuracy and diversity. First, we keep the model exposed to varied candidate captions by sampling from the initial state before RL launched. Second, a novel criterion named max-CIDEr is proposed to serve as the reward for promoting diversity. We combine the above-mentioned offpolicy strategy with the on-policy one to moderate the exploration effect, further balancing the diversity and accuracy for human-like image captioning. Experiments show that our method locates the closest to human performance in the diversity-accuracy space, and achieves the highest Pearson correlation as 0.337 with human performance.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"42 1","pages":"2167-2176"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCV48922.2021.00219","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Human-oriented image captioning with both high diversity and accuracy is a challenging task in vision+language modeling. The reinforcement learning (RL) based frameworks promote the accuracy of image captioning, yet seriously hurt the diversity. In contrast, other methods based on variational auto-encoder (VAE) or generative adversarial network (GAN) can produce diverse yet less accurate captions. In this work, we devote our attention to promote the diversity of RL-based image captioning. To be specific, we devise a partial off-policy learning scheme to balance accuracy and diversity. First, we keep the model exposed to varied candidate captions by sampling from the initial state before RL launched. Second, a novel criterion named max-CIDEr is proposed to serve as the reward for promoting diversity. We combine the above-mentioned offpolicy strategy with the on-policy one to moderate the exploration effect, further balancing the diversity and accuracy for human-like image captioning. Experiments show that our method locates the closest to human performance in the diversity-accuracy space, and achieves the highest Pearson correlation as 0.337 with human performance.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
部分非策略学习:以人为本的图像字幕的平衡准确性和多样性
在视觉+语言建模中,具有高度多样性和准确性的以人为本的图像字幕是一项具有挑战性的任务。基于强化学习(RL)的框架提高了图像字幕的准确性,但严重损害了图像字幕的多样性。相比之下,其他基于变分自编码器(VAE)或生成对抗网络(GAN)的方法可以产生多种但不太准确的字幕。在这项工作中,我们致力于促进基于强化学习的图像字幕的多样性。具体来说,我们设计了一个局部的非策略学习方案来平衡准确性和多样性。首先,我们通过从RL启动前的初始状态采样,使模型暴露于不同的候选标题。其次,提出了一个新的标准max-CIDEr作为促进多样性的奖励。我们将上述的非政策策略与政策策略相结合,以调节探索效果,进一步平衡类人图像字幕的多样性和准确性。实验表明,我们的方法在多样性-精度空间中最接近人类的表现,与人类表现的Pearson相关性最高,为0.337。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Naturalistic Physical Adversarial Patch for Object Detectors Polarimetric Helmholtz Stereopsis Deep Transport Network for Unsupervised Video Object Segmentation Real-time Vanishing Point Detector Integrating Under-parameterized RANSAC and Hough Transform Adaptive Label Noise Cleaning with Meta-Supervision for Deep Face Recognition
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1