Image Synthesis from Locally Related Texts

Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-08 DOI:10.1145/3372278.3390684

Tianrui Niu, Fangxiang Feng, Lingxuan Li, Xiaojie Wang

{"title":"Image Synthesis from Locally Related Texts","authors":"Tianrui Niu, Fangxiang Feng, Lingxuan Li, Xiaojie Wang","doi":"10.1145/3372278.3390684","DOIUrl":null,"url":null,"abstract":"Text-to-image synthesis refers to generating photo-realistic images from text descriptions. Recent works focus on generating images with complex scenes and multiple objects. However, the text inputs to these models are the only captions that always describe the most apparent object or feature of the image and detailed information (e.g. visual attributes) for regions and objects are often missing. Quantitative evaluation of generation performances is still an unsolved problem, where traditional image classification- or retrieval-based metrics fail at evaluating complex images. To address these problems, we propose to generate images conditioned on locally-related texts, i.e., descriptions of local image regions or objects instead of the whole image. Specifically, questions and answers (QAs) are chosen as locally-related texts, which makes it possible to use VQA accuracy as a new evaluation metric. The intuition is simple: higher image quality and image-text consistency (both globally and locally) can help a VQA model answer questions more correctly. We purposed VQA-GAN model with three key modules: hierarchical QA encoder, QA-conditional GAN and external VQA loss. These modules help leverage the new inputs effectively. Thorough experiments on two public VQA datasets demonstrate the effectiveness of the model and the newly proposed metric.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3372278.3390684","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Text-to-image synthesis refers to generating photo-realistic images from text descriptions. Recent works focus on generating images with complex scenes and multiple objects. However, the text inputs to these models are the only captions that always describe the most apparent object or feature of the image and detailed information (e.g. visual attributes) for regions and objects are often missing. Quantitative evaluation of generation performances is still an unsolved problem, where traditional image classification- or retrieval-based metrics fail at evaluating complex images. To address these problems, we propose to generate images conditioned on locally-related texts, i.e., descriptions of local image regions or objects instead of the whole image. Specifically, questions and answers (QAs) are chosen as locally-related texts, which makes it possible to use VQA accuracy as a new evaluation metric. The intuition is simple: higher image quality and image-text consistency (both globally and locally) can help a VQA model answer questions more correctly. We purposed VQA-GAN model with three key modules: hierarchical QA encoder, QA-conditional GAN and external VQA loss. These modules help leverage the new inputs effectively. Thorough experiments on two public VQA datasets demonstrate the effectiveness of the model and the newly proposed metric.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

从本地相关文本合成图像

文本到图像合成是指从文本描述生成逼真的图像。最近的作品集中在生成具有复杂场景和多个对象的图像。然而，这些模型的文本输入是唯一的标题，总是描述图像中最明显的物体或特征，而区域和物体的详细信息(例如视觉属性)往往缺失。生成性能的定量评价仍然是一个未解决的问题，传统的基于图像分类或检索的指标在评估复杂图像时失败。为了解决这些问题，我们建议以局部相关文本为条件生成图像，即局部图像区域或对象的描述，而不是整个图像。具体来说，问题和答案(qa)被选择为与本地相关的文本，这使得使用VQA准确性作为新的评估指标成为可能。直觉很简单:更高的图像质量和图像-文本一致性(全局和局部)可以帮助VQA模型更正确地回答问题。我们设计的VQA-GAN模型包含三个关键模块:分层QA编码器、QA条件GAN和外部VQA损失。这些模块有助于有效地利用新的输入。在两个公共VQA数据集上进行的实验证明了该模型和新提出的度量的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊