Transformer-based Image Generation from Scene Graphs

Renato Sortino, S. Palazzo, C. Spampinato
{"title":"Transformer-based Image Generation from Scene Graphs","authors":"Renato Sortino, S. Palazzo, C. Spampinato","doi":"10.48550/arXiv.2303.04634","DOIUrl":null,"url":null,"abstract":"Graph-structured scene descriptions can be efficiently used in generative models to control the composition of the generated image. Previous approaches are based on the combination of graph convolutional networks and adversarial methods for layout prediction and image generation, respectively. In this work, we show how employing multi-head attention to encode the graph information, as well as using a transformer-based model in the latent space for image generation can improve the quality of the sampled data, without the need to employ adversarial models with the subsequent advantage in terms of training stability. The proposed approach, specifically, is entirely based on transformer architectures both for encoding scene graphs into intermediate object layouts and for decoding these layouts into images, passing through a lower dimensional space learned by a vector-quantized variational autoencoder. Our approach shows an improved image quality with respect to state-of-the-art methods as well as a higher degree of diversity among multiple generations from the same scene graph. We evaluate our approach on three public datasets: Visual Genome, COCO, and CLEVR. We achieve an Inception Score of 13.7 and 12.8, and an FID of 52.3 and 60.3, on COCO and Visual Genome, respectively. We perform ablation studies on our contributions to assess the impact of each component. Code is available at https://github.com/perceivelab/trf-sg2im","PeriodicalId":10549,"journal":{"name":"Comput. Vis. Image Underst.","volume":"37 1","pages":"103721"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Comput. Vis. Image Underst.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2303.04634","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Graph-structured scene descriptions can be efficiently used in generative models to control the composition of the generated image. Previous approaches are based on the combination of graph convolutional networks and adversarial methods for layout prediction and image generation, respectively. In this work, we show how employing multi-head attention to encode the graph information, as well as using a transformer-based model in the latent space for image generation can improve the quality of the sampled data, without the need to employ adversarial models with the subsequent advantage in terms of training stability. The proposed approach, specifically, is entirely based on transformer architectures both for encoding scene graphs into intermediate object layouts and for decoding these layouts into images, passing through a lower dimensional space learned by a vector-quantized variational autoencoder. Our approach shows an improved image quality with respect to state-of-the-art methods as well as a higher degree of diversity among multiple generations from the same scene graph. We evaluate our approach on three public datasets: Visual Genome, COCO, and CLEVR. We achieve an Inception Score of 13.7 and 12.8, and an FID of 52.3 and 60.3, on COCO and Visual Genome, respectively. We perform ablation studies on our contributions to assess the impact of each component. Code is available at https://github.com/perceivelab/trf-sg2im
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
从场景图生成基于变压器的图像
图结构场景描述可以有效地用于生成模型,以控制生成图像的组成。以前的方法是基于图卷积网络和对抗方法的结合,分别用于布局预测和图像生成。在这项工作中,我们展示了如何使用多头注意力来编码图信息,以及在潜在空间中使用基于变压器的模型来生成图像,可以提高采样数据的质量,而不需要使用对抗性模型,从而在训练稳定性方面具有优势。具体来说,所提出的方法完全基于转换器架构,既可以将场景图编码为中间对象布局,也可以将这些布局解码为图像,通过矢量量化变分自编码器学习的低维空间。我们的方法显示了相对于最先进的方法的改进的图像质量,以及来自同一场景图的多代之间更高程度的多样性。我们在三个公共数据集上评估了我们的方法:Visual Genome, COCO和CLEVR。我们在COCO和Visual Genome上分别获得了13.7和12.8的Inception Score,以及52.3和60.3的FID。我们对我们的贡献进行消融研究,以评估每个组成部分的影响。代码可从https://github.com/perceivelab/trf-sg2im获得
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Real-time distributed video analytics for privacy-aware person search PAGML: Precise Alignment Guided Metric Learning for sketch-based 3D shape retrieval Robust Teacher: Self-correcting pseudo-label-guided semi-supervised learning for object detection Unpaired sonar image denoising with simultaneous contrastive learning 3DF-FCOS: Small object detection with 3D features based on FCOS
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1