Disentangled Representations and Hierarchical Refinement of Multi-Granularity Features for Text-to-Image Synthesis

Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI:10.1145/3512527.3531389

Pei Dong, L. Wu, Lei Meng, Xiangxu Meng

{"title":"Disentangled Representations and Hierarchical Refinement of Multi-Granularity Features for Text-to-Image Synthesis","authors":"Pei Dong, L. Wu, Lei Meng, Xiangxu Meng","doi":"10.1145/3512527.3531389","DOIUrl":null,"url":null,"abstract":"In this paper, we focus on generating photo-realistic images from given text descriptions. Current methods first generate an initial image and then progressively refine it to a high-resolution one. These methods typically indiscriminately refine all granularity features output from the previous stage. However, the ability to express different granularity features in each stage is not consistent, and it is difficult to express precise semantics by further refining the features with poor quality generated in the previous stage. Current methods cannot refine different granularity features independently, resulting in that it is challenging to clearly express all factors of semantics in generated image, and some features even become worse. To address this issue, we propose a Hierarchical Disentangled Representations Generative Adversarial Networks (HDR-GAN) to generate photo-realistic images by explicitly disentangling and individually modeling the factors of semantics in the image. HDR-GAN introduces a novel component called multi-granularity feature disentangled encoder to represent image information comprehensively through explicitly disentangling multi-granularity features including pose, shape and texture. Moreover, we develop a novel Multi-granularity Feature Refinement (MFR) containing a Coarse-grained Feature Refinement (CFR) model and a Fine-grained Feature Refinement (FFR) model. CFR utilizes coarse-grained disentangled representations (e.g., pose and shape) to clarify category information, while FFR employs fine-grained disentangled representations (e.g., texture) to reflect instance-level details. Extensive experiments on two well-studied and publicly available datasets (i.e., CUB-200 and CLEVR-SV) demonstrate the rationality and superiority of our method.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3512527.3531389","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

In this paper, we focus on generating photo-realistic images from given text descriptions. Current methods first generate an initial image and then progressively refine it to a high-resolution one. These methods typically indiscriminately refine all granularity features output from the previous stage. However, the ability to express different granularity features in each stage is not consistent, and it is difficult to express precise semantics by further refining the features with poor quality generated in the previous stage. Current methods cannot refine different granularity features independently, resulting in that it is challenging to clearly express all factors of semantics in generated image, and some features even become worse. To address this issue, we propose a Hierarchical Disentangled Representations Generative Adversarial Networks (HDR-GAN) to generate photo-realistic images by explicitly disentangling and individually modeling the factors of semantics in the image. HDR-GAN introduces a novel component called multi-granularity feature disentangled encoder to represent image information comprehensively through explicitly disentangling multi-granularity features including pose, shape and texture. Moreover, we develop a novel Multi-granularity Feature Refinement (MFR) containing a Coarse-grained Feature Refinement (CFR) model and a Fine-grained Feature Refinement (FFR) model. CFR utilizes coarse-grained disentangled representations (e.g., pose and shape) to clarify category information, while FFR employs fine-grained disentangled representations (e.g., texture) to reflect instance-level details. Extensive experiments on two well-studied and publicly available datasets (i.e., CUB-200 and CLEVR-SV) demonstrate the rationality and superiority of our method.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

文本到图像合成中多粒度特征的解纠缠表示和层次细化

在本文中，我们专注于从给定的文本描述生成逼真的图像。目前的方法首先生成初始图像，然后逐步将其细化为高分辨率图像。这些方法通常不加选择地细化前一阶段的所有粒度特征输出。然而，每个阶段表达不同粒度特征的能力并不一致，对前一阶段生成的质量较差的特征进行进一步细化，难以表达精确的语义。目前的方法无法独立地细化不同粒度的特征，导致生成的图像难以清晰地表达所有的语义因素，有些特征甚至变得更差。为了解决这个问题，我们提出了一种分层解纠缠表示生成对抗网络(HDR-GAN)，通过明确解纠缠和单独建模图像中的语义因素来生成逼真的图像。HDR-GAN引入了一种新的多粒度特征解纠缠编码器，通过显式解纠缠姿态、形状和纹理等多粒度特征，全面地表示图像信息。此外，我们还开发了一种新的多粒度特征细化(MFR)方法，该方法包含一个粗粒度特征细化(CFR)模型和一个细粒度特征细化(FFR)模型。CFR使用粗粒度的解纠缠表示(如姿态和形状)来澄清类别信息，而FFR使用细粒度的解纠缠表示(如纹理)来反映实例级细节。在两个研究充分且公开可用的数据集(即CUB-200和CLEVR-SV)上进行的大量实验证明了我们方法的合理性和优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2022 International Conference on Multimedia Retrieval

自引率

0.00%

发文量