Text to image synthesis with multi-granularity feature aware enhancement Generative Adversarial Networks

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computer Vision and Image Understanding Pub Date : 2024-08-01 Epub Date: 2024-05-20 DOI:10.1016/j.cviu.2024.104042

Pei Dong, Lei Wu, Ruichen Li, Xiangxu Meng, Lei Meng

{"title":"Text to image synthesis with multi-granularity feature aware enhancement Generative Adversarial Networks","authors":"Pei Dong, Lei Wu, Ruichen Li, Xiangxu Meng, Lei Meng","doi":"10.1016/j.cviu.2024.104042","DOIUrl":null,"url":null,"abstract":"<div><p>Synthesizing complex images from text presents challenging. Compared to autoregressive and diffusion model-based methods, Generative Adversarial Network-based methods have significant advantages in terms of computational cost and generation efficiency yet remain two limitations: first, these methods often refine all features output from the previous stage indiscriminately, without considering these features are initialized gradually during the generation process; second, the sparse semantic constraints provided by the text description are typically ineffective for refining fine-grained features. These issues complicate the balance between generation quality, computational cost and inference speed. To address these issues, we propose a Multi-granularity Feature Aware Enhancement GAN (MFAE-GAN), which allows the refinement process to match the order of different granularity features being initialized. Specifically, MFAE-GAN (1) samples category-related coarse-grained features and instance-level detail-related fine-grained features at different generation stages based on different attention mechanisms in Coarse-grained Feature Enhancement (CFE) and Fine-grained Feature Enhancement (FFE) to guide the generation process spatially, (2) provides denser semantic constraints than textual semantic information through Multi-granularity Features Adaptive Batch Normalization (MFA-BN) in the process of refining fine-grained features, and (3) adopts a Global Semantics Preservation (GSP) to avoid the loss of global semantics when sampling features continuously. Extensive experimental results demonstrate that our MFAE-GAN is competitive in terms of both image generation quality and efficiency.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"245 ","pages":"Article 104042"},"PeriodicalIF":3.5000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224001231","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/5/20 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Synthesizing complex images from text presents challenging. Compared to autoregressive and diffusion model-based methods, Generative Adversarial Network-based methods have significant advantages in terms of computational cost and generation efficiency yet remain two limitations: first, these methods often refine all features output from the previous stage indiscriminately, without considering these features are initialized gradually during the generation process; second, the sparse semantic constraints provided by the text description are typically ineffective for refining fine-grained features. These issues complicate the balance between generation quality, computational cost and inference speed. To address these issues, we propose a Multi-granularity Feature Aware Enhancement GAN (MFAE-GAN), which allows the refinement process to match the order of different granularity features being initialized. Specifically, MFAE-GAN (1) samples category-related coarse-grained features and instance-level detail-related fine-grained features at different generation stages based on different attention mechanisms in Coarse-grained Feature Enhancement (CFE) and Fine-grained Feature Enhancement (FFE) to guide the generation process spatially, (2) provides denser semantic constraints than textual semantic information through Multi-granularity Features Adaptive Batch Normalization (MFA-BN) in the process of refining fine-grained features, and (3) adopts a Global Semantics Preservation (GSP) to avoid the loss of global semantics when sampling features continuously. Extensive experimental results demonstrate that our MFAE-GAN is competitive in terms of both image generation quality and efficiency.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用多粒度特征感知增强生成式对抗网络进行文本到图像的合成

从文本合成复杂图像具有挑战性。与基于自回归模型和扩散模型的方法相比，基于生成对抗网络的方法在计算成本和生成效率方面具有显著优势，但仍存在两个局限性：首先，这些方法通常会不加区分地细化前一阶段输出的所有特征，而不考虑这些特征是在生成过程中逐渐初始化的；其次，文本描述提供的稀疏语义约束通常对细粒度特征的细化无效。这些问题使生成质量、计算成本和推理速度之间的平衡变得更加复杂。为了解决这些问题，我们提出了多粒度特征感知增强型 GAN（MFAE-GAN），它允许细化过程与初始化的不同粒度特征的顺序相匹配。具体来说，MFAE-GAN (1) 基于粗粒度特征增强（CFE）和细粒度特征增强（FFE）的不同注意机制，在不同生成阶段对类别相关的粗粒度特征和实例级细节相关的细粒度特征进行采样，从而在空间上引导生成过程、(2) 在提炼细粒度特征的过程中，通过多粒度特征自适应批量归一化（MFA-BN）提供比文本语义信息更密集的语义约束；以及 (3) 采用全局语义保留（GSP）技术，避免连续采样特征时全局语义的丢失。广泛的实验结果表明，我们的 MFAE-GAN 在图像生成质量和效率方面都很有竞争力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems