RenAIssance: A Survey Into AI Text-to-Image Generation in the Era of Large Model

IF 18.6 IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-12-27 DOI:10.1109/TPAMI.2024.3522305

Fengxiang Bie;Yibo Yang;Zhongzhu Zhou;Adam Ghanem;Minjia Zhang;Zhewei Yao;Xiaoxia Wu;Connor Holmes;Pareesa Golnari;David A. Clifton;Yuxiong He;Dacheng Tao;Shuaiwen Leon Song

{"title":"RenAIssance: A Survey Into AI Text-to-Image Generation in the Era of Large Model","authors":"Fengxiang Bie;Yibo Yang;Zhongzhu Zhou;Adam Ghanem;Minjia Zhang;Zhewei Yao;Xiaoxia Wu;Connor Holmes;Pareesa Golnari;David A. Clifton;Yuxiong He;Dacheng Tao;Shuaiwen Leon Song","doi":"10.1109/TPAMI.2024.3522305","DOIUrl":null,"url":null,"abstract":"Text-to-image generation (TTI) refers to the usage of models that could process text input and generate high fidelity images based on text descriptions. Text-to-image generation using neural networks could be traced back to the emergence of Generative Adversial Network (GAN), followed by the autoregressive Transformer. Diffusion models are one prominent type of generative model used for the generation of images through the systematic introduction of noises with repeating steps. As an effect of the impressive results of diffusion models on image synthesis, it has been cemented as the major image decoder used by text-to-image models and brought text-to-image generation to the forefront of machine-learning (ML) research. In the era of large models, scaling up model size and the integration with large language models have further improved the performance of TTI models, resulting the generation result nearly indistinguishable from real-world images, revolutionizing the way we retrieval images. Our explorative study has incentivised us to think that there are further ways of scaling text-to-image models with the combination of innovative model architectures and prediction enhancement techniques. We have divided the work of this survey into five main sections wherein we detail the frameworks of major literature in order to delve into the different types of text-to-image generation methods. Following this we provide a detailed comparison and critique of these methods and offer possible pathways of improvement for future work. In the future work, we argue that TTI development could yield impressive productivity improvements for creation, particularly in the context of the AIGC era, and could be extended to more complex tasks such as video generation and 3D generation.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"2212-2231"},"PeriodicalIF":18.6000,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10817489/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Text-to-image generation (TTI) refers to the usage of models that could process text input and generate high fidelity images based on text descriptions. Text-to-image generation using neural networks could be traced back to the emergence of Generative Adversial Network (GAN), followed by the autoregressive Transformer. Diffusion models are one prominent type of generative model used for the generation of images through the systematic introduction of noises with repeating steps. As an effect of the impressive results of diffusion models on image synthesis, it has been cemented as the major image decoder used by text-to-image models and brought text-to-image generation to the forefront of machine-learning (ML) research. In the era of large models, scaling up model size and the integration with large language models have further improved the performance of TTI models, resulting the generation result nearly indistinguishable from real-world images, revolutionizing the way we retrieval images. Our explorative study has incentivised us to think that there are further ways of scaling text-to-image models with the combination of innovative model architectures and prediction enhancement techniques. We have divided the work of this survey into five main sections wherein we detail the frameworks of major literature in order to delve into the different types of text-to-image generation methods. Following this we provide a detailed comparison and critique of these methods and offer possible pathways of improvement for future work. In the future work, we argue that TTI development could yield impressive productivity improvements for creation, particularly in the context of the AIGC era, and could be extended to more complex tasks such as video generation and 3D generation.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

文艺复兴：大模型时代人工智能文本到图像生成研究

文本到图像生成（text -to-image generation， TTI）是指利用模型对文本输入进行处理，并根据文本描述生成高保真图像。使用神经网络生成文本到图像可以追溯到生成对抗网络（GAN）的出现，然后是自回归变压器。扩散模型是一种突出的生成模型，用于通过重复步骤系统地引入噪声来生成图像。由于扩散模型在图像合成方面令人印象深刻的结果，它已被巩固为文本到图像模型使用的主要图像解码器，并将文本到图像生成带到了机器学习（ML）研究的前沿。在大模型时代，模型尺寸的扩大和与大语言模型的集成进一步提高了TTI模型的性能，使得生成的结果与真实图像几乎无法区分，彻底改变了我们检索图像的方式。我们的探索性研究激励我们认为，结合创新的模型架构和预测增强技术，还有进一步的方法来扩展文本到图像的模型。我们将这项调查的工作分为五个主要部分，其中我们详细介绍了主要文献的框架，以便深入研究不同类型的文本到图像生成方法。接下来，我们对这些方法进行了详细的比较和批判，并为今后的工作提供了可能的改进途径。在未来的工作中，我们认为TTI的发展可以为创作带来令人印象深刻的生产力提高，特别是在AIGC时代的背景下，并且可以扩展到更复杂的任务，如视频生成和3D生成。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量

期刊最新文献

Spike Camera Optical Flow Estimation Based on Continuous Spike Streams. Bi-C²R: Bidirectional Continual Compatible Representation for Re-Indexing Free Lifelong Person Re-Identification. 2025 Reviewers List* Deep Robust Reversible Watermarking. Parameter-Efficient Fine-Tuning for Continual Learning: A Neural Tangent Kernel Perspective.