RenAIssance: A Survey Into AI Text-to-Image Generation in the Era of Large Model

Fengxiang Bie;Yibo Yang;Zhongzhu Zhou;Adam Ghanem;Minjia Zhang;Zhewei Yao;Xiaoxia Wu;Connor Holmes;Pareesa Golnari;David A. Clifton;Yuxiong He;Dacheng Tao;Shuaiwen Leon Song
{"title":"RenAIssance: A Survey Into AI Text-to-Image Generation in the Era of Large Model","authors":"Fengxiang Bie;Yibo Yang;Zhongzhu Zhou;Adam Ghanem;Minjia Zhang;Zhewei Yao;Xiaoxia Wu;Connor Holmes;Pareesa Golnari;David A. Clifton;Yuxiong He;Dacheng Tao;Shuaiwen Leon Song","doi":"10.1109/TPAMI.2024.3522305","DOIUrl":null,"url":null,"abstract":"Text-to-image generation (TTI) refers to the usage of models that could process text input and generate high fidelity images based on text descriptions. Text-to-image generation using neural networks could be traced back to the emergence of Generative Adversial Network (GAN), followed by the autoregressive Transformer. Diffusion models are one prominent type of generative model used for the generation of images through the systematic introduction of noises with repeating steps. As an effect of the impressive results of diffusion models on image synthesis, it has been cemented as the major image decoder used by text-to-image models and brought text-to-image generation to the forefront of machine-learning (ML) research. In the era of large models, scaling up model size and the integration with large language models have further improved the performance of TTI models, resulting the generation result nearly indistinguishable from real-world images, revolutionizing the way we retrieval images. Our explorative study has incentivised us to think that there are further ways of scaling text-to-image models with the combination of innovative model architectures and prediction enhancement techniques. We have divided the work of this survey into five main sections wherein we detail the frameworks of major literature in order to delve into the different types of text-to-image generation methods. Following this we provide a detailed comparison and critique of these methods and offer possible pathways of improvement for future work. In the future work, we argue that TTI development could yield impressive productivity improvements for creation, particularly in the context of the AIGC era, and could be extended to more complex tasks such as video generation and 3D generation.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"2212-2231"},"PeriodicalIF":18.6000,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10817489/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Text-to-image generation (TTI) refers to the usage of models that could process text input and generate high fidelity images based on text descriptions. Text-to-image generation using neural networks could be traced back to the emergence of Generative Adversial Network (GAN), followed by the autoregressive Transformer. Diffusion models are one prominent type of generative model used for the generation of images through the systematic introduction of noises with repeating steps. As an effect of the impressive results of diffusion models on image synthesis, it has been cemented as the major image decoder used by text-to-image models and brought text-to-image generation to the forefront of machine-learning (ML) research. In the era of large models, scaling up model size and the integration with large language models have further improved the performance of TTI models, resulting the generation result nearly indistinguishable from real-world images, revolutionizing the way we retrieval images. Our explorative study has incentivised us to think that there are further ways of scaling text-to-image models with the combination of innovative model architectures and prediction enhancement techniques. We have divided the work of this survey into five main sections wherein we detail the frameworks of major literature in order to delve into the different types of text-to-image generation methods. Following this we provide a detailed comparison and critique of these methods and offer possible pathways of improvement for future work. In the future work, we argue that TTI development could yield impressive productivity improvements for creation, particularly in the context of the AIGC era, and could be extended to more complex tasks such as video generation and 3D generation.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
文艺复兴:大模型时代人工智能文本到图像生成研究
文本到图像生成(text -to-image generation, TTI)是指利用模型对文本输入进行处理,并根据文本描述生成高保真图像。使用神经网络生成文本到图像可以追溯到生成对抗网络(GAN)的出现,然后是自回归变压器。扩散模型是一种突出的生成模型,用于通过重复步骤系统地引入噪声来生成图像。由于扩散模型在图像合成方面令人印象深刻的结果,它已被巩固为文本到图像模型使用的主要图像解码器,并将文本到图像生成带到了机器学习(ML)研究的前沿。在大模型时代,模型尺寸的扩大和与大语言模型的集成进一步提高了TTI模型的性能,使得生成的结果与真实图像几乎无法区分,彻底改变了我们检索图像的方式。我们的探索性研究激励我们认为,结合创新的模型架构和预测增强技术,还有进一步的方法来扩展文本到图像的模型。我们将这项调查的工作分为五个主要部分,其中我们详细介绍了主要文献的框架,以便深入研究不同类型的文本到图像生成方法。接下来,我们对这些方法进行了详细的比较和批判,并为今后的工作提供了可能的改进途径。在未来的工作中,我们认为TTI的发展可以为创作带来令人印象深刻的生产力提高,特别是在AIGC时代的背景下,并且可以扩展到更复杂的任务,如视频生成和3D生成。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Learning-Based Multi-View Stereo: A Survey. GrowSP++: Growing Superpoints and Primitives for Unsupervised 3D Semantic Segmentation. Unsupervised Gaze Representation Learning by Switching Features. H2OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers. MV2DFusion: Leveraging Modality-Specific Object Semantics for Multi-Modal 3D Detection.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1