{"title":"LLMs Meet Multimodal Generation and Editing: A Survey","authors":"Yingqing He, Zhaoyang Liu, Jingye Chen, Zeyue Tian, Hongyu Liu, Xiaowei Chi, Runtao Liu, Ruibin Yuan, Yazhou Xing, Wenhai Wang, Jifeng Dai, Yong Zhang, Wei Xue, Qifeng Liu, Yike Guo, Qifeng Chen","doi":"arxiv-2405.19334","DOIUrl":null,"url":null,"abstract":"With the recent advancement in large language models (LLMs), there is a\ngrowing interest in combining LLMs with multimodal learning. Previous surveys\nof multimodal large language models (MLLMs) mainly focus on understanding. This\nsurvey elaborates on multimodal generation across different domains, including\nimage, video, 3D, and audio, where we highlight the notable advancements with\nmilestone works in these fields. Specifically, we exhaustively investigate the\nkey technical components behind methods and multimodal datasets utilized in\nthese studies. Moreover, we dig into tool-augmented multimodal agents that can\nuse existing generative models for human-computer interaction. Lastly, we also\ncomprehensively discuss the advancement in AI safety and investigate emerging\napplications as well as future prospects. Our work provides a systematic and\ninsightful overview of multimodal generation, which is expected to advance the\ndevelopment of Artificial Intelligence for Generative Content (AIGC) and world\nmodels. A curated list of all related papers can be found at\nhttps://github.com/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"214 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.19334","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
With the recent advancement in large language models (LLMs), there is a
growing interest in combining LLMs with multimodal learning. Previous surveys
of multimodal large language models (MLLMs) mainly focus on understanding. This
survey elaborates on multimodal generation across different domains, including
image, video, 3D, and audio, where we highlight the notable advancements with
milestone works in these fields. Specifically, we exhaustively investigate the
key technical components behind methods and multimodal datasets utilized in
these studies. Moreover, we dig into tool-augmented multimodal agents that can
use existing generative models for human-computer interaction. Lastly, we also
comprehensively discuss the advancement in AI safety and investigate emerging
applications as well as future prospects. Our work provides a systematic and
insightful overview of multimodal generation, which is expected to advance the
development of Artificial Intelligence for Generative Content (AIGC) and world
models. A curated list of all related papers can be found at
https://github.com/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation