Foundation Models Defining a New Era in Vision: A Survey and Outlook

Muhammad Awais;Muzammal Naseer;Salman Khan;Rao Muhammad Anwer;Hisham Cholakkal;Mubarak Shah;Ming-Hsuan Yang;Fahad Shahbaz Khan
{"title":"Foundation Models Defining a New Era in Vision: A Survey and Outlook","authors":"Muhammad Awais;Muzammal Naseer;Salman Khan;Rao Muhammad Anwer;Hisham Cholakkal;Mubarak Shah;Ming-Hsuan Yang;Fahad Shahbaz Khan","doi":"10.1109/TPAMI.2024.3506283","DOIUrl":null,"url":null,"abstract":"Vision systems that see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth. The models learned to bridge the gap between such modalities and large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. These models are referred to as <italic>foundation models</i>. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions. In this survey, we provide a comprehensive review of such emerging foundation models, including typical architecture designs to combine different modalities (vision, text, audio, etc.), training objectives (contrastive, generative), pre-training datasets, fine-tuning mechanisms, and the common prompting patterns; textual, visual, and heterogeneous. We discuss the open challenges and research directions for foundation models in computer vision, including difficulties in their evaluations and benchmarking, gaps in their real-world understanding, limitations of contextual understanding, biases, vulnerability to adversarial attacks, and interpretability issues. We review recent developments in this field, covering a wide range of applications of foundation models systematically and comprehensively.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2245-2264"},"PeriodicalIF":18.6000,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10834497/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Vision systems that see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth. The models learned to bridge the gap between such modalities and large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. These models are referred to as foundation models. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions. In this survey, we provide a comprehensive review of such emerging foundation models, including typical architecture designs to combine different modalities (vision, text, audio, etc.), training objectives (contrastive, generative), pre-training datasets, fine-tuning mechanisms, and the common prompting patterns; textual, visual, and heterogeneous. We discuss the open challenges and research directions for foundation models in computer vision, including difficulties in their evaluations and benchmarking, gaps in their real-world understanding, limitations of contextual understanding, biases, vulnerability to adversarial attacks, and interpretability issues. We review recent developments in this field, covering a wide range of applications of foundation models systematically and comprehensively.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
定义视觉新时代的基础模型:综述与展望
视觉系统能够看到并推断视觉场景的构成本质,这是理解我们的世界的基础。物体及其位置、模糊性和现实环境中的变化之间的复杂关系可以用人类语言更好地描述,自然受语法规则和其他模式(如音频和深度)的支配。学习的模型在这些模式和大规模训练数据之间架起了桥梁,促进了上下文推理、泛化和测试时的提示能力。这些模型被称为基础模型。这些模型的输出可以通过人类提供的提示进行修改,而无需重新训练,例如,通过提供边界框来分割特定对象,通过询问有关图像或视频场景的问题进行交互式对话,或者通过语言指令操纵机器人的行为。在本调查中,我们对这些新兴的基础模型进行了全面的回顾,包括结合不同模式(视觉、文本、音频等)、训练目标(对比、生成)、预训练数据集、微调机制和常见提示模式的典型架构设计;文本的、视觉的和异构的。我们讨论了计算机视觉基础模型的开放挑战和研究方向,包括评估和基准测试的困难,对现实世界理解的差距,上下文理解的局限性,偏见,对抗性攻击的脆弱性以及可解释性问题。本文综述了该领域的最新进展,系统、全面地涵盖了基础模型的广泛应用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Spatio-temporal Decoupled Knowledge Compensator for Few-Shot Action Recognition. Learning Continuous Wasserstein Barycenter Space for Generalized All-in-One Image Restoration. Towards Generating Realistic 3D Semantic Training Data for Autonomous Driving. Examining the Impact of Optical Aberrations to Image Classification and Object Detection Models. Neural Eigenfunctions are Structured Representation Learners.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1