Visual language integration: A survey and open challenges

IF 12.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Computer Science Review Pub Date : 2023-05-01 DOI:10.1016/j.cosrev.2023.100548

Sang-Min Park , Young-Gab Kim

{"title":"Visual language integration: A survey and open challenges","authors":"Sang-Min Park , Young-Gab Kim","doi":"10.1016/j.cosrev.2023.100548","DOIUrl":null,"url":null,"abstract":"<div><p>With the recent development of deep learning<span><span> technology comes the wide use of artificial intelligence (AI) models in various domains. AI shows good performance for definite-purpose tasks, such as image recognition and </span>text classification. The recognition performance for every single task has become more accurate than feature engineering, enabling more work that could not be done before. In addition, with the development of generation technology (e.g., GPT-3), AI models are showing stable performances in each recognition and generation task. However, not many studies have focused on how to integrate these models efficiently to achieve comprehensive human interaction. Each model grows in size with improved performance, thereby consequently requiring more computing power and more complicated designs to train than before. This requirement increases the complexity of each model and requires more paired data, making model integration difficult. This study provides a survey on visual language integration with a hierarchical approach for reviewing the recent trends that have already been performed on AI models among research communities as the interaction component. We also compare herein the strengths of existing AI models and integration approaches and the limitations they face. Furthermore, we discuss the current related issues and which research is needed for visual language integration. More specifically, we identify four aspects of visual language integration models: multimodal learning, multi-task learning, end-to-end learning, and embodiment for embodied visual language interaction. Finally, we discuss some current open issues and challenges and conclude our survey by giving possible future directions.</span></p></div>","PeriodicalId":48633,"journal":{"name":"Computer Science Review","volume":"48 ","pages":"Article 100548"},"PeriodicalIF":12.7000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Science Review","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1574013723000151","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 1

Abstract

With the recent development of deep learning technology comes the wide use of artificial intelligence (AI) models in various domains. AI shows good performance for definite-purpose tasks, such as image recognition and text classification. The recognition performance for every single task has become more accurate than feature engineering, enabling more work that could not be done before. In addition, with the development of generation technology (e.g., GPT-3), AI models are showing stable performances in each recognition and generation task. However, not many studies have focused on how to integrate these models efficiently to achieve comprehensive human interaction. Each model grows in size with improved performance, thereby consequently requiring more computing power and more complicated designs to train than before. This requirement increases the complexity of each model and requires more paired data, making model integration difficult. This study provides a survey on visual language integration with a hierarchical approach for reviewing the recent trends that have already been performed on AI models among research communities as the interaction component. We also compare herein the strengths of existing AI models and integration approaches and the limitations they face. Furthermore, we discuss the current related issues and which research is needed for visual language integration. More specifically, we identify four aspects of visual language integration models: multimodal learning, multi-task learning, end-to-end learning, and embodiment for embodied visual language interaction. Finally, we discuss some current open issues and challenges and conclude our survey by giving possible future directions.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

视觉语言整合:一个调查和开放的挑战

随着深度学习技术的发展，人工智能（AI）模型在各个领域得到了广泛应用。人工智能在图像识别和文本分类等特定任务中表现出良好的性能。每一项任务的识别性能都比特征工程更准确，实现了以前无法完成的更多工作。此外，随着生成技术（如GPT-3）的发展，人工智能模型在每个识别和生成任务中都表现出稳定的性能。然而，没有多少研究关注如何有效地整合这些模型，以实现全面的人类互动。每个模型的大小都随着性能的提高而增长，因此需要比以前更多的计算能力和更复杂的设计来训练。这一要求增加了每个模型的复杂性，并需要更多的配对数据，从而使模型集成变得困难。这项研究提供了一项关于视觉语言集成的调查，采用分层方法来回顾研究社区中人工智能模型作为交互组件的最新趋势。我们还比较了现有人工智能模型和集成方法的优势及其面临的局限性。此外，我们还讨论了当前的相关问题以及视觉语言整合需要进行哪些研究。更具体地说，我们确定了视觉语言集成模型的四个方面：多模式学习、多任务学习、端到端学习和具体视觉语言交互的体现。最后，我们讨论了一些当前悬而未决的问题和挑战，并通过给出未来可能的方向来结束我们的调查。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computer Science Review Computer Science-General Computer Science

CiteScore

32.70

自引率

0.00%

发文量

审稿时长

51 days

期刊介绍： Computer Science Review, a publication dedicated to research surveys and expository overviews of open problems in computer science, targets a broad audience within the field seeking comprehensive insights into the latest developments. The journal welcomes articles from various fields as long as their content impacts the advancement of computer science. In particular, articles that review the application of well-known Computer Science methods to other areas are in scope only if these articles advance the fundamental understanding of those methods.