视觉+X：数据背景下的多模态学习调查

IF 18.6 IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-06-28 DOI:10.1109/TPAMI.2024.3420239

Ye Zhu;Yu Wu;Nicu Sebe;Yan Yan

{"title":"视觉+X：数据背景下的多模态学习调查","authors":"Ye Zhu;Yu Wu;Nicu Sebe;Yan Yan","doi":"10.1109/TPAMI.2024.3420239","DOIUrl":null,"url":null,"abstract":"We are perceiving and communicating with the world in a multisensory manner, where different information sources are sophisticatedly processed and interpreted by separate parts of the human brain to constitute a complex, yet harmonious and unified sensing system. To endow the machines with true intelligence, multimodal machine learning that incorporates data from various sources has become an increasingly popular research area with emerging technical advances in recent years. In this paper, we present a survey on multimodal machine learning from a novel perspective considering not only the purely technical aspects but also the intrinsic nature of different data modalities. We analyze the commonness and uniqueness of each data format mainly ranging from vision, audio, text, and motions, and then present the methodological advancements categorized by the combination of data modalities, such as \n<italic>Vision+Text</i>\n, with slightly inclined emphasis on the visual data. We investigate the existing literature on multimodal learning from both the representation learning and downstream application levels, and provide an additional comparison in the light of their technical connections with the data nature, e.g., the semantic consistency between image objects and textual descriptions, and the rhythm correspondence between video dance moves and musical beats. We hope that the exploitation of the alignment as well as the existing gap between the intrinsic nature of data modality and the technical designs, will benefit future research studies to better address a specific challenge related to the concrete multimodal task, prompting a unified multimodal machine learning framework closer to a real human intelligence system.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"9102-9122"},"PeriodicalIF":18.6000,"publicationDate":"2024-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Vision + X: A Survey on Multimodal Learning in the Light of Data\",\"authors\":\"Ye Zhu;Yu Wu;Nicu Sebe;Yan Yan\",\"doi\":\"10.1109/TPAMI.2024.3420239\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We are perceiving and communicating with the world in a multisensory manner, where different information sources are sophisticatedly processed and interpreted by separate parts of the human brain to constitute a complex, yet harmonious and unified sensing system. To endow the machines with true intelligence, multimodal machine learning that incorporates data from various sources has become an increasingly popular research area with emerging technical advances in recent years. In this paper, we present a survey on multimodal machine learning from a novel perspective considering not only the purely technical aspects but also the intrinsic nature of different data modalities. We analyze the commonness and uniqueness of each data format mainly ranging from vision, audio, text, and motions, and then present the methodological advancements categorized by the combination of data modalities, such as \\n<italic>Vision+Text</i>\\n, with slightly inclined emphasis on the visual data. We investigate the existing literature on multimodal learning from both the representation learning and downstream application levels, and provide an additional comparison in the light of their technical connections with the data nature, e.g., the semantic consistency between image objects and textual descriptions, and the rhythm correspondence between video dance moves and musical beats. We hope that the exploitation of the alignment as well as the existing gap between the intrinsic nature of data modality and the technical designs, will benefit future research studies to better address a specific challenge related to the concrete multimodal task, prompting a unified multimodal machine learning framework closer to a real human intelligence system.\",\"PeriodicalId\":94034,\"journal\":{\"name\":\"IEEE transactions on pattern analysis and machine intelligence\",\"volume\":\"46 12\",\"pages\":\"9102-9122\"},\"PeriodicalIF\":18.6000,\"publicationDate\":\"2024-06-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on pattern analysis and machine intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10577498/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10577498/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

我们以多感官的方式感知世界并与之交流，不同的信息源经过人脑不同部分的精密处理和解读，构成了一个复杂而又和谐统一的感知系统。为了赋予机器真正的智能，近年来，随着技术的不断进步，融合各种来源数据的多模态机器学习已成为一个日益热门的研究领域。在本文中，我们从一个新颖的角度对多模态机器学习进行了研究，不仅考虑了纯粹的技术层面，还考虑了不同数据模态的内在本质。我们分析了每种数据格式的共性和独特性，主要包括视觉、音频、文本和动作，然后介绍了按数据模式组合分类的方法论进展，如视觉+文本，并略微偏重于视觉数据。我们从表征学习和下游应用两个层面研究了现有的多模态学习文献，并根据其与数据性质的技术联系进行了补充比较，例如图像对象与文本描述之间的语义一致性，以及视频舞蹈动作与音乐节拍之间的节奏对应。我们希望，利用对齐以及数据模态的内在性质与技术设计之间的现有差距，将有利于未来的研究，从而更好地应对与具体多模态任务相关的特定挑战，促使统一的多模态机器学习框架更接近真实的人类智能系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Vision + X: A Survey on Multimodal Learning in the Light of Data

We are perceiving and communicating with the world in a multisensory manner, where different information sources are sophisticatedly processed and interpreted by separate parts of the human brain to constitute a complex, yet harmonious and unified sensing system. To endow the machines with true intelligence, multimodal machine learning that incorporates data from various sources has become an increasingly popular research area with emerging technical advances in recent years. In this paper, we present a survey on multimodal machine learning from a novel perspective considering not only the purely technical aspects but also the intrinsic nature of different data modalities. We analyze the commonness and uniqueness of each data format mainly ranging from vision, audio, text, and motions, and then present the methodological advancements categorized by the combination of data modalities, such as Vision+Text , with slightly inclined emphasis on the visual data. We investigate the existing literature on multimodal learning from both the representation learning and downstream application levels, and provide an additional comparison in the light of their technical connections with the data nature, e.g., the semantic consistency between image objects and textual descriptions, and the rhythm correspondence between video dance moves and musical beats. We hope that the exploitation of the alignment as well as the existing gap between the intrinsic nature of data modality and the technical designs, will benefit future research studies to better address a specific challenge related to the concrete multimodal task, prompting a unified multimodal machine learning framework closer to a real human intelligence system.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量