{"title":"多模态数据融合与量子启发","authors":"Qiuchi Li","doi":"10.1145/3331184.3331419","DOIUrl":null,"url":null,"abstract":"Language understanding is multimodal. During human communication, messages are conveyed not only by words in textual form, but also through speech patterns, gestures or facial emotions of the speakers. Therefore, it is crucial to fuse information from different modalities to achieve a joint comprehension. With the rapid progress in the deep learning field, neural networks have emerged as the most popular approach for addressing multimodal data fusion [1, 6, 7, 12]. While these models can effectively combine multimodal features by learning from data, they nevertheless lack an explicit exhibition of how different modalities are related to each other, due to the inherent low interpretability of neural networks [2]. In the meantime, Quantum Theory (QT) has given rise to principled approaches for incorporating interactions between textual features into a holistic textual representation [3, 5, 8, 10], where the concepts of superposition andentanglement have been universally exploited to formulate interactions. The advantages of those models in capturing complicated correlations between textual features have been observed. We hereby propose the research on quantum-inspired multimodal data fusion, claiming that the limitation of multimodal data fusion can be tackled by quantum-driven models. In particular, we propose to employ superposition to formulate intra-modal interactions while the interplay between different modalities is expected to be captured by entanglement measures. By doing so, the interactions within multimodal data may be rendered explicitly in a unified quantum formalism, increasing the performance and interpretability for concrete multimodal tasks. It will also expand the application domains of quantum theory to multimodal tasks where only preliminary efforts have been made [11]. We therefore aim at answering the following research question: RQ. Can we fuse multimodal data with quantum-inspired models? To answer this question, we propose to fuse multimodal data with complex-valued neural networks, motivated by the theoretical link between neural networks and quantum theory [4] and advances in complex-valued neural networks [9]. Our model begins with a separate complex-valued embedding learned for each unimodal data based on the existing works [5, 10] which inherently assumes superposition between intra-modal features. Then we construct a many-body system in entangled state for multimodal data, where cross-modality interactions are naturally reflected by entanglement measures. Quantum measurement operators are applied to the entanglement state to address a concrete multimodal task at hand. The whole process is instrumented by a complex-valued neural network, which is able to learn how multimodal features are combined from data, and at the same time explain the combination by means of quantum superposition and entanglement measures. We plan to examine our proposed models on CMU-MOSI [12] and CMU-MOSEI [1] which are benchmarking multimodal sentiment analysis datasets. The dataset targets at classifying sentiment into 2, 5 or 7 classes with the input of textual, visual and acoustic features. We expect to see comparable effectiveness to state-of-the-art models, and we will explore superposition and entanglement measures to better understand the inter-modal interactions.","PeriodicalId":20700,"journal":{"name":"Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"21 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2019-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Multimodal Data Fusion with Quantum Inspiration\",\"authors\":\"Qiuchi Li\",\"doi\":\"10.1145/3331184.3331419\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Language understanding is multimodal. During human communication, messages are conveyed not only by words in textual form, but also through speech patterns, gestures or facial emotions of the speakers. Therefore, it is crucial to fuse information from different modalities to achieve a joint comprehension. With the rapid progress in the deep learning field, neural networks have emerged as the most popular approach for addressing multimodal data fusion [1, 6, 7, 12]. While these models can effectively combine multimodal features by learning from data, they nevertheless lack an explicit exhibition of how different modalities are related to each other, due to the inherent low interpretability of neural networks [2]. In the meantime, Quantum Theory (QT) has given rise to principled approaches for incorporating interactions between textual features into a holistic textual representation [3, 5, 8, 10], where the concepts of superposition andentanglement have been universally exploited to formulate interactions. The advantages of those models in capturing complicated correlations between textual features have been observed. We hereby propose the research on quantum-inspired multimodal data fusion, claiming that the limitation of multimodal data fusion can be tackled by quantum-driven models. In particular, we propose to employ superposition to formulate intra-modal interactions while the interplay between different modalities is expected to be captured by entanglement measures. By doing so, the interactions within multimodal data may be rendered explicitly in a unified quantum formalism, increasing the performance and interpretability for concrete multimodal tasks. It will also expand the application domains of quantum theory to multimodal tasks where only preliminary efforts have been made [11]. We therefore aim at answering the following research question: RQ. Can we fuse multimodal data with quantum-inspired models? To answer this question, we propose to fuse multimodal data with complex-valued neural networks, motivated by the theoretical link between neural networks and quantum theory [4] and advances in complex-valued neural networks [9]. Our model begins with a separate complex-valued embedding learned for each unimodal data based on the existing works [5, 10] which inherently assumes superposition between intra-modal features. Then we construct a many-body system in entangled state for multimodal data, where cross-modality interactions are naturally reflected by entanglement measures. Quantum measurement operators are applied to the entanglement state to address a concrete multimodal task at hand. The whole process is instrumented by a complex-valued neural network, which is able to learn how multimodal features are combined from data, and at the same time explain the combination by means of quantum superposition and entanglement measures. We plan to examine our proposed models on CMU-MOSI [12] and CMU-MOSEI [1] which are benchmarking multimodal sentiment analysis datasets. The dataset targets at classifying sentiment into 2, 5 or 7 classes with the input of textual, visual and acoustic features. We expect to see comparable effectiveness to state-of-the-art models, and we will explore superposition and entanglement measures to better understand the inter-modal interactions.\",\"PeriodicalId\":20700,\"journal\":{\"name\":\"Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval\",\"volume\":\"21 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-07-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3331184.3331419\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3331184.3331419","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

语言理解是多模态的。在人类的交流过程中,信息不仅通过文本形式的词语来传递,还通过说话人的语言模式、手势或面部表情来传递。因此,融合不同形式的信息以达到共同理解是至关重要的。随着深度学习领域的快速发展,神经网络已经成为解决多模态数据融合的最流行的方法[1,6,7,12]。虽然这些模型可以通过从数据中学习有效地结合多模态特征,但由于神经网络固有的低可解释性,它们缺乏对不同模态之间如何相互关联的明确展示[2]。与此同时,量子理论(QT)提出了将文本特征之间的相互作用纳入整体文本表示的原则方法[3,5,8,10],其中叠加和纠缠的概念已被普遍利用来制定相互作用。这些模型在捕获文本特征之间复杂关联方面的优势已经被观察到。在此,我们提出了量子启发的多模态数据融合研究,并声称可以通过量子驱动模型来解决多模态数据融合的局限性。特别是,我们建议采用叠加来制定模态内相互作用,而不同模态之间的相互作用有望通过纠缠测量来捕获。通过这样做,多模态数据中的相互作用可以以统一的量子形式显式呈现,从而提高了具体多模态任务的性能和可解释性。它还将把量子理论的应用领域扩展到多模态任务,而这只是初步的努力[11]。因此,我们旨在回答以下研究问题:RQ。我们能将多模态数据与量子启发的模型融合吗?为了回答这个问题,受神经网络与量子理论之间的理论联系[4]和复值神经网络的进展[9]的启发,我们提出将多模态数据与复值神经网络融合在一起。我们的模型首先基于现有工作[5,10]为每个单峰数据学习一个单独的复值嵌入,该嵌入固有地假设了模态内特征之间的叠加。然后,我们对多模态数据构建了一个纠缠态的多体系统,其中跨模态的相互作用通过纠缠度量自然地反映出来。量子测量算子应用于纠缠态,以解决手头的具体多模态任务。整个过程采用复值神经网络,该网络能够从数据中学习到多模态特征是如何组合的,同时通过量子叠加和纠缠度量来解释这种组合。我们计划在CMU-MOSI[12]和CMU-MOSEI[1]上检验我们提出的模型,这是对多模态情感分析数据集进行基准测试。该数据集的目标是将情感分为2、5或7类,并输入文本、视觉和声学特征。我们期望看到与最先进的模型相当的有效性,我们将探索叠加和纠缠措施,以更好地理解多式联运相互作用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Multimodal Data Fusion with Quantum Inspiration
Language understanding is multimodal. During human communication, messages are conveyed not only by words in textual form, but also through speech patterns, gestures or facial emotions of the speakers. Therefore, it is crucial to fuse information from different modalities to achieve a joint comprehension. With the rapid progress in the deep learning field, neural networks have emerged as the most popular approach for addressing multimodal data fusion [1, 6, 7, 12]. While these models can effectively combine multimodal features by learning from data, they nevertheless lack an explicit exhibition of how different modalities are related to each other, due to the inherent low interpretability of neural networks [2]. In the meantime, Quantum Theory (QT) has given rise to principled approaches for incorporating interactions between textual features into a holistic textual representation [3, 5, 8, 10], where the concepts of superposition andentanglement have been universally exploited to formulate interactions. The advantages of those models in capturing complicated correlations between textual features have been observed. We hereby propose the research on quantum-inspired multimodal data fusion, claiming that the limitation of multimodal data fusion can be tackled by quantum-driven models. In particular, we propose to employ superposition to formulate intra-modal interactions while the interplay between different modalities is expected to be captured by entanglement measures. By doing so, the interactions within multimodal data may be rendered explicitly in a unified quantum formalism, increasing the performance and interpretability for concrete multimodal tasks. It will also expand the application domains of quantum theory to multimodal tasks where only preliminary efforts have been made [11]. We therefore aim at answering the following research question: RQ. Can we fuse multimodal data with quantum-inspired models? To answer this question, we propose to fuse multimodal data with complex-valued neural networks, motivated by the theoretical link between neural networks and quantum theory [4] and advances in complex-valued neural networks [9]. Our model begins with a separate complex-valued embedding learned for each unimodal data based on the existing works [5, 10] which inherently assumes superposition between intra-modal features. Then we construct a many-body system in entangled state for multimodal data, where cross-modality interactions are naturally reflected by entanglement measures. Quantum measurement operators are applied to the entanglement state to address a concrete multimodal task at hand. The whole process is instrumented by a complex-valued neural network, which is able to learn how multimodal features are combined from data, and at the same time explain the combination by means of quantum superposition and entanglement measures. We plan to examine our proposed models on CMU-MOSI [12] and CMU-MOSEI [1] which are benchmarking multimodal sentiment analysis datasets. The dataset targets at classifying sentiment into 2, 5 or 7 classes with the input of textual, visual and acoustic features. We expect to see comparable effectiveness to state-of-the-art models, and we will explore superposition and entanglement measures to better understand the inter-modal interactions.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Automatic Task Completion Flows from Web APIs Session details: Session 6A: Social Media Sequence and Time Aware Neighborhood for Session-based Recommendations: STAN Adversarial Training for Review-Based Recommendations Hate Speech Detection is Not as Easy as You May Think: A Closer Look at Model Validation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1