多模态嵌入的多阶段公共向量空间

Sabarish Gopalakrishnan, Premkumar Udaiyar, Shagan Sah, R. Ptucha
{"title":"多模态嵌入的多阶段公共向量空间","authors":"Sabarish Gopalakrishnan, Premkumar Udaiyar, Shagan Sah, R. Ptucha","doi":"10.1109/AIPR47015.2019.9174583","DOIUrl":null,"url":null,"abstract":"Deep learning frameworks have proven to be very effective at tasks like classification, segmentation, detection, and translation. Before being processed by a deep learning model, objects are first encoded into a suitable vector representation. For example, images are typically encoded using convolutional neural networks whereas texts typically use recurrent neural networks. Similarly, other modalities of data like 3D point clouds, audio signals, and videos can be transformed into vectors using appropriate encoders. Although deep learning architectures do a good job of learning these vector representations in isolation, learning a single common representation across multiple modalities is a challenging task. In this work, we develop a Multi Stage Common Vector Space (M-CVS) that is suitable for encoding multiple modalities. The M-CVS is an efficient low-dimensional vector representation in which the contextual similarity of data is preserved across all modalities through the use of contrastive loss functions. Our vector space can perform tasks like multimodal retrieval, searching and generation, where for example, images can be retrieved from text or audio input. The addition of a new modality would generally mean resetting and training the entire network. However, we introduce a stagewise learning technique where each modality is compared to a reference modality before being projected to the M-CVS. Our method ensures that a new modality can be mapped into the MCVS without changing existing encodings, allowing the extension to any number of modalities. We build and evaluate M-CVS on the XMedia and XMedianet multimodal dataset. Extensive ablation experiments using images, text, audio, video, and 3D point cloud modalities demonstrate the complexity vs. accuracy tradeoff under a wide variety of real-world use cases.","PeriodicalId":167075,"journal":{"name":"2019 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi Stage Common Vector Space for Multimodal Embeddings\",\"authors\":\"Sabarish Gopalakrishnan, Premkumar Udaiyar, Shagan Sah, R. Ptucha\",\"doi\":\"10.1109/AIPR47015.2019.9174583\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep learning frameworks have proven to be very effective at tasks like classification, segmentation, detection, and translation. Before being processed by a deep learning model, objects are first encoded into a suitable vector representation. For example, images are typically encoded using convolutional neural networks whereas texts typically use recurrent neural networks. Similarly, other modalities of data like 3D point clouds, audio signals, and videos can be transformed into vectors using appropriate encoders. Although deep learning architectures do a good job of learning these vector representations in isolation, learning a single common representation across multiple modalities is a challenging task. In this work, we develop a Multi Stage Common Vector Space (M-CVS) that is suitable for encoding multiple modalities. The M-CVS is an efficient low-dimensional vector representation in which the contextual similarity of data is preserved across all modalities through the use of contrastive loss functions. Our vector space can perform tasks like multimodal retrieval, searching and generation, where for example, images can be retrieved from text or audio input. The addition of a new modality would generally mean resetting and training the entire network. However, we introduce a stagewise learning technique where each modality is compared to a reference modality before being projected to the M-CVS. Our method ensures that a new modality can be mapped into the MCVS without changing existing encodings, allowing the extension to any number of modalities. We build and evaluate M-CVS on the XMedia and XMedianet multimodal dataset. Extensive ablation experiments using images, text, audio, video, and 3D point cloud modalities demonstrate the complexity vs. accuracy tradeoff under a wide variety of real-world use cases.\",\"PeriodicalId\":167075,\"journal\":{\"name\":\"2019 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)\",\"volume\":\"100 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/AIPR47015.2019.9174583\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AIPR47015.2019.9174583","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

深度学习框架已被证明在分类、分割、检测和翻译等任务上非常有效。在被深度学习模型处理之前,对象首先被编码成合适的向量表示。例如,图像通常使用卷积神经网络编码,而文本通常使用循环神经网络编码。类似地,其他形式的数据,如3D点云、音频信号和视频,也可以通过适当的编码器转换成矢量。尽管深度学习架构在孤立地学习这些向量表示方面做得很好,但学习跨多个模态的单一公共表示是一项具有挑战性的任务。在这项工作中,我们开发了一个适合于编码多模态的多阶段公共向量空间(M-CVS)。M-CVS是一种高效的低维矢量表示,通过使用对比损失函数,在所有模态中保留了数据的上下文相似性。我们的向量空间可以执行多模态检索、搜索和生成等任务,例如,可以从文本或音频输入中检索图像。增加一种新的模式通常意味着重置和训练整个网络。然而,我们引入了一种分阶段学习技术,其中每个模态在投射到M-CVS之前与参考模态进行比较。我们的方法确保了新的模态可以在不改变现有编码的情况下映射到MCVS中,从而允许扩展到任意数量的模态。我们在XMedia和XMedianet多模态数据集上构建和评估M-CVS。使用图像、文本、音频、视频和3D点云模式的广泛消融实验证明了在各种实际用例下复杂性与准确性的权衡。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Multi Stage Common Vector Space for Multimodal Embeddings
Deep learning frameworks have proven to be very effective at tasks like classification, segmentation, detection, and translation. Before being processed by a deep learning model, objects are first encoded into a suitable vector representation. For example, images are typically encoded using convolutional neural networks whereas texts typically use recurrent neural networks. Similarly, other modalities of data like 3D point clouds, audio signals, and videos can be transformed into vectors using appropriate encoders. Although deep learning architectures do a good job of learning these vector representations in isolation, learning a single common representation across multiple modalities is a challenging task. In this work, we develop a Multi Stage Common Vector Space (M-CVS) that is suitable for encoding multiple modalities. The M-CVS is an efficient low-dimensional vector representation in which the contextual similarity of data is preserved across all modalities through the use of contrastive loss functions. Our vector space can perform tasks like multimodal retrieval, searching and generation, where for example, images can be retrieved from text or audio input. The addition of a new modality would generally mean resetting and training the entire network. However, we introduce a stagewise learning technique where each modality is compared to a reference modality before being projected to the M-CVS. Our method ensures that a new modality can be mapped into the MCVS without changing existing encodings, allowing the extension to any number of modalities. We build and evaluate M-CVS on the XMedia and XMedianet multimodal dataset. Extensive ablation experiments using images, text, audio, video, and 3D point cloud modalities demonstrate the complexity vs. accuracy tradeoff under a wide variety of real-world use cases.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Automated Segmentation of Nucleus, Cytoplasm and Background of Cervical Cells from Pap-smear Images using a Trainable Pixel Level Classifier Improving Industrial Safety Gear Detection through Re-ID conditioned Detector Internet of Things Anomaly Detection using Machine Learning Evaluation of Generative Adversarial Network Performance Based on Direct Analysis of Generated Images GLSNet: Global and Local Streams Network for 3D Point Cloud Classification
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1