Sabarish Gopalakrishnan, Premkumar Udaiyar, Shagan Sah, R. Ptucha
{"title":"多模态嵌入的多阶段公共向量空间","authors":"Sabarish Gopalakrishnan, Premkumar Udaiyar, Shagan Sah, R. Ptucha","doi":"10.1109/AIPR47015.2019.9174583","DOIUrl":null,"url":null,"abstract":"Deep learning frameworks have proven to be very effective at tasks like classification, segmentation, detection, and translation. Before being processed by a deep learning model, objects are first encoded into a suitable vector representation. For example, images are typically encoded using convolutional neural networks whereas texts typically use recurrent neural networks. Similarly, other modalities of data like 3D point clouds, audio signals, and videos can be transformed into vectors using appropriate encoders. Although deep learning architectures do a good job of learning these vector representations in isolation, learning a single common representation across multiple modalities is a challenging task. In this work, we develop a Multi Stage Common Vector Space (M-CVS) that is suitable for encoding multiple modalities. The M-CVS is an efficient low-dimensional vector representation in which the contextual similarity of data is preserved across all modalities through the use of contrastive loss functions. Our vector space can perform tasks like multimodal retrieval, searching and generation, where for example, images can be retrieved from text or audio input. The addition of a new modality would generally mean resetting and training the entire network. However, we introduce a stagewise learning technique where each modality is compared to a reference modality before being projected to the M-CVS. Our method ensures that a new modality can be mapped into the MCVS without changing existing encodings, allowing the extension to any number of modalities. We build and evaluate M-CVS on the XMedia and XMedianet multimodal dataset. Extensive ablation experiments using images, text, audio, video, and 3D point cloud modalities demonstrate the complexity vs. accuracy tradeoff under a wide variety of real-world use cases.","PeriodicalId":167075,"journal":{"name":"2019 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi Stage Common Vector Space for Multimodal Embeddings\",\"authors\":\"Sabarish Gopalakrishnan, Premkumar Udaiyar, Shagan Sah, R. Ptucha\",\"doi\":\"10.1109/AIPR47015.2019.9174583\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep learning frameworks have proven to be very effective at tasks like classification, segmentation, detection, and translation. Before being processed by a deep learning model, objects are first encoded into a suitable vector representation. For example, images are typically encoded using convolutional neural networks whereas texts typically use recurrent neural networks. Similarly, other modalities of data like 3D point clouds, audio signals, and videos can be transformed into vectors using appropriate encoders. Although deep learning architectures do a good job of learning these vector representations in isolation, learning a single common representation across multiple modalities is a challenging task. In this work, we develop a Multi Stage Common Vector Space (M-CVS) that is suitable for encoding multiple modalities. The M-CVS is an efficient low-dimensional vector representation in which the contextual similarity of data is preserved across all modalities through the use of contrastive loss functions. Our vector space can perform tasks like multimodal retrieval, searching and generation, where for example, images can be retrieved from text or audio input. The addition of a new modality would generally mean resetting and training the entire network. However, we introduce a stagewise learning technique where each modality is compared to a reference modality before being projected to the M-CVS. Our method ensures that a new modality can be mapped into the MCVS without changing existing encodings, allowing the extension to any number of modalities. We build and evaluate M-CVS on the XMedia and XMedianet multimodal dataset. Extensive ablation experiments using images, text, audio, video, and 3D point cloud modalities demonstrate the complexity vs. accuracy tradeoff under a wide variety of real-world use cases.\",\"PeriodicalId\":167075,\"journal\":{\"name\":\"2019 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)\",\"volume\":\"100 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/AIPR47015.2019.9174583\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AIPR47015.2019.9174583","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Multi Stage Common Vector Space for Multimodal Embeddings
Deep learning frameworks have proven to be very effective at tasks like classification, segmentation, detection, and translation. Before being processed by a deep learning model, objects are first encoded into a suitable vector representation. For example, images are typically encoded using convolutional neural networks whereas texts typically use recurrent neural networks. Similarly, other modalities of data like 3D point clouds, audio signals, and videos can be transformed into vectors using appropriate encoders. Although deep learning architectures do a good job of learning these vector representations in isolation, learning a single common representation across multiple modalities is a challenging task. In this work, we develop a Multi Stage Common Vector Space (M-CVS) that is suitable for encoding multiple modalities. The M-CVS is an efficient low-dimensional vector representation in which the contextual similarity of data is preserved across all modalities through the use of contrastive loss functions. Our vector space can perform tasks like multimodal retrieval, searching and generation, where for example, images can be retrieved from text or audio input. The addition of a new modality would generally mean resetting and training the entire network. However, we introduce a stagewise learning technique where each modality is compared to a reference modality before being projected to the M-CVS. Our method ensures that a new modality can be mapped into the MCVS without changing existing encodings, allowing the extension to any number of modalities. We build and evaluate M-CVS on the XMedia and XMedianet multimodal dataset. Extensive ablation experiments using images, text, audio, video, and 3D point cloud modalities demonstrate the complexity vs. accuracy tradeoff under a wide variety of real-world use cases.