通过多模态对齐概念知识进行非配对图像-文本匹配。

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-07-23 DOI:10.1109/TPAMI.2024.3432552

Yan Huang, Yuming Wang, Yunan Zeng, Junshi Huang, Zhenhua Chai, Liang Wang

{"title":"通过多模态对齐概念知识进行非配对图像-文本匹配。","authors":"Yan Huang, Yuming Wang, Yunan Zeng, Junshi Huang, Zhenhua Chai, Liang Wang","doi":"10.1109/TPAMI.2024.3432552","DOIUrl":null,"url":null,"abstract":"Recently, the accuracy of image-text matching has been greatly improved by multimodal pretrained models, all of which use millions or billions of paired images and texts for supervised model learning. Different from them, human brains can well match images with texts using their stored multimodal knowledge. Inspired by that, this paper studies a new scenario as unpaired image-text matching, in which paired images and texts are assumed to be unavailable during model learning. To deal with it, we accordingly propose a simple yet effective method namely Multimodal Aligned Conceptual Knowledge (MACK). First, we collect a set of words and their related image regions from publicly available datasets, and compute prototypical region representations to obtain pretrained general knowledge. To make the obtained knowledge better suit for certain datasets, we refine it using unpaired images and texts in a self-supervised learning manner to obtain fine-tuned domain knowledge. Then, to match given images with texts based on the knowledge, we represent parsed words in the texts by prototypical region representations, and compute region-word similarity scores. At last, the scores are aggregated based on bidirectional similarity pooling into an image-text similarity score, which can be directly used for unpaired image-text matching. The proposed MACK is complementary with existing models, which can be easily extended as a re-ranking method to substantially improve their performance of zero-shot and cross-dataset image-text matching.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Unpaired Image-text Matching via Multimodal Aligned Conceptual Knowledge.\",\"authors\":\"Yan Huang, Yuming Wang, Yunan Zeng, Junshi Huang, Zhenhua Chai, Liang Wang\",\"doi\":\"10.1109/TPAMI.2024.3432552\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, the accuracy of image-text matching has been greatly improved by multimodal pretrained models, all of which use millions or billions of paired images and texts for supervised model learning. Different from them, human brains can well match images with texts using their stored multimodal knowledge. Inspired by that, this paper studies a new scenario as unpaired image-text matching, in which paired images and texts are assumed to be unavailable during model learning. To deal with it, we accordingly propose a simple yet effective method namely Multimodal Aligned Conceptual Knowledge (MACK). First, we collect a set of words and their related image regions from publicly available datasets, and compute prototypical region representations to obtain pretrained general knowledge. To make the obtained knowledge better suit for certain datasets, we refine it using unpaired images and texts in a self-supervised learning manner to obtain fine-tuned domain knowledge. Then, to match given images with texts based on the knowledge, we represent parsed words in the texts by prototypical region representations, and compute region-word similarity scores. At last, the scores are aggregated based on bidirectional similarity pooling into an image-text similarity score, which can be directly used for unpaired image-text matching. The proposed MACK is complementary with existing models, which can be easily extended as a re-ranking method to substantially improve their performance of zero-shot and cross-dataset image-text matching.\",\"PeriodicalId\":94034,\"journal\":{\"name\":\"IEEE transactions on pattern analysis and machine intelligence\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on pattern analysis and machine intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/TPAMI.2024.3432552\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TPAMI.2024.3432552","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

最近，多模态预训练模型极大地提高了图像与文本匹配的准确性，所有这些模型都使用数百万或数十亿的配对图像和文本进行监督模型学习。与之不同的是，人脑可以利用其存储的多模态知识很好地匹配图像和文本。受此启发，本文研究了一种新的情况，即无配对图像-文本匹配，在这种情况下，假定在模型学习过程中没有配对图像和文本。为此，我们提出了一种简单而有效的方法，即多模态对齐概念知识（MACK）。首先，我们从公开数据集中收集一组词语及其相关图像区域，并计算原型区域表征，从而获得预训练的一般知识。为了使获得的知识更适合特定的数据集，我们使用未配对的图像和文本，以自我监督学习的方式对其进行完善，从而获得微调的领域知识。然后，为了根据知识匹配给定图像和文本，我们用原型区域表示法来表示文本中的解析词，并计算区域-词相似度得分。最后，基于双向相似性池将分数汇总为图像-文本相似性分数，该分数可直接用于无配对图像-文本匹配。所提出的 MACK 与现有模型具有互补性，可作为一种重新排序方法轻松扩展，从而大幅提高其在零镜头和跨数据集图像-文本匹配中的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Unpaired Image-text Matching via Multimodal Aligned Conceptual Knowledge.

Recently, the accuracy of image-text matching has been greatly improved by multimodal pretrained models, all of which use millions or billions of paired images and texts for supervised model learning. Different from them, human brains can well match images with texts using their stored multimodal knowledge. Inspired by that, this paper studies a new scenario as unpaired image-text matching, in which paired images and texts are assumed to be unavailable during model learning. To deal with it, we accordingly propose a simple yet effective method namely Multimodal Aligned Conceptual Knowledge (MACK). First, we collect a set of words and their related image regions from publicly available datasets, and compute prototypical region representations to obtain pretrained general knowledge. To make the obtained knowledge better suit for certain datasets, we refine it using unpaired images and texts in a self-supervised learning manner to obtain fine-tuned domain knowledge. Then, to match given images with texts based on the knowledge, we represent parsed words in the texts by prototypical region representations, and compute region-word similarity scores. At last, the scores are aggregated based on bidirectional similarity pooling into an image-text similarity score, which can be directly used for unpaired image-text matching. The proposed MACK is complementary with existing models, which can be easily extended as a re-ranking method to substantially improve their performance of zero-shot and cross-dataset image-text matching.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量