事件密集文本和图像之间的跨模态检索

Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI:10.1145/3512527.3531374

Zhongwei Xie, Lin Li, Luo Zhong, Jianquan Liu, Ling Liu

{"title":"事件密集文本和图像之间的跨模态检索","authors":"Zhongwei Xie, Lin Li, Luo Zhong, Jianquan Liu, Ling Liu","doi":"10.1145/3512527.3531374","DOIUrl":null,"url":null,"abstract":"This paper presents a novel approach to the problem of event-dense text and image cross-modal retrieval where the text contains the descriptions of numerous events. It is known that modality alignment is crucial for retrieval performance. However, due to the lack of event sequence information in the image, it is challenging to perform the fine-grain alignment of the event-dense text with the image. Our proposed approach incorporates the event-oriented features to enhance the cross-modal alignment, and applies the event-dense text-image retrieval to the food domain for empirical validation. Specifically, we capture the significance of each event by Transformer, and combine it with the identified key event elements, to enhance the discriminative ability of the learned text embedding that summarizes all the events. Next, we produce the image embedding by combining the event tag jointly shared by the text and image with the visual embedding of the event-related image regions, which describes the eventual consequence of all the events and facilitates the event-based cross-modal alignment. Finally, we integrate text embedding and image embedding with the loss optimization empowered with the event tag by iteratively regulating the joint embedding learning for cross-modal retrieval. Extensive experiments demonstrate that our proposed event-oriented modality alignment approach significantly outperforms the state-of-the-art approach with a 23.3% improvement on top-1 Recall for image-to-recipe retrieval on Recipe1M 10k test set.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Cross-Modal Retrieval between Event-Dense Text and Image\",\"authors\":\"Zhongwei Xie, Lin Li, Luo Zhong, Jianquan Liu, Ling Liu\",\"doi\":\"10.1145/3512527.3531374\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents a novel approach to the problem of event-dense text and image cross-modal retrieval where the text contains the descriptions of numerous events. It is known that modality alignment is crucial for retrieval performance. However, due to the lack of event sequence information in the image, it is challenging to perform the fine-grain alignment of the event-dense text with the image. Our proposed approach incorporates the event-oriented features to enhance the cross-modal alignment, and applies the event-dense text-image retrieval to the food domain for empirical validation. Specifically, we capture the significance of each event by Transformer, and combine it with the identified key event elements, to enhance the discriminative ability of the learned text embedding that summarizes all the events. Next, we produce the image embedding by combining the event tag jointly shared by the text and image with the visual embedding of the event-related image regions, which describes the eventual consequence of all the events and facilitates the event-based cross-modal alignment. Finally, we integrate text embedding and image embedding with the loss optimization empowered with the event tag by iteratively regulating the joint embedding learning for cross-modal retrieval. Extensive experiments demonstrate that our proposed event-oriented modality alignment approach significantly outperforms the state-of-the-art approach with a 23.3% improvement on top-1 Recall for image-to-recipe retrieval on Recipe1M 10k test set.\",\"PeriodicalId\":179895,\"journal\":{\"name\":\"Proceedings of the 2022 International Conference on Multimedia Retrieval\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-06-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2022 International Conference on Multimedia Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3512527.3531374\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3512527.3531374","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

本文提出了一种新的方法来解决事件密集文本和图像跨模态检索问题，其中文本包含许多事件的描述。众所周知，模态对齐对检索性能至关重要。然而，由于图像中缺乏事件序列信息，对事件密集文本与图像进行细粒度对齐是具有挑战性的。我们提出的方法结合了面向事件的特征来增强跨模态对齐，并将事件密集的文本图像检索应用于食品领域进行经验验证。具体而言，我们通过Transformer捕获每个事件的意义，并将其与识别出的关键事件元素结合起来，以增强学习文本嵌入对所有事件的识别能力。接下来，我们通过将文本和图像共同共享的事件标签与事件相关图像区域的视觉嵌入相结合来生成图像嵌入，该嵌入描述了所有事件的最终结果，并促进了基于事件的跨模态对齐。最后，我们通过迭代调节联合嵌入学习进行跨模态检索，将文本嵌入和图像嵌入与事件标签授权的损失优化相结合。大量的实验表明，我们提出的面向事件的模态对齐方法在Recipe1M 10k测试集上的图像到食谱检索的top-1召回率提高了23.3%，显著优于最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Cross-Modal Retrieval between Event-Dense Text and Image

This paper presents a novel approach to the problem of event-dense text and image cross-modal retrieval where the text contains the descriptions of numerous events. It is known that modality alignment is crucial for retrieval performance. However, due to the lack of event sequence information in the image, it is challenging to perform the fine-grain alignment of the event-dense text with the image. Our proposed approach incorporates the event-oriented features to enhance the cross-modal alignment, and applies the event-dense text-image retrieval to the food domain for empirical validation. Specifically, we capture the significance of each event by Transformer, and combine it with the identified key event elements, to enhance the discriminative ability of the learned text embedding that summarizes all the events. Next, we produce the image embedding by combining the event tag jointly shared by the text and image with the visual embedding of the event-related image regions, which describes the eventual consequence of all the events and facilitates the event-based cross-modal alignment. Finally, we integrate text embedding and image embedding with the loss optimization empowered with the event tag by iteratively regulating the joint embedding learning for cross-modal retrieval. Extensive experiments demonstrate that our proposed event-oriented modality alignment approach significantly outperforms the state-of-the-art approach with a 23.3% improvement on top-1 Recall for image-to-recipe retrieval on Recipe1M 10k test set.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2022 International Conference on Multimedia Retrieval

自引率

0.00%

发文量