事件密集文本和图像之间的跨模态检索

Zhongwei Xie, Lin Li, Luo Zhong, Jianquan Liu, Ling Liu
{"title":"事件密集文本和图像之间的跨模态检索","authors":"Zhongwei Xie, Lin Li, Luo Zhong, Jianquan Liu, Ling Liu","doi":"10.1145/3512527.3531374","DOIUrl":null,"url":null,"abstract":"This paper presents a novel approach to the problem of event-dense text and image cross-modal retrieval where the text contains the descriptions of numerous events. It is known that modality alignment is crucial for retrieval performance. However, due to the lack of event sequence information in the image, it is challenging to perform the fine-grain alignment of the event-dense text with the image. Our proposed approach incorporates the event-oriented features to enhance the cross-modal alignment, and applies the event-dense text-image retrieval to the food domain for empirical validation. Specifically, we capture the significance of each event by Transformer, and combine it with the identified key event elements, to enhance the discriminative ability of the learned text embedding that summarizes all the events. Next, we produce the image embedding by combining the event tag jointly shared by the text and image with the visual embedding of the event-related image regions, which describes the eventual consequence of all the events and facilitates the event-based cross-modal alignment. Finally, we integrate text embedding and image embedding with the loss optimization empowered with the event tag by iteratively regulating the joint embedding learning for cross-modal retrieval. Extensive experiments demonstrate that our proposed event-oriented modality alignment approach significantly outperforms the state-of-the-art approach with a 23.3% improvement on top-1 Recall for image-to-recipe retrieval on Recipe1M 10k test set.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Cross-Modal Retrieval between Event-Dense Text and Image\",\"authors\":\"Zhongwei Xie, Lin Li, Luo Zhong, Jianquan Liu, Ling Liu\",\"doi\":\"10.1145/3512527.3531374\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents a novel approach to the problem of event-dense text and image cross-modal retrieval where the text contains the descriptions of numerous events. It is known that modality alignment is crucial for retrieval performance. However, due to the lack of event sequence information in the image, it is challenging to perform the fine-grain alignment of the event-dense text with the image. Our proposed approach incorporates the event-oriented features to enhance the cross-modal alignment, and applies the event-dense text-image retrieval to the food domain for empirical validation. Specifically, we capture the significance of each event by Transformer, and combine it with the identified key event elements, to enhance the discriminative ability of the learned text embedding that summarizes all the events. Next, we produce the image embedding by combining the event tag jointly shared by the text and image with the visual embedding of the event-related image regions, which describes the eventual consequence of all the events and facilitates the event-based cross-modal alignment. Finally, we integrate text embedding and image embedding with the loss optimization empowered with the event tag by iteratively regulating the joint embedding learning for cross-modal retrieval. Extensive experiments demonstrate that our proposed event-oriented modality alignment approach significantly outperforms the state-of-the-art approach with a 23.3% improvement on top-1 Recall for image-to-recipe retrieval on Recipe1M 10k test set.\",\"PeriodicalId\":179895,\"journal\":{\"name\":\"Proceedings of the 2022 International Conference on Multimedia Retrieval\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-06-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2022 International Conference on Multimedia Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3512527.3531374\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3512527.3531374","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

摘要

本文提出了一种新的方法来解决事件密集文本和图像跨模态检索问题,其中文本包含许多事件的描述。众所周知,模态对齐对检索性能至关重要。然而,由于图像中缺乏事件序列信息,对事件密集文本与图像进行细粒度对齐是具有挑战性的。我们提出的方法结合了面向事件的特征来增强跨模态对齐,并将事件密集的文本图像检索应用于食品领域进行经验验证。具体而言,我们通过Transformer捕获每个事件的意义,并将其与识别出的关键事件元素结合起来,以增强学习文本嵌入对所有事件的识别能力。接下来,我们通过将文本和图像共同共享的事件标签与事件相关图像区域的视觉嵌入相结合来生成图像嵌入,该嵌入描述了所有事件的最终结果,并促进了基于事件的跨模态对齐。最后,我们通过迭代调节联合嵌入学习进行跨模态检索,将文本嵌入和图像嵌入与事件标签授权的损失优化相结合。大量的实验表明,我们提出的面向事件的模态对齐方法在Recipe1M 10k测试集上的图像到食谱检索的top-1召回率提高了23.3%,显著优于最先进的方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Cross-Modal Retrieval between Event-Dense Text and Image
This paper presents a novel approach to the problem of event-dense text and image cross-modal retrieval where the text contains the descriptions of numerous events. It is known that modality alignment is crucial for retrieval performance. However, due to the lack of event sequence information in the image, it is challenging to perform the fine-grain alignment of the event-dense text with the image. Our proposed approach incorporates the event-oriented features to enhance the cross-modal alignment, and applies the event-dense text-image retrieval to the food domain for empirical validation. Specifically, we capture the significance of each event by Transformer, and combine it with the identified key event elements, to enhance the discriminative ability of the learned text embedding that summarizes all the events. Next, we produce the image embedding by combining the event tag jointly shared by the text and image with the visual embedding of the event-related image regions, which describes the eventual consequence of all the events and facilitates the event-based cross-modal alignment. Finally, we integrate text embedding and image embedding with the loss optimization empowered with the event tag by iteratively regulating the joint embedding learning for cross-modal retrieval. Extensive experiments demonstrate that our proposed event-oriented modality alignment approach significantly outperforms the state-of-the-art approach with a 23.3% improvement on top-1 Recall for image-to-recipe retrieval on Recipe1M 10k test set.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Self-Lifting: A Novel Framework for Unsupervised Voice-Face Association Learning DMPCANet: A Low Dimensional Aggregation Network for Visual Place Recognition Revisiting Performance Measures for Cross-Modal Hashing MFGAN: A Lightweight Fast Multi-task Multi-scale Feature-fusion Model based on GAN Weakly Supervised Fine-grained Recognition based on Combined Learning for Small Data and Coarse Label
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1