利用概率扩展控制进行多模态学习稀疏检索

European Conference on Information Retrieval Pub Date : 2024-02-27 DOI:10.48550/arXiv.2402.17535

Thong Nguyen, Mariya Hendriksen, Andrew Yates, M. D. Rijke

{"title":"利用概率扩展控制进行多模态学习稀疏检索","authors":"Thong Nguyen, Mariya Hendriksen, Andrew Yates, M. D. Rijke","doi":"10.48550/arXiv.2402.17535","DOIUrl":null,"url":null,"abstract":"Learned sparse retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors that can be indexed and retrieved efficiently with an inverted index. We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval. While LSR has seen success in text retrieval, its application in multimodal retrieval remains underexplored. Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets. Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors. We address issues of high dimension co-activation and semantic deviation through a new training algorithm, using Bernoulli random variables to control query expansion. Experiments with two dense models (BLIP, ALBEF) and two datasets (MSCOCO, Flickr30k) show that our proposed algorithm effectively reduces co-activation and semantic deviation. Our best-performing sparsified model outperforms state-of-the-art text-image LSR models with a shorter training time and lower GPU memory requirements. Our approach offers an effective solution for training LSR retrieval models in multimodal settings. Our code and model checkpoints are available at github.com/thongnt99/lsr-multimodal","PeriodicalId":126309,"journal":{"name":"European Conference on Information Retrieval","volume":"41 4","pages":"448-464"},"PeriodicalIF":0.0000,"publicationDate":"2024-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control\",\"authors\":\"Thong Nguyen, Mariya Hendriksen, Andrew Yates, M. D. Rijke\",\"doi\":\"10.48550/arXiv.2402.17535\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Learned sparse retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors that can be indexed and retrieved efficiently with an inverted index. We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval. While LSR has seen success in text retrieval, its application in multimodal retrieval remains underexplored. Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets. Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors. We address issues of high dimension co-activation and semantic deviation through a new training algorithm, using Bernoulli random variables to control query expansion. Experiments with two dense models (BLIP, ALBEF) and two datasets (MSCOCO, Flickr30k) show that our proposed algorithm effectively reduces co-activation and semantic deviation. Our best-performing sparsified model outperforms state-of-the-art text-image LSR models with a shorter training time and lower GPU memory requirements. Our approach offers an effective solution for training LSR retrieval models in multimodal settings. Our code and model checkpoints are available at github.com/thongnt99/lsr-multimodal\",\"PeriodicalId\":126309,\"journal\":{\"name\":\"European Conference on Information Retrieval\",\"volume\":\"41 4\",\"pages\":\"448-464\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-02-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"European Conference on Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2402.17535\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Conference on Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2402.17535","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

学习稀疏检索（LSR）是一系列神经方法，可将查询和文档编码为稀疏词性向量，并通过倒排索引进行高效检索。我们探讨了 LSR 在多模态领域的应用，重点是文本-图像检索。虽然 LSR 在文本检索中取得了成功，但其在多模态检索中的应用仍未得到充分探索。LexLIP 和 STAIR 等当前方法需要在海量数据集上进行复杂的多步骤训练。我们提出的方法能有效地将稠密向量从冻结的稠密模型转换为稀疏词向量。我们通过一种新的训练算法，使用伯努利随机变量来控制查询扩展，从而解决了高维共激活和语义偏差的问题。对两个密集模型（BLIP、ALBEF）和两个数据集（MSCOCO、Flickr30k）的实验表明，我们提出的算法能有效减少共激活和语义偏差。我们性能最佳的稀疏化模型优于最先进的文本-图像 LSR 模型，而且训练时间更短，对 GPU 内存的要求更低。我们的方法为在多模态环境中训练 LSR 检索模型提供了有效的解决方案。我们的代码和模型检查点见 github.com/thongnt99/lsr-multimodal。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control

Learned sparse retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors that can be indexed and retrieved efficiently with an inverted index. We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval. While LSR has seen success in text retrieval, its application in multimodal retrieval remains underexplored. Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets. Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors. We address issues of high dimension co-activation and semantic deviation through a new training algorithm, using Bernoulli random variables to control query expansion. Experiments with two dense models (BLIP, ALBEF) and two datasets (MSCOCO, Flickr30k) show that our proposed algorithm effectively reduces co-activation and semantic deviation. Our best-performing sparsified model outperforms state-of-the-art text-image LSR models with a shorter training time and lower GPU memory requirements. Our approach offers an effective solution for training LSR retrieval models in multimodal settings. Our code and model checkpoints are available at github.com/thongnt99/lsr-multimodal

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

European Conference on Information Retrieval

自引率

0.00%

发文量