Realizing Efficient On-Device Language-based Image Retrieval

IF 5.2 3区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-03-15 DOI:10.1145/3649896
Zhiming Hu, Mete Kemertas, Lan Xiao, Caleb Phillips, Iqbal Mohomed, Afsaneh Fazly
{"title":"Realizing Efficient On-Device Language-based Image Retrieval","authors":"Zhiming Hu, Mete Kemertas, Lan Xiao, Caleb Phillips, Iqbal Mohomed, Afsaneh Fazly","doi":"10.1145/3649896","DOIUrl":null,"url":null,"abstract":"<p>Advances in deep learning have enabled accurate language-based search and retrieval, e.g., over user photos, in the cloud. Many users prefer to store their photos in the home due to privacy concerns. As such, a need arises for models that can perform cross-modal search on resource-limited devices. State-of-the-art cross-modal retrieval models achieve high accuracy through learning entangled representations that enable fine-grained similarity calculation between a language query and an image, but at the expense of having a prohibitively high retrieval latency. Alternatively, there is a new class of methods that exhibits good performance with low latency, but requires a lot more computational resources, and an order of magnitude more training data (i.e. large web-scraped datasets consisting of millions of image-caption pairs) making them infeasible to use in a commercial context. From a pragmatic perspective, none of the existing methods are suitable for developing commercial applications for low-latency cross-modal retrieval on low-resource devices. We propose CrispSearch, a cascaded approach that greatly reduces the retrieval latency with minimal loss in ranking accuracy for on-device language-based image retrieval. The idea behind our approach is to combine a light-weight and runtime-efficient coarse model with a fine re-ranking stage. Given a language query, the coarse model effectively filters out many of the irrelevant image candidates. After this filtering, only a handful of strong candidates will be selected and sent to a fine model for re-ranking. Extensive experimental results with two SOTA models for the fine re-ranking stage, on standard benchmark datasets show that CrispSearch results in a speedup of up to 38 times over the SOTA fine methods with negligible performance degradation. Moreover, our method does not require millions of training instances, making it a pragmatic solution to on-device search and retrieval.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":null,"pages":null},"PeriodicalIF":5.2000,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Multimedia Computing Communications and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3649896","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Advances in deep learning have enabled accurate language-based search and retrieval, e.g., over user photos, in the cloud. Many users prefer to store their photos in the home due to privacy concerns. As such, a need arises for models that can perform cross-modal search on resource-limited devices. State-of-the-art cross-modal retrieval models achieve high accuracy through learning entangled representations that enable fine-grained similarity calculation between a language query and an image, but at the expense of having a prohibitively high retrieval latency. Alternatively, there is a new class of methods that exhibits good performance with low latency, but requires a lot more computational resources, and an order of magnitude more training data (i.e. large web-scraped datasets consisting of millions of image-caption pairs) making them infeasible to use in a commercial context. From a pragmatic perspective, none of the existing methods are suitable for developing commercial applications for low-latency cross-modal retrieval on low-resource devices. We propose CrispSearch, a cascaded approach that greatly reduces the retrieval latency with minimal loss in ranking accuracy for on-device language-based image retrieval. The idea behind our approach is to combine a light-weight and runtime-efficient coarse model with a fine re-ranking stage. Given a language query, the coarse model effectively filters out many of the irrelevant image candidates. After this filtering, only a handful of strong candidates will be selected and sent to a fine model for re-ranking. Extensive experimental results with two SOTA models for the fine re-ranking stage, on standard benchmark datasets show that CrispSearch results in a speedup of up to 38 times over the SOTA fine methods with negligible performance degradation. Moreover, our method does not require millions of training instances, making it a pragmatic solution to on-device search and retrieval.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
实现基于语言的高效设备上图像检索
深度学习技术的进步使基于语言的搜索和检索成为可能,例如在云端对用户照片进行搜索和检索。出于隐私考虑,许多用户更愿意将照片存储在家中。因此,需要能在资源有限的设备上执行跨模态搜索的模型。最先进的跨模态检索模型通过学习纠缠表征来实现语言查询和图像之间的精细相似性计算,从而达到很高的准确率,但代价是检索延迟过高。另外,还有一类新方法,其性能好、延迟低,但需要更多的计算资源和数量级更高的训练数据(即由数百万图像标题对组成的大型网络抓取数据集),因此无法用于商业用途。从实用的角度来看,现有的方法都不适合在低资源设备上开发低延迟跨模态检索的商业应用。我们提出的 CrispSearch 是一种级联方法,可大大降低检索延迟,同时将基于设备语言的图像检索的排序准确性损失降至最低。我们这种方法的理念是将轻量级、运行效率高的粗略模型与精细的重新排序阶段相结合。在给定语言查询的情况下,粗略模型可以有效地过滤掉许多不相关的候选图像。经过过滤后,只有少数强候选图片会被选中并发送给精细模型进行重新排序。在标准基准数据集上使用两种 SOTA 模型进行精细重新排序的大量实验结果表明,CrispSearch 比 SOTA 精细方法的速度提高了 38 倍,而性能下降几乎可以忽略不计。此外,我们的方法不需要数百万个训练实例,因此是设备搜索和检索的实用解决方案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
8.50
自引率
5.90%
发文量
285
审稿时长
7.5 months
期刊介绍: The ACM Transactions on Multimedia Computing, Communications, and Applications is the flagship publication of the ACM Special Interest Group in Multimedia (SIGMM). It is soliciting paper submissions on all aspects of multimedia. Papers on single media (for instance, audio, video, animation) and their processing are also welcome. TOMM is a peer-reviewed, archival journal, available in both print form and digital form. The Journal is published quarterly; with roughly 7 23-page articles in each issue. In addition, all Special Issues are published online-only to ensure a timely publication. The transactions consists primarily of research papers. This is an archival journal and it is intended that the papers will have lasting importance and value over time. In general, papers whose primary focus is on particular multimedia products or the current state of the industry will not be included.
期刊最新文献
TA-Detector: A GNN-based Anomaly Detector via Trust Relationship KF-VTON: Keypoints-Driven Flow Based Virtual Try-On Network Unified View Empirical Study for Large Pretrained Model on Cross-Domain Few-Shot Learning Multimodal Fusion for Talking Face Generation Utilizing Speech-related Facial Action Units Compressed Point Cloud Quality Index by Combining Global Appearance and Local Details
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1