A Robust Approach to Open Vocabulary Image Retrieval with Deep Convolutional Neural Networks and Transfer Learning

Vishakh Padmakumar, Rishab Ranga, Srivalya Elluru, S. Kamath S
{"title":"A Robust Approach to Open Vocabulary Image Retrieval with Deep Convolutional Neural Networks and Transfer Learning","authors":"Vishakh Padmakumar, Rishab Ranga, Srivalya Elluru, S. Kamath S","doi":"10.23919/PNC.2018.8579473","DOIUrl":null,"url":null,"abstract":"Enabling computer systems to respond to conversational human language is a challenging problem with wideranging applications in the field of robotics and human computer interaction. Specifically, in image searches, humans tend to describe objects in fine-grained detail like color or company, for which conventional retrieval algorithms have shown poor performance. In this paper, a novel approach for open vocabulary image retrieval, capable of selecting the correct candidate image from among a set of distractions given a query in natural language form, is presented. Our methodology focuses on generating a robust set of image-text projections capable of accurately representing any image, with an objective of achieving high recall. To this end, an ensemble of classifiers is trained on ImageNet for representing high-resolution objects, Cifar 100 for smaller resolution images of objects and Caltech 256 for challenging views of everyday objects, for generating category-based projections. In addition to category based projections, we also make use of an image captioning model trained on MS COCO and Google Image Search (GISS) to capture additional semantic/latent information about the candidate images. To facilitate image retrieval, the natural language query and projection results are converted to a common vector representation using word embeddings, with which query-image similarity is computed. The proposed model when benchmarked on the RefCoco dataset, achieved an accuracy of 68.8%, while retrieving semantically meaningful candidate images.","PeriodicalId":409931,"journal":{"name":"2018 Pacific Neighborhood Consortium Annual Conference and Joint Meetings (PNC)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Pacific Neighborhood Consortium Annual Conference and Joint Meetings (PNC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/PNC.2018.8579473","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Enabling computer systems to respond to conversational human language is a challenging problem with wideranging applications in the field of robotics and human computer interaction. Specifically, in image searches, humans tend to describe objects in fine-grained detail like color or company, for which conventional retrieval algorithms have shown poor performance. In this paper, a novel approach for open vocabulary image retrieval, capable of selecting the correct candidate image from among a set of distractions given a query in natural language form, is presented. Our methodology focuses on generating a robust set of image-text projections capable of accurately representing any image, with an objective of achieving high recall. To this end, an ensemble of classifiers is trained on ImageNet for representing high-resolution objects, Cifar 100 for smaller resolution images of objects and Caltech 256 for challenging views of everyday objects, for generating category-based projections. In addition to category based projections, we also make use of an image captioning model trained on MS COCO and Google Image Search (GISS) to capture additional semantic/latent information about the candidate images. To facilitate image retrieval, the natural language query and projection results are converted to a common vector representation using word embeddings, with which query-image similarity is computed. The proposed model when benchmarked on the RefCoco dataset, achieved an accuracy of 68.8%, while retrieving semantically meaningful candidate images.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于深度卷积神经网络和迁移学习的开放词汇图像检索方法
在机器人技术和人机交互领域,使计算机系统能够响应人类会话语言是一个具有挑战性的问题。具体来说,在图像搜索中,人类倾向于用细粒度的细节来描述物体,比如颜色或公司,而传统的检索算法在这方面表现不佳。本文提出了一种开放词汇图像检索的新方法,该方法能够在给定自然语言形式的查询中从一组干扰中选择正确的候选图像。我们的方法侧重于生成一组鲁棒的图像-文本投影,能够准确地表示任何图像,目标是实现高召回。为此,在ImageNet上训练了一个分类器集合,用于表示高分辨率对象,Cifar 100用于较小分辨率的对象图像,Caltech 256用于挑战日常对象的视图,用于生成基于类别的投影。除了基于类别的投影,我们还利用MS COCO和谷歌图像搜索(GISS)训练的图像字幕模型来捕获关于候选图像的额外语义/潜在信息。为了方便图像检索,使用词嵌入将自然语言查询和投影结果转换为公共向量表示,并计算查询图像相似度。在RefCoco数据集上进行基准测试时,该模型在检索语义上有意义的候选图像时达到了68.8%的准确率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Human Rights Components of an eBook Portfolio Current Movement of "Digital Archives in Japan" and "khirin (Knowledgebase of Historical Resources in Institutes)" The Conflict Between Privacy and Scientific Research in the GDPR An Effective CNN Approach for Vertebrae Segmentation from 3D CT Images Robot Indoor Navigation Using Visible Light Communication
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1