开放域三维模型检索的自适应CLIP

IF 7.4 1区 管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Information Processing & Management Pub Date : 2024-11-29 DOI:10.1016/j.ipm.2024.103989
Dan Song , Zekai Qiang , Chumeng Zhang , Lanjun Wang , Qiong Liu , You Yang , An-An Liu
{"title":"开放域三维模型检索的自适应CLIP","authors":"Dan Song ,&nbsp;Zekai Qiang ,&nbsp;Chumeng Zhang ,&nbsp;Lanjun Wang ,&nbsp;Qiong Liu ,&nbsp;You Yang ,&nbsp;An-An Liu","doi":"10.1016/j.ipm.2024.103989","DOIUrl":null,"url":null,"abstract":"<div><div>In order to effectively enhance the practicality of 3D model retrieval, we adopt a single real image as the query sample for retrieving 3D models. However, the significant differences between 2D images and 3D models in terms of lighting conditions, textures and backgrounds, posing a great challenge for accurate retrieval. Existing work on 3D model retrieval mainly focuses on closed-domain research, while the open-domain condition where the category relationship between the query image and the 3D model is unknown is more in line with the needs of real scenarios. CLIP shows significant promise in comprehending open-world visual concepts, facilitating effective zero-shot image recognition. Based on this multimodal pre-training large language model, we introduce Adaptive Open-domain Semantic Nearest-neighbor Contrast (AOSNC), a method for learning and aligning multi-modal text, image, and 3D model. In order to solve the issue of inconsistent cross-domain categories and difficult sample correlation in open-domain, we construct a cross-modal bridge using CLIP. This model utilizes textual features to bridge the gap between 2D images and 3D model views. Additionally, we design an adaptive network layer to address the limitations of the pre-training model for 3D model views and enhance cross-modal alignment. We propose a mutual nearest-neighbor semantic alignment loss to address the challenge of aligning features from disparate modalities (text, images, and 3D models). This loss function enhances cross-modal learning by effectively associating and distinguishing features, improving retrieval accuracy. We conducted comprehensive experiments using the image-based 3D model retrieval dataset MI3DOR and the cross-domain 3D model retrieval dataset NTU-PSB to validate the superiority of the proposed method. Our results show significant improvements in several evaluation metrics, underscoring the efficacy of our method in augmenting cross-modal feature alignment and retrieval performance.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"62 2","pages":"Article 103989"},"PeriodicalIF":7.4000,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Adaptive CLIP for open-domain 3D model retrieval\",\"authors\":\"Dan Song ,&nbsp;Zekai Qiang ,&nbsp;Chumeng Zhang ,&nbsp;Lanjun Wang ,&nbsp;Qiong Liu ,&nbsp;You Yang ,&nbsp;An-An Liu\",\"doi\":\"10.1016/j.ipm.2024.103989\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In order to effectively enhance the practicality of 3D model retrieval, we adopt a single real image as the query sample for retrieving 3D models. However, the significant differences between 2D images and 3D models in terms of lighting conditions, textures and backgrounds, posing a great challenge for accurate retrieval. Existing work on 3D model retrieval mainly focuses on closed-domain research, while the open-domain condition where the category relationship between the query image and the 3D model is unknown is more in line with the needs of real scenarios. CLIP shows significant promise in comprehending open-world visual concepts, facilitating effective zero-shot image recognition. Based on this multimodal pre-training large language model, we introduce Adaptive Open-domain Semantic Nearest-neighbor Contrast (AOSNC), a method for learning and aligning multi-modal text, image, and 3D model. In order to solve the issue of inconsistent cross-domain categories and difficult sample correlation in open-domain, we construct a cross-modal bridge using CLIP. This model utilizes textual features to bridge the gap between 2D images and 3D model views. Additionally, we design an adaptive network layer to address the limitations of the pre-training model for 3D model views and enhance cross-modal alignment. We propose a mutual nearest-neighbor semantic alignment loss to address the challenge of aligning features from disparate modalities (text, images, and 3D models). This loss function enhances cross-modal learning by effectively associating and distinguishing features, improving retrieval accuracy. We conducted comprehensive experiments using the image-based 3D model retrieval dataset MI3DOR and the cross-domain 3D model retrieval dataset NTU-PSB to validate the superiority of the proposed method. Our results show significant improvements in several evaluation metrics, underscoring the efficacy of our method in augmenting cross-modal feature alignment and retrieval performance.</div></div>\",\"PeriodicalId\":50365,\"journal\":{\"name\":\"Information Processing & Management\",\"volume\":\"62 2\",\"pages\":\"Article 103989\"},\"PeriodicalIF\":7.4000,\"publicationDate\":\"2024-11-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Processing & Management\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0306457324003480\",\"RegionNum\":1,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457324003480","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

为了有效增强三维模型检索的实用性,我们采用单幅真实图像作为检索三维模型的查询样本。然而,2D图像与3D模型在光照条件、纹理和背景等方面存在显著差异,这给准确检索带来了很大的挑战。现有的三维模型检索工作主要集中在闭域研究,而开放域条件下查询图像与三维模型之间的类别关系未知更符合真实场景的需要。CLIP在理解开放世界视觉概念,促进有效的零射击图像识别方面显示出重要的前景。在此多模态预训练大型语言模型的基础上,我们引入了一种多模态文本、图像和3D模型的学习和对齐方法——自适应开放域语义最近邻对比(AOSNC)。为了解决开放域中跨域分类不一致和样本关联困难的问题,我们使用CLIP构造了一个跨模态桥。该模型利用文本特征来弥合2D图像和3D模型视图之间的差距。此外,我们设计了一个自适应网络层来解决3D模型视图预训练模型的局限性,并增强了跨模态对齐。我们提出了一种相互最近邻语义对齐损失来解决来自不同模式(文本、图像和3D模型)的特征对齐的挑战。该损失函数通过有效地关联和区分特征来增强跨模态学习,提高检索精度。利用基于图像的三维模型检索数据集MI3DOR和跨域三维模型检索数据集NTU-PSB进行了综合实验,验证了所提方法的优越性。我们的研究结果显示在几个评估指标上有显著的改进,强调了我们的方法在增强跨模态特征对齐和检索性能方面的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Adaptive CLIP for open-domain 3D model retrieval
In order to effectively enhance the practicality of 3D model retrieval, we adopt a single real image as the query sample for retrieving 3D models. However, the significant differences between 2D images and 3D models in terms of lighting conditions, textures and backgrounds, posing a great challenge for accurate retrieval. Existing work on 3D model retrieval mainly focuses on closed-domain research, while the open-domain condition where the category relationship between the query image and the 3D model is unknown is more in line with the needs of real scenarios. CLIP shows significant promise in comprehending open-world visual concepts, facilitating effective zero-shot image recognition. Based on this multimodal pre-training large language model, we introduce Adaptive Open-domain Semantic Nearest-neighbor Contrast (AOSNC), a method for learning and aligning multi-modal text, image, and 3D model. In order to solve the issue of inconsistent cross-domain categories and difficult sample correlation in open-domain, we construct a cross-modal bridge using CLIP. This model utilizes textual features to bridge the gap between 2D images and 3D model views. Additionally, we design an adaptive network layer to address the limitations of the pre-training model for 3D model views and enhance cross-modal alignment. We propose a mutual nearest-neighbor semantic alignment loss to address the challenge of aligning features from disparate modalities (text, images, and 3D models). This loss function enhances cross-modal learning by effectively associating and distinguishing features, improving retrieval accuracy. We conducted comprehensive experiments using the image-based 3D model retrieval dataset MI3DOR and the cross-domain 3D model retrieval dataset NTU-PSB to validate the superiority of the proposed method. Our results show significant improvements in several evaluation metrics, underscoring the efficacy of our method in augmenting cross-modal feature alignment and retrieval performance.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Information Processing & Management
Information Processing & Management 工程技术-计算机:信息系统
CiteScore
17.00
自引率
11.60%
发文量
276
审稿时长
39 days
期刊介绍: Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing. We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.
期刊最新文献
Few-shot multi-hop reasoning via reinforcement learning and path search strategy over temporal knowledge graphs Basis is also explanation: Interpretable Legal Judgment Reasoning prompted by multi-source knowledge Extracting key insights from earnings call transcript via information-theoretic contrastive learning Advancing rule learning in knowledge graphs with structure-aware graph transformer DCIB: Dual contrastive information bottleneck for knowledge-aware recommendation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1