开放域三维模型检索的自适应CLIP

IF 7.4 1区管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Information Processing & Management Pub Date : 2024-11-29 DOI:10.1016/j.ipm.2024.103989

Dan Song , Zekai Qiang , Chumeng Zhang , Lanjun Wang , Qiong Liu , You Yang , An-An Liu

{"title":"开放域三维模型检索的自适应CLIP","authors":"Dan Song , Zekai Qiang , Chumeng Zhang , Lanjun Wang , Qiong Liu , You Yang , An-An Liu","doi":"10.1016/j.ipm.2024.103989","DOIUrl":null,"url":null,"abstract":"<div><div>In order to effectively enhance the practicality of 3D model retrieval, we adopt a single real image as the query sample for retrieving 3D models. However, the significant differences between 2D images and 3D models in terms of lighting conditions, textures and backgrounds, posing a great challenge for accurate retrieval. Existing work on 3D model retrieval mainly focuses on closed-domain research, while the open-domain condition where the category relationship between the query image and the 3D model is unknown is more in line with the needs of real scenarios. CLIP shows significant promise in comprehending open-world visual concepts, facilitating effective zero-shot image recognition. Based on this multimodal pre-training large language model, we introduce Adaptive Open-domain Semantic Nearest-neighbor Contrast (AOSNC), a method for learning and aligning multi-modal text, image, and 3D model. In order to solve the issue of inconsistent cross-domain categories and difficult sample correlation in open-domain, we construct a cross-modal bridge using CLIP. This model utilizes textual features to bridge the gap between 2D images and 3D model views. Additionally, we design an adaptive network layer to address the limitations of the pre-training model for 3D model views and enhance cross-modal alignment. We propose a mutual nearest-neighbor semantic alignment loss to address the challenge of aligning features from disparate modalities (text, images, and 3D models). This loss function enhances cross-modal learning by effectively associating and distinguishing features, improving retrieval accuracy. We conducted comprehensive experiments using the image-based 3D model retrieval dataset MI3DOR and the cross-domain 3D model retrieval dataset NTU-PSB to validate the superiority of the proposed method. Our results show significant improvements in several evaluation metrics, underscoring the efficacy of our method in augmenting cross-modal feature alignment and retrieval performance.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"62 2","pages":"Article 103989"},"PeriodicalIF":7.4000,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Adaptive CLIP for open-domain 3D model retrieval\",\"authors\":\"Dan Song , Zekai Qiang , Chumeng Zhang , Lanjun Wang , Qiong Liu , You Yang , An-An Liu\",\"doi\":\"10.1016/j.ipm.2024.103989\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In order to effectively enhance the practicality of 3D model retrieval, we adopt a single real image as the query sample for retrieving 3D models. However, the significant differences between 2D images and 3D models in terms of lighting conditions, textures and backgrounds, posing a great challenge for accurate retrieval. Existing work on 3D model retrieval mainly focuses on closed-domain research, while the open-domain condition where the category relationship between the query image and the 3D model is unknown is more in line with the needs of real scenarios. CLIP shows significant promise in comprehending open-world visual concepts, facilitating effective zero-shot image recognition. Based on this multimodal pre-training large language model, we introduce Adaptive Open-domain Semantic Nearest-neighbor Contrast (AOSNC), a method for learning and aligning multi-modal text, image, and 3D model. In order to solve the issue of inconsistent cross-domain categories and difficult sample correlation in open-domain, we construct a cross-modal bridge using CLIP. This model utilizes textual features to bridge the gap between 2D images and 3D model views. Additionally, we design an adaptive network layer to address the limitations of the pre-training model for 3D model views and enhance cross-modal alignment. We propose a mutual nearest-neighbor semantic alignment loss to address the challenge of aligning features from disparate modalities (text, images, and 3D models). This loss function enhances cross-modal learning by effectively associating and distinguishing features, improving retrieval accuracy. We conducted comprehensive experiments using the image-based 3D model retrieval dataset MI3DOR and the cross-domain 3D model retrieval dataset NTU-PSB to validate the superiority of the proposed method. Our results show significant improvements in several evaluation metrics, underscoring the efficacy of our method in augmenting cross-modal feature alignment and retrieval performance.</div></div>\",\"PeriodicalId\":50365,\"journal\":{\"name\":\"Information Processing & Management\",\"volume\":\"62 2\",\"pages\":\"Article 103989\"},\"PeriodicalIF\":7.4000,\"publicationDate\":\"2024-11-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Processing & Management\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0306457324003480\",\"RegionNum\":1,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457324003480","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

为了有效增强三维模型检索的实用性，我们采用单幅真实图像作为检索三维模型的查询样本。然而，2D图像与3D模型在光照条件、纹理和背景等方面存在显著差异，这给准确检索带来了很大的挑战。现有的三维模型检索工作主要集中在闭域研究，而开放域条件下查询图像与三维模型之间的类别关系未知更符合真实场景的需要。CLIP在理解开放世界视觉概念，促进有效的零射击图像识别方面显示出重要的前景。在此多模态预训练大型语言模型的基础上，我们引入了一种多模态文本、图像和3D模型的学习和对齐方法——自适应开放域语义最近邻对比（AOSNC）。为了解决开放域中跨域分类不一致和样本关联困难的问题，我们使用CLIP构造了一个跨模态桥。该模型利用文本特征来弥合2D图像和3D模型视图之间的差距。此外，我们设计了一个自适应网络层来解决3D模型视图预训练模型的局限性，并增强了跨模态对齐。我们提出了一种相互最近邻语义对齐损失来解决来自不同模式（文本、图像和3D模型）的特征对齐的挑战。该损失函数通过有效地关联和区分特征来增强跨模态学习，提高检索精度。利用基于图像的三维模型检索数据集MI3DOR和跨域三维模型检索数据集NTU-PSB进行了综合实验，验证了所提方法的优越性。我们的研究结果显示在几个评估指标上有显著的改进，强调了我们的方法在增强跨模态特征对齐和检索性能方面的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Adaptive CLIP for open-domain 3D model retrieval

In order to effectively enhance the practicality of 3D model retrieval, we adopt a single real image as the query sample for retrieving 3D models. However, the significant differences between 2D images and 3D models in terms of lighting conditions, textures and backgrounds, posing a great challenge for accurate retrieval. Existing work on 3D model retrieval mainly focuses on closed-domain research, while the open-domain condition where the category relationship between the query image and the 3D model is unknown is more in line with the needs of real scenarios. CLIP shows significant promise in comprehending open-world visual concepts, facilitating effective zero-shot image recognition. Based on this multimodal pre-training large language model, we introduce Adaptive Open-domain Semantic Nearest-neighbor Contrast (AOSNC), a method for learning and aligning multi-modal text, image, and 3D model. In order to solve the issue of inconsistent cross-domain categories and difficult sample correlation in open-domain, we construct a cross-modal bridge using CLIP. This model utilizes textual features to bridge the gap between 2D images and 3D model views. Additionally, we design an adaptive network layer to address the limitations of the pre-training model for 3D model views and enhance cross-modal alignment. We propose a mutual nearest-neighbor semantic alignment loss to address the challenge of aligning features from disparate modalities (text, images, and 3D models). This loss function enhances cross-modal learning by effectively associating and distinguishing features, improving retrieval accuracy. We conducted comprehensive experiments using the image-based 3D model retrieval dataset MI3DOR and the cross-domain 3D model retrieval dataset NTU-PSB to validate the superiority of the proposed method. Our results show significant improvements in several evaluation metrics, underscoring the efficacy of our method in augmenting cross-modal feature alignment and retrieval performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information Processing & Management 工程技术-计算机：信息系统

CiteScore

17.00

自引率

11.60%

发文量

276

审稿时长

39 days

期刊介绍： Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing. We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.