{"title":"3D scene generation for zero-shot learning using ChatGPT guided language prompts","authors":"Sahar Ahmadi , Ali Cheraghian , Townim Faisal Chowdhury , Morteza Saberi , Shafin Rahman","doi":"10.1016/j.cviu.2024.104211","DOIUrl":null,"url":null,"abstract":"<div><div>Zero-shot learning in the realm of 3D point cloud data remains relatively unexplored compared to its 2D image counterpart. This domain introduces fresh challenges due to the absence of robust pre-trained feature extraction models. To tackle this, we introduce a prompt-guided method for 3D scene generation and supervision, enhancing the network’s ability to comprehend the intricate relationships between seen and unseen objects. Initially, we utilize basic prompts resembling scene annotations generated from one or two point cloud objects. Recognizing the limited diversity of basic prompts, we employ ChatGPT to expand them, enriching the contextual information within the descriptions. Subsequently, leveraging these descriptions, we arrange point cloud objects’ coordinates to fabricate augmented 3D scenes. Lastly, employing contrastive learning, we train our proposed architecture end-to-end, utilizing pairs of 3D scenes and prompt-based captions. We posit that 3D scenes facilitate more efficient object relationships than individual objects, as demonstrated by the effectiveness of language models like BERT in contextual understanding. Our prompt-guided scene generation method amalgamates data augmentation and prompt-based annotation, thereby enhancing 3D ZSL performance. We present ZSL and generalized ZSL results on both synthetic (ModelNet40, ModelNet10, and ShapeNet) and real-scanned (ScanOjbectNN) 3D object datasets. Furthermore, we challenge the model by training with synthetic data and testing with real-scanned data, achieving state-of-the-art performance compared to existing 2D and 3D ZSL methods in the literature. Codes and models are available at: <span><span>https://github.com/saharahmadisohraviyeh/ChatGPT_ZSL_3D</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104211"},"PeriodicalIF":4.3000,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224002923","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Zero-shot learning in the realm of 3D point cloud data remains relatively unexplored compared to its 2D image counterpart. This domain introduces fresh challenges due to the absence of robust pre-trained feature extraction models. To tackle this, we introduce a prompt-guided method for 3D scene generation and supervision, enhancing the network’s ability to comprehend the intricate relationships between seen and unseen objects. Initially, we utilize basic prompts resembling scene annotations generated from one or two point cloud objects. Recognizing the limited diversity of basic prompts, we employ ChatGPT to expand them, enriching the contextual information within the descriptions. Subsequently, leveraging these descriptions, we arrange point cloud objects’ coordinates to fabricate augmented 3D scenes. Lastly, employing contrastive learning, we train our proposed architecture end-to-end, utilizing pairs of 3D scenes and prompt-based captions. We posit that 3D scenes facilitate more efficient object relationships than individual objects, as demonstrated by the effectiveness of language models like BERT in contextual understanding. Our prompt-guided scene generation method amalgamates data augmentation and prompt-based annotation, thereby enhancing 3D ZSL performance. We present ZSL and generalized ZSL results on both synthetic (ModelNet40, ModelNet10, and ShapeNet) and real-scanned (ScanOjbectNN) 3D object datasets. Furthermore, we challenge the model by training with synthetic data and testing with real-scanned data, achieving state-of-the-art performance compared to existing 2D and 3D ZSL methods in the literature. Codes and models are available at: https://github.com/saharahmadisohraviyeh/ChatGPT_ZSL_3D.
期刊介绍:
The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views.
Research Areas Include:
• Theory
• Early vision
• Data structures and representations
• Shape
• Range
• Motion
• Matching and recognition
• Architecture and languages
• Vision systems