JM3D & JM3D-LLM: Elevating 3D Representation With Joint Multi-Modal Cues

IF 18.6 IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-12-30 DOI:10.1109/TPAMI.2024.3523675

Jiayi Ji;Haowei Wang;Changli Wu;Yiwei Ma;Xiaoshuai Sun;Rongrong Ji

{"title":"JM3D & JM3D-LLM: Elevating 3D Representation With Joint Multi-Modal Cues","authors":"Jiayi Ji;Haowei Wang;Changli Wu;Yiwei Ma;Xiaoshuai Sun;Rongrong Ji","doi":"10.1109/TPAMI.2024.3523675","DOIUrl":null,"url":null,"abstract":"The rising importance of 3D representation learning, pivotal in computer vision, autonomous driving, and robotics, is evident. However, a prevailing trend, which straightforwardly resorted to transferring 2D alignment strategies to the 3D domain, encounters three distinct challenges: (1) Information Degradation: This arises from the alignment of 3D data with mere single-view 2D images and generic texts, neglecting the need for multi-view images and detailed subcategory texts. (2) Insufficient Synergy: These strategies align 3D representations to image and text features individually, hampering the overall optimization for 3D models. (3) Underutilization: The fine-grained information inherent in the learned representations is often not fully exploited, indicating a potential loss in detail. To address these issues, we introduce JM3D, a comprehensive approach integrating point cloud, text, and image. Key contributions include the Structured Multimodal Organizer (SMO), enriching vision-language representation with multiple views and hierarchical text, and the Joint Multi-modal Alignment (JMA), combining language understanding with visual representation. Our advanced model, JM3D-LLM, marries 3D representation with large language models via efficient fine-tuning. Evaluations on ModelNet40 and ScanObjectNN establish JM3D's superiority. The superior performance of JM3D-LLM further underscores the effectiveness of our representation transfer approach.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2475-2492"},"PeriodicalIF":18.6000,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10817587/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The rising importance of 3D representation learning, pivotal in computer vision, autonomous driving, and robotics, is evident. However, a prevailing trend, which straightforwardly resorted to transferring 2D alignment strategies to the 3D domain, encounters three distinct challenges: (1) Information Degradation: This arises from the alignment of 3D data with mere single-view 2D images and generic texts, neglecting the need for multi-view images and detailed subcategory texts. (2) Insufficient Synergy: These strategies align 3D representations to image and text features individually, hampering the overall optimization for 3D models. (3) Underutilization: The fine-grained information inherent in the learned representations is often not fully exploited, indicating a potential loss in detail. To address these issues, we introduce JM3D, a comprehensive approach integrating point cloud, text, and image. Key contributions include the Structured Multimodal Organizer (SMO), enriching vision-language representation with multiple views and hierarchical text, and the Joint Multi-modal Alignment (JMA), combining language understanding with visual representation. Our advanced model, JM3D-LLM, marries 3D representation with large language models via efficient fine-tuning. Evaluations on ModelNet40 and ScanObjectNN establish JM3D's superiority. The superior performance of JM3D-LLM further underscores the effectiveness of our representation transfer approach.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

JM3D & JM3D- llm：用联合多模态线索提升3D表示

3D表示学习在计算机视觉、自动驾驶和机器人技术中至关重要，其重要性日益上升，这是显而易见的。然而，直接诉诸于将2D对齐策略转移到3D领域的主流趋势遇到了三个明显的挑战：(1)信息退化：这源于仅与单视图2D图像和通用文本进行3D数据对齐，而忽略了对多视图图像和详细子类别文本的需求。(2)协同不足：这些策略将3D表示单独与图像和文本特征对齐，阻碍了3D模型的整体优化。(3)利用不足：学习表征中固有的细粒度信息往往没有得到充分利用，这表明在细节上有潜在的损失。为了解决这些问题，我们介绍了一种集成点云、文本和图像的综合方法JM3D。主要贡献包括结构化多模态组织者（SMO），通过多视图和分层文本丰富视觉语言表示；联合多模态对齐（JMA），将语言理解与视觉表示相结合。我们的先进模型JM3D-LLM通过高效的微调将3D表示与大型语言模型结合在一起。对ModelNet40和ScanObjectNN的评价确立了JM3D的优越性。JM3D-LLM的优异性能进一步证明了我们的表征转移方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量