首页 > 最新文献

IEEE transactions on pattern analysis and machine intelligence最新文献

英文 中文
A Novel and Effective Method to Directly Solve Spectral Clustering 直接解决频谱聚类问题的新颖有效方法
Pub Date : 2024-08-21 DOI: 10.1109/TPAMI.2024.3447287
Feiping Nie;Chaodie Liu;Rong Wang;Xuelong Li
Spectral clustering has been attracting increasing attention due to its well-defined framework and excellent performance. However, most traditional spectral clustering methods consist of two separate steps: 1) Solving a relaxed optimization problem to learn the continuous clustering labels, and 2) Rounding the continuous clustering labels into discrete ones. The clustering results of the relax-and-discretize strategy inevitably result in information loss and unsatisfactory clustering performance. Moreover, the similarity matrix constructed from original data may not be optimal for clustering since data usually have noise and redundancy. To address these problems, we propose a novel and effective algorithm to directly optimize the original spectral clustering model, called Direct Spectral Clustering (DSC). We theoretically prove that the original spectral clustering model can be solved by simultaneously learning a weighted discrete indicator matrix and a structured similarity matrix whose connected components are equal to the number of clusters. Both of them can be used to directly obtain the final clustering results without any post-processing. Further, an effective iterative optimization algorithm is exploited to solve the proposed method. Extensive experiments performed on synthetic and real-world datasets demonstrate the superiority and effectiveness of the proposed method compared to the state-of-the-art algorithms.
光谱聚类因其定义明确的框架和出色的性能而受到越来越多的关注。然而,大多数传统的光谱聚类方法都包含两个独立的步骤:1) 解决松弛优化问题以学习连续聚类标签,以及 2) 将连续聚类标签舍入为离散标签。松弛-离散策略的聚类结果不可避免地会造成信息损失,聚类效果也不尽如人意。此外,由于数据通常存在噪声和冗余,根据原始数据构建的相似性矩阵可能不是最佳的聚类矩阵。为了解决这些问题,我们提出了一种直接优化原始光谱聚类模型的新颖而有效的算法,称为直接光谱聚类(DSC)。我们从理论上证明,原始光谱聚类模型可以通过同时学习加权离散指标矩阵和结构化相似性矩阵来解决。这两种方法都可以用来直接获得最终的聚类结果,而无需任何后处理。此外,该方法还采用了一种有效的迭代优化算法。在合成数据集和实际数据集上进行的大量实验证明,与最先进的算法相比,所提出的方法具有优越性和有效性。
{"title":"A Novel and Effective Method to Directly Solve Spectral Clustering","authors":"Feiping Nie;Chaodie Liu;Rong Wang;Xuelong Li","doi":"10.1109/TPAMI.2024.3447287","DOIUrl":"10.1109/TPAMI.2024.3447287","url":null,"abstract":"Spectral clustering has been attracting increasing attention due to its well-defined framework and excellent performance. However, most traditional spectral clustering methods consist of two separate steps: 1) Solving a relaxed optimization problem to learn the continuous clustering labels, and 2) Rounding the continuous clustering labels into discrete ones. The clustering results of the relax-and-discretize strategy inevitably result in information loss and unsatisfactory clustering performance. Moreover, the similarity matrix constructed from original data may not be optimal for clustering since data usually have noise and redundancy. To address these problems, we propose a novel and effective algorithm to directly optimize the original spectral clustering model, called Direct Spectral Clustering (DSC). We theoretically prove that the original spectral clustering model can be solved by simultaneously learning a weighted discrete indicator matrix and a structured similarity matrix whose connected components are equal to the number of clusters. Both of them can be used to directly obtain the final clustering results without any post-processing. Further, an effective iterative optimization algorithm is exploited to solve the proposed method. Extensive experiments performed on synthetic and real-world datasets demonstrate the superiority and effectiveness of the proposed method compared to the state-of-the-art algorithms.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10863-10875"},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142019957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CO-Net++: A Cohesive Network for Multiple Point Cloud Tasks at Once With Two-Stage Feature Rectification CO-Net++:一次完成多个点云任务的内聚网络,带两阶段特征校正。
Pub Date : 2024-08-21 DOI: 10.1109/TPAMI.2024.3447008
Tao Xie;Kun Dai;Qihao Sun;Zhiqiang Jiang;Chuqing Cao;Lijun Zhao;Ke Wang;Ruifeng Li
We present CO-Net++, a cohesive framework that optimizes multiple point cloud tasks collectively across heterogeneous dataset domains with a two-stage feature rectification strategy. The core of CO-Net++ lies in optimizing task-shared parameters to capture universal features across various tasks while discerning task-specific parameters tailored to encapsulate the unique characteristics of each task. Specifically, CO-Net++ develops a two-stage feature rectification strategy (TFRS) that distinctly separates the optimization processes for task-shared and task-specific parameters. At the first stage, TFRS configures all parameters in backbone as task-shared, which encourages CO-Net++ to thoroughly assimilate universal attributes pertinent to all tasks. In addition, TFRS introduces a sign-based gradient surgery to facilitate the optimization of task-shared parameters, thus alleviating conflicting gradients induced by various dataset domains. In the second stage, TFRS freezes task-shared parameters and flexibly integrates task-specific parameters into the network for encoding specific characteristics of each dataset domain. CO-Net++ prominently mitigates conflicting optimization caused by parameter entanglement, ensuring the sufficient identification of universal and specific features. Extensive experiments reveal that CO-Net++ realizes exceptional performances on both 3D object detection and 3D semantic segmentation tasks. Moreover, CO-Net++ delivers an impressive incremental learning capability and prevents catastrophic amnesia when generalizing to new point cloud tasks.
我们提出了 CO-Net++,这是一个内聚性框架,采用两阶段特征校正策略,在异构数据集领域对多个点云任务进行集体优化。CO-Net++ 的核心在于优化任务共享参数,以捕捉不同任务的通用特征,同时辨别特定任务参数,以概括每个任务的独特特征。具体来说,CO-Net++ 开发了一种两阶段特征修正策略(TFRS),将任务共享参数和任务特定参数的优化过程截然分开。在第一阶段,TFRS 将骨干网中的所有参数配置为任务共享参数,从而鼓励 CO-Net++ 彻底吸收与所有任务相关的通用属性。此外,TFRS 还引入了基于符号的梯度手术,以促进任务共享参数的优化,从而缓解不同数据集域引起的梯度冲突。在第二阶段,TFRS 会冻结任务共享参数,并灵活地将特定任务参数整合到网络中,以编码每个数据集域的具体特征。CO-Net++ 显著缓解了因参数纠缠而产生的优化冲突,确保了通用特征和特定特征的充分识别。广泛的实验表明,CO-Net++ 在三维物体检测和三维语义分割任务中均表现出色。此外,CO-Net++ 还具有令人印象深刻的增量学习能力,在推广到新的点云任务时可防止灾难性失忆。
{"title":"CO-Net++: A Cohesive Network for Multiple Point Cloud Tasks at Once With Two-Stage Feature Rectification","authors":"Tao Xie;Kun Dai;Qihao Sun;Zhiqiang Jiang;Chuqing Cao;Lijun Zhao;Ke Wang;Ruifeng Li","doi":"10.1109/TPAMI.2024.3447008","DOIUrl":"10.1109/TPAMI.2024.3447008","url":null,"abstract":"We present CO-Net++, a cohesive framework that optimizes multiple point cloud tasks collectively across heterogeneous dataset domains with a two-stage feature rectification strategy. The core of CO-Net++ lies in optimizing task-shared parameters to capture universal features across various tasks while discerning task-specific parameters tailored to encapsulate the unique characteristics of each task. Specifically, CO-Net++ develops a two-stage feature rectification strategy (TFRS) that distinctly separates the optimization processes for task-shared and task-specific parameters. At the first stage, TFRS configures all parameters in backbone as task-shared, which encourages CO-Net++ to thoroughly assimilate universal attributes pertinent to all tasks. In addition, TFRS introduces a sign-based gradient surgery to facilitate the optimization of task-shared parameters, thus alleviating conflicting gradients induced by various dataset domains. In the second stage, TFRS freezes task-shared parameters and flexibly integrates task-specific parameters into the network for encoding specific characteristics of each dataset domain. CO-Net++ prominently mitigates conflicting optimization caused by parameter entanglement, ensuring the sufficient identification of universal and specific features. Extensive experiments reveal that CO-Net++ realizes exceptional performances on both 3D object detection and 3D semantic segmentation tasks. Moreover, CO-Net++ delivers an impressive incremental learning capability and prevents catastrophic amnesia when generalizing to new point cloud tasks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10911-10928"},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142019960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Q-Bench$^+$+: A Benchmark for Multi-Modal Foundation Models on Low-Level Vision From Single Images to Pairs Q-BENCH:从单一图像到成对图像的低级视觉多模式基础模型基准。
Pub Date : 2024-08-21 DOI: 10.1109/TPAMI.2024.3445770
Zicheng Zhang;Haoning Wu;Erli Zhang;Guangtao Zhai;Weisi Lin
The rapid development of Multi-modality Large Language Models (MLLMs) has navigated a paradigm shift in computer vision, moving towards versatile foundational models. However, evaluating MLLMs in low-level visual perception and understanding remains a yet-to-explore domain. To this end, we design benchmark settings to emulate human language responses related to low-level vision: the low-level visual perception (A1) via visual question answering related to low-level attributes (e.g. clarity, lighting); and the low-level visual description (A2), on evaluating MLLMs for low-level text descriptions. Furthermore, given that pairwise comparison can better avoid ambiguity of responses and has been adopted by many human experiments, we further extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs. Specifically, for perception (A1), we carry out the LLVisionQA$^{+}$ dataset, comprising 2,990 single images and 1,999 image pairs each accompanied by an open-ended question about its low-level features; for description (A2), we propose the LLDescribe$^{+}$ dataset, evaluating MLLMs for low-level descriptions on 499 single images and 450 pairs. Additionally, we evaluate MLLMs on assessment (A3) ability, i.e. predicting score, by employing a softmax-based approach to enable all MLLMs to generate quantifiable quality ratings, tested against human opinions in 7 image quality assessment (IQA) datasets. With 24 MLLMs under evaluation, we demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than single image evaluations (like humans). We hope that our benchmark will motivate further research into uncovering and enhancing these nascent capabilities of MLLMs.
多模态大语言模型(MLLMs)的快速发展引领了计算机视觉领域的范式转变,使其朝着多功能基础模型的方向发展。然而,在低级视觉感知和理解方面评估 MLLM 仍然是一个有待探索的领域。为此,我们设计了基准设置来模拟与低级视觉相关的人类语言反应:低级视觉感知(A1),通过与低级属性(如清晰度、照明)相关的视觉问题解答;以及低级视觉描述(A2),用于评估低级文本描述的 MLLM。此外,鉴于成对比较可以更好地避免回答的模糊性,并且已被许多人类实验所采用,我们进一步将 MLLM 的低层次感知相关问题解答和描述评估从单一图像扩展到图像对。具体来说,在感知(A1)方面,我们使用了 LLVisionQA+ 数据集,其中包括 2,990 张单张图像和 1,999 对图像,每张图像都附有一个关于其底层特征的开放式问题;在描述(A2)方面,我们提出了 LLDescribe+ 数据集,在 499 张单张图像和 450 对图像上评估了用于底层描述的 MLLM。此外,我们还评估了 MLLM 的评估(A3)能力,即预测得分,通过采用基于 softmax 的方法,使所有 MLLM 都能生成可量化的质量评级,并在 7 个图像质量评估(IQA)数据集中根据人类意见进行测试。通过对 24 种 MLLM 的评估,我们证明了几种 MLLM 在单幅图像上都具有不错的低级视觉能力,但只有 GPT-4V 在成对比较上比单幅图像评估(如人类)表现出更高的准确性。我们希望我们的基准能激励人们进一步研究如何发掘和提高 MLLM 的这些新生能力。数据集将发布在 https://github.com/Q-Future/Q-Bench 网站上。
{"title":"Q-Bench$^+$+: A Benchmark for Multi-Modal Foundation Models on Low-Level Vision From Single Images to Pairs","authors":"Zicheng Zhang;Haoning Wu;Erli Zhang;Guangtao Zhai;Weisi Lin","doi":"10.1109/TPAMI.2024.3445770","DOIUrl":"10.1109/TPAMI.2024.3445770","url":null,"abstract":"The rapid development of Multi-modality Large Language Models (MLLMs) has navigated a paradigm shift in computer vision, moving towards versatile foundational models. However, evaluating MLLMs in \u0000<i>low-level visual perception and understanding</i>\u0000 remains a yet-to-explore domain. To this end, we design benchmark settings to \u0000<i>emulate human language responses</i>\u0000 related to low-level vision: the low-level visual \u0000<i>perception</i>\u0000 (\u0000<u>A1</u>\u0000) \u0000<i>via</i>\u0000 visual question answering related to low-level attributes (\u0000<i>e.g. clarity, lighting</i>\u0000); and the low-level visual \u0000<i>description</i>\u0000 (\u0000<u>A2</u>\u0000), on evaluating MLLMs for low-level text descriptions. Furthermore, given that pairwise comparison can better avoid ambiguity of responses and has been adopted by many human experiments, we further extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to \u0000<i>image pairs</i>\u0000. Specifically, for \u0000<i>perception</i>\u0000 (A1), we carry out the LLVisionQA\u0000<inline-formula><tex-math>$^{+}$</tex-math></inline-formula>\u0000 dataset, comprising 2,990 single images and 1,999 image pairs each accompanied by an open-ended question about its low-level features; for \u0000<bold/>\u0000<i>description</i>\u0000<bold/>\u0000 (A2), we propose the LLDescribe\u0000<inline-formula><tex-math>$^{+}$</tex-math></inline-formula>\u0000 dataset, evaluating MLLMs for low-level descriptions on 499 single images and 450 pairs. Additionally, we evaluate MLLMs on \u0000<bold/>\u0000<i>assessment</i>\u0000<bold/>\u0000 (A3) ability, \u0000<i>i.e.</i>\u0000 predicting score, by employing a softmax-based approach to enable all MLLMs to generate \u0000<i>quantifiable</i>\u0000 quality ratings, tested against human opinions in 7 image quality assessment (IQA) datasets. With 24 MLLMs under evaluation, we demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than single image evaluations (\u0000<i>like humans</i>\u0000). We hope that our benchmark will motivate further research into uncovering and enhancing these nascent capabilities of MLLMs.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10404-10418"},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142019987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tensorized and Compressed Multi-View Subspace Clustering via Structured Constraint 通过结构化约束进行张量和压缩多视角子空间聚类
Pub Date : 2024-08-20 DOI: 10.1109/TPAMI.2024.3446537
Wei Chang;Huimin Chen;Feiping Nie;Rong Wang;Xuelong Li
Multi-view learning has raised more and more attention in recent years. However, traditional approaches only focus on the difference while ignoring the consistency among views. It may make some views, with the situation of data abnormality or noise, ineffective in the progress of view learning. Besides, the current datasets have become high-dimensional and large-scale gradually. Therefore, this paper proposes a novel multi-view compressed subspace learning method via low-rank tensor constraint, which incorporates the clustering progress and multi-view learning into a unified framework. First, for each view, we take the partial samples to build a small-size dictionary, which can reduce the effect of both redundancy information and computation cost greatly. Then, to find the consistency and difference among views, we impose a low-rank tensor constraint on these representations and further design an auto-weighted mechanism to learn the optimal representation. Last, due to the non-square of the learned representation, the bipartite graph has been introduced, and under the structured constraint, the clustering results can be obtained directly from this graph without any post-processing. Extensive experiments on synthetic and real-world benchmark datasets demonstrate the efficacy and efficiency of our method, especially for the views with noise or outliers.
近年来,多视图学习受到越来越多的关注。然而,传统方法只关注视图之间的差异,而忽略了视图之间的一致性。这可能会使一些存在数据异常或噪声的视图在视图学习过程中失去效果。此外,当前的数据集逐渐变得高维化和大规模化。因此,本文提出了一种通过低秩张量约束的新型多视图压缩子空间学习方法,将聚类进展和多视图学习纳入一个统一的框架。首先,针对每个视图,我们提取部分样本来构建小尺寸字典,这样可以大大降低冗余信息和计算成本的影响。然后,为了找到不同视图之间的一致性和差异性,我们对这些表征施加了低秩张量约束,并进一步设计了一种自动加权机制来学习最优表征。最后,由于学习到的表征是非方形的,因此引入了双方图,在结构化约束下,可以直接从该图中得到聚类结果,而无需任何后处理。在合成数据集和真实基准数据集上进行的大量实验证明了我们的方法的有效性和高效性,尤其是对于有噪声或异常值的视图。
{"title":"Tensorized and Compressed Multi-View Subspace Clustering via Structured Constraint","authors":"Wei Chang;Huimin Chen;Feiping Nie;Rong Wang;Xuelong Li","doi":"10.1109/TPAMI.2024.3446537","DOIUrl":"10.1109/TPAMI.2024.3446537","url":null,"abstract":"Multi-view learning has raised more and more attention in recent years. However, traditional approaches only focus on the difference while ignoring the consistency among views. It may make some views, with the situation of data abnormality or noise, ineffective in the progress of view learning. Besides, the current datasets have become high-dimensional and large-scale gradually. Therefore, this paper proposes a novel multi-view compressed subspace learning method via low-rank tensor constraint, which incorporates the clustering progress and multi-view learning into a unified framework. First, for each view, we take the partial samples to build a small-size dictionary, which can reduce the effect of both redundancy information and computation cost greatly. Then, to find the consistency and difference among views, we impose a low-rank tensor constraint on these representations and further design an auto-weighted mechanism to learn the optimal representation. Last, due to the non-square of the learned representation, the bipartite graph has been introduced, and under the structured constraint, the clustering results can be obtained directly from this graph without any post-processing. Extensive experiments on synthetic and real-world benchmark datasets demonstrate the efficacy and efficiency of our method, especially for the views with noise or outliers.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10434-10451"},"PeriodicalIF":0.0,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142010127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Survey on Graph Neural Networks and Graph Transformers in Computer Vision: A Task-Oriented Perspective 计算机视觉中的图神经网络和图变换器概览:以任务为导向的视角
Pub Date : 2024-08-19 DOI: 10.1109/TPAMI.2024.3445463
Chaoqi Chen;Yushuang Wu;Qiyuan Dai;Hong-Yu Zhou;Mutian Xu;Sibei Yang;Xiaoguang Han;Yizhou Yu
Graph Neural Networks (GNNs) have gained momentum in graph representation learning and boosted the state of the art in a variety of areas, such as data mining (e.g., social network analysis and recommender systems), computer vision (e.g., object detection and point cloud learning), and natural language processing (e.g., relation extraction and sequence learning), to name a few. With the emergence of Transformers in natural language processing and computer vision, graph Transformers embed a graph structure into the Transformer architecture to overcome the limitations of local neighborhood aggregation while avoiding strict structural inductive biases. In this paper, we present a comprehensive review of GNNs and graph Transformers in computer vision from a task-oriented perspective. Specifically, we divide their applications in computer vision into five categories according to the modality of input data, i.e., 2D natural images, videos, 3D data, vision + language, and medical images. In each category, we further divide the applications according to a set of vision tasks. Such a task-oriented taxonomy allows us to examine how each task is tackled by different GNN-based approaches and how well these approaches perform. Based on the necessary preliminaries, we provide the definitions and challenges of the tasks, in-depth coverage of the representative approaches, as well as discussions regarding insights, limitations, and future directions.
图神经网络(GNN)在图表示学习方面势头迅猛,并在数据挖掘(如社交网络分析和推荐系统)、计算机视觉(如物体检测和点云学习)以及自然语言处理(如关系提取和序列学习)等多个领域提升了技术水平。随着变换器在自然语言处理和计算机视觉领域的出现,图变换器将图结构嵌入到变换器架构中,以克服局部邻域聚合的局限性,同时避免严格的结构归纳偏差。在本文中,我们从任务导向的角度全面回顾了计算机视觉中的 GNN 和图变换器。具体来说,我们根据输入数据的模式将它们在计算机视觉中的应用分为五类,即二维自然图像、视频、三维数据、视觉 + 语言和医学图像。在每个类别中,我们根据一组视觉任务进一步划分应用。通过这种以任务为导向的分类法,我们可以研究基于 GNN 的不同方法是如何处理每项任务的,以及这些方法的性能如何。基于必要的铺垫,我们提供了任务的定义和挑战、代表性方法的深入介绍,以及有关见解、局限性和未来方向的讨论。
{"title":"A Survey on Graph Neural Networks and Graph Transformers in Computer Vision: A Task-Oriented Perspective","authors":"Chaoqi Chen;Yushuang Wu;Qiyuan Dai;Hong-Yu Zhou;Mutian Xu;Sibei Yang;Xiaoguang Han;Yizhou Yu","doi":"10.1109/TPAMI.2024.3445463","DOIUrl":"10.1109/TPAMI.2024.3445463","url":null,"abstract":"Graph Neural Networks (GNNs) have gained momentum in graph representation learning and boosted the state of the art in a variety of areas, such as data mining (e.g., social network analysis and recommender systems), computer vision (e.g., object detection and point cloud learning), and natural language processing (e.g., relation extraction and sequence learning), to name a few. With the emergence of Transformers in natural language processing and computer vision, graph Transformers embed a graph structure into the Transformer architecture to overcome the limitations of local neighborhood aggregation while avoiding strict structural inductive biases. In this paper, we present a comprehensive review of GNNs and graph Transformers in computer vision from a task-oriented perspective. Specifically, we divide their applications in computer vision into five categories according to the modality of input data, i.e., 2D natural images, videos, 3D data, vision + language, and medical images. In each category, we further divide the applications according to a set of vision tasks. Such a task-oriented taxonomy allows us to examine how each task is tackled by different GNN-based approaches and how well these approaches perform. Based on the necessary preliminaries, we provide the definitions and challenges of the tasks, in-depth coverage of the representative approaches, as well as discussions regarding insights, limitations, and future directions.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10297-10318"},"PeriodicalIF":0.0,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142006162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Approaching the Global Nash Equilibrium of Non-Convex Multi-Player Games 接近非凸多人游戏的全局纳什均衡
Pub Date : 2024-08-19 DOI: 10.1109/TPAMI.2024.3445666
Guanpu Chen;Gehui Xu;Fengxiang He;Yiguang Hong;Leszek Rutkowski;Dacheng Tao
Many machine learning problems can be formulated as non-convex multi-player games. Due to non-convexity, it is challenging to obtain the existence condition of the global Nash equilibrium (NE) and design theoretically guaranteed algorithms. This paper studies a class of non-convex multi-player games, where players’ payoff functions consist of canonical functions and quadratic operators. We leverage conjugate properties to transform the complementary problem into a variational inequality (VI) problem using a continuous pseudo-gradient mapping. We prove the existence condition of the global NE as the solution to the VI problem satisfies a duality relation. We then design an ordinary differential equation to approach the global NE with an exponential convergence rate. For practical implementation, we derive a discretized algorithm and apply it to two scenarios: multi-player games with generalized monotonicity and multi-player potential games. In the two settings, step sizes are required to be $mathcal {O}(1/k)$ and $mathcal {O}(1/sqrt{k})$ to yield the convergence rates of $mathcal {O}(1/ k)$ and $mathcal {O}(1/sqrt{k})$, respectively. Extensive experiments on robust neural network training and sensor network localization validate our theory. Our code is available at https://github.com/GuanpuChen/Global-NE.
许多机器学习问题都可以表述为非凸多玩家博弈。由于非凸性,获得全局纳什均衡(NE)的存在条件和设计理论上有保证的算法是一项挑战。本文研究了一类非凸多玩家博弈,其中玩家的报酬函数由典型函数和二次算子组成。我们利用共轭特性,使用连续伪梯度映射将互补问题转化为变不等式(VI)问题。我们证明了全局 NE 的存在条件,因为 VI 问题的解满足对偶关系。然后,我们设计了一个常微分方程,以指数收敛速度逼近全局近似值。在实际应用中,我们推导出一种离散化算法,并将其应用于两种情况:具有广义单调性的多人博弈和多人潜在博弈。在这两种情况下,步长要求分别为 O(1/k) 和 O(1/√k),收敛率分别为 O(1/k) 和 O(1/√k)。鲁棒神经网络训练和传感器网络定位的大量实验验证了我们的理论。我们的代码见 https://github.com/GuanpuChen/Global-NE。
{"title":"Approaching the Global Nash Equilibrium of Non-Convex Multi-Player Games","authors":"Guanpu Chen;Gehui Xu;Fengxiang He;Yiguang Hong;Leszek Rutkowski;Dacheng Tao","doi":"10.1109/TPAMI.2024.3445666","DOIUrl":"10.1109/TPAMI.2024.3445666","url":null,"abstract":"Many machine learning problems can be formulated as non-convex multi-player games. Due to non-convexity, it is challenging to obtain the existence condition of the global Nash equilibrium (NE) and design theoretically guaranteed algorithms. This paper studies a class of non-convex multi-player games, where players’ payoff functions consist of canonical functions and quadratic operators. We leverage conjugate properties to transform the complementary problem into a variational inequality (VI) problem using a continuous pseudo-gradient mapping. We prove the existence condition of the global NE as the solution to the VI problem satisfies a duality relation. We then design an ordinary differential equation to approach the global NE with an exponential convergence rate. For practical implementation, we derive a discretized algorithm and apply it to two scenarios: multi-player games with generalized monotonicity and multi-player potential games. In the two settings, step sizes are required to be \u0000<inline-formula><tex-math>$mathcal {O}(1/k)$</tex-math></inline-formula>\u0000 and \u0000<inline-formula><tex-math>$mathcal {O}(1/sqrt{k})$</tex-math></inline-formula>\u0000 to yield the convergence rates of \u0000<inline-formula><tex-math>$mathcal {O}(1/ k)$</tex-math></inline-formula>\u0000 and \u0000<inline-formula><tex-math>$mathcal {O}(1/sqrt{k})$</tex-math></inline-formula>\u0000, respectively. Extensive experiments on robust neural network training and sensor network localization validate our theory. Our code is available at \u0000<uri>https://github.com/GuanpuChen/Global-NE</uri>\u0000.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10797-10813"},"PeriodicalIF":0.0,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142006163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unsupervised Part Discovery via Dual Representation Alignment 通过双重表征对齐进行无监督部件发现。
Pub Date : 2024-08-19 DOI: 10.1109/TPAMI.2024.3445582
Jiahao Xia;Wenjian Huang;Min Xu;Jianguo Zhang;Haimin Zhang;Ziyu Sheng;Dong Xu
Object parts serve as crucial intermediate representations in various downstream tasks, but part-level representation learning still has not received as much attention as other vision tasks. Previous research has established that Vision Transformer can learn instance-level attention without labels, extracting high-quality instance-level representations for boosting downstream tasks. In this paper, we achieve unsupervised part-specific attention learning using a novel paradigm and further employ the part representations to improve part discovery performance. Specifically, paired images are generated from the same image with different geometric transformations, and multiple part representations are extracted from these paired images using a novel module, named PartFormer. These part representations from the paired images are then exchanged to improve geometric transformation invariance. Subsequently, the part representations are aligned with the feature map extracted by a feature map encoder, achieving high similarity with the pixel representations of the corresponding part regions and low similarity in irrelevant regions. Finally, the geometric and semantic constraints are applied to the part representations through the intermediate results in alignment for part-specific attention learning, encouraging the PartFormer to focus locally and the part representations to explicitly include the information of the corresponding parts. Moreover, the aligned part representations can further serve as a series of reliable detectors in the testing phase, predicting pixel masks for part discovery. Extensive experiments are carried out on four widely used datasets, and our results demonstrate that the proposed method achieves competitive performance and robustness due to its part-specific attention.
物体部件是各种下游任务的重要中间表征,但部件级表征学习仍未像其他视觉任务那样受到广泛关注。先前的研究已经证实,Vision Transformer 可以在没有标签的情况下学习实例级注意力,从而提取高质量的实例级表征,用于促进下游任务。在本文中,我们利用一种新颖的范式实现了无监督的特定部件注意力学习,并进一步利用部件表征来提高部件发现性能。具体来说,我们从具有不同几何变换的同一幅图像中生成配对图像,并使用名为 "PartFormer "的新模块从这些配对图像中提取多个零件表征。然后交换配对图像中的这些零件表示,以提高几何变换不变性。随后,将零件表示与特征图编码器提取的特征图对齐,实现与相应零件区域像素表示的高相似性和无关区域的低相似性。最后,通过对齐的中间结果,将几何和语义约束应用到零件表征中,以实现特定零件的注意力学习,从而鼓励零件成型器局部聚焦,并使零件表征明确包含相应零件的信息。此外,对齐后的部件表征还能在测试阶段进一步充当一系列可靠的检测器,预测用于发现部件的像素掩码。我们在四个广泛使用的数据集上进行了广泛的实验,结果表明,所提出的方法因其对特定部分的关注而获得了具有竞争力的性能和鲁棒性。代码将在论文被接受后发布。
{"title":"Unsupervised Part Discovery via Dual Representation Alignment","authors":"Jiahao Xia;Wenjian Huang;Min Xu;Jianguo Zhang;Haimin Zhang;Ziyu Sheng;Dong Xu","doi":"10.1109/TPAMI.2024.3445582","DOIUrl":"10.1109/TPAMI.2024.3445582","url":null,"abstract":"Object parts serve as crucial intermediate representations in various downstream tasks, but part-level representation learning still has not received as much attention as other vision tasks. Previous research has established that Vision Transformer can learn instance-level attention without labels, extracting high-quality instance-level representations for boosting downstream tasks. In this paper, we achieve unsupervised part-specific attention learning using a novel paradigm and further employ the part representations to improve part discovery performance. Specifically, paired images are generated from the same image with different geometric transformations, and multiple part representations are extracted from these paired images using a novel module, named PartFormer. These part representations from the paired images are then exchanged to improve geometric transformation invariance. Subsequently, the part representations are aligned with the feature map extracted by a feature map encoder, achieving high similarity with the pixel representations of the corresponding part regions and low similarity in irrelevant regions. Finally, the geometric and semantic constraints are applied to the part representations through the intermediate results in alignment for part-specific attention learning, encouraging the PartFormer to focus locally and the part representations to explicitly include the information of the corresponding parts. Moreover, the aligned part representations can further serve as a series of reliable detectors in the testing phase, predicting pixel masks for part discovery. Extensive experiments are carried out on four widely used datasets, and our results demonstrate that the proposed method achieves competitive performance and robustness due to its part-specific attention.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10597-10613"},"PeriodicalIF":0.0,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142006185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SEA++: Multi-Graph-Based Higher-Order Sensor Alignment for Multivariate Time-Series Unsupervised Domain Adaptation SEA++:基于多图的高阶传感器对齐,用于多变量时间序列无监督领域适应。
Pub Date : 2024-08-16 DOI: 10.1109/TPAMI.2024.3444904
Yucheng Wang;Yuecong Xu;Jianfei Yang;Min Wu;Xiaoli Li;Lihua Xie;Zhenghua Chen
Unsupervised Domain Adaptation (UDA) methods have been successful in reducing label dependency by minimizing the domain discrepancy between labeled source domains and unlabeled target domains. However, these methods face challenges when dealing with Multivariate Time-Series (MTS) data. MTS data typically originates from multiple sensors, each with its unique distribution. This property poses difficulties in adapting existing UDA techniques, which mainly focus on aligning global features while overlooking the distribution discrepancies at the sensor level, thus limiting their effectiveness for MTS data. To address this issue, a practical domain adaptation scenario is formulated as Multivariate Time-Series Unsupervised Domain Adaptation (MTS-UDA). In this paper, we propose SEnsor Alignment (SEA) for MTS-UDA, aiming to address domain discrepancy at both local and global sensor levels. At the local sensor level, we design endo-feature alignment, which aligns sensor features and their correlations across domains. To reduce domain discrepancy at the global sensor level, we design exo-feature alignment that enforces restrictions on global sensor features. We further extend SEA to SEA++ by enhancing the endo-feature alignment. Particularly, we incorporate multi-graph-based higher-order alignment for both sensor features and their correlations. Extensive empirical results have demonstrated the state-of-the-art performance of our SEA and SEA++ on six public MTS datasets for MTS-UDA.
无监督域自适应(UDA)方法通过最大限度地减少已标注源域和未标注目标域之间的域差异,成功地降低了标签依赖性。然而,这些方法在处理多变量时间序列(MTS)数据时面临挑战。MTS 数据通常来自多个传感器,每个传感器都有其独特的分布。现有的 UDA 技术主要关注全局特征的对齐,而忽略了传感器层面的分布差异,因此限制了其对 MTS 数据的有效性。为了解决这个问题,我们提出了一种实用的领域适配方案,即多变量时序无监督领域适配(MTS-UDA)。在本文中,我们为 MTS-UDA 提出了传感器对齐(SEA),旨在解决本地和全局传感器层面的领域差异问题。在本地传感器层面,我们设计了内部特征对齐(endo-feature alignment)技术,可对传感器特征及其跨域相关性进行对齐。为了减少全局传感器层面的领域差异,我们设计了外部特征对齐,对全局传感器特征实施限制。我们通过增强内部特征对齐,进一步将 SEA 扩展到 SEA++。特别是,我们为传感器特征及其相关性加入了基于多图的高阶对齐。广泛的实证结果表明,在 MTS-UDA 的六个公共 MTS 数据集上,我们的 SEA 和 SEA++ 具有最先进的性能。
{"title":"SEA++: Multi-Graph-Based Higher-Order Sensor Alignment for Multivariate Time-Series Unsupervised Domain Adaptation","authors":"Yucheng Wang;Yuecong Xu;Jianfei Yang;Min Wu;Xiaoli Li;Lihua Xie;Zhenghua Chen","doi":"10.1109/TPAMI.2024.3444904","DOIUrl":"10.1109/TPAMI.2024.3444904","url":null,"abstract":"Unsupervised Domain Adaptation (UDA) methods have been successful in reducing label dependency by minimizing the domain discrepancy between labeled source domains and unlabeled target domains. However, these methods face challenges when dealing with Multivariate Time-Series (MTS) data. MTS data typically originates from multiple sensors, each with its unique distribution. This property poses difficulties in adapting existing UDA techniques, which mainly focus on aligning global features while overlooking the distribution discrepancies at the sensor level, thus limiting their effectiveness for MTS data. To address this issue, a practical domain adaptation scenario is formulated as Multivariate Time-Series Unsupervised Domain Adaptation (MTS-UDA). In this paper, we propose SEnsor Alignment (SEA) for MTS-UDA, aiming to address domain discrepancy at both local and global sensor levels. At the local sensor level, we design endo-feature alignment, which aligns sensor features and their correlations across domains. To reduce domain discrepancy at the global sensor level, we design exo-feature alignment that enforces restrictions on global sensor features. We further extend SEA to SEA++ by enhancing the endo-feature alignment. Particularly, we incorporate multi-graph-based higher-order alignment for both sensor features and their correlations. Extensive empirical results have demonstrated the state-of-the-art performance of our SEA and SEA++ on six public MTS datasets for MTS-UDA.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10781-10796"},"PeriodicalIF":0.0,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141992479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
T-Net++: Effective Permutation-Equivariance Network for Two-View Correspondence Pruning T-Net++:用于双视图对应性剪枝的有效换向-方差网络
Pub Date : 2024-08-16 DOI: 10.1109/TPAMI.2024.3444457
Guobao Xiao;Xin Liu;Zhen Zhong;Xiaoqin Zhang;Jiayi Ma;Haibin Ling
We propose a conceptually novel, flexible, and effective framework (named T-Net++) for the task of two-view correspondence pruning. T-Net++ comprises two unique structures: the $hbox{``}-$'' structure and the $hbox{``}|$'' structure. The $hbox{``}-$'' structure utilizes an iterative learning strategy to process correspondences, while the $hbox{``}|$'' structure integrates all feature information of the $hbox{``}-$'' structure and produces inlier weights. Moreover, within the $hbox{``}|$'' structure, we design a new Local-Global Attention Fusion module to fully exploit valuable information obtained from concatenating features through channel-wise and spatial-wise relationships. Furthermore, we develop a Channel-Spatial Squeeze-and-Excitation module, a modified network backbone that enhances the representation ability of important channels and correspondences through the squeeze-and-excitation operation. T-Net++ not only preserves the permutation-equivariance manner for correspondence pruning, but also gathers rich contextual information, thereby enhancing the effectiveness of the network. Experimental results demonstrate that T-Net++ outperforms other state-of-the-art correspondence pruning methods on various benchmarks and excels in two extended tasks.
我们为双视图对应剪枝任务提出了一个概念新颖、灵活有效的框架(命名为 T-Net++)。T-Net++ 包括两种独特的结构:"-''结构和"|''结构。-''结构利用迭代学习策略来处理对应关系,而"|''结构则整合了"-''结构的所有特征信息,并产生离群器权重。此外,在"|''"结构中,我们设计了一个新的局部-全局注意力融合模块,以充分利用通过通道和空间关系串联特征所获得的有价值信息。此外,我们还开发了 "信道-空间挤压-激发 "模块,这是一种改进的网络骨干,通过挤压-激发操作增强了重要信道和对应关系的表示能力。T-Net++ 不仅保留了用于对应关系剪枝的置换-方差方式,还收集了丰富的上下文信息,从而提高了网络的有效性。实验结果表明,T-Net++ 在各种基准测试中的表现优于其他最先进的对应剪枝方法,并在两项扩展任务中表现出色。我们的代码将发布在 https://github.com/guobaoxiao/T-Net 网站上。
{"title":"T-Net++: Effective Permutation-Equivariance Network for Two-View Correspondence Pruning","authors":"Guobao Xiao;Xin Liu;Zhen Zhong;Xiaoqin Zhang;Jiayi Ma;Haibin Ling","doi":"10.1109/TPAMI.2024.3444457","DOIUrl":"10.1109/TPAMI.2024.3444457","url":null,"abstract":"We propose a conceptually novel, flexible, and effective framework (named T-Net++) for the task of two-view correspondence pruning. T-Net++ comprises two unique structures: the \u0000<inline-formula><tex-math>$hbox{``}-$</tex-math></inline-formula>\u0000'' structure and the \u0000<inline-formula><tex-math>$hbox{``}|$</tex-math></inline-formula>\u0000'' structure. The \u0000<inline-formula><tex-math>$hbox{``}-$</tex-math></inline-formula>\u0000'' structure utilizes an iterative learning strategy to process correspondences, while the \u0000<inline-formula><tex-math>$hbox{``}|$</tex-math></inline-formula>\u0000'' structure integrates all feature information of the \u0000<inline-formula><tex-math>$hbox{``}-$</tex-math></inline-formula>\u0000'' structure and produces inlier weights. Moreover, within the \u0000<inline-formula><tex-math>$hbox{``}|$</tex-math></inline-formula>\u0000'' structure, we design a new Local-Global Attention Fusion module to fully exploit valuable information obtained from concatenating features through channel-wise and spatial-wise relationships. Furthermore, we develop a Channel-Spatial Squeeze-and-Excitation module, a modified network backbone that enhances the representation ability of important channels and correspondences through the squeeze-and-excitation operation. T-Net++ not only preserves the permutation-equivariance manner for correspondence pruning, but also gathers rich contextual information, thereby enhancing the effectiveness of the network. Experimental results demonstrate that T-Net++ outperforms other state-of-the-art correspondence pruning methods on various benchmarks and excels in two extended tasks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10629-10644"},"PeriodicalIF":0.0,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141992480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Metric3D v2: A Versatile Monocular Geometric Foundation Model for Zero-Shot Metric Depth and Surface Normal Estimation Metric3D v2:用于零镜头度量深度和表面法线估算的多功能单目几何基础模型。
Pub Date : 2024-08-16 DOI: 10.1109/TPAMI.2024.3444912
Mu Hu;Wei Yin;Chi Zhang;Zhipeng Cai;Xiaoxiao Long;Hao Chen;Kaixuan Wang;Gang Yu;Chunhua Shen;Shaojie Shen
We introduce Metric3D v2, a geometric foundation model designed for zero-shot metric depth and surface normal estimation from single images, critical for accurate 3D recovery. Depth and normal estimation, though complementary, present distinct challenges. State-of-the-art monocular depth methods achieve zero-shot generalization through affine-invariant depths, but fail to recover real-world metric scale. Conversely, current normal estimation techniques struggle with zero-shot performance due to insufficient labeled data. We propose targeted solutions for both metric depth and normal estimation. For metric depth, we present a canonical camera space transformation module that resolves metric ambiguity across various camera models and large-scale datasets, which can be easily integrated into existing monocular models. For surface normal estimation, we introduce a joint depth-normal optimization module that leverages diverse data from metric depth, allowing normal estimators to improve beyond traditional labels. Our model, trained on over 16 million images from thousands of camera models with varied annotations, excels in zero-shot generalization to new camera settings. As shown in Fig. 1, It ranks the 1st in multiple zero-shot and standard benchmarks for metric depth and surface normal prediction. Our method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. Our model also relieves the scale drift issues of monocular-SLAM (Fig. 3), leading to high-quality metric scale dense mapping. Such applications highlight the versatility of Metric3D v2 models as geometric foundation models.
我们介绍了 Metric3D v2,这是一种几何基础模型,用于从单幅图像中进行零镜头度量深度和表面法线估算,这对于度量三维复原至关重要。虽然深度和法线在几何上相互关联、互为补充,但它们也面临着不同的挑战。最先进的(SoTA)单目深度方法通过学习仿射不变深度来实现零点泛化,但无法恢复真实世界的度量。同时,由于缺乏大规模标注数据,SoTA 正常估计方法的零镜头性能有限。为了解决这些问题,我们提出了度量深度估计和表面法线估计的解决方案。在度量深度估算方面,我们发现零镜头单视角模型的关键在于解决来自各种相机模型和大规模数据训练的度量模糊性。我们提出了一个典型相机空间转换模块,它明确地解决了模糊性问题,并能毫不费力地插入到现有的单目模型中。对于表面法线估计,我们提出了一个深度-法线联合优化模块,从度量深度中提炼出多样化的数据知识,使法线估计器能够学习法线标签以外的知识。有了这些模块,我们的深度-法线模型就能稳定地训练来自成千上万不同类型注释的相机模型的 1600 多万张图像,从而实现对未见相机设置的野外图像的零误差泛化。目前,我们的方法在公制深度、仿射不变深度以及表面法线预测的各种零拍摄和非零拍摄基准测试中排名第一,如图 1 所示。值得注意的是,在包括 NYUv2 和 KITTI 在内的各种深度基准测试中,我们超越了最新的 MarigoldDepth 和 DepthAnything。我们的方法能够在随机收集的互联网图像上准确恢复度量三维结构,为可信的单图像计量铺平了道路。我们的潜在优势还可延伸到下游任务,只需插入我们的模型,这些任务就能得到显著改善。例如,我们的模型解决了单目-SLAM 的尺度漂移问题(图 3),从而实现了高质量的度量尺度密集映射。这些应用凸显了 Metric3D v2 模型作为几何基础模型的多功能性。我们的项目页面是 https://JUGGHM.github.io/Metric3Dv2。
{"title":"Metric3D v2: A Versatile Monocular Geometric Foundation Model for Zero-Shot Metric Depth and Surface Normal Estimation","authors":"Mu Hu;Wei Yin;Chi Zhang;Zhipeng Cai;Xiaoxiao Long;Hao Chen;Kaixuan Wang;Gang Yu;Chunhua Shen;Shaojie Shen","doi":"10.1109/TPAMI.2024.3444912","DOIUrl":"10.1109/TPAMI.2024.3444912","url":null,"abstract":"We introduce Metric3D v2, a geometric foundation model designed for zero-shot metric depth and surface normal estimation from single images, critical for accurate 3D recovery. Depth and normal estimation, though complementary, present distinct challenges. State-of-the-art monocular depth methods achieve zero-shot generalization through affine-invariant depths, but fail to recover real-world metric scale. Conversely, current normal estimation techniques struggle with zero-shot performance due to insufficient labeled data. We propose targeted solutions for both metric depth and normal estimation. For metric depth, we present a canonical camera space transformation module that resolves metric ambiguity across various camera models and large-scale datasets, which can be easily integrated into existing monocular models. For surface normal estimation, we introduce a joint depth-normal optimization module that leverages diverse data from metric depth, allowing normal estimators to improve beyond traditional labels. Our model, trained on over 16 million images from thousands of camera models with varied annotations, excels in zero-shot generalization to new camera settings. As shown in Fig. 1, It ranks the 1st in multiple zero-shot and standard benchmarks for metric depth and surface normal prediction. Our method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. Our model also relieves the scale drift issues of monocular-SLAM (Fig. 3), leading to high-quality metric scale dense mapping. Such applications highlight the versatility of Metric3D v2 models as geometric foundation models.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10579-10596"},"PeriodicalIF":0.0,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141992477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE transactions on pattern analysis and machine intelligence
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1