Pub Date : 2024-08-16DOI: 10.1109/TPAMI.2024.3444912
Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, Shaojie Shen
We introduce Metric3D v2, a geometric foundation model for zero-shot metric depth and surface normal estimation from a single image, which is crucial for metric 3D recovery. While depth and normal are geometrically related and highly complimentary, they present distinct challenges. State-of-the-art (SoTA) monocular depth methods achieve zero-shot generalization by learning affine-invariant depths, which cannot recover real-world metrics. Meanwhile, SoTA normal estimation methods have limited zero-shot performance due to the lack of large-scale labeled data. To tackle these issues, we propose solutions for both metric depth estimation and surface normal estimation. For metric depth estimation, we show that the key to a zero-shot single-view model lies in resolving the metric ambiguity from various camera models and large-scale data training. We propose a canonical camera space transformation module, which explicitly addresses the ambiguity problem and can be effortlessly plugged into existing monocular models. For surface normal estimation, we propose a joint depth-normal optimization module to distill diverse data knowledge from metric depth, enabling normal estimators to learn beyond normal labels. Equipped with these modules, our depth-normal models can be stably trained with over 16 million of images from thousands of camera models with different-type annotations, resulting in zero-shot generalization to in-the-wild images with unseen camera settings. Our method currently ranks the 1st on various zero-shot and non-zero-shot benchmarks for metric depth, affine-invariant-depth as well as surface-normal prediction, shown in Fig. 1. Notably, we surpassed the ultra-recent MarigoldDepth and DepthAnything on various depth benchmarks including NYUv2 and KITTI. Our method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. The potential benefits extend to downstream tasks, which can be significantly improved by simply plugging in our model. For example, our model relieves the scale drift issues of monocular-SLAM (Fig. 3), leading to high-quality metric scale dense mapping. These applications highlight the versatility of Metric3D v2 models as geometric foundation models. Our project page is at https://JUGGHM.github.io/Metric3Dv2.
{"title":"Metric3D v2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation.","authors":"Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, Shaojie Shen","doi":"10.1109/TPAMI.2024.3444912","DOIUrl":"10.1109/TPAMI.2024.3444912","url":null,"abstract":"<p><p>We introduce Metric3D v2, a geometric foundation model for zero-shot metric depth and surface normal estimation from a single image, which is crucial for metric 3D recovery. While depth and normal are geometrically related and highly complimentary, they present distinct challenges. State-of-the-art (SoTA) monocular depth methods achieve zero-shot generalization by learning affine-invariant depths, which cannot recover real-world metrics. Meanwhile, SoTA normal estimation methods have limited zero-shot performance due to the lack of large-scale labeled data. To tackle these issues, we propose solutions for both metric depth estimation and surface normal estimation. For metric depth estimation, we show that the key to a zero-shot single-view model lies in resolving the metric ambiguity from various camera models and large-scale data training. We propose a canonical camera space transformation module, which explicitly addresses the ambiguity problem and can be effortlessly plugged into existing monocular models. For surface normal estimation, we propose a joint depth-normal optimization module to distill diverse data knowledge from metric depth, enabling normal estimators to learn beyond normal labels. Equipped with these modules, our depth-normal models can be stably trained with over 16 million of images from thousands of camera models with different-type annotations, resulting in zero-shot generalization to in-the-wild images with unseen camera settings. Our method currently ranks the 1st on various zero-shot and non-zero-shot benchmarks for metric depth, affine-invariant-depth as well as surface-normal prediction, shown in Fig. 1. Notably, we surpassed the ultra-recent MarigoldDepth and DepthAnything on various depth benchmarks including NYUv2 and KITTI. Our method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. The potential benefits extend to downstream tasks, which can be significantly improved by simply plugging in our model. For example, our model relieves the scale drift issues of monocular-SLAM (Fig. 3), leading to high-quality metric scale dense mapping. These applications highlight the versatility of Metric3D v2 models as geometric foundation models. Our project page is at https://JUGGHM.github.io/Metric3Dv2.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141992477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Directly regressing the non-rigid shape and camera pose from the individual 2D frame is ill-suited to the Non-Rigid Structure-from-Motion (NRSfM) problem. This frame-by-frame 3D reconstruction pipeline overlooks the inherent spatial-temporal nature of NRSfM, i.e., reconstructing the 3D sequence from the input 2D sequence. In this paper, we propose to solve deep sparse NRSfM from a sequence-to-sequence translation perspective, where the input 2D keypoints sequence is taken as a whole to reconstruct the corresponding 3D keypoints sequence in a self-supervised manner. First, we apply a shape-motion predictor on the input sequence to obtain an initial sequence of shapes and corresponding motions. Then, we propose the Context Layer, which enables the deep learning framework to effectively impose overall constraints on sequences based on the structural characteristics of non-rigid sequences. The Context Layer constructs modules for imposing the self-expressiveness regularity on non-rigid sequences with multi-head attention (MHA) as the core, together with the use of temporal encoding, both of which act simultaneously to constitute constraints on non-rigid sequences in the deep framework. Experimental results across different datasets such as Human3.6M, CMU Mocap, and InterHand prove the superiority of our framework. The code will be made publicly available.
{"title":"Deep Non-rigid Structure-from-Motion: A Sequence-to-Sequence Translation Perspective.","authors":"Hui Deng, Tong Zhang, Yuchao Dai, Jiawei Shi, Yiran Zhong, Hongdong Li","doi":"10.1109/TPAMI.2024.3443922","DOIUrl":"10.1109/TPAMI.2024.3443922","url":null,"abstract":"<p><p>Directly regressing the non-rigid shape and camera pose from the individual 2D frame is ill-suited to the Non-Rigid Structure-from-Motion (NRSfM) problem. This frame-by-frame 3D reconstruction pipeline overlooks the inherent spatial-temporal nature of NRSfM, i.e., reconstructing the 3D sequence from the input 2D sequence. In this paper, we propose to solve deep sparse NRSfM from a sequence-to-sequence translation perspective, where the input 2D keypoints sequence is taken as a whole to reconstruct the corresponding 3D keypoints sequence in a self-supervised manner. First, we apply a shape-motion predictor on the input sequence to obtain an initial sequence of shapes and corresponding motions. Then, we propose the Context Layer, which enables the deep learning framework to effectively impose overall constraints on sequences based on the structural characteristics of non-rigid sequences. The Context Layer constructs modules for imposing the self-expressiveness regularity on non-rigid sequences with multi-head attention (MHA) as the core, together with the use of temporal encoding, both of which act simultaneously to constitute constraints on non-rigid sequences in the deep framework. Experimental results across different datasets such as Human3.6M, CMU Mocap, and InterHand prove the superiority of our framework. The code will be made publicly available.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141992476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-16DOI: 10.1109/TPAMI.2024.3444910
Tao Xiong, Yachao Li, Mengdao Xing
When the locations of non-zero samples are known, the Moore-Penrose inverse (MPI) can be used for the data recovery of compressive sensing (CS). First, the prior from the locations is used to shrink the measurement matrix in CS. Then the data can be recovered by using MPI with such shrinking matrix. We can also prove that the results of data recovery from the original CS and our MPI-based method are the same mathematically. Based on such finding, a novel sidelobe-reduction method for synthetic aperture radar (SAR) and Polarimetric SAR (POLSAR) images is studied. The aim of sidelobe reduction is to recover the samples within the mainlobes and suppress the ones within the sidelobes. In our study, prior from spatial variant apodization (SVA) is used to determine the locations of the mainlobes and the sidelobes, respectively. With CS, the mainlobe area can be well recovered. Samples within the sidelobe areas are also recovered using background fusion. Our method is suitable for acquired data with large sizes. The performance of the proposed algorithm is evaluated with acquired spaceborne SAR and air-borne POLSAR data. In our experiments, we use the 1m space-borne SAR data with the size of 10000 (samples) × 10000 (samples) and 0.3m POLSAR data with the size of 10000 (samples) × 26000 (samples) for sidelobe suppression. Furthermore, We also verified that, our method does not affect the polarization signatures. The effectiveness for the sidelobe suppression is qualitatively examined, and results were satisfactory.
{"title":"Quality improvement synthetic aperture radar (SAR) images using compressive sensing (CS) with Moore-Penrose inverse (MPI) and prior from spatial variant apodization (SVA).","authors":"Tao Xiong, Yachao Li, Mengdao Xing","doi":"10.1109/TPAMI.2024.3444910","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3444910","url":null,"abstract":"<p><p>When the locations of non-zero samples are known, the Moore-Penrose inverse (MPI) can be used for the data recovery of compressive sensing (CS). First, the prior from the locations is used to shrink the measurement matrix in CS. Then the data can be recovered by using MPI with such shrinking matrix. We can also prove that the results of data recovery from the original CS and our MPI-based method are the same mathematically. Based on such finding, a novel sidelobe-reduction method for synthetic aperture radar (SAR) and Polarimetric SAR (POLSAR) images is studied. The aim of sidelobe reduction is to recover the samples within the mainlobes and suppress the ones within the sidelobes. In our study, prior from spatial variant apodization (SVA) is used to determine the locations of the mainlobes and the sidelobes, respectively. With CS, the mainlobe area can be well recovered. Samples within the sidelobe areas are also recovered using background fusion. Our method is suitable for acquired data with large sizes. The performance of the proposed algorithm is evaluated with acquired spaceborne SAR and air-borne POLSAR data. In our experiments, we use the 1m space-borne SAR data with the size of 10000 (samples) × 10000 (samples) and 0.3m POLSAR data with the size of 10000 (samples) × 26000 (samples) for sidelobe suppression. Furthermore, We also verified that, our method does not affect the polarization signatures. The effectiveness for the sidelobe suppression is qualitatively examined, and results were satisfactory.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141992478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-15DOI: 10.1109/TPAMI.2024.3443916
Zeyang Li, Chuxiong Hu, Yunan Wang, Yujie Yang, Shengbo Eben Li
Reinforcement learning (RL) agents are vulnerable to adversarial disturbances, which can deteriorate task performance or break down safety specifications. Existing methods either address safety requirements under the assumption of no adversary (e.g., safe RL) or only focus on robustness against performance adversaries (e.g., robust RL). Learning one policy that is both safe and robust under any adversaries remains a challenging open problem. The difficulty is how to tackle two intertwined aspects in the worst cases: feasibility and optimality. The optimality is only valid inside a feasible region (i.e., robust invariant set), while the identification of maximal feasible region must rely on how to learn the optimal policy. To address this issue, we propose a systematic framework to unify safe RL and robust RL, including the problem formulation, iteration scheme, convergence analysis and practical algorithm design. The unification is built upon constrained two-player zero-sum Markov games, in which the objective for protagonist is twofold. For states inside the maximal robust invariant set, the goal is to pursue rewards under the condition of guaranteed safety; for states outside the maximal robust invariant set, the goal is to reduce the extent of constraint violation. A dual policy iteration scheme is proposed, which simultaneously optimizes a task policy and a safety policy. We prove that the iteration scheme converges to the optimal task policy which maximizes the twofold objective in the worst cases, and the optimal safety policy which stays as far away from the safety boundary. The convergence of safety policy is established by exploiting the monotone contraction property of safety self-consistency operators, and that of task policy depends on the transformation of safety constraints into state-dependent action spaces. By adding two adversarial networks (one is for safety guarantee and the other is for task performance), we propose a practical deep RL algorithm for constrained zero-sum Markov games, called dually robust actor-critic (DRAC). The evaluations with safety-critical benchmarks demonstrate that DRAC achieves high performance and persistent safety under all scenarios (no adversary, safety adversary, performance adversary), outperforming all baselines by a large margin.
{"title":"Safe Reinforcement Learning with Dual Robustness.","authors":"Zeyang Li, Chuxiong Hu, Yunan Wang, Yujie Yang, Shengbo Eben Li","doi":"10.1109/TPAMI.2024.3443916","DOIUrl":"10.1109/TPAMI.2024.3443916","url":null,"abstract":"<p><p>Reinforcement learning (RL) agents are vulnerable to adversarial disturbances, which can deteriorate task performance or break down safety specifications. Existing methods either address safety requirements under the assumption of no adversary (e.g., safe RL) or only focus on robustness against performance adversaries (e.g., robust RL). Learning one policy that is both safe and robust under any adversaries remains a challenging open problem. The difficulty is how to tackle two intertwined aspects in the worst cases: feasibility and optimality. The optimality is only valid inside a feasible region (i.e., robust invariant set), while the identification of maximal feasible region must rely on how to learn the optimal policy. To address this issue, we propose a systematic framework to unify safe RL and robust RL, including the problem formulation, iteration scheme, convergence analysis and practical algorithm design. The unification is built upon constrained two-player zero-sum Markov games, in which the objective for protagonist is twofold. For states inside the maximal robust invariant set, the goal is to pursue rewards under the condition of guaranteed safety; for states outside the maximal robust invariant set, the goal is to reduce the extent of constraint violation. A dual policy iteration scheme is proposed, which simultaneously optimizes a task policy and a safety policy. We prove that the iteration scheme converges to the optimal task policy which maximizes the twofold objective in the worst cases, and the optimal safety policy which stays as far away from the safety boundary. The convergence of safety policy is established by exploiting the monotone contraction property of safety self-consistency operators, and that of task policy depends on the transformation of safety constraints into state-dependent action spaces. By adding two adversarial networks (one is for safety guarantee and the other is for task performance), we propose a practical deep RL algorithm for constrained zero-sum Markov games, called dually robust actor-critic (DRAC). The evaluations with safety-critical benchmarks demonstrate that DRAC achieves high performance and persistent safety under all scenarios (no adversary, safety adversary, performance adversary), outperforming all baselines by a large margin.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141989775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-15DOI: 10.1109/TPAMI.2024.3444002
Chengli Tan, Jiangshe Zhang, Junmin Liu, Yihong Gong
Lookahead is a popular stochastic optimizer that can accelerate the training process of deep neural networks. However, the solutions found by Lookahead often generalize worse than those found by its base optimizers, such as SGD and Adam. To address this issue, we propose Sharpness-Aware Lookahead (SALA), a novel optimizer that aims to identify flat minima that generalize well. SALA divides the training process into two stages. In the first stage, the direction towards flat regions is determined by leveraging a quadratic approximation of the optimization trajectory, without incurring any extra computational overhead. In the second stage, however, it is determined by Sharpness-Aware Minimization (SAM), which is particularly effective in improving generalization at the terminal phase of training. In contrast to Lookahead, SALA retains the benefits of accelerated convergence while also enjoying superior generalization performance compared to the base optimizer. Theoretical analysis of the expected excess risk, as well as empirical results on canonical neural network architectures and datasets, demonstrate the advantages of SALA over Lookahead. It is noteworthy that with approximately 25% more computational overhead than the base optimizer, SALA can achieve the same generalization performance as SAM which requires twice the training budget of the base optimizer.
Lookahead 是一种流行的随机优化器,可以加速深度神经网络的训练过程。然而,Lookahead 找到的解决方案的泛化效果往往不如其基础优化器(如 SGD 和 Adam)。为了解决这个问题,我们提出了锐度感知 Lookahead(SALA),这是一种新颖的优化器,旨在识别泛化效果好的平最小值。SALA 将训练过程分为两个阶段。在第一阶段,通过对优化轨迹进行二次逼近来确定平坦区域的方向,而不会产生任何额外的计算开销。而在第二阶段,则通过锐度感知最小化(SAM)来确定,这对提高训练末期的泛化效果尤为有效。与 Lookahead 相比,SALA 既保留了加速收敛的优点,又比基础优化器具有更优越的泛化性能。对预期超额风险的理论分析,以及对典型神经网络架构和数据集的实证结果,都证明了 SALA 相对于 Lookahead 的优势。值得注意的是,与基础优化器相比,SALA 的计算开销大约增加了 25%,却能达到与 SAM 相同的泛化性能,而 SAM 需要的训练预算是基础优化器的两倍。
{"title":"Sharpness-aware Lookahead for Accelerating Convergence and Improving Generalization.","authors":"Chengli Tan, Jiangshe Zhang, Junmin Liu, Yihong Gong","doi":"10.1109/TPAMI.2024.3444002","DOIUrl":"10.1109/TPAMI.2024.3444002","url":null,"abstract":"<p><p>Lookahead is a popular stochastic optimizer that can accelerate the training process of deep neural networks. However, the solutions found by Lookahead often generalize worse than those found by its base optimizers, such as SGD and Adam. To address this issue, we propose Sharpness-Aware Lookahead (SALA), a novel optimizer that aims to identify flat minima that generalize well. SALA divides the training process into two stages. In the first stage, the direction towards flat regions is determined by leveraging a quadratic approximation of the optimization trajectory, without incurring any extra computational overhead. In the second stage, however, it is determined by Sharpness-Aware Minimization (SAM), which is particularly effective in improving generalization at the terminal phase of training. In contrast to Lookahead, SALA retains the benefits of accelerated convergence while also enjoying superior generalization performance compared to the base optimizer. Theoretical analysis of the expected excess risk, as well as empirical results on canonical neural network architectures and datasets, demonstrate the advantages of SALA over Lookahead. It is noteworthy that with approximately 25% more computational overhead than the base optimizer, SALA can achieve the same generalization performance as SAM which requires twice the training budget of the base optimizer.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141989776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sound source localization aims to localize objects emitting the sound in visual scenes. Recent works obtaining impressive results typically rely on contrastive learning. However, the common practice of randomly sampling negatives in prior arts can lead to the false negative issue, where the sounds semantically similar to visual instance are sampled as negatives and incorrectly pushed away from the visual anchor/query. As a result, this misalignment of audio and visual features could yield inferior performance. To address this issue, we propose a novel audio-visual learning framework which is instantiated with two individual learning schemes: self-supervised predictive learning (SSPL) and semantic-aware contrastive learning (SACL). SSPL explores image-audio positive pairs alone to discover semantically coherent similarities between audio and visual features, while a predictive coding module for feature alignment is introduced to facilitate the positive-only learning. In this regard SSPL acts as a negative-free method to eliminate false negatives. By contrast, SACL is designed to compact visual features and remove false negatives, providing reliable visual anchor and audio negatives for contrast. Different from SSPL, SACL releases the potential of audio-visual contrastive learning, offering an effective alternative to achieve the same goal. Comprehensive experiments demonstrate the superiority of our approach over the state-of-the-arts. Furthermore, we highlight the versatility of the learned representation by extending the approach to audio-visual event classification and object detection tasks. Code and models are available at: https://github.com/zjsong/SACL.
{"title":"Enhancing Sound Source Localization via False Negative Elimination.","authors":"Zengjie Song, Jiangshe Zhang, Yuxi Wang, Junsong Fan, Zhaoxiang Zhang","doi":"10.1109/TPAMI.2024.3444029","DOIUrl":"10.1109/TPAMI.2024.3444029","url":null,"abstract":"<p><p>Sound source localization aims to localize objects emitting the sound in visual scenes. Recent works obtaining impressive results typically rely on contrastive learning. However, the common practice of randomly sampling negatives in prior arts can lead to the false negative issue, where the sounds semantically similar to visual instance are sampled as negatives and incorrectly pushed away from the visual anchor/query. As a result, this misalignment of audio and visual features could yield inferior performance. To address this issue, we propose a novel audio-visual learning framework which is instantiated with two individual learning schemes: self-supervised predictive learning (SSPL) and semantic-aware contrastive learning (SACL). SSPL explores image-audio positive pairs alone to discover semantically coherent similarities between audio and visual features, while a predictive coding module for feature alignment is introduced to facilitate the positive-only learning. In this regard SSPL acts as a negative-free method to eliminate false negatives. By contrast, SACL is designed to compact visual features and remove false negatives, providing reliable visual anchor and audio negatives for contrast. Different from SSPL, SACL releases the potential of audio-visual contrastive learning, offering an effective alternative to achieve the same goal. Comprehensive experiments demonstrate the superiority of our approach over the state-of-the-arts. Furthermore, we highlight the versatility of the learned representation by extending the approach to audio-visual event classification and object detection tasks. Code and models are available at: https://github.com/zjsong/SACL.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141989774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-14DOI: 10.1109/TPAMI.2024.3443335
Yifan Zhang, Zhiyu Zhu, Junhui Hou, Dapeng Wu
The Detection Transformer (DETR) has revolutionized the design of CNN-based object detection systems, showcasing impressive performance. However, its potential in the domain of multi-frame 3D object detection remains largely unexplored. In this paper, we present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection by addressing three key aspects specifically tailored for this task. First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network, which represents queries as nodes in a graph and enables effective modeling of object interactions within a social context. To solve the problem of missing hard cases in the proposed output of the encoder in the current frame, we incorporate the output of the previous frame to initialize the query input of the decoder. Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match. And similar queries are insufficiently suppressed and turn into redundant prediction boxes. To address this issue, our proposed IoU regularization term encourages similar queries to be distinct during the refinement. Through extensive experiments, we demonstrate the effectiveness of our approach in handling challenging scenarios, while incurring only a minor additional computational overhead. The code is publicly available at https://github.com/Eaphan/STEMD.
{"title":"Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection.","authors":"Yifan Zhang, Zhiyu Zhu, Junhui Hou, Dapeng Wu","doi":"10.1109/TPAMI.2024.3443335","DOIUrl":"10.1109/TPAMI.2024.3443335","url":null,"abstract":"<p><p>The Detection Transformer (DETR) has revolutionized the design of CNN-based object detection systems, showcasing impressive performance. However, its potential in the domain of multi-frame 3D object detection remains largely unexplored. In this paper, we present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection by addressing three key aspects specifically tailored for this task. First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network, which represents queries as nodes in a graph and enables effective modeling of object interactions within a social context. To solve the problem of missing hard cases in the proposed output of the encoder in the current frame, we incorporate the output of the previous frame to initialize the query input of the decoder. Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match. And similar queries are insufficiently suppressed and turn into redundant prediction boxes. To address this issue, our proposed IoU regularization term encourages similar queries to be distinct during the refinement. Through extensive experiments, we demonstrate the effectiveness of our approach in handling challenging scenarios, while incurring only a minor additional computational overhead. The code is publicly available at https://github.com/Eaphan/STEMD.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141984212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-14DOI: 10.1109/TPAMI.2024.3443141
Ming Jin, Huan Yee Koh, Qingsong Wen, Daniele Zambon, Cesare Alippi, Geoffrey I Webb, Irwin King, Shirui Pan
Time series are the primary data type used to record dynamic system measurements and generated in great volume by both physical sensors and online processes (virtual sensors). Time series analytics is therefore crucial to unlocking the wealth of information implicit in available data. With the recent advancements in graph neural networks (GNNs), there has been a surge in GNN-based approaches for time series analysis. These approaches can explicitly model inter-temporal and inter-variable relationships, which traditional and other deep neural network-based methods struggle to do. In this survey, we provide a comprehensive review of graph neural networks for time series analysis (GNN4TS), encompassing four fundamental dimensions: forecasting, classification, anomaly detection, and imputation. Our aim is to guide designers and practitioners to understand, build applications, and advance research of GNN4TS. At first, we provide a comprehensive task-oriented taxonomy of GNN4TS. Then, we present and discuss representative research works and introduce mainstream applications of GNN4TS. A comprehensive discussion of potential future research directions completes the survey. This survey, for the first time, brings together a vast array of knowledge on GNN-based time series research, highlighting foundations, practical applications, and opportunities of graph neural networks for time series analysis.
{"title":"A Survey on Graph Neural Networks for Time Series: Forecasting, Classification, Imputation, and Anomaly Detection.","authors":"Ming Jin, Huan Yee Koh, Qingsong Wen, Daniele Zambon, Cesare Alippi, Geoffrey I Webb, Irwin King, Shirui Pan","doi":"10.1109/TPAMI.2024.3443141","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3443141","url":null,"abstract":"<p><p>Time series are the primary data type used to record dynamic system measurements and generated in great volume by both physical sensors and online processes (virtual sensors). Time series analytics is therefore crucial to unlocking the wealth of information implicit in available data. With the recent advancements in graph neural networks (GNNs), there has been a surge in GNN-based approaches for time series analysis. These approaches can explicitly model inter-temporal and inter-variable relationships, which traditional and other deep neural network-based methods struggle to do. In this survey, we provide a comprehensive review of graph neural networks for time series analysis (GNN4TS), encompassing four fundamental dimensions: forecasting, classification, anomaly detection, and imputation. Our aim is to guide designers and practitioners to understand, build applications, and advance research of GNN4TS. At first, we provide a comprehensive task-oriented taxonomy of GNN4TS. Then, we present and discuss representative research works and introduce mainstream applications of GNN4TS. A comprehensive discussion of potential future research directions completes the survey. This survey, for the first time, brings together a vast array of knowledge on GNN-based time series research, highlighting foundations, practical applications, and opportunities of graph neural networks for time series analysis.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141984207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-14DOI: 10.1109/TPAMI.2023.3347082
Feihu Huang, Shangqian Gao, Jian Pei, Heng Huang
Zeroth-order (a.k.a, derivative-free) methods are a class of effective optimization methods for solving complex machine learning problems, where gradients of the objective functions are not available or computationally prohibitive. Recently, although many zeroth-order methods have been developed, these approaches still have two main drawbacks: 1) high function query complexity; 2) not being well suitable for solving the problems with complex penalties and constraints. To address these challenging drawbacks, in this paper, we propose a class of faster zeroth-order stochastic alternating direction method of multipliers (ADMM) methods (ZO-SPIDER-ADMM) to solve the nonconvex finite-sum problems with multiple nonsmooth penalties. Moreover, we prove that the ZO-SPIDER-ADMM methods can achieve a lower function query complexity of [Formula: see text] for finding an ϵ-stationary point, which improves the existing best nonconvex zeroth-order ADMM methods by a factor of [Formula: see text], where n and d denote the sample size and data dimension, respectively. At the same time, we propose a class of faster zeroth-order online ADMM methods (ZOO-ADMM+) to solve the nonconvex online problems with multiple nonsmooth penalties. We also prove that the proposed ZOO-ADMM+ methods achieve a lower function query complexity of [Formula: see text], which improves the existing best result by a factor of [Formula: see text]. Extensive experimental results on the structure adversarial attack on black-box deep neural networks demonstrate the efficiency of our new algorithms.
零阶(又称无导数)方法是一类有效的优化方法,可用于解决目标函数梯度不可用或计算量过大的复杂机器学习问题。最近,虽然开发了很多零阶方法,但这些方法仍有两个主要缺点:1)函数查询复杂度高;2)不太适合解决具有复杂惩罚和约束条件的问题。为了解决这些具有挑战性的缺点,我们在本文中提出了一类更快的零阶随机交替乘法(ADMM)方法(ZO-SPIDER-ADMM),用于解决具有多个非光滑惩罚的非凸求和问题。此外,我们还证明了 ZO-SPIDER-ADMM 方法在寻找ϵ静止点时可以实现更低的函数查询复杂度[式:见正文],比现有的最佳非凸零阶 ADMM 方法提高了[式:见正文],其中 n 和 d 分别表示样本大小和数据维度。同时,我们提出了一类更快的零阶在线 ADMM 方法(ZOO-ADMM+),用于解决具有多重非光滑惩罚的非凸在线问题。我们还证明,所提出的 ZOO-ADMM+ 方法实现了更低的函数查询复杂度[公式:见正文],将现有最佳结果提高了[公式:见正文]倍。针对黑盒深度神经网络的结构对抗攻击的大量实验结果证明了我们新算法的高效性。
{"title":"Nonconvex Zeroth-Order Stochastic ADMM Methods with Lower Function Query Complexity.","authors":"Feihu Huang, Shangqian Gao, Jian Pei, Heng Huang","doi":"10.1109/TPAMI.2023.3347082","DOIUrl":"https://doi.org/10.1109/TPAMI.2023.3347082","url":null,"abstract":"<p><p>Zeroth-order (a.k.a, derivative-free) methods are a class of effective optimization methods for solving complex machine learning problems, where gradients of the objective functions are not available or computationally prohibitive. Recently, although many zeroth-order methods have been developed, these approaches still have two main drawbacks: 1) high function query complexity; 2) not being well suitable for solving the problems with complex penalties and constraints. To address these challenging drawbacks, in this paper, we propose a class of faster zeroth-order stochastic alternating direction method of multipliers (ADMM) methods (ZO-SPIDER-ADMM) to solve the nonconvex finite-sum problems with multiple nonsmooth penalties. Moreover, we prove that the ZO-SPIDER-ADMM methods can achieve a lower function query complexity of [Formula: see text] for finding an ϵ-stationary point, which improves the existing best nonconvex zeroth-order ADMM methods by a factor of [Formula: see text], where n and d denote the sample size and data dimension, respectively. At the same time, we propose a class of faster zeroth-order online ADMM methods (ZOO-ADMM+) to solve the nonconvex online problems with multiple nonsmooth penalties. We also prove that the proposed ZOO-ADMM+ methods achieve a lower function query complexity of [Formula: see text], which improves the existing best result by a factor of [Formula: see text]. Extensive experimental results on the structure adversarial attack on black-box deep neural networks demonstrate the efficiency of our new algorithms.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141984211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-14DOI: 10.1109/TPAMI.2024.3443110
Chao Chen, Haoyu Geng, Nianzu Yang, Xiaokang Yang, Junchi Yan
Dynamic graphs arise in various real-world applications, and it is often welcomed to model the dynamics in continuous time domain for its flexibility. This paper aims to design an easy-to-use pipeline (EasyDGL which is also due to its implementation by DGL toolkit) composed of three modules with both strong fitting ability and interpretability, namely encoding, training and interpreting: i) a temporal point process (TPP) modulated attention architecture to endow the continuous-time resolution with the coupled spatiotemporal dynamics of the graph with edge-addition events; ii) a principled loss composed of task-agnostic TPP posterior maximization based on observed events, and a task-aware loss with a masking strategy over dynamic graph, where the tasks include dynamic link prediction, dynamic node classification and node traffic forecasting; iii) interpretation of the outputs (e.g., representations and predictions) with scalable perturbation-based quantitative analysis in the graph Fourier domain, which could comprehensively reflect the behavior of the learned model. Empirical results on public benchmarks show our superior performance for time-conditioned predictive tasks, and in particular EasyDGL can effectively quantify the predictive power of frequency content that a model learns from evolving graph data.
{"title":"EasyDGL: Encode, Train and Interpret for Continuous-Time Dynamic Graph Learning.","authors":"Chao Chen, Haoyu Geng, Nianzu Yang, Xiaokang Yang, Junchi Yan","doi":"10.1109/TPAMI.2024.3443110","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3443110","url":null,"abstract":"<p><p>Dynamic graphs arise in various real-world applications, and it is often welcomed to model the dynamics in continuous time domain for its flexibility. This paper aims to design an easy-to-use pipeline (EasyDGL which is also due to its implementation by DGL toolkit) composed of three modules with both strong fitting ability and interpretability, namely encoding, training and interpreting: i) a temporal point process (TPP) modulated attention architecture to endow the continuous-time resolution with the coupled spatiotemporal dynamics of the graph with edge-addition events; ii) a principled loss composed of task-agnostic TPP posterior maximization based on observed events, and a task-aware loss with a masking strategy over dynamic graph, where the tasks include dynamic link prediction, dynamic node classification and node traffic forecasting; iii) interpretation of the outputs (e.g., representations and predictions) with scalable perturbation-based quantitative analysis in the graph Fourier domain, which could comprehensively reflect the behavior of the learned model. Empirical results on public benchmarks show our superior performance for time-conditioned predictive tasks, and in particular EasyDGL can effectively quantify the predictive power of frequency content that a model learns from evolving graph data.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141984208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}