Pub Date : 2024-07-19DOI: 10.1109/TPAMI.2024.3431221
Qing Li, Huifang Feng, Kanle Shi, Yue Gao, Yi Fang, Yu-Shen Liu, Zhizhong Han
We propose a novel method called SHS-Net for point cloud normal estimation by learning signed hyper surfaces, which can accurately predict normals with global consistent orientation from various point clouds. Almost all existing methods estimate oriented normals through a two-stage pipeline, i.e., unoriented normal estimation and normal orientation, and each step is implemented by a separate algorithm. However, previous methods are sensitive to parameter settings, resulting in poor results from point clouds with noise, density variations and complex geometries. In this work, we introduce signed hyper surfaces (SHS), which are parameterized by multi-layer perceptron (MLP) layers, to learn to estimate oriented normals from point clouds in an end-to-end manner. The signed hyper surfaces are implicitly learned in a high-dimensional feature space where the local and global information is aggregated. Specifically, we introduce a patch encoding module and a shape encoding module to encode a 3D point cloud into a local latent code and a global latent code, respectively. Then, an attention-weighted normal prediction module is proposed as a decoder, which takes the local and global latent codes as input to predict oriented normals. Experimental results show that our algorithm outperforms the state-of-the-art methods in both unoriented and oriented normal estimation.
{"title":"Learning Signed Hyper Surfaces for Oriented Point Cloud Normal Estimation.","authors":"Qing Li, Huifang Feng, Kanle Shi, Yue Gao, Yi Fang, Yu-Shen Liu, Zhizhong Han","doi":"10.1109/TPAMI.2024.3431221","DOIUrl":"10.1109/TPAMI.2024.3431221","url":null,"abstract":"<p><p>We propose a novel method called SHS-Net for point cloud normal estimation by learning signed hyper surfaces, which can accurately predict normals with global consistent orientation from various point clouds. Almost all existing methods estimate oriented normals through a two-stage pipeline, i.e., unoriented normal estimation and normal orientation, and each step is implemented by a separate algorithm. However, previous methods are sensitive to parameter settings, resulting in poor results from point clouds with noise, density variations and complex geometries. In this work, we introduce signed hyper surfaces (SHS), which are parameterized by multi-layer perceptron (MLP) layers, to learn to estimate oriented normals from point clouds in an end-to-end manner. The signed hyper surfaces are implicitly learned in a high-dimensional feature space where the local and global information is aggregated. Specifically, we introduce a patch encoding module and a shape encoding module to encode a 3D point cloud into a local latent code and a global latent code, respectively. Then, an attention-weighted normal prediction module is proposed as a decoder, which takes the local and global latent codes as input to predict oriented normals. Experimental results show that our algorithm outperforms the state-of-the-art methods in both unoriented and oriented normal estimation.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141728473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-18DOI: 10.1109/TPAMI.2024.3430533
Zipeng Ye, Wenjian Luo, Qi Zhou, Zhenqian Zhu, Yuhui Shi, Yan Jia
Gradient inversion attacks (GIAs) have posed significant challenges to the emerging paradigm of distributed learning, which aims to reconstruct the private training data of clients (participating parties in distributed training) through the shared parameters. For counteracting GIAs, a large number of privacy-preserving methods for distributed learning scenario have emerged. However, these methods have significant limitations, either compromising the usability of global model or consuming substantial additional computational resources. Furthermore, despite the extensive efforts dedicated to defense methods, the underlying causes of data leakage in distributed learning still have not been thoroughly investigated. Therefore, this paper tries to reveal the potential reasons behind the successful implementation of existing GIAs, explore variations in the robustness of models against GIAs during the training process, and investigate the impact of different model structures on attack performance. After these explorations and analyses, this paper propose a plug-and-play GIAs defense method, which augments the training data by a designed vicinal distribution. Sufficient empirical experiments demonstrate that this easy-toimplement method can ensure the basic level of privacy without compromising the usability of global model.
梯度反转攻击(Gradient Inversion Attack,GIAs)对新兴的分布式学习范式提出了重大挑战,该范式旨在通过共享参数重建客户端(参与分布式训练的各方)的隐私训练数据。为应对 GIA,出现了大量针对分布式学习场景的隐私保护方法。然而,这些方法都有很大的局限性,要么会影响全局模型的可用性,要么会消耗大量额外的计算资源。此外,尽管人们在防御方法上做了大量努力,但分布式学习中数据泄漏的根本原因仍未得到深入研究。因此,本文试图揭示现有 GIA 成功实施背后的潜在原因,探索模型在训练过程中对 GIA 的鲁棒性变化,并研究不同模型结构对攻击性能的影响。经过这些探索和分析,本文提出了一种即插即用的 GIAs 防御方法,即通过设计的邻域分布来增强训练数据。充分的实证实验证明,这种易于实现的方法可以在不影响全局模型可用性的前提下确保基本的隐私水平。
{"title":"Gradient Inversion Attacks: Impact Factors Analyses and Privacy Enhancement.","authors":"Zipeng Ye, Wenjian Luo, Qi Zhou, Zhenqian Zhu, Yuhui Shi, Yan Jia","doi":"10.1109/TPAMI.2024.3430533","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3430533","url":null,"abstract":"<p><p>Gradient inversion attacks (GIAs) have posed significant challenges to the emerging paradigm of distributed learning, which aims to reconstruct the private training data of clients (participating parties in distributed training) through the shared parameters. For counteracting GIAs, a large number of privacy-preserving methods for distributed learning scenario have emerged. However, these methods have significant limitations, either compromising the usability of global model or consuming substantial additional computational resources. Furthermore, despite the extensive efforts dedicated to defense methods, the underlying causes of data leakage in distributed learning still have not been thoroughly investigated. Therefore, this paper tries to reveal the potential reasons behind the successful implementation of existing GIAs, explore variations in the robustness of models against GIAs during the training process, and investigate the impact of different model structures on attack performance. After these explorations and analyses, this paper propose a plug-and-play GIAs defense method, which augments the training data by a designed vicinal distribution. Sufficient empirical experiments demonstrate that this easy-toimplement method can ensure the basic level of privacy without compromising the usability of global model.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141725398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-18DOI: 10.1109/TPAMI.2024.3430860
Hui Wei, Hao Tang, Xuemei Jia, Zhixiang Wang, Hanxun Yu, Zhubo Li, Shinichi Satoh, Luc Van Gool, Zheng Wang
Despite the impressive achievements of Deep Neural Networks (DNNs) in computer vision, their vulnerability to adversarial attacks remains a critical concern. Extensive research has demonstrated that incorporating sophisticated perturbations into input images can lead to a catastrophic degradation in DNNs' performance. This perplexing phenomenon not only exists in the digital space but also in the physical world. Consequently, it becomes imperative to evaluate the security of DNNs-based systems to ensure their safe deployment in real-world scenarios, particularly in security-sensitive applications. To facilitate a profound understanding of this topic, this paper presents a comprehensive overview of physical adversarial attacks. Firstly, we distill four general steps for launching physical adversarial attacks. Building upon this foundation, we uncover the pervasive role of artifacts carrying adversarial perturbations in the physical world. These artifacts influence each step. To denote them, we introduce a new term: adversarial medium. Then, we take the first step to systematically evaluate the performance of physical adversarial attacks, taking the adversarial medium as a first attempt. Our proposed evaluation metric, hiPAA, comprises six perspectives: Effectiveness, Stealthiness, Robustness, Practicability, Aesthetics, and Economics. We also provide comparative results across task categories, together with insightful observations and suggestions for future research directions.
{"title":"Physical Adversarial Attack Meets Computer Vision: A Decade Survey.","authors":"Hui Wei, Hao Tang, Xuemei Jia, Zhixiang Wang, Hanxun Yu, Zhubo Li, Shinichi Satoh, Luc Van Gool, Zheng Wang","doi":"10.1109/TPAMI.2024.3430860","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3430860","url":null,"abstract":"<p><p>Despite the impressive achievements of Deep Neural Networks (DNNs) in computer vision, their vulnerability to adversarial attacks remains a critical concern. Extensive research has demonstrated that incorporating sophisticated perturbations into input images can lead to a catastrophic degradation in DNNs' performance. This perplexing phenomenon not only exists in the digital space but also in the physical world. Consequently, it becomes imperative to evaluate the security of DNNs-based systems to ensure their safe deployment in real-world scenarios, particularly in security-sensitive applications. To facilitate a profound understanding of this topic, this paper presents a comprehensive overview of physical adversarial attacks. Firstly, we distill four general steps for launching physical adversarial attacks. Building upon this foundation, we uncover the pervasive role of artifacts carrying adversarial perturbations in the physical world. These artifacts influence each step. To denote them, we introduce a new term: adversarial medium. Then, we take the first step to systematically evaluate the performance of physical adversarial attacks, taking the adversarial medium as a first attempt. Our proposed evaluation metric, hiPAA, comprises six perspectives: Effectiveness, Stealthiness, Robustness, Practicability, Aesthetics, and Economics. We also provide comparative results across task categories, together with insightful observations and suggestions for future research directions.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141725399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-18DOI: 10.1109/TPAMI.2024.3430839
Xinyu Li, Danni Ai, Hong Song, Jingfan Fan, Tianyu Fu, Deqiang Xiao, Yining Wang, Jian Yang
Detecting coronary stenosis accurately in X-ray angiography (XRA) is important for diagnosing and treating coronary artery disease (CAD). However, challenges arise from factors like breathing and heart motion, poor imaging quality, and the complex vascular structures, making it difficult to identify stenosis fast and precisely. In this study, we proposed a Quantum Diffusion Model with Spatio-Temporal Feature Sharing to Real-time detect Stenosis (STQD-Det). Our framework consists of two modules: Sequential Quantum Noise Boxes module and spatio-temporal feature module. To evaluate the effectiveness of the method, we conducted a 4-fold cross-validation using a dataset consisting of 233 XRA sequences. Our approach achieved the F1 score of 92.39% with a real-time processing speed of 25.08 frames per second. These results outperform 17 state-of-the-art methods. The experimental results show that the proposed method can accomplish the stenosis detection quickly and accurately.
在 X 射线血管造影(XRA)中准确检测冠状动脉狭窄对于诊断和治疗冠状动脉疾病(CAD)非常重要。然而,由于呼吸和心脏运动、成像质量差以及血管结构复杂等因素,很难快速准确地识别狭窄。在这项研究中,我们提出了一种具有时空特征共享功能的量子扩散模型来实时检测血管狭窄(STQD-Det)。我们的框架由两个模块组成:序列量子噪声盒模块和时空特征模块。为了评估该方法的有效性,我们使用由 233 个 XRA 序列组成的数据集进行了 4 倍交叉验证。我们的方法获得了 92.39% 的 F1 分数,实时处理速度为每秒 25.08 帧。这些结果优于 17 种最先进的方法。实验结果表明,所提出的方法可以快速、准确地完成血管狭窄检测。
{"title":"STQD-Det: Spatio-Temporal Quantum Diffusion Model for Real-time Coronary Stenosis Detection in X-ray Angiography.","authors":"Xinyu Li, Danni Ai, Hong Song, Jingfan Fan, Tianyu Fu, Deqiang Xiao, Yining Wang, Jian Yang","doi":"10.1109/TPAMI.2024.3430839","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3430839","url":null,"abstract":"<p><p>Detecting coronary stenosis accurately in X-ray angiography (XRA) is important for diagnosing and treating coronary artery disease (CAD). However, challenges arise from factors like breathing and heart motion, poor imaging quality, and the complex vascular structures, making it difficult to identify stenosis fast and precisely. In this study, we proposed a Quantum Diffusion Model with Spatio-Temporal Feature Sharing to Real-time detect Stenosis (STQD-Det). Our framework consists of two modules: Sequential Quantum Noise Boxes module and spatio-temporal feature module. To evaluate the effectiveness of the method, we conducted a 4-fold cross-validation using a dataset consisting of 233 XRA sequences. Our approach achieved the F1 score of 92.39% with a real-time processing speed of 25.08 frames per second. These results outperform 17 state-of-the-art methods. The experimental results show that the proposed method can accomplish the stenosis detection quickly and accurately.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141725400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-17DOI: 10.1109/TPAMI.2024.3429498
Truong Thanh Nhat Mai, Edmund Y Lam, Chul Lee
Low-rank tensor completion (LRTC) aims to recover missing data of high-dimensional structures from a limited set of observed entries. Despite recent significant successes, the original structures of data tensors are still not effectively preserved in LRTC algorithms, yielding less accurate restoration results. Moreover, LRTC algorithms often incur high computational costs, which hinder their applicability. In this work, we propose an attention-guided low-rank tensor completion (AGTC) algorithm, which can faithfully restore the original structures of data tensors using deep unfolding attention-guided tensor factorization. First, we formulate the LRTC task as a robust factorization problem based on low-rank and sparse error assumptions. Low-rank tensor recovery is guided by an attention mechanism to better preserve the structures of the original data. We also develop implicit regularizers to compensate for modeling inaccuracies. Then, we solve the optimization problem by employing an iterative technique. Finally, we design a multistage deep network by unfolding the iterative algorithm, where each stage corresponds to an iteration of the algorithm; at each stage, the optimization variables and regularizers are updated by closed-form solutions and learned deep networks, respectively. Experimental results for high dynamic range imaging and hyperspectral image restoration show that the proposed algorithm outperforms state-of-the-art algorithms.
{"title":"Attention-Guided Low-Rank Tensor Completion.","authors":"Truong Thanh Nhat Mai, Edmund Y Lam, Chul Lee","doi":"10.1109/TPAMI.2024.3429498","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3429498","url":null,"abstract":"<p><p>Low-rank tensor completion (LRTC) aims to recover missing data of high-dimensional structures from a limited set of observed entries. Despite recent significant successes, the original structures of data tensors are still not effectively preserved in LRTC algorithms, yielding less accurate restoration results. Moreover, LRTC algorithms often incur high computational costs, which hinder their applicability. In this work, we propose an attention-guided low-rank tensor completion (AGTC) algorithm, which can faithfully restore the original structures of data tensors using deep unfolding attention-guided tensor factorization. First, we formulate the LRTC task as a robust factorization problem based on low-rank and sparse error assumptions. Low-rank tensor recovery is guided by an attention mechanism to better preserve the structures of the original data. We also develop implicit regularizers to compensate for modeling inaccuracies. Then, we solve the optimization problem by employing an iterative technique. Finally, we design a multistage deep network by unfolding the iterative algorithm, where each stage corresponds to an iteration of the algorithm; at each stage, the optimization variables and regularizers are updated by closed-form solutions and learned deep networks, respectively. Experimental results for high dynamic range imaging and hyperspectral image restoration show that the proposed algorithm outperforms state-of-the-art algorithms.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141636247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-16DOI: 10.1109/TPAMI.2024.3429209
ZhangJin Huang, Yuxin Wen, ZiHao Wang, Jinjuan Ren, Kui Jia
Reconstruction of a continuous surface of two-dimensional manifold from its raw, discrete point cloud observation is a long-standing problem in computer vision and graphics research. The problem is technically ill-posed, and becomes more difficult considering that various sensing imperfections would appear in the point clouds obtained by practical depth scanning. In literature, a rich set of methods has been proposed, and reviews of existing methods are also provided. However, existing reviews are short of thorough investigations on a common benchmark. The present paper aims to review and benchmark existing methods in the new era of deep learning surface reconstruction. To this end, we contribute a large-scale benchmarking dataset consisting of both synthetic and real-scanned data; the benchmark includes object- and scene-level surfaces and takes into account various sensing imperfections that are commonly encountered in practical depth scanning. We conduct thorough empirical studies by comparing existing methods on the constructed benchmark, and pay special attention on robustness of existing methods against various scanning imperfections; we also study how different methods generalize in terms of reconstructing complex surface shapes. Our studies help identity the best conditions under which different methods work, and suggest some empirical findings. For example, while deep learning methods are increasingly popular in the research community, our systematic studies suggest that, surprisingly, a few classical methods perform even better in terms of both robustness and generalization; our studies also suggest that the practical challenges of misalignment of point sets from multi-view scanning, missing of surface points, and point outliers remain unsolved by all the existing surface reconstruction methods. We expect that the benchmark and our studies would be valuable both for practitioners and as a guidance for new innovations in future research. We make the benchmark publicly accessible at https://Gorilla-Lab-SCUT.github.io/SurfaceReconstructionBenchmark.
{"title":"Surface Reconstruction from Point Clouds: A Survey and a Benchmark.","authors":"ZhangJin Huang, Yuxin Wen, ZiHao Wang, Jinjuan Ren, Kui Jia","doi":"10.1109/TPAMI.2024.3429209","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3429209","url":null,"abstract":"<p><p>Reconstruction of a continuous surface of two-dimensional manifold from its raw, discrete point cloud observation is a long-standing problem in computer vision and graphics research. The problem is technically ill-posed, and becomes more difficult considering that various sensing imperfections would appear in the point clouds obtained by practical depth scanning. In literature, a rich set of methods has been proposed, and reviews of existing methods are also provided. However, existing reviews are short of thorough investigations on a common benchmark. The present paper aims to review and benchmark existing methods in the new era of deep learning surface reconstruction. To this end, we contribute a large-scale benchmarking dataset consisting of both synthetic and real-scanned data; the benchmark includes object- and scene-level surfaces and takes into account various sensing imperfections that are commonly encountered in practical depth scanning. We conduct thorough empirical studies by comparing existing methods on the constructed benchmark, and pay special attention on robustness of existing methods against various scanning imperfections; we also study how different methods generalize in terms of reconstructing complex surface shapes. Our studies help identity the best conditions under which different methods work, and suggest some empirical findings. For example, while deep learning methods are increasingly popular in the research community, our systematic studies suggest that, surprisingly, a few classical methods perform even better in terms of both robustness and generalization; our studies also suggest that the practical challenges of misalignment of point sets from multi-view scanning, missing of surface points, and point outliers remain unsolved by all the existing surface reconstruction methods. We expect that the benchmark and our studies would be valuable both for practitioners and as a guidance for new innovations in future research. We make the benchmark publicly accessible at https://Gorilla-Lab-SCUT.github.io/SurfaceReconstructionBenchmark.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141629623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-16DOI: 10.1109/TPAMI.2024.3429387
Kun-Yu Lin, Jiaming Zhou, Wei-Shi Zheng
We study the domain adaptation task for action recognition, namely domain adaptive action recognition, which aims to effectively transfer action recognition power from a label-sufficient source domain to a label-free target domain. Since actions are performed by humans, it is crucial to exploit human cues in videos when recognizing actions across domains. However, existing methods are prone to losing human cues but prefer to exploit the correlation between non-human contexts and associated actions for recognition, and the contexts of interest agnostic to actions would reduce recognition performance in the target domain. To overcome this problem, we focus on uncovering human-centric action cues for domain adaptive action recognition, and our conception is to investigate two aspects of human-centric action cues, namely human cues and human-context interaction cues. Accordingly, our proposed Human-Centric Transformer (HCTransformer) develops a decoupled human-centric learning paradigm to explicitly concentrate on human-centric action cues in domain-variant video feature learning. Our HCTransformer first conducts human-aware temporal modeling by a human encoder, aiming to avoid a loss of human cues during domain-invariant video feature learning. Then, by a Transformer-like architecture, HCTransformer exploits domain-invariant and action-correlated contexts by a context encoder, and further models domain-invariant interaction between humans and action-correlated contexts. We conduct extensive experiments on three benchmarks, namely UCF-HMDB, Kinetics-NecDrone and EPIC-Kitchens-UDA, and the state-of-the-art performance demonstrates the effectiveness of our proposed HCTransformer.
{"title":"Human-Centric Transformer for Domain Adaptive Action Recognition.","authors":"Kun-Yu Lin, Jiaming Zhou, Wei-Shi Zheng","doi":"10.1109/TPAMI.2024.3429387","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3429387","url":null,"abstract":"<p><p>We study the domain adaptation task for action recognition, namely domain adaptive action recognition, which aims to effectively transfer action recognition power from a label-sufficient source domain to a label-free target domain. Since actions are performed by humans, it is crucial to exploit human cues in videos when recognizing actions across domains. However, existing methods are prone to losing human cues but prefer to exploit the correlation between non-human contexts and associated actions for recognition, and the contexts of interest agnostic to actions would reduce recognition performance in the target domain. To overcome this problem, we focus on uncovering human-centric action cues for domain adaptive action recognition, and our conception is to investigate two aspects of human-centric action cues, namely human cues and human-context interaction cues. Accordingly, our proposed Human-Centric Transformer (HCTransformer) develops a decoupled human-centric learning paradigm to explicitly concentrate on human-centric action cues in domain-variant video feature learning. Our HCTransformer first conducts human-aware temporal modeling by a human encoder, aiming to avoid a loss of human cues during domain-invariant video feature learning. Then, by a Transformer-like architecture, HCTransformer exploits domain-invariant and action-correlated contexts by a context encoder, and further models domain-invariant interaction between humans and action-correlated contexts. We conduct extensive experiments on three benchmarks, namely UCF-HMDB, Kinetics-NecDrone and EPIC-Kitchens-UDA, and the state-of-the-art performance demonstrates the effectiveness of our proposed HCTransformer.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141629621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-16DOI: 10.1109/TPAMI.2024.3429383
Da-Wei Zhou, Qi-Wei Wang, Zhi-Hong Qi, Han-Jia Ye, De-Chuan Zhan, Ziwei Liu
Deep models, e.g., CNNs and Vision Transformers, have achieved impressive achievements in many vision tasks in the closed world. However, novel classes emerge from time to time in our ever-changing world, requiring a learning system to acquire new knowledge continually. Class-Incremental Learning (CIL) enables the learner to incorporate the knowledge of new classes incrementally and build a universal classifier among all seen classes. Correspondingly, when directly training the model with new class instances, a fatal problem occurs - the model tends to catastrophically forget the characteristics of former ones, and its performance drastically degrades. There have been numerous efforts to tackle catastrophic forgetting in the machine learning community. In this paper, we survey comprehensively recent advances in class-incremental learning and summarize these methods from several aspects. We also provide a rigorous and unified evaluation of 17 methods in benchmark image classification tasks to find out the characteristics of different algorithms empirically. Furthermore, we notice that the current comparison protocol ignores the influence of memory budget in model storage, which may result in unfair comparison and biased results. Hence, we advocate fair comparison by aligning the memory budget in evaluation, as well as several memory-agnostic performance measures. The source code is available at https://github.com/zhoudw-zdw/CIL_Survey/.
{"title":"Class-Incremental Learning: A Survey.","authors":"Da-Wei Zhou, Qi-Wei Wang, Zhi-Hong Qi, Han-Jia Ye, De-Chuan Zhan, Ziwei Liu","doi":"10.1109/TPAMI.2024.3429383","DOIUrl":"10.1109/TPAMI.2024.3429383","url":null,"abstract":"<p><p>Deep models, e.g., CNNs and Vision Transformers, have achieved impressive achievements in many vision tasks in the closed world. However, novel classes emerge from time to time in our ever-changing world, requiring a learning system to acquire new knowledge continually. Class-Incremental Learning (CIL) enables the learner to incorporate the knowledge of new classes incrementally and build a universal classifier among all seen classes. Correspondingly, when directly training the model with new class instances, a fatal problem occurs - the model tends to catastrophically forget the characteristics of former ones, and its performance drastically degrades. There have been numerous efforts to tackle catastrophic forgetting in the machine learning community. In this paper, we survey comprehensively recent advances in class-incremental learning and summarize these methods from several aspects. We also provide a rigorous and unified evaluation of 17 methods in benchmark image classification tasks to find out the characteristics of different algorithms empirically. Furthermore, we notice that the current comparison protocol ignores the influence of memory budget in model storage, which may result in unfair comparison and biased results. Hence, we advocate fair comparison by aligning the memory budget in evaluation, as well as several memory-agnostic performance measures. The source code is available at https://github.com/zhoudw-zdw/CIL_Survey/.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141629620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-16DOI: 10.1109/TPAMI.2024.3428546
Wanjie Sun, Zhenzhong Chen
Learning based single image super-resolution (SISR) for real-world images has been an active research topic yet a challenging task, due to the lack of paired low-resolution (LR) and high-resolution (HR) training images. Most of the existing unsupervised real-world SISR methods adopt a twostage training strategy by synthesizing realistic LR images from their HR counterparts first, then training the super-resolution (SR) models in a supervised manner. However, the training of image degradation and SR models in this strategy are separate, ignoring the inherent mutual dependency between downscaling and its inverse upscaling process. Additionally, the ill-posed nature of image degradation is not fully considered. In this paper, we propose an image downscaling and SR model dubbed as SDFlow, which simultaneously learns a bidirectional manyto- many mapping between real-world LR and HR images unsupervisedly. The main idea of SDFlow is to decouple image content and degradation information in the latent space, where content information distribution of LR and HR images is matched in a common latent space. Degradation information of the LR images and the high-frequency information of the HR images are fitted to an easy-to-sample conditional distribution. Experimental results on real-world image SR datasets indicate that SDFlow can generate diverse realistic LR and SR images both quantitatively and qualitatively.
由于缺乏成对的低分辨率(LR)和高分辨率(HR)训练图像,基于学习的真实世界图像单图像超分辨率(SISR)一直是一个活跃的研究课题,但也是一项具有挑战性的任务。大多数现有的无监督真实世界 SISR 方法都采用了两阶段训练策略,即首先从对应的高分辨率图像中合成真实的低分辨率图像,然后以监督方式训练超分辨率(SR)模型。然而,在这种策略中,图像降级和 SR 模型的训练是分开的,忽略了降级和反向升维过程之间固有的相互依赖性。此外,也没有充分考虑到图像降解的不确定性。在本文中,我们提出了一种被称为 SDFlow 的图像降尺度和升尺度模型,该模型可以在无监督的情况下同时学习真实世界 LR 和 HR 图像之间的双向多对多映射。SDFlow 的主要思想是在潜空间中解耦图像内容和降级信息,即在一个共同的潜空间中匹配 LR 和 HR 图像的内容信息分布。LR 图像的降解信息和 HR 图像的高频信息被拟合到一个易于采样的条件分布中。在真实世界图像 SR 数据集上的实验结果表明,SDFlow 可以定量和定性地生成多种逼真的 LR 和 SR 图像。
{"title":"Learning Many-to-Many Mapping for Unpaired Real-World Image Super-resolution and Downscaling.","authors":"Wanjie Sun, Zhenzhong Chen","doi":"10.1109/TPAMI.2024.3428546","DOIUrl":"10.1109/TPAMI.2024.3428546","url":null,"abstract":"<p><p>Learning based single image super-resolution (SISR) for real-world images has been an active research topic yet a challenging task, due to the lack of paired low-resolution (LR) and high-resolution (HR) training images. Most of the existing unsupervised real-world SISR methods adopt a twostage training strategy by synthesizing realistic LR images from their HR counterparts first, then training the super-resolution (SR) models in a supervised manner. However, the training of image degradation and SR models in this strategy are separate, ignoring the inherent mutual dependency between downscaling and its inverse upscaling process. Additionally, the ill-posed nature of image degradation is not fully considered. In this paper, we propose an image downscaling and SR model dubbed as SDFlow, which simultaneously learns a bidirectional manyto- many mapping between real-world LR and HR images unsupervisedly. The main idea of SDFlow is to decouple image content and degradation information in the latent space, where content information distribution of LR and HR images is matched in a common latent space. Degradation information of the LR images and the high-frequency information of the HR images are fitted to an easy-to-sample conditional distribution. Experimental results on real-world image SR datasets indicate that SDFlow can generate diverse realistic LR and SR images both quantitatively and qualitatively.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141629622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-12DOI: 10.1109/TPAMI.2024.3426998
Bo Pang, Gao Peng, Yizhuo Li, Cewu Lu
Compared to images, video, as an increasingly mainstream visual media, contains more semantic information. For this reason, the computational complexity of video models is an order of magnitude larger than their image-level counterparts, which increases linearly with the square number of frames. Constrained by computational resources, training video models to learn long-term temporal semantics end-to-end is quite a challenge. Currently, the main-stream method is to split a raw video into clips, leading to incomplete fragmentary temporal information flow and failure of modeling long-term semantics. To solve this problem, in this paper, we design the Markov Progressive framework (MaPro), a theoretical framework consisting of the progressive modeling method and a paradigm model tailored for it. Inspired by natural language processing techniques dealing with long sentences, the core idea of MaPro is to find a paradigm model consisting of proposed Markov operators which can be trained in multiple sequential steps and ensure that the multi-step progressive modeling is equivalent to the conventional end-to-end modeling. By training the paradigm model under the progressive method, we are able to model long videos end-to-end with limited resources and ensure the effective transmission of long-term temporal information. We provide detailed implementations of this theoretical system on the mainstream CNN- and Transformer-based models, where they are modified to conform to the Markov paradigm. The theoretical paradigm as a basic model is the lower bound of the model efficiency. With it, we further explore more sophisticated designs for CNN and Transformer-based methods specifically. As a general and robust training method, we experimentally demonstrate that it yields significant performance improvements on different backbones and datasets. As an illustrative example, the proposed method improves the SlowOnly network by 4.1 mAP on Charades and 2.5 top-1 accuracy on Kinetics. And for TimeSformer, MaPro improves its performance on Kinetics by 2.0 top-1 accuracy. Importantly, all these improvements are achieved with a little parameter and computation overhead. We hope the MaPro method can provide the community with new insight into modeling long videos.
{"title":"Markov Progressive Framework, a Universal Paradigm for Modeling Long Videos.","authors":"Bo Pang, Gao Peng, Yizhuo Li, Cewu Lu","doi":"10.1109/TPAMI.2024.3426998","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3426998","url":null,"abstract":"<p><p>Compared to images, video, as an increasingly mainstream visual media, contains more semantic information. For this reason, the computational complexity of video models is an order of magnitude larger than their image-level counterparts, which increases linearly with the square number of frames. Constrained by computational resources, training video models to learn long-term temporal semantics end-to-end is quite a challenge. Currently, the main-stream method is to split a raw video into clips, leading to incomplete fragmentary temporal information flow and failure of modeling long-term semantics. To solve this problem, in this paper, we design the Markov Progressive framework (MaPro), a theoretical framework consisting of the progressive modeling method and a paradigm model tailored for it. Inspired by natural language processing techniques dealing with long sentences, the core idea of MaPro is to find a paradigm model consisting of proposed Markov operators which can be trained in multiple sequential steps and ensure that the multi-step progressive modeling is equivalent to the conventional end-to-end modeling. By training the paradigm model under the progressive method, we are able to model long videos end-to-end with limited resources and ensure the effective transmission of long-term temporal information. We provide detailed implementations of this theoretical system on the mainstream CNN- and Transformer-based models, where they are modified to conform to the Markov paradigm. The theoretical paradigm as a basic model is the lower bound of the model efficiency. With it, we further explore more sophisticated designs for CNN and Transformer-based methods specifically. As a general and robust training method, we experimentally demonstrate that it yields significant performance improvements on different backbones and datasets. As an illustrative example, the proposed method improves the SlowOnly network by 4.1 mAP on Charades and 2.5 top-1 accuracy on Kinetics. And for TimeSformer, MaPro improves its performance on Kinetics by 2.0 top-1 accuracy. Importantly, all these improvements are achieved with a little parameter and computation overhead. We hope the MaPro method can provide the community with new insight into modeling long videos.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}