首页 > 最新文献

IEEE transactions on pattern analysis and machine intelligence最新文献

英文 中文
Unsupervised Degradation Representation Learning for Unpaired Restoration of Images and Point Clouds. 用于图像和点云非配对修复的无监督退化表征学习。
Pub Date : 2024-10-30 DOI: 10.1109/TPAMI.2024.3471571
Longguang Wang, Yulan Guo, Yingqian Wang, Xiaoyu Dong, Qingyu Xu, Jungang Yang, Wei An

Restoration tasks in low-level vision aim to restore high-quality (HQ) data from their low-quality (LQ) observations. To circumvents the difficulty of acquiring paired data in real scenarios, unpaired approaches that aim to restore HQ data solely on unpaired data are drawing increasing interest. Since restoration tasks are tightly coupled with the degradation model, unknown and highly diverse degradations in real scenarios make learning from unpaired data quite challenging. In this paper, we propose a degradation representation learning scheme to address this challenge. By learning to distinguish various degradations in the representation space, our degradation representations can extract implicit degradation information in an unsupervised manner. Moreover, to handle diverse degradations, we develop degradation-aware (DA) convolutions with flexible adaption to various degradations to fully exploit the degrdation information in the learned representations. Based on our degradation representations and DA convolutions, we introduce a generic framework for unpaired restoration tasks. Based on our framework, we propose UnIRnet and UnPRnet for unpaired image and point cloud restoration tasks, respectively. It is demonstrated that our degradation representation learning scheme can extract discriminative representations to obtain accurate degradation information. Experiments on unpaired image and point cloud restoration tasks show that our UnIRnet and UnPRnet achieve state-of-the-art performance.

低级视觉中的还原任务旨在从低质量(LQ)观测数据中还原高质量(HQ)数据。为了规避在真实场景中获取配对数据的困难,旨在仅通过非配对数据来恢复高质量数据的非配对方法正引起越来越多的关注。由于恢复任务与降解模型紧密相关,真实场景中未知且高度多样化的降解使得从非配对数据中学习具有相当大的挑战性。在本文中,我们提出了一种退化表示学习方案来应对这一挑战。通过学习区分表征空间中的各种退化,我们的退化表征可以在无监督的情况下提取隐含的退化信息。此外,为了处理各种降解,我们开发了降解感知(DA)卷积,可灵活适应各种降解,以充分利用所学表征中的降解信息。基于我们的退化表征和 DA 卷积,我们为无配对修复任务引入了一个通用框架。基于我们的框架,我们提出了 UnIRnet 和 UnPRnet,分别用于无配对图像和点云修复任务。实验证明,我们的降解表征学习方案可以提取鉴别性表征,从而获得准确的降解信息。无配对图像和点云修复任务的实验表明,我们的 UnIRnet 和 UnPRnet 达到了最先进的性能。
{"title":"Unsupervised Degradation Representation Learning for Unpaired Restoration of Images and Point Clouds.","authors":"Longguang Wang, Yulan Guo, Yingqian Wang, Xiaoyu Dong, Qingyu Xu, Jungang Yang, Wei An","doi":"10.1109/TPAMI.2024.3471571","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3471571","url":null,"abstract":"<p><p>Restoration tasks in low-level vision aim to restore high-quality (HQ) data from their low-quality (LQ) observations. To circumvents the difficulty of acquiring paired data in real scenarios, unpaired approaches that aim to restore HQ data solely on unpaired data are drawing increasing interest. Since restoration tasks are tightly coupled with the degradation model, unknown and highly diverse degradations in real scenarios make learning from unpaired data quite challenging. In this paper, we propose a degradation representation learning scheme to address this challenge. By learning to distinguish various degradations in the representation space, our degradation representations can extract implicit degradation information in an unsupervised manner. Moreover, to handle diverse degradations, we develop degradation-aware (DA) convolutions with flexible adaption to various degradations to fully exploit the degrdation information in the learned representations. Based on our degradation representations and DA convolutions, we introduce a generic framework for unpaired restoration tasks. Based on our framework, we propose UnIRnet and UnPRnet for unpaired image and point cloud restoration tasks, respectively. It is demonstrated that our degradation representation learning scheme can extract discriminative representations to obtain accurate degradation information. Experiments on unpaired image and point cloud restoration tasks show that our UnIRnet and UnPRnet achieve state-of-the-art performance.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142549775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Noise Self-Regression: A New Learning Paradigm to Enhance Low-Light Images Without Task-Related Data. 噪声自回归:在没有任务相关数据的情况下增强弱光图像的新学习范例
Pub Date : 2024-10-28 DOI: 10.1109/TPAMI.2024.3487361
Zhao Zhang, Suiyi Zhao, Xiaojie Jin, Mingliang Xu, Yi Yang, Shuicheng Yan, Meng Wang

Deep learning-based low-light image enhancement (LLIE) is a task of leveraging deep neural networks to enhance the image illumination while keeping the image content unchanged. From the perspective of training data, existing methods complete the LLIE task driven by one of the following three data types: paired data, unpaired data and zero-reference data. Each type of these data-driven methods has its own advantages, e.g., zero-reference data-based methods have very low requirements on training data and can meet the human needs in many scenarios. In this paper, we leverage pure Gaussian noise to complete the LLIE task, which further reduces the requirements for training data in LLIE tasks and can be used as another alternative in practical use. Specifically, we propose Noise SElf-Regression (NoiSER) without access to any task-related data, simply learns a convolutional neural network equipped with an instance-normalization layer by taking a random noise image, N(0,σ2) for each pixel, as both input and output for each training pair, and then the low-light image is fed to the trained network for predicting the normal-light image. Technically, an intuitive explanation for its effectiveness is as follows: 1) the self-regression reconstructs the contrast between adjacent pixels of the input image, 2) the instance-normalization layer may naturally remediate the overall magnitude/lighting of the input image, and 3) the N(0,σ2) assumption for each pixel enforces the output image to follow the well-known gray-world hypothesis [1] when the image size is big enough. Compared to current state-of-the-art LLIE methods with access to different task-related data, NoiSER is highly competitive in enhancement quality, yet with a much smaller model size, and much lower training and inference cost. In addition, the experiments also demonstrate that NoiSER has great potential in overexposure suppression and joint processing with other restoration tasks.

基于深度学习的低照度图像增强(LLIE)是一项利用深度神经网络在保持图像内容不变的情况下增强图像照度的任务。从训练数据的角度来看,现有方法是在以下三种数据类型之一的驱动下完成低照度图像增强任务的:配对数据、非配对数据和零参考数据。每种数据驱动方法都有自己的优势,例如,基于零参考数据的方法对训练数据的要求很低,可以满足人类在很多场景下的需求。在本文中,我们利用纯高斯噪声来完成 LLIE 任务,这进一步降低了 LLIE 任务对训练数据的要求,在实际应用中可以作为另一种选择。具体来说,我们提出的 Noise SElf-Regression(NoiSER)无需获取任何任务相关数据,只需学习一个配备实例归一化层的卷积神经网络,将每个像素的随机噪声图像 N(0,σ2)作为每个训练对的输入和输出,然后将低亮度图像输入训练好的网络以预测正常亮度图像。从技术上讲,对其有效性的直观解释如下:1) 自回归可以重建输入图像相邻像素之间的对比度;2) 实例归一化层可以自然地修正输入图像的整体幅度/亮度;3) 当图像尺寸足够大时,每个像素的 N(0,σ2)假设会强制输出图像遵循众所周知的灰度世界假设[1]。与目前能获取不同任务相关数据的最先进 LLIE 方法相比,NoiSER 在增强质量方面具有很强的竞争力,但模型规模却小得多,训练和推理成本也低得多。此外,实验还证明了 NoiSER 在抑制曝光过度和与其他修复任务联合处理方面的巨大潜力。
{"title":"Noise Self-Regression: A New Learning Paradigm to Enhance Low-Light Images Without Task-Related Data.","authors":"Zhao Zhang, Suiyi Zhao, Xiaojie Jin, Mingliang Xu, Yi Yang, Shuicheng Yan, Meng Wang","doi":"10.1109/TPAMI.2024.3487361","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3487361","url":null,"abstract":"<p><p>Deep learning-based low-light image enhancement (LLIE) is a task of leveraging deep neural networks to enhance the image illumination while keeping the image content unchanged. From the perspective of training data, existing methods complete the LLIE task driven by one of the following three data types: paired data, unpaired data and zero-reference data. Each type of these data-driven methods has its own advantages, e.g., zero-reference data-based methods have very low requirements on training data and can meet the human needs in many scenarios. In this paper, we leverage pure Gaussian noise to complete the LLIE task, which further reduces the requirements for training data in LLIE tasks and can be used as another alternative in practical use. Specifically, we propose Noise SElf-Regression (NoiSER) without access to any task-related data, simply learns a convolutional neural network equipped with an instance-normalization layer by taking a random noise image, N(0,σ<sup>2</sup>) for each pixel, as both input and output for each training pair, and then the low-light image is fed to the trained network for predicting the normal-light image. Technically, an intuitive explanation for its effectiveness is as follows: 1) the self-regression reconstructs the contrast between adjacent pixels of the input image, 2) the instance-normalization layer may naturally remediate the overall magnitude/lighting of the input image, and 3) the N(0,σ<sup>2</sup>) assumption for each pixel enforces the output image to follow the well-known gray-world hypothesis [1] when the image size is big enough. Compared to current state-of-the-art LLIE methods with access to different task-related data, NoiSER is highly competitive in enhancement quality, yet with a much smaller model size, and much lower training and inference cost. In addition, the experiments also demonstrate that NoiSER has great potential in overexposure suppression and joint processing with other restoration tasks.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142523961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Disentangling Before Composing: Learning Invariant Disentangled Features for Compositional Zero-Shot Learning. 合成前的分解:为零镜头合成学习学习不变的分解特征
Pub Date : 2024-10-28 DOI: 10.1109/TPAMI.2024.3487222
Tian Zhang, Kongming Liang, Ruoyi Du, Wei Chen, Zhanyu Ma

Compositional Zero-Shot Learning (CZSL) aims to recognize novel compositions using knowledge learned from seen attribute-object compositions in the training set. Previous works mainly project an image and its corresponding composition into a common embedding space to measure their compatibility score. However, both attributes and objects share the visual representations learned above, leading the model to exploit spurious correlations and bias towards seen compositions. Instead, we reconsider CZSL as an out-of-distribution generalization problem. If an object is treated as a domain, we can learn object-invariant features to recognize attributes attached to any object reliably, and vice versa. Specifically, we propose an invariant feature learning framework to align different domains at the representation and gradient levels to capture the intrinsic characteristics associated with the tasks. To further facilitate and encourage the disentanglement of attributes and objects, we propose an "encoding-reshuffling-decoding" process to help the model avoid spurious correlations by randomly regrouping the disentangled features into synthetic features. Ultimately, our method improves generalization by learning to disentangle features that represent two independent factors of attributes and objects. Experiments demonstrate that the proposed method achieves state-of-the-art or competitive performance in both closed-world and open-world scenarios. Codes are available at https://github.com/PRIS-CV/Disentangling-before-Composing.

构图零点学习(CZSL)旨在利用从训练集中所见的属性-对象构图中学到的知识来识别新的构图。以往的工作主要是将图像及其相应的组合投射到一个共同的嵌入空间,以衡量它们的兼容性得分。然而,属性和对象都共享上述学习到的视觉表征,导致模型利用虚假的相关性,偏向于已见过的组合。相反,我们将 CZSL 视为分布外概括问题。如果将对象视为一个域,我们就可以学习对象不变的特征,从而可靠地识别任何对象的属性,反之亦然。具体来说,我们提出了一种不变特征学习框架,在表征和梯度层面对不同领域进行调整,以捕捉与任务相关的内在特征。为了进一步促进和鼓励属性与对象的分离,我们提出了一个 "编码-清除-解码 "过程,通过将分离的特征随机重新组合为合成特征,帮助模型避免虚假的相关性。最终,我们的方法通过学习如何分离代表属性和对象两个独立因素的特征,提高了泛化能力。实验证明,所提出的方法在封闭世界和开放世界场景中都达到了最先进或具有竞争力的性能。代码见 https://github.com/PRIS-CV/Disentangling-before-Composing。
{"title":"Disentangling Before Composing: Learning Invariant Disentangled Features for Compositional Zero-Shot Learning.","authors":"Tian Zhang, Kongming Liang, Ruoyi Du, Wei Chen, Zhanyu Ma","doi":"10.1109/TPAMI.2024.3487222","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3487222","url":null,"abstract":"<p><p>Compositional Zero-Shot Learning (CZSL) aims to recognize novel compositions using knowledge learned from seen attribute-object compositions in the training set. Previous works mainly project an image and its corresponding composition into a common embedding space to measure their compatibility score. However, both attributes and objects share the visual representations learned above, leading the model to exploit spurious correlations and bias towards seen compositions. Instead, we reconsider CZSL as an out-of-distribution generalization problem. If an object is treated as a domain, we can learn object-invariant features to recognize attributes attached to any object reliably, and vice versa. Specifically, we propose an invariant feature learning framework to align different domains at the representation and gradient levels to capture the intrinsic characteristics associated with the tasks. To further facilitate and encourage the disentanglement of attributes and objects, we propose an \"encoding-reshuffling-decoding\" process to help the model avoid spurious correlations by randomly regrouping the disentangled features into synthetic features. Ultimately, our method improves generalization by learning to disentangle features that represent two independent factors of attributes and objects. Experiments demonstrate that the proposed method achieves state-of-the-art or competitive performance in both closed-world and open-world scenarios. Codes are available at https://github.com/PRIS-CV/Disentangling-before-Composing.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142523959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PSRR-MaxpoolNMS++: Fast Non-Maximum Suppression with Discretization and Pooling. PSRR-MaxpoolNMS++:利用离散化和池化实现快速非最大值抑制
Pub Date : 2024-10-28 DOI: 10.1109/TPAMI.2024.3485898
Tianyi Zhang, Chunyun Chen, Yun Liu, Xue Geng, Mohamed M Sabry Aly, Jie Lin

Non-maximum suppression (NMS) is an essential post-processing step for object detection. The de-facto standard for NMS, namely GreedyNMS, is not parallelizable and could thus be the performance bottleneck in object detection pipelines. MaxpoolNMS is introduced as a fast and parallelizable alternative to GreedyNMS. However, MaxpoolNMS is only capable of replacing the GreedyNMS at the first stage of two-stage detectors like Faster R-CNN. To address this issue, we observe that MaxpoolNMS employs the process of box coordinate discretization followed by local score argmax calculation, to discard the nested-loop pipeline in GreedyNMS to enable parallelizable implementations. In this paper, we introduce a simple Relationship Recovery module and a Pyramid Shifted MaxpoolNMS module to improve the above two stages, respectively. With these two modules, our PSRR-MaxpoolNMS is a generic and parallelizable approach, which can completely replace GreedyNMS at all stages in all detectors. Furthermore, we extend PSRR-MaxpoolNMS to the more powerful PSRR-MaxpoolNMS++. As for box coordinate discretization, we propose Density-based Discretization for better adherence to the target density of the suppression. As for local score argmax calculation, we propose an Adjacent Scale Pooling scheme for mining out the duplicated box pairs more accurately and efficiently. Extensive experiments demonstrate that both our PSRR-MaxpoolNMS and PSRR-MaxpoolNMS++ outperform MaxpoolNMS by a large margin. Additionally, PSRR-MaxpoolNMS++ not only surpasses PSRR-MaxpoolNMS but also attains competitive accuracy and much better efficiency when compared with GreedyNMS. Therefore, PSRR-MaxpoolNMS++ is a parallelizable NMS solution that can effectively replace GreedyNMS at all stages in all detectors.

非最大抑制(NMS)是物体检测中必不可少的后处理步骤。NMS 的事实标准,即 GreedyNMS,是不可并行化的,因此可能成为物体检测管道的性能瓶颈。MaxpoolNMS 是作为 GreedyNMS 的快速、可并行化替代品而推出的。然而,MaxpoolNMS 只能在两阶段检测器(如 Faster R-CNN)的第一阶段替代 GreedyNMS。为了解决这个问题,我们发现 MaxpoolNMS 采用了先计算盒坐标离散化,再计算局部得分 argmax 的过程,从而摒弃了 GreedyNMS 中的嵌套循环流水线,实现了可并行化实现。本文引入了简单关系恢复模块和金字塔移动 MaxpoolNMS 模块,分别对上述两个阶段进行改进。有了这两个模块,我们的 PSRR-MaxpoolNMS 是一种通用的可并行化方法,可以在所有探测器的所有阶段完全取代 GreedyNMS。此外,我们还将 PSRR-MaxpoolNMS 扩展为功能更强大的 PSRR-MaxpoolNMS++。在盒坐标离散化方面,我们提出了基于密度的离散化,以更好地遵循抑制的目标密度。在局部得分 argmax 计算方面,我们提出了相邻规模池化方案,以更准确、更高效地挖掘出重复的盒对。大量实验证明,PSRR-MaxpoolNMS 和 PSRR-MaxpoolNMS++ 的性能远远优于 MaxpoolNMS。此外,与 GreedyNMS 相比,PSRR-MaxpoolNMS++ 不仅超越了 PSRR-MaxpoolNMS,而且在准确性和效率方面也更胜一筹。因此,PSRR-MaxpoolNMS++ 是一种可并行化的 NMS 解决方案,能在所有探测器的所有阶段有效取代 GreedyNMS。
{"title":"PSRR-MaxpoolNMS++: Fast Non-Maximum Suppression with Discretization and Pooling.","authors":"Tianyi Zhang, Chunyun Chen, Yun Liu, Xue Geng, Mohamed M Sabry Aly, Jie Lin","doi":"10.1109/TPAMI.2024.3485898","DOIUrl":"10.1109/TPAMI.2024.3485898","url":null,"abstract":"<p><p>Non-maximum suppression (NMS) is an essential post-processing step for object detection. The de-facto standard for NMS, namely GreedyNMS, is not parallelizable and could thus be the performance bottleneck in object detection pipelines. MaxpoolNMS is introduced as a fast and parallelizable alternative to GreedyNMS. However, MaxpoolNMS is only capable of replacing the GreedyNMS at the first stage of two-stage detectors like Faster R-CNN. To address this issue, we observe that MaxpoolNMS employs the process of box coordinate discretization followed by local score argmax calculation, to discard the nested-loop pipeline in GreedyNMS to enable parallelizable implementations. In this paper, we introduce a simple Relationship Recovery module and a Pyramid Shifted MaxpoolNMS module to improve the above two stages, respectively. With these two modules, our PSRR-MaxpoolNMS is a generic and parallelizable approach, which can completely replace GreedyNMS at all stages in all detectors. Furthermore, we extend PSRR-MaxpoolNMS to the more powerful PSRR-MaxpoolNMS++. As for box coordinate discretization, we propose Density-based Discretization for better adherence to the target density of the suppression. As for local score argmax calculation, we propose an Adjacent Scale Pooling scheme for mining out the duplicated box pairs more accurately and efficiently. Extensive experiments demonstrate that both our PSRR-MaxpoolNMS and PSRR-MaxpoolNMS++ outperform MaxpoolNMS by a large margin. Additionally, PSRR-MaxpoolNMS++ not only surpasses PSRR-MaxpoolNMS but also attains competitive accuracy and much better efficiency when compared with GreedyNMS. Therefore, PSRR-MaxpoolNMS++ is a parallelizable NMS solution that can effectively replace GreedyNMS at all stages in all detectors.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142523962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FLAC: Fairness-Aware Representation Learning by Suppressing Attribute-Class Associations. FLAC:通过抑制属性-类别关联进行公平感知表征学习。
Pub Date : 2024-10-28 DOI: 10.1109/TPAMI.2024.3487254
Ioannis Sarridis, Christos Koutlis, Symeon Papadopoulos, Christos Diou

Bias in computer vision systems can perpetuate or even amplify discrimination against certain populations. Considering that bias is often introduced by biased visual datasets, many recent research efforts focus on training fair models using such data. However, most of them heavily rely on the availability of protected attribute labels in the dataset, which limits their applicability, while label-unaware approaches, i.e., approaches operating without such labels, exhibit considerably lower performance. To overcome these limitations, this work introduces FLAC, a methodology that minimizes mutual information between the features extracted by the model and a protected attribute, without the use of attribute labels. To do that, FLAC proposes a sampling strategy that highlights underrepresented samples in the dataset, and casts the problem of learning fair representations as a probability matching problem that leverages representations extracted by a bias-capturing classifier. It is theoretically shown that FLAC can indeed lead to fair representations, that are independent of the protected attributes. FLAC surpasses the current state-of-the-art on Biased-MNIST, CelebA, and UTKFace, by 29.1%, 18.1%, and 21.9%, respectively. Additionally, FLAC exhibits 2.2% increased accuracy on ImageNet-A and up to 4.2% increased accuracy on Corrupted-Cifar10. Finally, in most experiments, FLAC even outperforms the bias label-aware state-of-the-art methods.

计算机视觉系统中的偏见会延续甚至扩大对某些人群的歧视。考虑到有偏见的视觉数据集通常会带来偏见,最近的许多研究工作都集中在使用此类数据训练公平模型上。然而,其中大多数研究都严重依赖于数据集中受保护属性标签的可用性,这限制了它们的适用性,而标签感知方法,即在没有此类标签的情况下运行的方法,则表现出相当低的性能。为了克服这些限制,这项工作引入了 FLAC,这是一种在不使用属性标签的情况下最小化模型提取的特征与受保护属性之间的互信息的方法。为此,FLAC 提出了一种采样策略,以突出数据集中代表性不足的样本,并将学习公平表征的问题视为一个概率匹配问题,利用偏差捕捉分类器提取的表征。从理论上讲,FLAC 的确可以产生独立于受保护属性的公平表征。在 Biased-MNIST、CelebA 和 UTKFace 上,FLAC 分别以 29.1%、18.1% 和 21.9% 的优势超越了当前最先进的水平。此外,FLAC 在 ImageNet-A 上的准确率提高了 2.2%,在 Corrupted-Cifar10 上的准确率提高了 4.2%。最后,在大多数实验中,FLAC 的表现甚至超过了最先进的偏差标签感知方法。
{"title":"FLAC: Fairness-Aware Representation Learning by Suppressing Attribute-Class Associations.","authors":"Ioannis Sarridis, Christos Koutlis, Symeon Papadopoulos, Christos Diou","doi":"10.1109/TPAMI.2024.3487254","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3487254","url":null,"abstract":"<p><p>Bias in computer vision systems can perpetuate or even amplify discrimination against certain populations. Considering that bias is often introduced by biased visual datasets, many recent research efforts focus on training fair models using such data. However, most of them heavily rely on the availability of protected attribute labels in the dataset, which limits their applicability, while label-unaware approaches, i.e., approaches operating without such labels, exhibit considerably lower performance. To overcome these limitations, this work introduces FLAC, a methodology that minimizes mutual information between the features extracted by the model and a protected attribute, without the use of attribute labels. To do that, FLAC proposes a sampling strategy that highlights underrepresented samples in the dataset, and casts the problem of learning fair representations as a probability matching problem that leverages representations extracted by a bias-capturing classifier. It is theoretically shown that FLAC can indeed lead to fair representations, that are independent of the protected attributes. FLAC surpasses the current state-of-the-art on Biased-MNIST, CelebA, and UTKFace, by 29.1%, 18.1%, and 21.9%, respectively. Additionally, FLAC exhibits 2.2% increased accuracy on ImageNet-A and up to 4.2% increased accuracy on Corrupted-Cifar10. Finally, in most experiments, FLAC even outperforms the bias label-aware state-of-the-art methods.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142523960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fast Window-Based Event Denoising with Spatiotemporal Correlation Enhancement. 利用时空相关性增强技术实现基于窗口的快速事件去噪。
Pub Date : 2024-10-10 DOI: 10.1109/TPAMI.2024.3467709
Huachen Fang, Jinjian Wu, Qibin Hou, Weisheng Dong, Guangming Shi

Previous deep learning-based event denoising methods mostly suffer from poor interpretability and difficulty in real-time processing due to their complex architecture designs. In this paper, we propose window-based event denoising, which simultaneously deals with a stack of events while existing element-based denoising focuses on one event each time. Besides, we give the theoretical analysis based on probability distributions in both temporal and spatial domains to improve interpretability. In temporal domain, we use timestamp deviations between processing events and central event to judge the temporal correlation and filter out temporal-irrelevant events. In spatial domain, we choose maximum a posteriori (MAP) to discriminate real-world event and noise and use the learned convolutional sparse coding to optimize the objective function. Based on the theoretical analysis, we build Temporal Window (TW) module and Soft Spatial Feature Embedding (SSFE) module to process temporal and spatial information separately, and construct a novel multi-scale window-based event denoising network, named WedNet. The high denoising accuracy and fast running speed of our WedNet enables us to achieve real-time denoising in complex scenes. Extensive experimental results verify the effectiveness and robustness of our WedNet. Our algorithm can remove event noise effectively and efficiently and improve the performance of downstream tasks.

以往基于深度学习的事件去噪方法大多存在可解释性差、架构设计复杂难以实时处理等问题。在本文中,我们提出了基于窗口的事件去噪方法,它可以同时处理一叠事件,而现有的基于元素的去噪方法每次只处理一个事件。此外,我们还给出了基于时域和空间域概率分布的理论分析,以提高可解释性。在时间域,我们利用处理事件与中心事件之间的时间戳偏差来判断时间相关性,并过滤掉与时间无关的事件。在空间域,我们选择最大后验(MAP)来区分真实世界的事件和噪声,并使用学习到的卷积稀疏编码来优化目标函数。在理论分析的基础上,我们建立了时间窗口(TW)模块和软空间特征嵌入(SSFE)模块,分别处理时间和空间信息,并构建了一个新颖的基于多尺度窗口的事件去噪网络,命名为 WedNet。WedNet 的去噪精度高、运行速度快,可以实现复杂场景的实时去噪。大量实验结果验证了 WedNet 的有效性和鲁棒性。我们的算法能有效去除事件噪声,并提高下游任务的性能。
{"title":"Fast Window-Based Event Denoising with Spatiotemporal Correlation Enhancement.","authors":"Huachen Fang, Jinjian Wu, Qibin Hou, Weisheng Dong, Guangming Shi","doi":"10.1109/TPAMI.2024.3467709","DOIUrl":"10.1109/TPAMI.2024.3467709","url":null,"abstract":"<p><p>Previous deep learning-based event denoising methods mostly suffer from poor interpretability and difficulty in real-time processing due to their complex architecture designs. In this paper, we propose window-based event denoising, which simultaneously deals with a stack of events while existing element-based denoising focuses on one event each time. Besides, we give the theoretical analysis based on probability distributions in both temporal and spatial domains to improve interpretability. In temporal domain, we use timestamp deviations between processing events and central event to judge the temporal correlation and filter out temporal-irrelevant events. In spatial domain, we choose maximum a posteriori (MAP) to discriminate real-world event and noise and use the learned convolutional sparse coding to optimize the objective function. Based on the theoretical analysis, we build Temporal Window (TW) module and Soft Spatial Feature Embedding (SSFE) module to process temporal and spatial information separately, and construct a novel multi-scale window-based event denoising network, named WedNet. The high denoising accuracy and fast running speed of our WedNet enables us to achieve real-time denoising in complex scenes. Extensive experimental results verify the effectiveness and robustness of our WedNet. Our algorithm can remove event noise effectively and efficiently and improve the performance of downstream tasks.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142402489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Robust Multimodal Learning with Missing Modalities via Parameter-Efficient Adaptation. 通过参数高效自适应进行有缺失模态的鲁棒多模态学习
Pub Date : 2024-10-10 DOI: 10.1109/TPAMI.2024.3476487
Md Kaykobad Reza, Ashley Prater-Bennette, M Salman Asif

Multimodal learning seeks to utilize data from multiple sources to improve the overall performance of downstream tasks. It is desirable for redundancies in the data to make multimodal systems robust to missing or corrupted observations in some correlated modalities. However, we observe that the performance of several existing multimodal networks significantly deteriorates if one or multiple modalities are absent at test time. To enable robustness to missing modalities, we propose a simple and parameter-efficient adaptation procedure for pretrained multimodal networks. In particular, we exploit modulation of intermediate features to compensate for the missing modalities. We demonstrate that such adaptation can partially bridge performance drop due to missing modalities and outperform independent, dedicated networks trained for the available modality combinations in some cases. The proposed adaptation requires extremely small number of parameters (e.g., fewer than 1% of the total parameters) and applicable to a wide range of modality combinations and tasks. We conduct a series of experiments to highlight the missing modality robustness of our proposed method on five different multimodal tasks across seven datasets. Our proposed method demonstrates versatility across various tasks and datasets, and outperforms existing methods for robust multimodal learning with missing modalities.

多模态学习旨在利用多种来源的数据来提高下游任务的整体性能。我们希望数据中的冗余能使多模态系统在某些相关模态的观察结果缺失或损坏时保持稳健。然而,我们观察到,如果在测试时缺少一种或多种模态,现有的几种多模态网络的性能就会明显下降。为了实现对缺失模态的鲁棒性,我们为预训练的多模态网络提出了一种简单、参数效率高的适应程序。特别是,我们利用中间特征的调制来补偿缺失的模态。我们证明,这种适配可以部分弥补因模态缺失而导致的性能下降,在某些情况下,其性能优于针对现有模态组合训练的独立专用网络。所提出的适配只需要极少量的参数(例如,少于总参数的 1%),并且适用于多种模态组合和任务。我们进行了一系列实验,在七个数据集的五种不同多模态任务中强调了我们提出的方法对缺失模态的鲁棒性。我们提出的方法在各种任务和数据集上都表现出了多功能性,在缺失模态的鲁棒多模态学习方面优于现有方法。
{"title":"Robust Multimodal Learning with Missing Modalities via Parameter-Efficient Adaptation.","authors":"Md Kaykobad Reza, Ashley Prater-Bennette, M Salman Asif","doi":"10.1109/TPAMI.2024.3476487","DOIUrl":"10.1109/TPAMI.2024.3476487","url":null,"abstract":"<p><p>Multimodal learning seeks to utilize data from multiple sources to improve the overall performance of downstream tasks. It is desirable for redundancies in the data to make multimodal systems robust to missing or corrupted observations in some correlated modalities. However, we observe that the performance of several existing multimodal networks significantly deteriorates if one or multiple modalities are absent at test time. To enable robustness to missing modalities, we propose a simple and parameter-efficient adaptation procedure for pretrained multimodal networks. In particular, we exploit modulation of intermediate features to compensate for the missing modalities. We demonstrate that such adaptation can partially bridge performance drop due to missing modalities and outperform independent, dedicated networks trained for the available modality combinations in some cases. The proposed adaptation requires extremely small number of parameters (e.g., fewer than 1% of the total parameters) and applicable to a wide range of modality combinations and tasks. We conduct a series of experiments to highlight the missing modality robustness of our proposed method on five different multimodal tasks across seven datasets. Our proposed method demonstrates versatility across various tasks and datasets, and outperforms existing methods for robust multimodal learning with missing modalities.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142402490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Continuous-time Object Segmentation using High Temporal Resolution Event Camera. 使用高时间分辨率事件摄像机进行连续时间物体分割。
Pub Date : 2024-10-10 DOI: 10.1109/TPAMI.2024.3477591
Lin Zhu, Xianzhang Chen, Lizhi Wang, Xiao Wang, Yonghong Tian, Hua Huang

Event cameras are novel bio-inspired sensors, where individual pixels operate independently and asynchronously, generating intensity changes as events. Leveraging the microsecond resolution (no motion blur) and high dynamic range (compatible with extreme light conditions) of events, there is considerable promise in directly segmenting objects from sparse and asynchronous event streams in various applications. However, different from the rich cues in video object segmentation, it is challenging to segment complete objects from the sparse event stream. In this paper, we present the first framework for continuous-time object segmentation from event stream. Given the object mask at the initial time, our task aims to segment the complete object at any subsequent time in event streams. Specifically, our framework consists of a Recurrent Temporal Embedding Extraction (RTEE) module based on a novel ResLSTM, a Cross-time Spatiotemporal Feature Modeling (CSFM) module which is a transformer architecture with long-term and short-term matching modules, and a segmentation head. The historical events and masks (reference sets) are recurrently fed into our framework along with current-time events. The temporal embedding is updated as new events are input, enabling our framework to continuously process the event stream. To train and test our model, we construct both real-world and simulated event-based object segmentation datasets, each comprising event streams, APS images, and object annotations. Extensive experiments on our datasets demonstrate the effectiveness of the proposed recurrent architecture. Our code and dataset are available at https://sites.google.com/view/ecos-net/.

事件相机是一种新颖的生物启发传感器,单个像素独立异步工作,产生强度变化作为事件。利用事件的微秒级分辨率(无运动模糊)和高动态范围(与极端光线条件兼容),在各种应用中直接从稀疏和异步事件流中分割对象大有可为。然而,与视频对象分割中的丰富线索不同,从稀疏事件流中分割完整的对象具有挑战性。在本文中,我们首次提出了从事件流中进行连续时间对象分割的框架。鉴于初始时间的对象掩码,我们的任务旨在分割事件流中任意后续时间的完整对象。具体来说,我们的框架由一个基于新型 ResLSTM 的递归时空嵌入提取(RTEE)模块、一个跨时时空特征建模(CSFM)模块(这是一个包含长期和短期匹配模块的转换器架构)和一个分割头组成。历史事件和掩码(参考集)与当前时间事件一起循环输入到我们的框架中。随着新事件的输入,时间嵌入也会随之更新,从而使我们的框架能够持续处理事件流。为了训练和测试我们的模型,我们构建了真实世界和模拟基于事件的物体分割数据集,每个数据集都包含事件流、APS 图像和物体注释。在我们的数据集上进行的大量实验证明了所提出的循环架构的有效性。我们的代码和数据集可在 https://sites.google.com/view/ecos-net/ 上获取。
{"title":"Continuous-time Object Segmentation using High Temporal Resolution Event Camera.","authors":"Lin Zhu, Xianzhang Chen, Lizhi Wang, Xiao Wang, Yonghong Tian, Hua Huang","doi":"10.1109/TPAMI.2024.3477591","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3477591","url":null,"abstract":"<p><p>Event cameras are novel bio-inspired sensors, where individual pixels operate independently and asynchronously, generating intensity changes as events. Leveraging the microsecond resolution (no motion blur) and high dynamic range (compatible with extreme light conditions) of events, there is considerable promise in directly segmenting objects from sparse and asynchronous event streams in various applications. However, different from the rich cues in video object segmentation, it is challenging to segment complete objects from the sparse event stream. In this paper, we present the first framework for continuous-time object segmentation from event stream. Given the object mask at the initial time, our task aims to segment the complete object at any subsequent time in event streams. Specifically, our framework consists of a Recurrent Temporal Embedding Extraction (RTEE) module based on a novel ResLSTM, a Cross-time Spatiotemporal Feature Modeling (CSFM) module which is a transformer architecture with long-term and short-term matching modules, and a segmentation head. The historical events and masks (reference sets) are recurrently fed into our framework along with current-time events. The temporal embedding is updated as new events are input, enabling our framework to continuously process the event stream. To train and test our model, we construct both real-world and simulated event-based object segmentation datasets, each comprising event streams, APS images, and object annotations. Extensive experiments on our datasets demonstrate the effectiveness of the proposed recurrent architecture. Our code and dataset are available at https://sites.google.com/view/ecos-net/.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142402487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dual-Grained Lightweight Strategy 双粒度轻量级策略
Pub Date : 2024-10-10 DOI: 10.1109/TPAMI.2024.3437421
Debin Liu;Xiang Bai;Ruonan Zhao;Xianjun Deng;Laurence T. Yang
Removing redundant parameters and computations before the model training has attracted a great interest as it can effectively reduce the storage space of the model, speed up the training and inference of the model, and save energy consumption during the running of the model. In addition, the simplification of deep neural network models can enable high-performance network models to be deployed to resource-constrained edge devices, thus promoting the development of the intelligent world. However, current pruning at initialization methods exhibit poor performance at extreme sparsity. In order to improve the performance of the model under extreme sparsity, this paper proposes a dual-grained lightweight strategy-TEDEPR. This is the first time that TEDEPR has used tensor theory in the pruning at initialization method to optimize the structure of a sparse sub-network model and improve its performance. Specifically, first, at the coarse-grained level, we represent the weight matrix or weight tensor of the model as a low-rank tensor decomposition form and use multi-step chain operations to enhance the feature extraction capability of the base module to construct a low-rank compact network model. Second, unimportant weights are pruned at a fine-grained level based on the trainability of the weights in the low-rank model before the training of the model, resulting in the final compressed model. To evaluate the superiority of TEDEPR, we conducted extensive experiments on MNIST, UCF11, CIFAR-10, CIFAR-100, Tiny-ImageNet and ImageNet datasets with LeNet, LSTM, VGGNet, ResNet and Transformer architectures, and compared with state-of-the-art methods. The experimental results show that TEDEPR has higher accuracy, faster training and inference, and less storage space than other pruning at initialization methods under extreme sparsity.
在模型训练之前去除冗余参数和计算,可以有效减少模型的存储空间,加快模型的训练和推理速度,并节省模型运行过程中的能耗,因此备受关注。此外,简化深度神经网络模型还能使高性能网络模型部署到资源受限的边缘设备上,从而促进智能世界的发展。然而,目前的初始化剪枝方法在极端稀疏时表现出很差的性能。为了提高模型在极端稀疏性条件下的性能,本文提出了一种双粒度轻量级策略--TEDEPR。这是 TEDEPR 首次在初始化剪枝方法中使用张量理论来优化稀疏子网络模型的结构并提高其性能。具体来说,首先,在粗粒度层面,我们将模型的权重矩阵或权重张量表示为低秩张量分解形式,并利用多步链式运算增强基础模块的特征提取能力,构建低秩紧凑网络模型。其次,在模型训练之前,根据低阶模型中权重的可训练性,对不重要的权重进行细粒度剪枝,从而得到最终的压缩模型。为了评估 TEDEPR 的优越性,我们在 MNIST、UCF11、CIFAR-10、CIFAR-100、Tiny-ImageNet 和 ImageNet 数据集上使用 LeNet、LSTM、VGGNet、ResNet 和 Transformer 架构进行了大量实验,并与最先进的方法进行了比较。实验结果表明,在极端稀疏性条件下,TEDEPR 比其他初始化剪枝方法具有更高的准确率、更快的训练和推理速度以及更少的存储空间。
{"title":"Dual-Grained Lightweight Strategy","authors":"Debin Liu;Xiang Bai;Ruonan Zhao;Xianjun Deng;Laurence T. Yang","doi":"10.1109/TPAMI.2024.3437421","DOIUrl":"10.1109/TPAMI.2024.3437421","url":null,"abstract":"Removing redundant parameters and computations before the model training has attracted a great interest as it can effectively reduce the storage space of the model, speed up the training and inference of the model, and save energy consumption during the running of the model. In addition, the simplification of deep neural network models can enable high-performance network models to be deployed to resource-constrained edge devices, thus promoting the development of the intelligent world. However, current pruning at initialization methods exhibit poor performance at extreme sparsity. In order to improve the performance of the model under extreme sparsity, this paper proposes a dual-grained lightweight strategy-TEDEPR. This is the first time that TEDEPR has used tensor theory in the pruning at initialization method to optimize the structure of a sparse sub-network model and improve its performance. Specifically, first, at the coarse-grained level, we represent the weight matrix or weight tensor of the model as a low-rank tensor decomposition form and use multi-step chain operations to enhance the feature extraction capability of the base module to construct a low-rank compact network model. Second, unimportant weights are pruned at a fine-grained level based on the trainability of the weights in the low-rank model before the training of the model, resulting in the final compressed model. To evaluate the superiority of TEDEPR, we conducted extensive experiments on MNIST, UCF11, CIFAR-10, CIFAR-100, Tiny-ImageNet and ImageNet datasets with LeNet, LSTM, VGGNet, ResNet and Transformer architectures, and compared with state-of-the-art methods. The experimental results show that TEDEPR has higher accuracy, faster training and inference, and less storage space than other pruning at initialization methods under extreme sparsity.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10228-10245"},"PeriodicalIF":0.0,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142402488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Changen2: Multi-Temporal Remote Sensing Generative Change Foundation Model. Changen2:多时遥感生成变化基础模型
Pub Date : 2024-10-10 DOI: 10.1109/TPAMI.2024.3475824
Zhuo Zheng, Stefano Ermon, Dongjun Kim, Liangpei Zhang, Yanfei Zhong

Our understanding of the temporal dynamics of the Earth's surface has been significantly advanced by deep vision models, which often require a massive amount of labeled multi-temporal images for training. However, collecting, preprocessing, and annotating multi-temporal remote sensing images at scale is non-trivial since it is expensive and knowledge-intensive. In this paper, we present scalable multi-temporal change data generators based on generative models, which are cheap and automatic, alleviating these data problems. Our main idea is to simulate a stochastic change process over time. We describe the stochastic change process as a probabilistic graphical model, namely the generative probabilistic change model (GPCM), which factorizes the complex simulation problem into two more tractable sub-problems, i.e., condition-level change event simulation and image-level semantic change synthesis. To solve these two problems, we present Changen2, a GPCM implemented with a resolution-scalable diffusion transformer which can generate time series of remote sensing images and corresponding semantic and change labels from labeled and even unlabeled single-temporal images. Changen2 is a "generative change foundation model" that can be trained at scale via self-supervision, and is capable of producing change supervisory signals from unlabeled single-temporal images. Unlike existing "foundation models", our generative change foundation model synthesizes change data to train task-specific foundation models for change detection. The resulting model possesses inherent zero-shot change detection capabilities and excellent transferability. Comprehensive experiments suggest Changen2 has superior spatiotemporal scalability in data generation, e.g., Changen2 model trained on 256 2 pixel single-temporal images can yield time series of any length and resolutions of 1,024 2 pixels. Changen2 pre-trained models exhibit superior zero-shot performance (narrowing the performance gap to 3% on LEVIR-CD and approximately 10% on both S2Looking and SECOND, compared to fully supervised counterpart) and transferability across multiple types of change tasks, including ordinary and off-nadir building change, land-use/land-cover change, and disaster assessment. The model and datasets are available at https://github.com/Z-Zheng/pytorch-change-models.

深度视觉模型极大地促进了我们对地球表面时间动态的理解,而深度视觉模型通常需要大量标注的多时态图像进行训练。然而,大规模收集、预处理和标注多时相遥感图像并非易事,因为它既昂贵又需要大量知识。在本文中,我们提出了基于生成模型的可扩展多时变化数据生成器,这种生成器既便宜又自动,从而缓解了这些数据问题。我们的主要想法是模拟随时间变化的随机变化过程。我们将随机变化过程描述为一个概率图形模型,即生成概率变化模型(GPCM),它将复杂的模拟问题分解为两个更容易解决的子问题,即条件级变化事件模拟和图像级语义变化合成。为了解决这两个问题,我们提出了 Changen2,这是一种利用分辨率可扩展的扩散变换器实现的 GPCM,可以从已标记甚至未标记的单时相图像中生成遥感图像的时间序列以及相应的语义和变化标签。Changen2 是一种 "生成式变化基础模型",可通过自我监督进行大规模训练,并能从未标明的单时相图像中生成变化监督信号。与现有的 "基础模型 "不同,我们的生成式变化基础模型综合了变化数据,以训练用于变化检测的特定任务基础模型。由此产生的模型具有固有的零镜头变化检测能力和出色的可移植性。综合实验表明,Changen2 在数据生成方面具有卓越的时空可扩展性,例如,在 256 2 像素单时相图像上训练的 Changen2 模型可生成任意长度和分辨率为 1,024 2 像素的时间序列。Changen2 预先训练的模型表现出卓越的零拍摄性能(与完全监督的模型相比,在 LEVIR-CD 上的性能差距缩小到 3%,在 S2Looking 和 SECOND 上的性能差距缩小到约 10%),并可用于多种类型的变化任务,包括普通和非天顶建筑变化、土地利用/土地覆被变化和灾害评估。该模型和数据集可在 https://github.com/Z-Zheng/pytorch-change-models 上查阅。
{"title":"Changen2: Multi-Temporal Remote Sensing Generative Change Foundation Model.","authors":"Zhuo Zheng, Stefano Ermon, Dongjun Kim, Liangpei Zhang, Yanfei Zhong","doi":"10.1109/TPAMI.2024.3475824","DOIUrl":"10.1109/TPAMI.2024.3475824","url":null,"abstract":"<p><p>Our understanding of the temporal dynamics of the Earth's surface has been significantly advanced by deep vision models, which often require a massive amount of labeled multi-temporal images for training. However, collecting, preprocessing, and annotating multi-temporal remote sensing images at scale is non-trivial since it is expensive and knowledge-intensive. In this paper, we present scalable multi-temporal change data generators based on generative models, which are cheap and automatic, alleviating these data problems. Our main idea is to simulate a stochastic change process over time. We describe the stochastic change process as a probabilistic graphical model, namely the generative probabilistic change model (GPCM), which factorizes the complex simulation problem into two more tractable sub-problems, i.e., condition-level change event simulation and image-level semantic change synthesis. To solve these two problems, we present Changen2, a GPCM implemented with a resolution-scalable diffusion transformer which can generate time series of remote sensing images and corresponding semantic and change labels from labeled and even unlabeled single-temporal images. Changen2 is a \"generative change foundation model\" that can be trained at scale via self-supervision, and is capable of producing change supervisory signals from unlabeled single-temporal images. Unlike existing \"foundation models\", our generative change foundation model synthesizes change data to train task-specific foundation models for change detection. The resulting model possesses inherent zero-shot change detection capabilities and excellent transferability. Comprehensive experiments suggest Changen2 has superior spatiotemporal scalability in data generation, e.g., Changen2 model trained on 256 <sup>2</sup> pixel single-temporal images can yield time series of any length and resolutions of 1,024 <sup>2</sup> pixels. Changen2 pre-trained models exhibit superior zero-shot performance (narrowing the performance gap to 3% on LEVIR-CD and approximately 10% on both S2Looking and SECOND, compared to fully supervised counterpart) and transferability across multiple types of change tasks, including ordinary and off-nadir building change, land-use/land-cover change, and disaster assessment. The model and datasets are available at https://github.com/Z-Zheng/pytorch-change-models.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142402486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE transactions on pattern analysis and machine intelligence
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1