Pub Date : 2024-11-06DOI: 10.1109/TPAMI.2024.3492328
Yifan Zhao, Jia Li, Zeyin Song, Yonghong Tian
Depicting novel classes with language descriptions by observing few-shot samples is inherent in human-learning systems. This lifelong learning capability helps to distinguish new knowledge from old ones through the increase of open-world learning, namely Few-Shot Class-Incremental Learning (FSCIL). Existing works to solve this problem mainly rely on the careful tuning of visual encoders, which shows an evident trade-off between the base knowledge and incremental ones. Motivated by human learning systems, we propose a new Language-inspired Relation Transfer (LRT) paradigm to understand objects by joint visual clues and text depictions, composed of two major steps. We first transfer the pretrained text knowledge to the visual domains by proposing a graph relation transformation module and then fuse the visual and language embedding by a text-vision prototypical fusion module. Second, to mitigate the domain gap caused by visual finetuning, we propose context prompt learning for fast domain alignment and imagined contrastive learning to alleviate the insufficient text data during alignment. With collaborative learning of domain alignments and text-image transfer, our proposed LRT outperforms the state-of-the-art models by over 13% and 7% on the final session of miniImageNet and CIFAR-100 FSCIL benchmarks.
{"title":"Language-Inspired Relation Transfer for Few-Shot Class-Incremental Learning.","authors":"Yifan Zhao, Jia Li, Zeyin Song, Yonghong Tian","doi":"10.1109/TPAMI.2024.3492328","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3492328","url":null,"abstract":"<p><p>Depicting novel classes with language descriptions by observing few-shot samples is inherent in human-learning systems. This lifelong learning capability helps to distinguish new knowledge from old ones through the increase of open-world learning, namely Few-Shot Class-Incremental Learning (FSCIL). Existing works to solve this problem mainly rely on the careful tuning of visual encoders, which shows an evident trade-off between the base knowledge and incremental ones. Motivated by human learning systems, we propose a new Language-inspired Relation Transfer (LRT) paradigm to understand objects by joint visual clues and text depictions, composed of two major steps. We first transfer the pretrained text knowledge to the visual domains by proposing a graph relation transformation module and then fuse the visual and language embedding by a text-vision prototypical fusion module. Second, to mitigate the domain gap caused by visual finetuning, we propose context prompt learning for fast domain alignment and imagined contrastive learning to alleviate the insufficient text data during alignment. With collaborative learning of domain alignments and text-image transfer, our proposed LRT outperforms the state-of-the-art models by over 13% and 7% on the final session of miniImageNet and CIFAR-100 FSCIL benchmarks.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142591827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-06DOI: 10.1109/TPAMI.2024.3492259
Yipo Huang, Leida Li, Pengfei Chen, Haoning Wu, Weisi Lin, Guangming Shi
In the Image Aesthetics Computing (IAC) field, most prior methods leveraged the off-the-shelf backbones pre-trained on the large-scale ImageNet database. While these pre-trained backbones have achieved notable success, they often overemphasize object-level semantics and fail to capture the high-level concepts of image aesthetics, which may only achieve suboptimal performances. To tackle this long-neglected problem, we propose a multi-modality multi-attribute contrastive pre-training framework, targeting at constructing an alternative to ImageNet-based pre-training for IAC. Specifically, the proposed framework consists of two main aspects. (1) We build a multi-attribute image description database with human feedback, leveraging the competent image understanding capability of the multi-modality large language model to generate rich aesthetic descriptions. (2) To better adapt models to aesthetic computing tasks, we integrate the image-based visual features with the attribute-based text features, and map the integrated features into different embedding spaces, based on which the multi-attribute contrastive learning is proposed for obtaining more comprehensive aesthetic representation. To alleviate the distribution shift encountered when transitioning from the general visual domain to the aesthetic domain, we further propose a semantic affinity loss to restrain the content information and enhance model generalization. Extensive experiments demonstrate that the proposed framework sets new state-of-the-arts for IAC tasks. The code, database and pre-trained weights will be available at https://github.com/yipoh/AesNet.
{"title":"Multi-Modality Multi-Attribute Contrastive Pre-Training for Image Aesthetics Computing.","authors":"Yipo Huang, Leida Li, Pengfei Chen, Haoning Wu, Weisi Lin, Guangming Shi","doi":"10.1109/TPAMI.2024.3492259","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3492259","url":null,"abstract":"<p><p>In the Image Aesthetics Computing (IAC) field, most prior methods leveraged the off-the-shelf backbones pre-trained on the large-scale ImageNet database. While these pre-trained backbones have achieved notable success, they often overemphasize object-level semantics and fail to capture the high-level concepts of image aesthetics, which may only achieve suboptimal performances. To tackle this long-neglected problem, we propose a multi-modality multi-attribute contrastive pre-training framework, targeting at constructing an alternative to ImageNet-based pre-training for IAC. Specifically, the proposed framework consists of two main aspects. (1) We build a multi-attribute image description database with human feedback, leveraging the competent image understanding capability of the multi-modality large language model to generate rich aesthetic descriptions. (2) To better adapt models to aesthetic computing tasks, we integrate the image-based visual features with the attribute-based text features, and map the integrated features into different embedding spaces, based on which the multi-attribute contrastive learning is proposed for obtaining more comprehensive aesthetic representation. To alleviate the distribution shift encountered when transitioning from the general visual domain to the aesthetic domain, we further propose a semantic affinity loss to restrain the content information and enhance model generalization. Extensive experiments demonstrate that the proposed framework sets new state-of-the-arts for IAC tasks. The code, database and pre-trained weights will be available at https://github.com/yipoh/AesNet.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142591892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-04DOI: 10.1109/TPAMI.2024.3490777
Hao Chen, Francois Bremond, Nicu Sebe, Shiliang Zhang
Regular unsupervised domain adaptive person re-identification (ReID) focuses on adapting a model from a source domain to a fixed target domain. However, an adapted ReID model can hardly retain previously-acquired knowledge and generalize to unseen data. In this paper, we propose a Dual-level Joint Adaptation and Anti-forgetting (DJAA) framework, which incrementally adapts a model to new domains without forgetting source domain and each adapted target domain. We explore the possibility of using prototype and instance-level consistency to mitigate the forgetting during the adaptation. Specifically, we store a small number of representative image samples and corresponding cluster prototypes in a memory buffer, which is updated at each adaptation step. With the buffered images and prototypes, we regularize the image-to-image similarity and image-to-prototype similarity to rehearse old knowledge. After the multi-step adaptation, the model is tested on all seen domains and several unseen domains to validate the generalization ability of our method. Extensive experiments demonstrate that our proposed method significantly improves the anti-forgetting, generalization and backward-compatible ability of an unsupervised person ReID model.
{"title":"Anti-Forgetting Adaptation for Unsupervised Person Re-Identification.","authors":"Hao Chen, Francois Bremond, Nicu Sebe, Shiliang Zhang","doi":"10.1109/TPAMI.2024.3490777","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3490777","url":null,"abstract":"<p><p>Regular unsupervised domain adaptive person re-identification (ReID) focuses on adapting a model from a source domain to a fixed target domain. However, an adapted ReID model can hardly retain previously-acquired knowledge and generalize to unseen data. In this paper, we propose a Dual-level Joint Adaptation and Anti-forgetting (DJAA) framework, which incrementally adapts a model to new domains without forgetting source domain and each adapted target domain. We explore the possibility of using prototype and instance-level consistency to mitigate the forgetting during the adaptation. Specifically, we store a small number of representative image samples and corresponding cluster prototypes in a memory buffer, which is updated at each adaptation step. With the buffered images and prototypes, we regularize the image-to-image similarity and image-to-prototype similarity to rehearse old knowledge. After the multi-step adaptation, the model is tested on all seen domains and several unseen domains to validate the generalization ability of our method. Extensive experiments demonstrate that our proposed method significantly improves the anti-forgetting, generalization and backward-compatible ability of an unsupervised person ReID model.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142577347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Offset-based representation has emerged as a promising approach for modeling semantic relations between pixels and object motion, demonstrating efficacy across various computer vision tasks. In this paper, we introduce a novel one-stage multi-tasking network tailored to extend the offset-based approach to MOTS. Our proposed framework, named OffsetNet, is designed to concurrently address amodal bounding box detection, instance segmentation, and tracking. It achieves this by formulating these three tasks within a unified pixel-offset-based representation, thereby achieving excellent efficiency and encouraging mutual collaborations. OffsetNet achieves several remarkable properties: first, the encoder is empowered by a novel Memory Enhanced Linear Self-Attention (MELSA) block to efficiently aggregate spatial-temporal features; second, all tasks are decoupled fairly using three lightweight decoders that operate in a one-shot manner; third, a novel cross-frame offsets prediction module is proposed to enhance the robustness of tracking against occlusions. With these merits, OffsetNet achieves 76.83% HOTA on KITTI MOTS benchmark, which is the best result without relying on 3D detection. Furthermore, OffsetNet achieves 74.83% HOTA at 50 FPS on the KITTI MOT benchmark, which is nearly 3.3 times faster than CenterTrack with better performance. We hope our approach will serve as a solid baseline and encourage future research in this field.
基于偏移量的表示法已成为一种很有前途的方法,可用于模拟像素与物体运动之间的语义关系,在各种计算机视觉任务中都显示出功效。在本文中,我们介绍了一种新颖的单级多任务网络,专门用于将基于偏移的方法扩展到 MOTS。我们提出的框架名为 OffsetNet,旨在同时处理模态边界框检测、实例分割和跟踪。为了实现这一目标,我们将这三项任务置于统一的基于像素偏移的表示法中,从而实现了出色的效率,并鼓励了相互协作。OffsetNet 实现了几个显著的特性:首先,编码器由一个新颖的内存增强线性自保持(MELSA)模块授权,以有效地聚合空间-时间特征;其次,所有任务都通过三个轻量级解码器进行解耦,以单次方式运行;第三,提出了一个新颖的跨帧偏移预测模块,以增强跟踪对遮挡的鲁棒性。凭借这些优点,OffsetNet 在 KITTI MOTS 基准测试中取得了 76.83% 的 HOTA,这是在不依赖 3D 检测的情况下取得的最佳成绩。此外,OffsetNet 在 KITTI MOT 基准测试中以 50 FPS 的速度实现了 74.83% 的 HOTA,比性能更好的 CenterTrack 快了近 3.3 倍。我们希望我们的方法能成为一个坚实的基线,并鼓励未来在这一领域的研究。
{"title":"OffsetNet: Towards Efficient Multiple Object Tracking, Detection, and Segmentation.","authors":"Wei Zhang, Jiaming Li, Meng Xia, Xu Gao, Xiao Tan, Yifeng Shi, Zhenhua Huang, Guanbin Li","doi":"10.1109/TPAMI.2024.3485644","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3485644","url":null,"abstract":"<p><p>Offset-based representation has emerged as a promising approach for modeling semantic relations between pixels and object motion, demonstrating efficacy across various computer vision tasks. In this paper, we introduce a novel one-stage multi-tasking network tailored to extend the offset-based approach to MOTS. Our proposed framework, named OffsetNet, is designed to concurrently address amodal bounding box detection, instance segmentation, and tracking. It achieves this by formulating these three tasks within a unified pixel-offset-based representation, thereby achieving excellent efficiency and encouraging mutual collaborations. OffsetNet achieves several remarkable properties: first, the encoder is empowered by a novel Memory Enhanced Linear Self-Attention (MELSA) block to efficiently aggregate spatial-temporal features; second, all tasks are decoupled fairly using three lightweight decoders that operate in a one-shot manner; third, a novel cross-frame offsets prediction module is proposed to enhance the robustness of tracking against occlusions. With these merits, OffsetNet achieves 76.83% HOTA on KITTI MOTS benchmark, which is the best result without relying on 3D detection. Furthermore, OffsetNet achieves 74.83% HOTA at 50 FPS on the KITTI MOT benchmark, which is nearly 3.3 times faster than CenterTrack with better performance. We hope our approach will serve as a solid baseline and encourage future research in this field.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142577354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-04DOI: 10.1109/TPAMI.2024.3490776
Zhanzhou Feng, Shiliang Zhang
Existing Masked Image Modeling methods apply fixed mask patterns to guide the self-supervised training. As those mask patterns resort to different criteria to depict image contents, sticking to a fixed pattern leads to a limited vision cues modeling capability. This paper introduces an evolved hierarchical masking method to pursue general visual cues modeling in self-supervised learning. The proposed method leverages the vision model being trained to parse the input visual cues into a hierarchy structure, which is hence adopted to generate masks accordingly. The accuracy of hierarchy is on par with the capability of the model being trained, leading to evolved mask patterns at different training stages. Initially, generated masks focus on low-level visual cues to grasp basic textures, then gradually evolve to depict higher-level cues to reinforce the learning of more complicated object semantics and contexts. Our method does not require extra pre-trained models or annotations and ensures training efficiency by evolving the training difficulty. We conduct extensive experiments on seven downstream tasks including partial-duplicate image retrieval relying on low-level details, as well as image classification and semantic segmentation that require semantic parsing capability. Experimental results demonstrate that it substantially boosts performance across these tasks. For instance, it surpasses the recent MAE by 1.1% in imageNet-1K classification and 1.4% in ADE20K segmentation with the same training epochs. We also align the proposed method with the current research focus on LLMs. The proposed approach bridges the gap with large-scale pre-training on semantic demanding tasks and enhances intricate detail perception in tasks requiring low-level feature recognition.
现有的遮罩图像建模方法采用固定的遮罩模式来指导自我监督训练。由于这些遮罩模式采用不同的标准来描述图像内容,拘泥于固定模式导致视觉线索建模能力有限。本文介绍了一种进化的分层遮罩方法,以追求自我监督学习中的通用视觉线索建模。所提出的方法利用正在训练的视觉模型,将输入的视觉线索解析为层次结构,并据此生成遮罩。层次结构的准确性与所训练模型的能力相当,从而在不同的训练阶段产生不同的遮罩模式。最初,生成的遮罩侧重于低层次的视觉线索,以掌握基本的纹理,然后逐渐演变为描绘更高层次的线索,以加强对更复杂的物体语义和语境的学习。我们的方法不需要额外的预训练模型或注释,并通过不断提高训练难度来确保训练效率。我们在七个下游任务上进行了广泛的实验,包括依赖低级细节的部分重复图像检索,以及需要语义解析能力的图像分类和语义分割。实验结果表明,它大大提高了这些任务的性能。例如,在相同的训练历时下,它在 imageNet-1K 分类中的 MAE 高出 1.1%,在 ADE20K 分割中的 MAE 高出 1.4%。我们还将提出的方法与当前对 LLM 的研究重点相结合。所提出的方法弥补了在语义要求较高的任务中进行大规模预训练的不足,并增强了在需要低级特征识别的任务中对复杂细节的感知。
{"title":"Evolved Hierarchical Masking for Self-Supervised Learning.","authors":"Zhanzhou Feng, Shiliang Zhang","doi":"10.1109/TPAMI.2024.3490776","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3490776","url":null,"abstract":"<p><p>Existing Masked Image Modeling methods apply fixed mask patterns to guide the self-supervised training. As those mask patterns resort to different criteria to depict image contents, sticking to a fixed pattern leads to a limited vision cues modeling capability. This paper introduces an evolved hierarchical masking method to pursue general visual cues modeling in self-supervised learning. The proposed method leverages the vision model being trained to parse the input visual cues into a hierarchy structure, which is hence adopted to generate masks accordingly. The accuracy of hierarchy is on par with the capability of the model being trained, leading to evolved mask patterns at different training stages. Initially, generated masks focus on low-level visual cues to grasp basic textures, then gradually evolve to depict higher-level cues to reinforce the learning of more complicated object semantics and contexts. Our method does not require extra pre-trained models or annotations and ensures training efficiency by evolving the training difficulty. We conduct extensive experiments on seven downstream tasks including partial-duplicate image retrieval relying on low-level details, as well as image classification and semantic segmentation that require semantic parsing capability. Experimental results demonstrate that it substantially boosts performance across these tasks. For instance, it surpasses the recent MAE by 1.1% in imageNet-1K classification and 1.4% in ADE20K segmentation with the same training epochs. We also align the proposed method with the current research focus on LLMs. The proposed approach bridges the gap with large-scale pre-training on semantic demanding tasks and enhances intricate detail perception in tasks requiring low-level feature recognition.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142577352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-04DOI: 10.1109/TPAMI.2024.3490619
Xu Zheng, Peng Yuan Zhou, Athanasios V Vasilakos, Lin Wang
In this paper, we address the challenging source-free unsupervised domain adaptation (SFUDA) for pinhole-to-panoramic semantic segmentation, given only a pinhole image pre-trained model (i.e., source) and unlabeled panoramic images (i.e., target). Tackling this problem is non-trivial due to three critical challenges: 1) semantic mismatches from the distinct Field-of-View (FoV) between domains, 2) style discrepancies inherent in the UDA problem, and 3) inevitable distortion of the panoramic images. To tackle these problems, we propose 360SFUDA++ that effectively extracts knowledge from the source pinhole model with only unlabeled panoramic images and transfers the reliable knowledge to the target panoramic domain. Specifically, we first utilize Tangent Projection (TP) as it has less distortion and meanwhile slits the equirectangular projection (ERP) to patches with fixed FoV projection (FFP) to mimic the pinhole images. Both projections are shown effective in extracting knowledge from the source model. However, as the distinct projections make it less possible to directly transfer knowledge between domains, we then propose Reliable Panoramic Prototype Adaptation Module (RP 2 AM) to transfer knowledge at both prediction and prototype levels. RP 2 AM selects the confident knowledge and integrates panoramic prototypes for reliable knowledge adaptation. Moreover, we introduce Cross-projection Dual Attention Module (CDAM), which better aligns the spatial and channel characteristics across projections at the feature level between domains. Both knowledge extraction and transfer processes are synchronously updated to reach the best performance. Extensive experiments on the synthetic and real-world benchmarks, including outdoor and indoor scenarios, demonstrate that our 360SFUDA++ achieves significantly better performance than prior SFUDA methods. Project Page.
{"title":"360SFUDA++: Towards Source-Free UDA for Panoramic Segmentation by Learning Reliable Category Prototypes.","authors":"Xu Zheng, Peng Yuan Zhou, Athanasios V Vasilakos, Lin Wang","doi":"10.1109/TPAMI.2024.3490619","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3490619","url":null,"abstract":"<p><p>In this paper, we address the challenging source-free unsupervised domain adaptation (SFUDA) for pinhole-to-panoramic semantic segmentation, given only a pinhole image pre-trained model (i.e., source) and unlabeled panoramic images (i.e., target). Tackling this problem is non-trivial due to three critical challenges: 1) semantic mismatches from the distinct Field-of-View (FoV) between domains, 2) style discrepancies inherent in the UDA problem, and 3) inevitable distortion of the panoramic images. To tackle these problems, we propose 360SFUDA++ that effectively extracts knowledge from the source pinhole model with only unlabeled panoramic images and transfers the reliable knowledge to the target panoramic domain. Specifically, we first utilize Tangent Projection (TP) as it has less distortion and meanwhile slits the equirectangular projection (ERP) to patches with fixed FoV projection (FFP) to mimic the pinhole images. Both projections are shown effective in extracting knowledge from the source model. However, as the distinct projections make it less possible to directly transfer knowledge between domains, we then propose Reliable Panoramic Prototype Adaptation Module (RP <sup>2</sup> AM) to transfer knowledge at both prediction and prototype levels. RP <sup>2</sup> AM selects the confident knowledge and integrates panoramic prototypes for reliable knowledge adaptation. Moreover, we introduce Cross-projection Dual Attention Module (CDAM), which better aligns the spatial and channel characteristics across projections at the feature level between domains. Both knowledge extraction and transfer processes are synchronously updated to reach the best performance. Extensive experiments on the synthetic and real-world benchmarks, including outdoor and indoor scenarios, demonstrate that our 360SFUDA++ achieves significantly better performance than prior SFUDA methods. Project Page.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142577336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-01DOI: 10.1109/TPAMI.2024.3489597
Rui Zhang, Xingbo Du, Junchi Yan, Shihua Zhang
The Concept Bottleneck Model (CBM) is an interpretable neural network that leverages high-level concepts to explain model decisions and conduct human-machine interaction. However, in real-world scenarios, the deficiency of informative concepts can impede the model's interpretability and subsequent interventions. This paper proves that insufficient concept information can lead to an inherent dilemma of concept and label distortions in CBM. To address this challenge, we propose the Decoupling Concept Bottleneck Model (DCBM), which comprises two phases: 1) DCBM for prediction and interpretation, which decouples heterogeneous information into explicit and implicit concepts while maintaining high label and concept accuracy, and 2) DCBM for human-machine interaction, which automatically corrects labels and traces wrong concepts via mutual information estimation. The construction of the interaction system can be formulated as a light min-max optimization problem. Extensive experiments expose the success of alleviating concept/label distortions, especially when concepts are insufficient. In particular, we propose the Concept Contribution Score (CCS) to quantify the interpretability of DCBM. Numerical results demonstrate that CCS can be guaranteed by the Jensen-Shannon divergence constraint in DCBM. Moreover, DCBM expresses two effective human-machine interactions, including forward intervention and backward rectification, to further promote concept/label accuracy via interaction with human experts.
{"title":"The Decoupling Concept Bottleneck Model.","authors":"Rui Zhang, Xingbo Du, Junchi Yan, Shihua Zhang","doi":"10.1109/TPAMI.2024.3489597","DOIUrl":"10.1109/TPAMI.2024.3489597","url":null,"abstract":"<p><p>The Concept Bottleneck Model (CBM) is an interpretable neural network that leverages high-level concepts to explain model decisions and conduct human-machine interaction. However, in real-world scenarios, the deficiency of informative concepts can impede the model's interpretability and subsequent interventions. This paper proves that insufficient concept information can lead to an inherent dilemma of concept and label distortions in CBM. To address this challenge, we propose the Decoupling Concept Bottleneck Model (DCBM), which comprises two phases: 1) DCBM for prediction and interpretation, which decouples heterogeneous information into explicit and implicit concepts while maintaining high label and concept accuracy, and 2) DCBM for human-machine interaction, which automatically corrects labels and traces wrong concepts via mutual information estimation. The construction of the interaction system can be formulated as a light min-max optimization problem. Extensive experiments expose the success of alleviating concept/label distortions, especially when concepts are insufficient. In particular, we propose the Concept Contribution Score (CCS) to quantify the interpretability of DCBM. Numerical results demonstrate that CCS can be guaranteed by the Jensen-Shannon divergence constraint in DCBM. Moreover, DCBM expresses two effective human-machine interactions, including forward intervention and backward rectification, to further promote concept/label accuracy via interaction with human experts.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142562962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-31DOI: 10.1109/TPAMI.2024.3489217
Shilin Gu, Chao Xu, Dewen Hu, Chenping Hou
Applying current machine learning algorithms in complex and open environments remains challenging, especially when different changing elements are coupled and the training data is scarce. For example, in the activity recognition task, the motion sensors may change position or fall off due to the intensity of the activity, leading to changes in feature space and finally resulting in label noise. Learning from such a problem where the dynamic features are coupled with noisy labels is crucial but rarely studied, particularly when the noisy samples in new feature space are limited. In this paper, we tackle the above problem by proposing a novel two-stage algorithm, called Adaptive Learning for Dynamic features and Noisy labels (ALDN). Specifically, optimal transport is firstly modified to map the previously learned heterogeneous model to the prior model of the current stage. Then, to fully reuse the mapped prior model, we add a simple yet efficient regularizer as the consistency constraint to assist both the estimation of the noise transition matrix and the model training in the current stage. Finally, two implementations with direct (ALDN-D) and indirect (ALDN-ID) constraints are illustrated for better investigation. More importantly, we provide theoretical guarantees for risk minimization of ALDN-D and ALDN-ID. Extensive experiments validate the effectiveness of the proposed algorithms.
{"title":"Adaptive Learning for Dynamic Features and Noisy Labels.","authors":"Shilin Gu, Chao Xu, Dewen Hu, Chenping Hou","doi":"10.1109/TPAMI.2024.3489217","DOIUrl":"10.1109/TPAMI.2024.3489217","url":null,"abstract":"<p><p>Applying current machine learning algorithms in complex and open environments remains challenging, especially when different changing elements are coupled and the training data is scarce. For example, in the activity recognition task, the motion sensors may change position or fall off due to the intensity of the activity, leading to changes in feature space and finally resulting in label noise. Learning from such a problem where the dynamic features are coupled with noisy labels is crucial but rarely studied, particularly when the noisy samples in new feature space are limited. In this paper, we tackle the above problem by proposing a novel two-stage algorithm, called Adaptive Learning for Dynamic features and Noisy labels (ALDN). Specifically, optimal transport is firstly modified to map the previously learned heterogeneous model to the prior model of the current stage. Then, to fully reuse the mapped prior model, we add a simple yet efficient regularizer as the consistency constraint to assist both the estimation of the noise transition matrix and the model training in the current stage. Finally, two implementations with direct (ALDN-D) and indirect (ALDN-ID) constraints are illustrated for better investigation. More importantly, we provide theoretical guarantees for risk minimization of ALDN-D and ALDN-ID. Extensive experiments validate the effectiveness of the proposed algorithms.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142559851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-31DOI: 10.1109/TPAMI.2024.3489030
Eduardo Fernandes Montesuma, Fred Maurice Ngole Mboula, Antoine Souloumiac
Recently, Optimal Transport has been proposed as a probabilistic framework in Machine Learning for comparing and manipulating probability distributions. This is rooted in its rich history and theory, and has offered new solutions to different problems in machine learning, such as generative modeling and transfer learning. In this survey we explore contributions of Optimal Transport for Machine Learning over the period 2012 - 2023, focusing on four sub-fields of Machine Learning: supervised, unsupervised, transfer and reinforcement learning. We further highlight the recent development in computational Optimal Transport and its extensions, such as partial, unbalanced, Gromov and Neural Optimal Transport, and its interplay with Machine Learning practice.
{"title":"Recent Advances in Optimal Transport for Machine Learning.","authors":"Eduardo Fernandes Montesuma, Fred Maurice Ngole Mboula, Antoine Souloumiac","doi":"10.1109/TPAMI.2024.3489030","DOIUrl":"10.1109/TPAMI.2024.3489030","url":null,"abstract":"<p><p>Recently, Optimal Transport has been proposed as a probabilistic framework in Machine Learning for comparing and manipulating probability distributions. This is rooted in its rich history and theory, and has offered new solutions to different problems in machine learning, such as generative modeling and transfer learning. In this survey we explore contributions of Optimal Transport for Machine Learning over the period 2012 - 2023, focusing on four sub-fields of Machine Learning: supervised, unsupervised, transfer and reinforcement learning. We further highlight the recent development in computational Optimal Transport and its extensions, such as partial, unbalanced, Gromov and Neural Optimal Transport, and its interplay with Machine Learning practice.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142559853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modeling count data using suitable statistical distributions has been instrumental for analyzing the patterns it conveys. However, failing to address critical aspects, like overdispersion, jeopardizes the effectiveness of such an analysis. In this paper, overdispersed count data is modeled using the Dirichlet Multinomial (DM) distribution by maximizing its likelihood using a fixed-point iteration algorithm. This is achieved by estimating the DM distribution parameters while comparing the recent Languasco-Migliardi (LM), and the Yu-Shaw (YS) procedures, which address the well-known computational difficulties of evaluating its log-likelihood. Experiments were conducted using multiple datasets from different domains spanning polls, images, and IoT network traffic. They all showed the superiority of the LM procedure as it succeeded at estimating the DM parameters at the designated level of accuracy in all experiments, while the YS procedure failed to produce sufficiently accurate results (or any results at all) in several experiments. Moreover, the LM procedure achieved a speedup that ranged from 2-fold to 20-fold over YS.
{"title":"Efficient Analysis of Overdispersed Data Using an Accurate Computation of the Dirichlet Multinomial Distribution.","authors":"Sherenaz Al-Haj Baddar, Alessandro Languasco, Mauro Migliardi","doi":"10.1109/TPAMI.2024.3489645","DOIUrl":"10.1109/TPAMI.2024.3489645","url":null,"abstract":"<p><p>Modeling count data using suitable statistical distributions has been instrumental for analyzing the patterns it conveys. However, failing to address critical aspects, like overdispersion, jeopardizes the effectiveness of such an analysis. In this paper, overdispersed count data is modeled using the Dirichlet Multinomial (DM) distribution by maximizing its likelihood using a fixed-point iteration algorithm. This is achieved by estimating the DM distribution parameters while comparing the recent Languasco-Migliardi (LM), and the Yu-Shaw (YS) procedures, which address the well-known computational difficulties of evaluating its log-likelihood. Experiments were conducted using multiple datasets from different domains spanning polls, images, and IoT network traffic. They all showed the superiority of the LM procedure as it succeeded at estimating the DM parameters at the designated level of accuracy in all experiments, while the YS procedure failed to produce sufficiently accurate results (or any results at all) in several experiments. Moreover, the LM procedure achieved a speedup that ranged from 2-fold to 20-fold over YS.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142559852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}