Pub Date : 2026-01-12DOI: 10.1109/TIP.2025.3650664
{"title":"Reviewer Summary for Transactions on Image Processing","authors":"","doi":"10.1109/TIP.2025.3650664","DOIUrl":"10.1109/TIP.2025.3650664","url":null,"abstract":"","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8684-8708"},"PeriodicalIF":13.7,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11346802","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145955219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1109/TIP.2025.3648203
Hao Jing, Anhong Wang, Yifan Zhang, Donghan Bu, Junhui Hou
Regarding intelligent transportation systems, low-bitrate transmission via lossy point cloud compression is vital for facilitating real-time collaborative perception among connected agents, such as vehicles and infrastructures, under restricted bandwidth. In existing compression transmission systems, the sender lossily compresses point coordinates and reflectance to generate a transmission code stream, which faces transmission burdens from reflectance encoding and limited detection robustness due to information loss. To address these issues, this paper proposes a 3D object detection framework with reflectance prediction-based knowledge distillation (RPKD). We compress point coordinates while discarding reflectance during low-bitrate transmission, and feed the decoded non-reflectance compressed point clouds into a student detector. The discarded reflectance is then reconstructed by a geometry-based reflectance prediction (RP) module within the student detector for precise detection. A teacher detector with the same structure as the student detector is designed for performing reflectance knowledge distillation (RKD) and detection knowledge distillation (DKD) from raw to compressed point clouds. Our cross-source distillation training strategy (CDTS) equips the student detector with robustness to low-quality compressed data while preserving the accuracy benefits of raw data through transferred distillation knowledge. Experimental results on the KITTI and DAIR-V2X-V datasets demonstrate that our method can boost detection accuracy for compressed point clouds across multiple code rates. We will release the code publicly at https://github.com/HaoJing-SX/RPKD.
{"title":"Reflectance Prediction-based Knowledge Distillation for Robust 3D Object Detection in Compressed Point Clouds.","authors":"Hao Jing, Anhong Wang, Yifan Zhang, Donghan Bu, Junhui Hou","doi":"10.1109/TIP.2025.3648203","DOIUrl":"https://doi.org/10.1109/TIP.2025.3648203","url":null,"abstract":"<p><p>Regarding intelligent transportation systems, low-bitrate transmission via lossy point cloud compression is vital for facilitating real-time collaborative perception among connected agents, such as vehicles and infrastructures, under restricted bandwidth. In existing compression transmission systems, the sender lossily compresses point coordinates and reflectance to generate a transmission code stream, which faces transmission burdens from reflectance encoding and limited detection robustness due to information loss. To address these issues, this paper proposes a 3D object detection framework with reflectance prediction-based knowledge distillation (RPKD). We compress point coordinates while discarding reflectance during low-bitrate transmission, and feed the decoded non-reflectance compressed point clouds into a student detector. The discarded reflectance is then reconstructed by a geometry-based reflectance prediction (RP) module within the student detector for precise detection. A teacher detector with the same structure as the student detector is designed for performing reflectance knowledge distillation (RKD) and detection knowledge distillation (DKD) from raw to compressed point clouds. Our cross-source distillation training strategy (CDTS) equips the student detector with robustness to low-quality compressed data while preserving the accuracy benefits of raw data through transferred distillation knowledge. Experimental results on the KITTI and DAIR-V2X-V datasets demonstrate that our method can boost detection accuracy for compressed point clouds across multiple code rates. We will release the code publicly at https://github.com/HaoJing-SX/RPKD.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145893537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unsupervised Domain Adaptation (UDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain by learning domain-invariant representations. Motivated by the recent success of Vision Transformers (ViTs), several UDA approaches have adopted ViT architectures to exploit fine-grained patch-level representations, which are unified as Transformer-based $D$ omain $A$ daptation (TransDA) independent of CNN-based. However, we have a key observation in TransDA: due to inherent domain shifts, patches (tokens) from different semantic categories across domains may exhibit abnormally high similarities, which can mislead the self-attention mechanism and degrade adaptation performance. To solve that, we propose a novel $P$ atch-$A$ daptation Transformer (PATrans), which first identifies similarity-anomalous patches and then adaptively suppresses their negative impact to domain alignment, i.e. token calibration. Specifically, we introduce a $P$ atch-$A$ daptation $A$ ttention (PAA) mechanism to replace the standard self-attention mechanism, which consists of a weight-shared triple-branch mixed attention mechanism and a patch-level domain discriminator. The mixed attention integrates self-attention and cross-attention to enhance intra-domain feature modeling and inter-domain similarity estimation. Meanwhile, the patch-level domain discriminator quantifies the anomaly probability of each patch, enabling dynamic reweighting to mitigate the impact of unreliable patch correspondences. Furthermore, we introduce a contrastive attention regularization strategy, which leverages category-level information in a contrastive learning framework to promote class-consistent attention distributions. Extensive experiments on four benchmark datasets demonstrate that PATrans attains significant improvements over existing state-of-the-art UDA methods (e.g., 89.2% on the VisDA-2017). Code is available at: https://github.com/YSY145/PATrans
{"title":"Token Calibration for Transformer-Based Domain Adaptation","authors":"Xiaowei Fu;Shiyu Ye;Chenxu Zhang;Fuxiang Huang;Xin Xu;Lei Zhang","doi":"10.1109/TIP.2025.3647367","DOIUrl":"10.1109/TIP.2025.3647367","url":null,"abstract":"Unsupervised Domain Adaptation (UDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain by learning domain-invariant representations. Motivated by the recent success of Vision Transformers (ViTs), several UDA approaches have adopted ViT architectures to exploit fine-grained patch-level representations, which are unified as <italic>Trans</i>former-based <inline-formula> <tex-math>$D$ </tex-math></inline-formula>omain <inline-formula> <tex-math>$A$ </tex-math></inline-formula>daptation (TransDA) independent of CNN-based. However, we have a key observation in TransDA: due to inherent domain shifts, patches (tokens) from different semantic categories across domains may exhibit abnormally high similarities, which can mislead the self-attention mechanism and degrade adaptation performance. To solve that, we propose a novel <inline-formula> <tex-math>$P$ </tex-math></inline-formula>atch-<inline-formula> <tex-math>$A$ </tex-math></inline-formula>daptation <italic>Trans</i>former (PATrans), which first <italic>identifies</i> similarity-anomalous patches and then adaptively <italic>suppresses</i> their negative impact to domain alignment, i.e. <italic>token calibration</i>. Specifically, we introduce a <inline-formula> <tex-math>$P$ </tex-math></inline-formula>atch-<inline-formula> <tex-math>$A$ </tex-math></inline-formula>daptation <inline-formula> <tex-math>$A$ </tex-math></inline-formula>ttention (<italic>PAA</i>) mechanism to replace the standard self-attention mechanism, which consists of a weight-shared triple-branch mixed attention mechanism and a patch-level domain discriminator. The mixed attention integrates self-attention and cross-attention to enhance intra-domain feature modeling and inter-domain similarity estimation. Meanwhile, the patch-level domain discriminator quantifies the anomaly probability of each patch, enabling dynamic reweighting to mitigate the impact of unreliable patch correspondences. Furthermore, we introduce a contrastive attention regularization strategy, which leverages category-level information in a contrastive learning framework to promote class-consistent attention distributions. Extensive experiments on four benchmark datasets demonstrate that PATrans attains significant improvements over existing state-of-the-art UDA methods (e.g., 89.2% on the VisDA-2017). Code is available at: <uri>https://github.com/YSY145/PATrans</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"57-68"},"PeriodicalIF":13.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145890754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1109/TIP.2025.3647207
Yang Xu;Jian Zhu;Danfeng Hong;Zhihui Wei;Zebin Wu
Hyperspectral images (HSIs) and multispectral images (MSIs) fusion is a hot topic in the remote sensing society. A high-resolution HSI (HR-HSI) can be obtained by fusing a low-resolution HSI (LR-HSI) and a high-resolution MSI (HR-MSI) or RGB image. However, most deep learning-based methods require a large amount of HR-HSIs for supervised training, which is very rare in practice. In this paper, we propose a coupled diffusion posterior sampling (CDPS) method for HSI and MSI fusion in which the HR-HSIs are no longer required in the training process. Because the LR-HSI contains the spectral information and HR-MSI contains the spatial information of the captured scene, we design an unsupervised strategy that learns the required diffusion priors directly and solely from the input test image pair (the LR-HSI and HR-MSI themselves). Then, a coupled diffusion posterior sampling method is proposed to introduce the two priors in the diffusion posterior sampling which leverages the observed LR-HSI and HR-MSI as fidelity terms. Experimental results demonstrate that the proposed method outperforms other state-of-the-art unsupervised HSI and MSI fusion methods. Additionally, this method utilizes smaller networks that are simpler and easier to train without other data.
{"title":"Coupled Diffusion Posterior Sampling for Unsupervised Hyperspectral and Multispectral Images Fusion","authors":"Yang Xu;Jian Zhu;Danfeng Hong;Zhihui Wei;Zebin Wu","doi":"10.1109/TIP.2025.3647207","DOIUrl":"10.1109/TIP.2025.3647207","url":null,"abstract":"Hyperspectral images (HSIs) and multispectral images (MSIs) fusion is a hot topic in the remote sensing society. A high-resolution HSI (HR-HSI) can be obtained by fusing a low-resolution HSI (LR-HSI) and a high-resolution MSI (HR-MSI) or RGB image. However, most deep learning-based methods require a large amount of HR-HSIs for supervised training, which is very rare in practice. In this paper, we propose a coupled diffusion posterior sampling (CDPS) method for HSI and MSI fusion in which the HR-HSIs are no longer required in the training process. Because the LR-HSI contains the spectral information and HR-MSI contains the spatial information of the captured scene, we design an unsupervised strategy that learns the required diffusion priors directly and solely from the input test image pair (the LR-HSI and HR-MSI themselves). Then, a coupled diffusion posterior sampling method is proposed to introduce the two priors in the diffusion posterior sampling which leverages the observed LR-HSI and HR-MSI as fidelity terms. Experimental results demonstrate that the proposed method outperforms other state-of-the-art unsupervised HSI and MSI fusion methods. Additionally, this method utilizes smaller networks that are simpler and easier to train without other data.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"69-84"},"PeriodicalIF":13.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145890776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Point clouds have gained prominence across numerous applications due to their ability to accurately represent 3D objects and scenes. However, efficiently compressing unstructured, high-precision point cloud data remains a significant challenge. In this paper, we propose NeRC3, a novel point cloud compression framework that leverages implicit neural representations (INRs) to encode both geometry and attributes of dense point clouds. Our approach employs two coordinate-based neural networks: one maps spatial coordinates to voxel occupancy, while the other maps occupied voxels to their attributes, thereby implicitly representing the geometry and attributes of a voxelized point cloud. The encoder quantizes and compresses network parameters alongside auxiliary information required for reconstruction, while the decoder reconstructs the original point cloud by inputting voxel coordinates into the neural networks. Furthermore, we extend our method to dynamic point cloud compression through techniques that reduce temporal redundancy, including a 4D spatio-temporal representation termed 4D-NeRC3. Experimental results validate the effectiveness of our approach: For static point clouds, NeRC3 outperforms octree-based G-PCC standard and existing INR-based methods. For dynamic point clouds, 4D-NeRC3 achieves superior geometry compression performance compared to the latest G-PCC and V-PCC standards, while matching state-of-the-art learning-based methods. It also demonstrates competitive performance in joint geometry and attribute compression.
{"title":"Implicit Neural Compression of Point Clouds.","authors":"Hongning Ruan, Yulin Shao, Qianqian Yang, Liang Zhao, Zhaoyang Zhang, Dusit Niyato","doi":"10.1109/TIP.2025.3648141","DOIUrl":"https://doi.org/10.1109/TIP.2025.3648141","url":null,"abstract":"<p><p>Point clouds have gained prominence across numerous applications due to their ability to accurately represent 3D objects and scenes. However, efficiently compressing unstructured, high-precision point cloud data remains a significant challenge. In this paper, we propose NeRC<sup>3</sup>, a novel point cloud compression framework that leverages implicit neural representations (INRs) to encode both geometry and attributes of dense point clouds. Our approach employs two coordinate-based neural networks: one maps spatial coordinates to voxel occupancy, while the other maps occupied voxels to their attributes, thereby implicitly representing the geometry and attributes of a voxelized point cloud. The encoder quantizes and compresses network parameters alongside auxiliary information required for reconstruction, while the decoder reconstructs the original point cloud by inputting voxel coordinates into the neural networks. Furthermore, we extend our method to dynamic point cloud compression through techniques that reduce temporal redundancy, including a 4D spatio-temporal representation termed 4D-NeRC<sup>3</sup>. Experimental results validate the effectiveness of our approach: For static point clouds, NeRC<sup>3</sup> outperforms octree-based G-PCC standard and existing INR-based methods. For dynamic point clouds, 4D-NeRC<sup>3</sup> achieves superior geometry compression performance compared to the latest G-PCC and V-PCC standards, while matching state-of-the-art learning-based methods. It also demonstrates competitive performance in joint geometry and attribute compression.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145890733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1109/TIP.2025.3647323
Meng Yu;Liquan Shen;Yihan Yu;Yu Zhang;Rui Le
Underwater image enhancement (UIE) is crucial for robust marine exploration, yet existing methods prioritize perceptual quality while overlooking irreversible semantic corruption that impairs downstream tasks. Unlike terrestrial images, underwater semantics exhibit layer-specific degradations: shallow features suffer from color shifts and edge erosion, while deep features face semantic ambiguity. These distortions entangle with semantic content across feature hierarchies, where direct enhancement amplifies interference in downstream tasks. Even if distortions are removed, the damaged semantic structures cannot be fully recovered, making it imperative to further enhance corrupted content. To address these challenges, we propose a task-driven UIE framework that redefines enhancement as machine-interpretable semantic recovery rather than mere distortion removal. First, we introduce a multi-scale underwater distortion-aware generator to perceive distortions across feature levels and provide a prior for distortion removal. Second, leveraging this prior and the absence of clean underwater references, we propose a stable self-supervised disentanglement strategy to explicitly separate distortions from corrupted content through CLIP-based semantic constraints and identity consistency. Finally, to compensate for the irreversible semantic loss, we design a task-aware hierarchical enhancement module that refines shallow details via spatial-frequency fusion and strengthens deep semantics through multi-scale context aggregation, aligning results with machine vision requirements. Extensive experiments on segmentation, detection, and saliency tasks demonstrate the superiority of our method in restoring machine-friendly semantics from degraded underwater images. Our code is available at https://github.com/gemyumeng/HSRUIE
{"title":"Task-Driven Underwater Image Enhancement via Hierarchical Semantic Refinement","authors":"Meng Yu;Liquan Shen;Yihan Yu;Yu Zhang;Rui Le","doi":"10.1109/TIP.2025.3647323","DOIUrl":"10.1109/TIP.2025.3647323","url":null,"abstract":"Underwater image enhancement (UIE) is crucial for robust marine exploration, yet existing methods prioritize perceptual quality while overlooking irreversible semantic corruption that impairs downstream tasks. Unlike terrestrial images, underwater semantics exhibit layer-specific degradations: shallow features suffer from color shifts and edge erosion, while deep features face semantic ambiguity. These distortions entangle with semantic content across feature hierarchies, where direct enhancement amplifies interference in downstream tasks. Even if distortions are removed, the damaged semantic structures cannot be fully recovered, making it imperative to further enhance corrupted content. To address these challenges, we propose a task-driven UIE framework that redefines enhancement as machine-interpretable semantic recovery rather than mere distortion removal. First, we introduce a multi-scale underwater distortion-aware generator to perceive distortions across feature levels and provide a prior for distortion removal. Second, leveraging this prior and the absence of clean underwater references, we propose a stable self-supervised disentanglement strategy to explicitly separate distortions from corrupted content through CLIP-based semantic constraints and identity consistency. Finally, to compensate for the irreversible semantic loss, we design a task-aware hierarchical enhancement module that refines shallow details via spatial-frequency fusion and strengthens deep semantics through multi-scale context aggregation, aligning results with machine vision requirements. Extensive experiments on segmentation, detection, and saliency tasks demonstrate the superiority of our method in restoring machine-friendly semantics from degraded underwater images. Our code is available at <uri>https://github.com/gemyumeng/HSRUIE</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"42-56"},"PeriodicalIF":13.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145890756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31DOI: 10.1109/TIP.2025.3646889
Tao Yan;Yingying Wang;Yuhua Qian;Jiangfeng Zhang;Feijiang Li;Peng Wu;Lu Chen;Jieru Jia;Xiaoying Guo
Microscopic 3D shape reconstruction using depth from focus (DFF) is crucial in precision manufacturing for 3D modeling and quality control. However, the absence of high-precision microscopic DFF datasets and the significant differences between existing DFF datasets and microscopic DFF data in optical design, imaging principles and scene characteristics hinder the performance of current DFF models in microscopic tasks. To address this, we introduce M3D, a novel microscopic DFF dataset, constructed using a self-developed microscopic device. It includes multi-focus image sequences of 1,952 scenes across five categories, with depth labels obtained through the 3D TFT algorithm applied to dense image sequences for initial depth estimation and calibration. All labels are then compared and analyzed against the design values, and those with large errors are eliminated. We also propose M3DNet, a frequency-aware end-to-end network, to tackle challenges like shallow depth-of-field (DoF) and weak textures. Results show that M3D compensates for the limitations of macroscopic DFF datasets and extends DFF applications to microscopic scenarios. M3DNet effectively captures rapid focus decay and improves performance on public DFF datasets by leveraging superior global feature extraction. Additionally, it exhibits strong robustness even in extreme conditions. Dataset and code are available at https://github.com/jiangfeng-Z/M3D
利用聚焦深度(depth from focus, DFF)进行微观三维形状重建是精密制造中三维建模和质量控制的关键。然而,由于缺乏高精度的微观DFF数据集,以及现有DFF数据集与微观DFF数据在光学设计、成像原理和场景特征方面存在显著差异,阻碍了当前DFF模型在微观任务中的表现。为了解决这个问题,我们引入了M3D,一个新的微观DFF数据集,使用自行开发的微观设备构建。它包括5类1952个场景的多焦点图像序列,通过3D TFT算法对密集图像序列进行初始深度估计和校准,获得深度标签。然后将所有标签与设计值进行比较和分析,并消除误差较大的标签。我们还提出了M3DNet,一个频率感知的端到端网络,以解决像浅景深(DoF)和弱纹理的挑战。结果表明,M3D弥补了宏观DFF数据集的局限性,并将DFF应用扩展到微观场景。M3DNet通过利用卓越的全局特征提取,有效地捕获快速焦点衰减,并提高公共DFF数据集的性能。此外,即使在极端条件下,它也具有很强的稳健性。数据集和代码可在https://github.com/jiangfeng-Z/M3D上获得。
{"title":"M3D: A Benchmark Dataset and Model for Microscopic 3D Shape Reconstruction","authors":"Tao Yan;Yingying Wang;Yuhua Qian;Jiangfeng Zhang;Feijiang Li;Peng Wu;Lu Chen;Jieru Jia;Xiaoying Guo","doi":"10.1109/TIP.2025.3646889","DOIUrl":"10.1109/TIP.2025.3646889","url":null,"abstract":"Microscopic 3D shape reconstruction using depth from focus (DFF) is crucial in precision manufacturing for 3D modeling and quality control. However, the absence of high-precision microscopic DFF datasets and the significant differences between existing DFF datasets and microscopic DFF data in optical design, imaging principles and scene characteristics hinder the performance of current DFF models in microscopic tasks. To address this, we introduce M3D, a novel microscopic DFF dataset, constructed using a self-developed microscopic device. It includes multi-focus image sequences of 1,952 scenes across five categories, with depth labels obtained through the 3D TFT algorithm applied to dense image sequences for initial depth estimation and calibration. All labels are then compared and analyzed against the design values, and those with large errors are eliminated. We also propose M3DNet, a frequency-aware end-to-end network, to tackle challenges like shallow depth-of-field (DoF) and weak textures. Results show that M3D compensates for the limitations of macroscopic DFF datasets and extends DFF applications to microscopic scenarios. M3DNet effectively captures rapid focus decay and improves performance on public DFF datasets by leveraging superior global feature extraction. Additionally, it exhibits strong robustness even in extreme conditions. Dataset and code are available at <uri>https://github.com/jiangfeng-Z/M3D</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"181-193"},"PeriodicalIF":13.7,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145879783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31DOI: 10.1109/TIP.2025.3646890
Guoqing Wang;Chao Ma;Xiaokang Yang
Existing methods for learning 3D point cloud representation often use a single dataset-specific training and testing approach, leading to performance drops due to significant domain shifts between training and testing data. While recent cross-domain methods have made promising progress, the lack of inherent semantic information in point clouds makes models prone to overfitting specific datasets. As such, we introduce 3D-CFA, a simple yet effective cross-modality feature aggregation method for cross-domain 3D point cloud representation learning. 3D-CFA aggregates the geometry tokens with semantic tokens derived from multi-view images, which are projected from the point cloud, thus generating more transferable features for cross-domain 3D point cloud representation learning. Specifically, 3D-CFA consists of two main components: a cross-modality feature aggregation module and an elastic domain alignment module. The cross-modality feature aggregation module converts unordered points into multi-view images using the modality transformation module. Then, the geometry tokens and semantic tokens extracted from the geometry encoder and semantic encoder are fed into the cross-modal projector to get the transferable 3D tokens. A key insight of this design is that the semantic tokens can serve as a bridge between the 3D point cloud model and the 2D foundation model, greatly promoting the generalization of cross-domain models facing the severe domain shift. Finally, the elastic domain alignment module learns the hierarchical domain-invariant features of different training domains for either domain adaptation or domain generalization protocols. 3D-CFA finds a better way to transfer the knowledge of the 2D foundation model pre-trained at scale, meanwhile only introducing a few extra trainable parameters. Comprehensive experiments on several cross-domain point cloud benchmarks demonstrate the effectiveness of the proposed method compared to the state-of-the-art methods.
{"title":"Cross-Modality Feature Aggregation for Cross-Domain Point Cloud Representation Learning","authors":"Guoqing Wang;Chao Ma;Xiaokang Yang","doi":"10.1109/TIP.2025.3646890","DOIUrl":"10.1109/TIP.2025.3646890","url":null,"abstract":"Existing methods for learning 3D point cloud representation often use a single dataset-specific training and testing approach, leading to performance drops due to significant domain shifts between training and testing data. While recent cross-domain methods have made promising progress, the lack of inherent semantic information in point clouds makes models prone to overfitting specific datasets. As such, we introduce 3D-CFA, a simple yet effective cross-modality feature aggregation method for cross-domain 3D point cloud representation learning. 3D-CFA aggregates the geometry tokens with semantic tokens derived from multi-view images, which are projected from the point cloud, thus generating more transferable features for cross-domain 3D point cloud representation learning. Specifically, 3D-CFA consists of two main components: a cross-modality feature aggregation module and an elastic domain alignment module. The cross-modality feature aggregation module converts unordered points into multi-view images using the modality transformation module. Then, the geometry tokens and semantic tokens extracted from the geometry encoder and semantic encoder are fed into the cross-modal projector to get the transferable 3D tokens. A key insight of this design is that the semantic tokens can serve as a bridge between the 3D point cloud model and the 2D foundation model, greatly promoting the generalization of cross-domain models facing the severe domain shift. Finally, the elastic domain alignment module learns the hierarchical domain-invariant features of different training domains for either domain adaptation or domain generalization protocols. 3D-CFA finds a better way to transfer the knowledge of the 2D foundation model pre-trained at scale, meanwhile only introducing a few extra trainable parameters. Comprehensive experiments on several cross-domain point cloud benchmarks demonstrate the effectiveness of the proposed method compared to the state-of-the-art methods.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"166-180"},"PeriodicalIF":13.7,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145879817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Industrial few-shot anomaly detection (FSAD) requires identifying various abnormal states by leveraging as few normal samples as possible (abnormal samples are unavailable during training). However, current methods often require training a separate model for each category, leading to increased computation and storage overhead. Thus, designing a unified anomaly detection model that supports multiple categories remains a challenging task, as such a model must recognize anomalous patterns across diverse objects and domains. To tackle these challenges, this paper introduces FocusPatch AD, a unified anomaly detection framework based on vision-language models, achieving anomaly detection under few-shot multi-class settings. FocusPatch AD links anomaly state keywords to highly relevant discrete local regions within the image, guiding the model to focus on cross-category anomalies while filtering out background interference. This approach mitigates the false detection issues caused by global semantic alignment in vision-language models. We evaluate the proposed method on the MVTec, VisA, and Real-IAD datasets, comparing them against several prevailing anomaly detection methods. In both image-level and pixel-level anomaly detection tasks, FocusPatch AD achieves significant gains in classification and localization performance, demonstrating excellent generalization and adaptability.
{"title":"FocusPatch AD: Few-Shot Multi-Class Anomaly Detection With Unified Keywords Patch Prompts","authors":"Xicheng Ding;Xiaofan Li;Mingang Chen;Jingyu Gong;Yuan Xie","doi":"10.1109/TIP.2025.3646861","DOIUrl":"10.1109/TIP.2025.3646861","url":null,"abstract":"Industrial few-shot anomaly detection (FSAD) requires identifying various abnormal states by leveraging as few normal samples as possible (abnormal samples are unavailable during training). However, current methods often require training a separate model for each category, leading to increased computation and storage overhead. Thus, designing a unified anomaly detection model that supports multiple categories remains a challenging task, as such a model must recognize anomalous patterns across diverse objects and domains. To tackle these challenges, this paper introduces FocusPatch AD, a unified anomaly detection framework based on vision-language models, achieving anomaly detection under few-shot multi-class settings. FocusPatch AD links anomaly state keywords to highly relevant discrete local regions within the image, guiding the model to focus on cross-category anomalies while filtering out background interference. This approach mitigates the false detection issues caused by global semantic alignment in vision-language models. We evaluate the proposed method on the MVTec, VisA, and Real-IAD datasets, comparing them against several prevailing anomaly detection methods. In both image-level and pixel-level anomaly detection tasks, FocusPatch AD achieves significant gains in classification and localization performance, demonstrating excellent generalization and adaptability.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"112-123"},"PeriodicalIF":13.7,"publicationDate":"2025-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145866716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-30DOI: 10.1109/TIP.2025.3646940
Yangfan Li;Wei Li
Feature reconstruction networks have achieved remarkable performance in few-shot fine-grained classification tasks. Nonetheless, traditional feature reconstruction networks rely on linear regression. This linearity may cause the loss of subtle discriminative cues, ultimately resulting in less precise reconstructed features. Moreover, in situations where the background predominantly occupies the image, the background reconstruction errors tend to overshadow foreground reconstruction errors, resulting in inaccurate reconstruction errors. In order to address the two key issues, a novel approach called the Foreground-Aware Kernelized Feature Reconstruction Network (FKFRN) is proposed. Specifically, to address the problem of imprecise reconstructed features, we introduce kernel methods into linear feature reconstruction, extending it to nonlinear feature reconstruction, thus enabling the reconstruction of richer, finer-grained discriminative features. To tackle the issue of inaccurate reconstruction errors, the foreground-aware reconstruction error is proposed. Specifically, the model assigns higher weights to features containing more foreground information and lower weights to those dominated by background content, which reduces the impact of background errors on the overall reconstruction. To estimate these weights accurately, we design two complementary strategies: an explicit probabilistic graphical model and an implicit neural network–based approach. Extensive experimental results on eight datasets validate the effectiveness of the proposed approach for few-shot fine-grained classification.
{"title":"Few-Shot Fine-Grained Classification With Foreground-Aware Kernelized Feature Reconstruction Network","authors":"Yangfan Li;Wei Li","doi":"10.1109/TIP.2025.3646940","DOIUrl":"10.1109/TIP.2025.3646940","url":null,"abstract":"Feature reconstruction networks have achieved remarkable performance in few-shot fine-grained classification tasks. Nonetheless, traditional feature reconstruction networks rely on linear regression. This linearity may cause the loss of subtle discriminative cues, ultimately resulting in less precise reconstructed features. Moreover, in situations where the background predominantly occupies the image, the background reconstruction errors tend to overshadow foreground reconstruction errors, resulting in inaccurate reconstruction errors. In order to address the two key issues, a novel approach called the Foreground-Aware Kernelized Feature Reconstruction Network (FKFRN) is proposed. Specifically, to address the problem of imprecise reconstructed features, we introduce kernel methods into linear feature reconstruction, extending it to nonlinear feature reconstruction, thus enabling the reconstruction of richer, finer-grained discriminative features. To tackle the issue of inaccurate reconstruction errors, the foreground-aware reconstruction error is proposed. Specifically, the model assigns higher weights to features containing more foreground information and lower weights to those dominated by background content, which reduces the impact of background errors on the overall reconstruction. To estimate these weights accurately, we design two complementary strategies: an explicit probabilistic graphical model and an implicit neural network–based approach. Extensive experimental results on eight datasets validate the effectiveness of the proposed approach for few-shot fine-grained classification.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"150-165"},"PeriodicalIF":13.7,"publicationDate":"2025-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145866581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}