Pub Date : 2026-02-06DOI: 10.1109/TIP.2026.3653206
Nimrod Shabtay, Eli Schwartz, Raja Giryes
In Deep Image Prior (DIP), a Convolutional Neural Network (CNN) is fitted to map a latent space to a degraded (e.g. noisy) image but in the process learns to reconstruct the clean image. This phenomenon is attributed to CNN's internal image prior. We revisit the DIP framework, examining it from the perspective of a neural implicit representation. Motivated by this perspective, we replace the random latent with Fourier-Features (Positional Encoding). We empirically demonstrate that the convolution layers in DIP can be replaced with simple pixel-level MLPs thanks to the Fourier features properties. We also prove that they are equivalent in the case of linear networks. We name our scheme "Positional Encoding Image Prior" (PIP) and exhibit that it performs very similar to DIP on various image-reconstruction tasks with much fewer parameters. Furthermore, we demonstrate that PIP can be easily extended to videos, an area where methods based on image-priors and certain INR approaches face challenges with stability. Code and additional examples for all tasks, including videos, are available on the project page nimrodshabtay.github.io/PIP.
{"title":"Positional Encoding Image Prior.","authors":"Nimrod Shabtay, Eli Schwartz, Raja Giryes","doi":"10.1109/TIP.2026.3653206","DOIUrl":"https://doi.org/10.1109/TIP.2026.3653206","url":null,"abstract":"<p><p>In Deep Image Prior (DIP), a Convolutional Neural Network (CNN) is fitted to map a latent space to a degraded (e.g. noisy) image but in the process learns to reconstruct the clean image. This phenomenon is attributed to CNN's internal image prior. We revisit the DIP framework, examining it from the perspective of a neural implicit representation. Motivated by this perspective, we replace the random latent with Fourier-Features (Positional Encoding). We empirically demonstrate that the convolution layers in DIP can be replaced with simple pixel-level MLPs thanks to the Fourier features properties. We also prove that they are equivalent in the case of linear networks. We name our scheme \"Positional Encoding Image Prior\" (PIP) and exhibit that it performs very similar to DIP on various image-reconstruction tasks with much fewer parameters. Furthermore, we demonstrate that PIP can be easily extended to videos, an area where methods based on image-priors and certain INR approaches face challenges with stability. Code and additional examples for all tasks, including videos, are available on the project page nimrodshabtay.github.io/PIP.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-04DOI: 10.1109/TIP.2026.3659325
Shuping Zhao;Lunke Fei;Tingting Chai;Jie Wen;Bob Zhang;Jinrong Cui
Unrestrained palmprint recognition refers to a comprehensive identity authentication technology, that performs personal authentication based on the palmprint images captured in uncontrolled environments, i.e., smartphone cameras, surveillance footage, or near-infrared scenarios. However, unrestrained palmprint recognition faces significant challenges due to the variability in image quality, lighting conditions, and hand poses present in such settings. We observed that many existing methods utilize the subspace structure as a prior, where the block diagonal property of the data has been proved. In this paper, we consider a unified learning model to guarantee the consensus block diagonal property for all views, named high-confident block diagonal analysis for multi-view palmprint recognition (HCBDA_MPR). Particularly, this paper proposed a multi-view block diagonal regularizer to guide that all views learn a consensus block diagonal structure. In such a manner, the main discriminant features from each view can be preserved while the learning of the strict block diagonal structure across all views. Experimental results on a number of real-world unrestrained palmprint databases proved the superiority of the proposed method, where the highest recognition accuracies were obtained in comparison with the other state-of-the-art related methods.
{"title":"High-Confident Block Diagonal Analysis for Multi-View Palmprint Recognition in Unrestrained Environment","authors":"Shuping Zhao;Lunke Fei;Tingting Chai;Jie Wen;Bob Zhang;Jinrong Cui","doi":"10.1109/TIP.2026.3659325","DOIUrl":"10.1109/TIP.2026.3659325","url":null,"abstract":"Unrestrained palmprint recognition refers to a comprehensive identity authentication technology, that performs personal authentication based on the palmprint images captured in uncontrolled environments, i.e., smartphone cameras, surveillance footage, or near-infrared scenarios. However, unrestrained palmprint recognition faces significant challenges due to the variability in image quality, lighting conditions, and hand poses present in such settings. We observed that many existing methods utilize the subspace structure as a prior, where the block diagonal property of the data has been proved. In this paper, we consider a unified learning model to guarantee the consensus block diagonal property for all views, named high-confident block diagonal analysis for multi-view palmprint recognition (HCBDA_MPR). Particularly, this paper proposed a multi-view block diagonal regularizer to guide that all views learn a consensus block diagonal structure. In such a manner, the main discriminant features from each view can be preserved while the learning of the strict block diagonal structure across all views. Experimental results on a number of real-world unrestrained palmprint databases proved the superiority of the proposed method, where the highest recognition accuracies were obtained in comparison with the other state-of-the-art related methods.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1621-1635"},"PeriodicalIF":13.7,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146115782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advancements in industrial anomaly detection (AD) have demonstrated that incorporating a small number of anomalous samples during training can significantly enhance accuracy. However, this improvement often comes at the cost of extensive annotation efforts, which are impractical for many real-world applications. In this paper, we introduce a novel framework, “Weakly-supervised RESidual $T$ ransformer” (WeakREST), designed to achieve high anomaly detection accuracy while minimizing the reliance on manual annotations. First, we reformulate the pixel-wise anomaly localization task into a block-wise classification problem. Second, we introduce a residual-based feature representation called “Positional $F$ ast $A$ nomaly $R$ esiduals” (PosFAR) which captures anomalous patterns more effectively. To leverage this feature, we adapt the Swin Transformer for enhanced anomaly detection and localization. Additionally, we propose a weak annotation approach utilizing bounding boxes and image tags to define anomalous regions. This approach establishes a semi-supervised learning context that reduces the dependency on precise pixel-level labels. To further improve the learning process, we develop a novel ResMixMatch algorithm, capable of handling the interplay between weak labels and residual-based representations. On the benchmark dataset MVTec-AD, our method achieves an Average Precision (AP) of 83.0%, surpassing the previous best result of 82.7% in the unsupervised setting. In the supervised AD setting, WeakREST attains an AP of 87.6%, outperforming the previous best of 86.0%. Notably, even when using weaker annotations such as bounding boxes, WeakREST exceeds the performance of leading methods relying on pixel-wise supervision, achieving an AP of 87.1% compared to the prior best of 86.0% on MVTec-AD. This superior performance is consistently replicated across other well-established AD datasets, including MVTec 3D, KSDD2 and Real-IAD. Code is available at: https://github.com/BeJane/Semi_REST
工业异常检测(AD)的最新进展表明,在训练过程中加入少量异常样本可以显著提高准确性。然而,这种改进通常是以大量注释工作为代价的,这对于许多实际应用程序来说是不切实际的。在本文中,我们引入了一个新的框架,“弱监督残差T变换”(WeakREST),旨在实现高异常检测精度,同时最大限度地减少对人工注释的依赖。首先,我们将逐像素异常定位任务重新表述为逐块分类问题。其次,我们引入了一种基于残差的特征表示,称为“位置$F$ ast $ a $ normal $R$残差”(PosFAR),它可以更有效地捕获异常模式。为了利用这一特性,我们对Swin Transformer进行了调整,以增强异常检测和定位。此外,我们提出了一种弱标注方法,利用边界框和图像标签来定义异常区域。这种方法建立了半监督学习环境,减少了对精确像素级标签的依赖。为了进一步改进学习过程,我们开发了一种新的ResMixMatch算法,能够处理弱标签和基于残差的表示之间的相互作用。在基准数据集MVTec-AD上,我们的方法实现了83.0%的平均精度(AP),超过了之前在无监督设置下82.7%的最佳结果。在有监督的AD设置中,WeakREST达到了87.6%的AP,超过了之前最好的86.0%。值得注意的是,即使在使用较弱的注释(如边界框)时,WeakREST的性能也超过了依赖于逐像素监督的领先方法,实现了87.1%的AP,而MVTec-AD的最佳AP为86.0%。这种卓越的性能在其他成熟的AD数据集上也得到了一致的复制,包括MVTec 3D、KSDD2和Real-IAD。代码可从https://github.com/BeJane/Semi_REST获得
{"title":"Accurate Industrial Anomaly Detection and Localization Using Weakly-Supervised Residual Transformers","authors":"Hanxi Li;Jingqi Wu;Deyin Liu;Lin Yuanbo Wu;Hao Chen;Chunhua Shen","doi":"10.1109/TIP.2026.3659337","DOIUrl":"10.1109/TIP.2026.3659337","url":null,"abstract":"Recent advancements in industrial anomaly detection (AD) have demonstrated that incorporating a small number of anomalous samples during training can significantly enhance accuracy. However, this improvement often comes at the cost of extensive annotation efforts, which are impractical for many real-world applications. In this paper, we introduce a novel framework, “Weakly-supervised RESidual <inline-formula> <tex-math>$T$ </tex-math></inline-formula>ransformer” (WeakREST), designed to achieve high anomaly detection accuracy while minimizing the reliance on manual annotations. First, we reformulate the pixel-wise anomaly localization task into a block-wise classification problem. Second, we introduce a residual-based feature representation called “Positional <inline-formula> <tex-math>$F$ </tex-math></inline-formula>ast <inline-formula> <tex-math>$A$ </tex-math></inline-formula>nomaly <inline-formula> <tex-math>$R$ </tex-math></inline-formula>esiduals” (PosFAR) which captures anomalous patterns more effectively. To leverage this feature, we adapt the Swin Transformer for enhanced anomaly detection and localization. Additionally, we propose a weak annotation approach utilizing bounding boxes and image tags to define anomalous regions. This approach establishes a semi-supervised learning context that reduces the dependency on precise pixel-level labels. To further improve the learning process, we develop a novel ResMixMatch algorithm, capable of handling the interplay between weak labels and residual-based representations. On the benchmark dataset MVTec-AD, our method achieves an Average Precision (AP) of 83.0%, surpassing the previous best result of 82.7% in the unsupervised setting. In the supervised AD setting, WeakREST attains an AP of 87.6%, outperforming the previous best of 86.0%. Notably, even when using weaker annotations such as bounding boxes, WeakREST exceeds the performance of leading methods relying on pixel-wise supervision, achieving an AP of 87.1% compared to the prior best of 86.0% on MVTec-AD. This superior performance is consistently replicated across other well-established AD datasets, including MVTec 3D, KSDD2 and Real-IAD. Code is available at: <uri>https://github.com/BeJane/Semi_REST</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1551-1566"},"PeriodicalIF":13.7,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146115777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ultrasonic image anomaly detection faces significant challenges due to limited labeled data, strong structural and random noise, and highly diverse defect manifestations. To overcome these obstacles, we introduce UltraChip, a new large-scale C-scan benchmark containing about 8,000 real-world images from various chip packaging types, each meticulously annotated with pixel-level masks for cracks, holes, and layers. Building on this resource, we present FSGM-Net, a fully unsupervised framework tailored for anomaly detection. FSGM-Net leverages an adaptive Frequency-Spatial feature filtering mechanism: a learnable FFT-Spatial patch filter first suppresses noise and dynamically assigns normality weights to Vision Transformer (ViT) patch features. Subsequently, an Adaptive Gaussian Mixture Model (Ada-GMM) captures the distribution of normal features and guides a deep–shallow multi-scale interaction decoder for accurate, pixel-level anomaly inference. In addition, we propose a filter loss that enforces encoder–filter consistency and entropy-based sparse gating, together with a distributional loss that encourages both feature reconstruction and confident Gaussian mixture modeling. Extensive experiments demonstrate that FSGM-Net not only achieves state-of-the-art results on UltraChip but also exhibits superior cross-domain generalization to MVTec-AD and VisA, while supporting real-time inference on a single GPU. Together, the dataset and framework advance robust, annotation-free ultrasonic NDT in practical applications. The UltraChip dataset can be obtained via https://iiplab.net/ultrachip/
{"title":"Improving Unsupervised Ultrasonic Image Anomaly Detection via Frequency-Spatial Feature Filtering and Gaussian Mixture Modeling","authors":"Wenjing Zhang;Ke Lu;Jinbao Wang;Hao Liang;Can Gao;Jian Xue","doi":"10.1109/TIP.2026.3659292","DOIUrl":"10.1109/TIP.2026.3659292","url":null,"abstract":"Ultrasonic image anomaly detection faces significant challenges due to limited labeled data, strong structural and random noise, and highly diverse defect manifestations. To overcome these obstacles, we introduce UltraChip, a new large-scale C-scan benchmark containing about 8,000 real-world images from various chip packaging types, each meticulously annotated with pixel-level masks for cracks, holes, and layers. Building on this resource, we present FSGM-Net, a fully unsupervised framework tailored for anomaly detection. FSGM-Net leverages an adaptive Frequency-Spatial feature filtering mechanism: a learnable FFT-Spatial patch filter first suppresses noise and dynamically assigns normality weights to Vision Transformer (ViT) patch features. Subsequently, an Adaptive Gaussian Mixture Model (Ada-GMM) captures the distribution of normal features and guides a deep–shallow multi-scale interaction decoder for accurate, pixel-level anomaly inference. In addition, we propose a filter loss that enforces encoder–filter consistency and entropy-based sparse gating, together with a distributional loss that encourages both feature reconstruction and confident Gaussian mixture modeling. Extensive experiments demonstrate that FSGM-Net not only achieves state-of-the-art results on UltraChip but also exhibits superior cross-domain generalization to MVTec-AD and VisA, while supporting real-time inference on a single GPU. Together, the dataset and framework advance robust, annotation-free ultrasonic NDT in practical applications. The UltraChip dataset can be obtained via <uri>https://iiplab.net/ultrachip/</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1567-1581"},"PeriodicalIF":13.7,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146115780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Contrastive learning facilitates the acquisition of informative skeleton representations for unsupervised action recognition by leveraging effective positive and negative sample pairs. However, most existing methods construct these pairs through weak or strong data augmentations, which typically rely on random appearance alterations of skeletons. While such augmentations are somewhat effective, they introduce semantic variations only indirectly and face two inherent limitations. First, simply modifying the appearance of skeletons often fails to reflect meaningful semantic variations. Second, random perturbations can unintentionally blur the boundary between positive and negative pairs, weakening the contrastive objective. To address these challenges, we propose an attack-driven augmentation framework that explicitly introduces semantic-level perturbations. This approach facilitates the generation of hard positives while guiding the model to mine more informative hard negatives. Building on this idea, we present Attack-Augmented Mixing-Contrastive Skeletal Representation Learning (A2MC), a novel framework that focuses on contrasting hard positive and hard negative samples for more robust representation learning. Within A2MC, we design an Attack-Augmentation (Att-Aug) module that integrates both targeted (attack-based) and untargeted (augmentation-based) perturbations to generate informative hard positive samples. In parallel, we propose the Positive-Negative Mixer (PNM), which blends hard positive and negative features to synthesize challenging hard negatives. These are then used to update a mixed memory bank for more effective contrastive learning. Comprehensive evaluations across three public benchmarks demonstrate that our approach, termed A2MC, achieves performance on par with or exceeding existing state-of-the-art methods.
{"title":"Attack-Augmented Mixing-Contrastive Skeletal Representation Learning","authors":"Binqian Xu;Xiangbo Shu;Jiachao Zhang;Rui Yan;Guo-Sen Xie","doi":"10.1109/TIP.2026.3659331","DOIUrl":"10.1109/TIP.2026.3659331","url":null,"abstract":"Contrastive learning facilitates the acquisition of informative skeleton representations for unsupervised action recognition by leveraging effective positive and negative sample pairs. However, most existing methods construct these pairs through weak or strong data augmentations, which typically rely on random appearance alterations of skeletons. While such augmentations are somewhat effective, they introduce semantic variations only indirectly and face two inherent limitations. First, simply modifying the appearance of skeletons often fails to reflect meaningful semantic variations. Second, random perturbations can unintentionally blur the boundary between positive and negative pairs, weakening the contrastive objective. To address these challenges, we propose an attack-driven augmentation framework that explicitly introduces semantic-level perturbations. This approach facilitates the generation of hard positives while guiding the model to mine more informative hard negatives. Building on this idea, we present Attack-Augmented Mixing-Contrastive Skeletal Representation Learning (A2MC), a novel framework that focuses on contrasting hard positive and hard negative samples for more robust representation learning. Within A2MC, we design an Attack-Augmentation (Att-Aug) module that integrates both targeted (attack-based) and untargeted (augmentation-based) perturbations to generate informative hard positive samples. In parallel, we propose the Positive-Negative Mixer (PNM), which blends hard positive and negative features to synthesize challenging hard negatives. These are then used to update a mixed memory bank for more effective contrastive learning. Comprehensive evaluations across three public benchmarks demonstrate that our approach, termed A2MC, achieves performance on par with or exceeding existing state-of-the-art methods.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1521-1534"},"PeriodicalIF":13.7,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146115781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-04DOI: 10.1109/TIP.2026.3659334
Jonghyuk Park;Jae-Young Sim
Single Image Reflection Separation (SIRS) aims to reconstruct both the transmitted and reflected images from a single image that contains a superimposition of both, captured through a glass-like reflective surface. Recent learning-based methods of SIRS have significantly improved performance on typical images with mild reflection artifacts; however, they often struggle with diverse images containing challenging reflections captured in the wild. In this paper, we propose a universal SIRS framework based on a flexible dual-stream architecture, capable of handling diverse reflection artifacts. Specifically, we incorporate a Mixture-of-Experts mechanism that dynamically assigns specialized experts to image patches based on spatially heterogeneous reflection characteristics. The assigned experts then cooperate to extract complementary features between the transmission and reflection streams in an adaptive manner. In addition, we leverage the multi-head attention mechanism of Transformers to simultaneously exploit both high and low cross-correlations, which are then complementarily used to facilitate adaptive inter-stream feature interactions. Experimental results evaluated on diverse real-world datasets demonstrate that the proposed method significantly outperforms existing state-of-the-art methods qualitatively and quantitatively.
{"title":"Complementary Mixture-of-Experts and Complementary Cross-Attention for Single Image Reflection Separation in the Wild","authors":"Jonghyuk Park;Jae-Young Sim","doi":"10.1109/TIP.2026.3659334","DOIUrl":"10.1109/TIP.2026.3659334","url":null,"abstract":"Single Image Reflection Separation (SIRS) aims to reconstruct both the transmitted and reflected images from a single image that contains a superimposition of both, captured through a glass-like reflective surface. Recent learning-based methods of SIRS have significantly improved performance on typical images with mild reflection artifacts; however, they often struggle with diverse images containing challenging reflections captured in the wild. In this paper, we propose a universal SIRS framework based on a flexible dual-stream architecture, capable of handling diverse reflection artifacts. Specifically, we incorporate a Mixture-of-Experts mechanism that dynamically assigns specialized experts to image patches based on spatially heterogeneous reflection characteristics. The assigned experts then cooperate to extract complementary features between the transmission and reflection streams in an adaptive manner. In addition, we leverage the multi-head attention mechanism of Transformers to simultaneously exploit both high and low cross-correlations, which are then complementarily used to facilitate adaptive inter-stream feature interactions. Experimental results evaluated on diverse real-world datasets demonstrate that the proposed method significantly outperforms existing state-of-the-art methods qualitatively and quantitatively.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1607-1620"},"PeriodicalIF":13.7,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146115779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Similar to facial beautification in real life, 3D virtual avatars require personalized customization to enhance their visual appeal, yet this area remains insufficiently explored. Although current 3D Gaussian editing methods can be adapted for facial makeup purposes, these methods fail to meet the fundamental requirements for achieving realistic makeup effects: 1) ensuring a consistent appearance during drivable expressions; 2) preserving the identity throughout the makeup process; and 3) enabling precise control over fine details. To address these, we propose a specialized 3D makeup method named AvatarMakeup, leveraging a pretrained diffusion model to transfer makeup patterns from a single reference photo of any individual. We adopt a coarse-to-fine idea to first maintain the consistent appearance and identity, and then to refine the details. In particular, the diffusion model is employed to generate makeup images as supervision. Due to the uncertainties in diffusion process, the generated images are inconsistent across different viewpoints and expressions. Therefore, we propose a Coherent Duplication method to coarsely apply makeup to the target while ensuring consistency across dynamic and multi-view effects. Coherent Duplication optimizes a global UV map by recoding the averaged facial attributes among the generated makeup images. By querying the global UV map, it easily synthesizes coherent makeup guidance from arbitrary views and expressions to optimize the target avatar. Given the coarse makeup avatar, we further enhance the makeup by incorporating a Refinement Module into the diffusion model to achieve high makeup quality. Experiments demonstrate that AvatarMakeup achieves state-of-the-art makeup transfer quality and consistency throughout animation.
{"title":"AvatarMakeup: Realistic Makeup Transfer for 3D Animatable Head Avatars","authors":"Yiming Zhong;Xiaolin Zhang;Ligang Liu;Yao Zhao;Yunchao Wei","doi":"10.1109/TIP.2026.3657896","DOIUrl":"10.1109/TIP.2026.3657896","url":null,"abstract":"Similar to facial beautification in real life, 3D virtual avatars require personalized customization to enhance their visual appeal, yet this area remains insufficiently explored. Although current 3D Gaussian editing methods can be adapted for facial makeup purposes, these methods fail to meet the fundamental requirements for achieving realistic makeup effects: 1) ensuring a consistent appearance during drivable expressions; 2) preserving the identity throughout the makeup process; and 3) enabling precise control over fine details. To address these, we propose a specialized 3D makeup method named AvatarMakeup, leveraging a pretrained diffusion model to transfer makeup patterns from a single reference photo of any individual. We adopt a coarse-to-fine idea to first maintain the consistent appearance and identity, and then to refine the details. In particular, the diffusion model is employed to generate makeup images as supervision. Due to the uncertainties in diffusion process, the generated images are inconsistent across different viewpoints and expressions. Therefore, we propose a Coherent Duplication method to coarsely apply makeup to the target while ensuring consistency across dynamic and multi-view effects. Coherent Duplication optimizes a global UV map by recoding the averaged facial attributes among the generated makeup images. By querying the global UV map, it easily synthesizes coherent makeup guidance from arbitrary views and expressions to optimize the target avatar. Given the coarse makeup avatar, we further enhance the makeup by incorporating a Refinement Module into the diffusion model to achieve high makeup quality. Experiments demonstrate that AvatarMakeup achieves state-of-the-art makeup transfer quality and consistency throughout animation.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1436-1447"},"PeriodicalIF":13.7,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146110197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-03DOI: 10.1109/TIP.2026.3658212
Jiahui Qu;Jingyu Zhao;Wenqian Dong;Lijian Zhang;Yunsong Li
Hyperspectral image (HSI) change detection is a technique that can identify the changes occurring between the bitemporal HSIs covering the same geographic area. The field of change detection has witnessed the proposal and successful implementation of numerous methods. However, a majority of these approaches adhere to the centralized learning paradigm, which requires data transmission to a central server for training. The sensitivity of remote sensing data generally prohibit their sharing across different clients. Furthermore, manual labeling is a costly effort in practically. In this paper, we propose a spatial-spectral-temporal collaborative Mamba-based active federated hyperspectral change detection (MambaFedCD) framework, which utilizes the limited labeled samples from multiple clients to achieve change detection while ensuring the data privacy of each client. Specifically, there are three key characteristics: 1) a spatial-spectral-temporal collaborative Mamba-based change detection (${{text {S}}^{2}}{text {TMamba}}$ ) model is proposed to efficiently synergize the temporal and global spatial-spectral information of the bitemporal HSIs for change detection; 2) a difference feature diversity correction-based model aggregation (DFDCMA) strategy is devised to incorporate the diversity of difference features for rational allocation of weight factors among clients and to facilitate effective aggregation of the global model; 3) we propose a multi-decision federated active learning (MDFAL) strategy that selects both error-prone and valuable samples for model training to alleviate the burden of sample labeling. Comprehensive experiments conducted on commonly utilized datasets demonstrate that the proposed method outperforms other state-of-the-art methods. The code is available at https://github.com/Jiahuiqu/MambaFedCD
{"title":"MambaFedCD: Spatial–Spectral–Temporal Collaborative Mamba-Based Active Federated Hyperspectral Change Detection","authors":"Jiahui Qu;Jingyu Zhao;Wenqian Dong;Lijian Zhang;Yunsong Li","doi":"10.1109/TIP.2026.3658212","DOIUrl":"10.1109/TIP.2026.3658212","url":null,"abstract":"Hyperspectral image (HSI) change detection is a technique that can identify the changes occurring between the bitemporal HSIs covering the same geographic area. The field of change detection has witnessed the proposal and successful implementation of numerous methods. However, a majority of these approaches adhere to the centralized learning paradigm, which requires data transmission to a central server for training. The sensitivity of remote sensing data generally prohibit their sharing across different clients. Furthermore, manual labeling is a costly effort in practically. In this paper, we propose a spatial-spectral-temporal collaborative Mamba-based active federated hyperspectral change detection (MambaFedCD) framework, which utilizes the limited labeled samples from multiple clients to achieve change detection while ensuring the data privacy of each client. Specifically, there are three key characteristics: 1) a spatial-spectral-temporal collaborative Mamba-based change detection (<inline-formula> <tex-math>${{text {S}}^{2}}{text {TMamba}}$ </tex-math></inline-formula>) model is proposed to efficiently synergize the temporal and global spatial-spectral information of the bitemporal HSIs for change detection; 2) a difference feature diversity correction-based model aggregation (DFDCMA) strategy is devised to incorporate the diversity of difference features for rational allocation of weight factors among clients and to facilitate effective aggregation of the global model; 3) we propose a multi-decision federated active learning (MDFAL) strategy that selects both error-prone and valuable samples for model training to alleviate the burden of sample labeling. Comprehensive experiments conducted on commonly utilized datasets demonstrate that the proposed method outperforms other state-of-the-art methods. The code is available at <uri>https://github.com/Jiahuiqu/MambaFedCD</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1478-1492"},"PeriodicalIF":13.7,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146110200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-03DOI: 10.1109/TIP.2026.3653547
Weilei Wen;Tianyi Zhang;Qianqian Zhao;Zhaohui Zheng;Chunle Guo;Xiuli Shao;Chongyi Li
Recent advancements in codebook-based real image super-resolution (SR) have shown promising results in real-world applications. The core idea involves matching high-quality image features from a codebook based on low-resolution (LR) image features. However, existing methods face two major challenges: inaccurate feature matching with the codebook and poor texture detail reconstruction. To address these issues, we propose a novel Uncertainty-Guided and Top-k Codebook Matching SR (UGTSR) framework, which incorporates three key components: 1) an uncertainty learning mechanism that guides the model to focus on texture-rich regions, 2) a Top-k feature matching strategy that enhances feature matching accuracy by fusing multiple candidate features, and 3) an Align-Attention module that enhances the alignment of information between LR and HR features. Experimental results demonstrate significant improvements in texture realism and reconstruction fidelity compared to existing methods. The source code can be found at https://github.com/wwlCape/UGTSR-main
{"title":"Incorporating Uncertainty-Guided and Top-k Codebook Matching for Real-World Blind Image Super-Resolution","authors":"Weilei Wen;Tianyi Zhang;Qianqian Zhao;Zhaohui Zheng;Chunle Guo;Xiuli Shao;Chongyi Li","doi":"10.1109/TIP.2026.3653547","DOIUrl":"10.1109/TIP.2026.3653547","url":null,"abstract":"Recent advancements in codebook-based real image super-resolution (SR) have shown promising results in real-world applications. The core idea involves matching high-quality image features from a codebook based on low-resolution (LR) image features. However, existing methods face two major challenges: inaccurate feature matching with the codebook and poor texture detail reconstruction. To address these issues, we propose a novel Uncertainty-Guided and Top-k Codebook Matching SR (UGTSR) framework, which incorporates three key components: 1) an uncertainty learning mechanism that guides the model to focus on texture-rich regions, 2) a Top-k feature matching strategy that enhances feature matching accuracy by fusing multiple candidate features, and 3) an Align-Attention module that enhances the alignment of information between LR and HR features. Experimental results demonstrate significant improvements in texture realism and reconstruction fidelity compared to existing methods. The source code can be found at <uri>https://github.com/wwlCape/UGTSR-main</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1535-1550"},"PeriodicalIF":13.7,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146110199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-03DOI: 10.1109/TIP.2026.3657233
Shuang Zeng;Lei Zhu;Xinliang Zhang;Hangzhou He;Yanye Lu
Medical image segmentation is a critical yet challenging task, primarily due to the difficulty of obtaining extensive datasets of high-quality, expert-annotated images. Contrastive learning presents a potential but still problematic solution to this issue. Because most existing methods focus on extracting instance-level or pixel-to-pixel representation, which ignores the characteristics between intra-image similar pixel groups. Moreover, when considering contrastive pairs generation, most SOTA methods mainly rely on manually setting thresholds, which requires a large number of gradient experiments and lacks efficiency and generalization. To address these issues, we propose a novel contrastive learning approach named SuperCL for medical image segmentation pre-training. Specifically, our SuperCL exploits the structural prior and pixel correlation of images by introducing two novel contrastive pairs generation strategies: Intra-image Local Contrastive Pairs (ILCP) Generation and Inter-image Global Contrastive Pairs (IGCP) Generation. Considering superpixel cluster aligns well with the concept of contrastive pairs generation, we utilize the superpixel map to generate pseudo masks for both ILCP and IGCP to guide supervised contrastive learning. Moreover, we also propose two modules named Average SuperPixel Feature Map Generation (ASP) and Connected Components Label Generation (CCL) to better exploit the prior structural information for IGCP. Finally, experiments on 8 medical image datasets indicate our SuperCL outperforms existing 12 methods. i.e. Our SuperCL achieves a superior performance with more precise predictions from visualization figures and 3.15%, 5.44%, 7.89% DSC higher than the previous best results on MMWHS, CHAOS, Spleen with 10% annotations. Our code is released at https://github.com/stevezs315/SuperCL
{"title":"SuperCL: Superpixel Guided Contrastive Learning for Medical Image Segmentation Pre-Training","authors":"Shuang Zeng;Lei Zhu;Xinliang Zhang;Hangzhou He;Yanye Lu","doi":"10.1109/TIP.2026.3657233","DOIUrl":"10.1109/TIP.2026.3657233","url":null,"abstract":"Medical image segmentation is a critical yet challenging task, primarily due to the difficulty of obtaining extensive datasets of high-quality, expert-annotated images. Contrastive learning presents a potential but still problematic solution to this issue. Because most existing methods focus on extracting instance-level or pixel-to-pixel representation, which ignores the characteristics between intra-image similar pixel groups. Moreover, when considering contrastive pairs generation, most SOTA methods mainly rely on manually setting thresholds, which requires a large number of gradient experiments and lacks efficiency and generalization. To address these issues, we propose a novel contrastive learning approach named SuperCL for medical image segmentation pre-training. Specifically, our SuperCL exploits the structural prior and pixel correlation of images by introducing two novel contrastive pairs generation strategies: Intra-image Local Contrastive Pairs (ILCP) Generation and Inter-image Global Contrastive Pairs (IGCP) Generation. Considering superpixel cluster aligns well with the concept of contrastive pairs generation, we utilize the superpixel map to generate pseudo masks for both ILCP and IGCP to guide supervised contrastive learning. Moreover, we also propose two modules named Average SuperPixel Feature Map Generation (ASP) and Connected Components Label Generation (CCL) to better exploit the prior structural information for IGCP. Finally, experiments on 8 medical image datasets indicate our SuperCL outperforms existing 12 methods. i.e. Our SuperCL achieves a superior performance with more precise predictions from visualization figures and 3.15%, 5.44%, 7.89% DSC higher than the previous best results on MMWHS, CHAOS, Spleen with 10% annotations. Our code is released at <uri>https://github.com/stevezs315/SuperCL</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1636-1651"},"PeriodicalIF":13.7,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146110198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}