Pub Date : 2026-02-04DOI: 10.1109/TIP.2026.3659334
Jonghyuk Park;Jae-Young Sim
Single Image Reflection Separation (SIRS) aims to reconstruct both the transmitted and reflected images from a single image that contains a superimposition of both, captured through a glass-like reflective surface. Recent learning-based methods of SIRS have significantly improved performance on typical images with mild reflection artifacts; however, they often struggle with diverse images containing challenging reflections captured in the wild. In this paper, we propose a universal SIRS framework based on a flexible dual-stream architecture, capable of handling diverse reflection artifacts. Specifically, we incorporate a Mixture-of-Experts mechanism that dynamically assigns specialized experts to image patches based on spatially heterogeneous reflection characteristics. The assigned experts then cooperate to extract complementary features between the transmission and reflection streams in an adaptive manner. In addition, we leverage the multi-head attention mechanism of Transformers to simultaneously exploit both high and low cross-correlations, which are then complementarily used to facilitate adaptive inter-stream feature interactions. Experimental results evaluated on diverse real-world datasets demonstrate that the proposed method significantly outperforms existing state-of-the-art methods qualitatively and quantitatively.
{"title":"Complementary Mixture-of-Experts and Complementary Cross-Attention for Single Image Reflection Separation in the Wild","authors":"Jonghyuk Park;Jae-Young Sim","doi":"10.1109/TIP.2026.3659334","DOIUrl":"10.1109/TIP.2026.3659334","url":null,"abstract":"Single Image Reflection Separation (SIRS) aims to reconstruct both the transmitted and reflected images from a single image that contains a superimposition of both, captured through a glass-like reflective surface. Recent learning-based methods of SIRS have significantly improved performance on typical images with mild reflection artifacts; however, they often struggle with diverse images containing challenging reflections captured in the wild. In this paper, we propose a universal SIRS framework based on a flexible dual-stream architecture, capable of handling diverse reflection artifacts. Specifically, we incorporate a Mixture-of-Experts mechanism that dynamically assigns specialized experts to image patches based on spatially heterogeneous reflection characteristics. The assigned experts then cooperate to extract complementary features between the transmission and reflection streams in an adaptive manner. In addition, we leverage the multi-head attention mechanism of Transformers to simultaneously exploit both high and low cross-correlations, which are then complementarily used to facilitate adaptive inter-stream feature interactions. Experimental results evaluated on diverse real-world datasets demonstrate that the proposed method significantly outperforms existing state-of-the-art methods qualitatively and quantitatively.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1607-1620"},"PeriodicalIF":13.7,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146115779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Similar to facial beautification in real life, 3D virtual avatars require personalized customization to enhance their visual appeal, yet this area remains insufficiently explored. Although current 3D Gaussian editing methods can be adapted for facial makeup purposes, these methods fail to meet the fundamental requirements for achieving realistic makeup effects: 1) ensuring a consistent appearance during drivable expressions; 2) preserving the identity throughout the makeup process; and 3) enabling precise control over fine details. To address these, we propose a specialized 3D makeup method named AvatarMakeup, leveraging a pretrained diffusion model to transfer makeup patterns from a single reference photo of any individual. We adopt a coarse-to-fine idea to first maintain the consistent appearance and identity, and then to refine the details. In particular, the diffusion model is employed to generate makeup images as supervision. Due to the uncertainties in diffusion process, the generated images are inconsistent across different viewpoints and expressions. Therefore, we propose a Coherent Duplication method to coarsely apply makeup to the target while ensuring consistency across dynamic and multi-view effects. Coherent Duplication optimizes a global UV map by recoding the averaged facial attributes among the generated makeup images. By querying the global UV map, it easily synthesizes coherent makeup guidance from arbitrary views and expressions to optimize the target avatar. Given the coarse makeup avatar, we further enhance the makeup by incorporating a Refinement Module into the diffusion model to achieve high makeup quality. Experiments demonstrate that AvatarMakeup achieves state-of-the-art makeup transfer quality and consistency throughout animation.
{"title":"AvatarMakeup: Realistic Makeup Transfer for 3D Animatable Head Avatars","authors":"Yiming Zhong;Xiaolin Zhang;Ligang Liu;Yao Zhao;Yunchao Wei","doi":"10.1109/TIP.2026.3657896","DOIUrl":"10.1109/TIP.2026.3657896","url":null,"abstract":"Similar to facial beautification in real life, 3D virtual avatars require personalized customization to enhance their visual appeal, yet this area remains insufficiently explored. Although current 3D Gaussian editing methods can be adapted for facial makeup purposes, these methods fail to meet the fundamental requirements for achieving realistic makeup effects: 1) ensuring a consistent appearance during drivable expressions; 2) preserving the identity throughout the makeup process; and 3) enabling precise control over fine details. To address these, we propose a specialized 3D makeup method named AvatarMakeup, leveraging a pretrained diffusion model to transfer makeup patterns from a single reference photo of any individual. We adopt a coarse-to-fine idea to first maintain the consistent appearance and identity, and then to refine the details. In particular, the diffusion model is employed to generate makeup images as supervision. Due to the uncertainties in diffusion process, the generated images are inconsistent across different viewpoints and expressions. Therefore, we propose a Coherent Duplication method to coarsely apply makeup to the target while ensuring consistency across dynamic and multi-view effects. Coherent Duplication optimizes a global UV map by recoding the averaged facial attributes among the generated makeup images. By querying the global UV map, it easily synthesizes coherent makeup guidance from arbitrary views and expressions to optimize the target avatar. Given the coarse makeup avatar, we further enhance the makeup by incorporating a Refinement Module into the diffusion model to achieve high makeup quality. Experiments demonstrate that AvatarMakeup achieves state-of-the-art makeup transfer quality and consistency throughout animation.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1436-1447"},"PeriodicalIF":13.7,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146110197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-03DOI: 10.1109/TIP.2026.3658212
Jiahui Qu;Jingyu Zhao;Wenqian Dong;Lijian Zhang;Yunsong Li
Hyperspectral image (HSI) change detection is a technique that can identify the changes occurring between the bitemporal HSIs covering the same geographic area. The field of change detection has witnessed the proposal and successful implementation of numerous methods. However, a majority of these approaches adhere to the centralized learning paradigm, which requires data transmission to a central server for training. The sensitivity of remote sensing data generally prohibit their sharing across different clients. Furthermore, manual labeling is a costly effort in practically. In this paper, we propose a spatial-spectral-temporal collaborative Mamba-based active federated hyperspectral change detection (MambaFedCD) framework, which utilizes the limited labeled samples from multiple clients to achieve change detection while ensuring the data privacy of each client. Specifically, there are three key characteristics: 1) a spatial-spectral-temporal collaborative Mamba-based change detection (${{text {S}}^{2}}{text {TMamba}}$ ) model is proposed to efficiently synergize the temporal and global spatial-spectral information of the bitemporal HSIs for change detection; 2) a difference feature diversity correction-based model aggregation (DFDCMA) strategy is devised to incorporate the diversity of difference features for rational allocation of weight factors among clients and to facilitate effective aggregation of the global model; 3) we propose a multi-decision federated active learning (MDFAL) strategy that selects both error-prone and valuable samples for model training to alleviate the burden of sample labeling. Comprehensive experiments conducted on commonly utilized datasets demonstrate that the proposed method outperforms other state-of-the-art methods. The code is available at https://github.com/Jiahuiqu/MambaFedCD
{"title":"MambaFedCD: Spatial–Spectral–Temporal Collaborative Mamba-Based Active Federated Hyperspectral Change Detection","authors":"Jiahui Qu;Jingyu Zhao;Wenqian Dong;Lijian Zhang;Yunsong Li","doi":"10.1109/TIP.2026.3658212","DOIUrl":"10.1109/TIP.2026.3658212","url":null,"abstract":"Hyperspectral image (HSI) change detection is a technique that can identify the changes occurring between the bitemporal HSIs covering the same geographic area. The field of change detection has witnessed the proposal and successful implementation of numerous methods. However, a majority of these approaches adhere to the centralized learning paradigm, which requires data transmission to a central server for training. The sensitivity of remote sensing data generally prohibit their sharing across different clients. Furthermore, manual labeling is a costly effort in practically. In this paper, we propose a spatial-spectral-temporal collaborative Mamba-based active federated hyperspectral change detection (MambaFedCD) framework, which utilizes the limited labeled samples from multiple clients to achieve change detection while ensuring the data privacy of each client. Specifically, there are three key characteristics: 1) a spatial-spectral-temporal collaborative Mamba-based change detection (<inline-formula> <tex-math>${{text {S}}^{2}}{text {TMamba}}$ </tex-math></inline-formula>) model is proposed to efficiently synergize the temporal and global spatial-spectral information of the bitemporal HSIs for change detection; 2) a difference feature diversity correction-based model aggregation (DFDCMA) strategy is devised to incorporate the diversity of difference features for rational allocation of weight factors among clients and to facilitate effective aggregation of the global model; 3) we propose a multi-decision federated active learning (MDFAL) strategy that selects both error-prone and valuable samples for model training to alleviate the burden of sample labeling. Comprehensive experiments conducted on commonly utilized datasets demonstrate that the proposed method outperforms other state-of-the-art methods. The code is available at <uri>https://github.com/Jiahuiqu/MambaFedCD</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1478-1492"},"PeriodicalIF":13.7,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146110200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-03DOI: 10.1109/TIP.2026.3653547
Weilei Wen;Tianyi Zhang;Qianqian Zhao;Zhaohui Zheng;Chunle Guo;Xiuli Shao;Chongyi Li
Recent advancements in codebook-based real image super-resolution (SR) have shown promising results in real-world applications. The core idea involves matching high-quality image features from a codebook based on low-resolution (LR) image features. However, existing methods face two major challenges: inaccurate feature matching with the codebook and poor texture detail reconstruction. To address these issues, we propose a novel Uncertainty-Guided and Top-k Codebook Matching SR (UGTSR) framework, which incorporates three key components: 1) an uncertainty learning mechanism that guides the model to focus on texture-rich regions, 2) a Top-k feature matching strategy that enhances feature matching accuracy by fusing multiple candidate features, and 3) an Align-Attention module that enhances the alignment of information between LR and HR features. Experimental results demonstrate significant improvements in texture realism and reconstruction fidelity compared to existing methods. The source code can be found at https://github.com/wwlCape/UGTSR-main
{"title":"Incorporating Uncertainty-Guided and Top-k Codebook Matching for Real-World Blind Image Super-Resolution","authors":"Weilei Wen;Tianyi Zhang;Qianqian Zhao;Zhaohui Zheng;Chunle Guo;Xiuli Shao;Chongyi Li","doi":"10.1109/TIP.2026.3653547","DOIUrl":"10.1109/TIP.2026.3653547","url":null,"abstract":"Recent advancements in codebook-based real image super-resolution (SR) have shown promising results in real-world applications. The core idea involves matching high-quality image features from a codebook based on low-resolution (LR) image features. However, existing methods face two major challenges: inaccurate feature matching with the codebook and poor texture detail reconstruction. To address these issues, we propose a novel Uncertainty-Guided and Top-k Codebook Matching SR (UGTSR) framework, which incorporates three key components: 1) an uncertainty learning mechanism that guides the model to focus on texture-rich regions, 2) a Top-k feature matching strategy that enhances feature matching accuracy by fusing multiple candidate features, and 3) an Align-Attention module that enhances the alignment of information between LR and HR features. Experimental results demonstrate significant improvements in texture realism and reconstruction fidelity compared to existing methods. The source code can be found at <uri>https://github.com/wwlCape/UGTSR-main</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1535-1550"},"PeriodicalIF":13.7,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146110199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-03DOI: 10.1109/TIP.2026.3657233
Shuang Zeng;Lei Zhu;Xinliang Zhang;Hangzhou He;Yanye Lu
Medical image segmentation is a critical yet challenging task, primarily due to the difficulty of obtaining extensive datasets of high-quality, expert-annotated images. Contrastive learning presents a potential but still problematic solution to this issue. Because most existing methods focus on extracting instance-level or pixel-to-pixel representation, which ignores the characteristics between intra-image similar pixel groups. Moreover, when considering contrastive pairs generation, most SOTA methods mainly rely on manually setting thresholds, which requires a large number of gradient experiments and lacks efficiency and generalization. To address these issues, we propose a novel contrastive learning approach named SuperCL for medical image segmentation pre-training. Specifically, our SuperCL exploits the structural prior and pixel correlation of images by introducing two novel contrastive pairs generation strategies: Intra-image Local Contrastive Pairs (ILCP) Generation and Inter-image Global Contrastive Pairs (IGCP) Generation. Considering superpixel cluster aligns well with the concept of contrastive pairs generation, we utilize the superpixel map to generate pseudo masks for both ILCP and IGCP to guide supervised contrastive learning. Moreover, we also propose two modules named Average SuperPixel Feature Map Generation (ASP) and Connected Components Label Generation (CCL) to better exploit the prior structural information for IGCP. Finally, experiments on 8 medical image datasets indicate our SuperCL outperforms existing 12 methods. i.e. Our SuperCL achieves a superior performance with more precise predictions from visualization figures and 3.15%, 5.44%, 7.89% DSC higher than the previous best results on MMWHS, CHAOS, Spleen with 10% annotations. Our code is released at https://github.com/stevezs315/SuperCL
{"title":"SuperCL: Superpixel Guided Contrastive Learning for Medical Image Segmentation Pre-Training","authors":"Shuang Zeng;Lei Zhu;Xinliang Zhang;Hangzhou He;Yanye Lu","doi":"10.1109/TIP.2026.3657233","DOIUrl":"10.1109/TIP.2026.3657233","url":null,"abstract":"Medical image segmentation is a critical yet challenging task, primarily due to the difficulty of obtaining extensive datasets of high-quality, expert-annotated images. Contrastive learning presents a potential but still problematic solution to this issue. Because most existing methods focus on extracting instance-level or pixel-to-pixel representation, which ignores the characteristics between intra-image similar pixel groups. Moreover, when considering contrastive pairs generation, most SOTA methods mainly rely on manually setting thresholds, which requires a large number of gradient experiments and lacks efficiency and generalization. To address these issues, we propose a novel contrastive learning approach named SuperCL for medical image segmentation pre-training. Specifically, our SuperCL exploits the structural prior and pixel correlation of images by introducing two novel contrastive pairs generation strategies: Intra-image Local Contrastive Pairs (ILCP) Generation and Inter-image Global Contrastive Pairs (IGCP) Generation. Considering superpixel cluster aligns well with the concept of contrastive pairs generation, we utilize the superpixel map to generate pseudo masks for both ILCP and IGCP to guide supervised contrastive learning. Moreover, we also propose two modules named Average SuperPixel Feature Map Generation (ASP) and Connected Components Label Generation (CCL) to better exploit the prior structural information for IGCP. Finally, experiments on 8 medical image datasets indicate our SuperCL outperforms existing 12 methods. i.e. Our SuperCL achieves a superior performance with more precise predictions from visualization figures and 3.15%, 5.44%, 7.89% DSC higher than the previous best results on MMWHS, CHAOS, Spleen with 10% annotations. Our code is released at <uri>https://github.com/stevezs315/SuperCL</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1636-1651"},"PeriodicalIF":13.7,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146110198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-02DOI: 10.1109/TIP.2026.3658219
Xianmin Tian;Jin Xie;Ronghua Xu;Jing Nie;Jiale Cao;Yanwei Pang;Xuelong Li
Stereo image restoration in adverse environments, such as low-light conditions, rain, and low resolution, requires effective exploitation of cross-view complementary information to recover degraded visual content. In monocular image restoration, frequency decomposition has proven effective, where high-frequency components aid in recovering fine textures and reducing blur, while low-frequency components facilitate noise suppression and illumination correction. However, existing stereo restoration methods have yet to explore cross-view interactions by frequency decomposition, which is a promising direction for enhancing restoration quality. To address this, we propose a frequency-aware framework comprising a Frequency Decomposition Module (FDM), Detail Interaction Module (DIM), Structural Interaction Module (SIM), and Adaptive Fusion Module (AFM). FDM employs learnable filters to decompose the image into high- and low-frequency components. DIM enhances the high-frequency branch by capturing local detail cues through deformable convolution. SIM processes the low-frequency branch by modeling global structural correlations via a cross-view row-wise attention mechanism. Finally, AFM adaptively fuses the complementary frequency-specific information to generate high-quality restored images. Extensive experiments demonstrate the efficacy and generalizability of our framework across three diverse stereo restoration tasks, where it achieves state-of-the-art performance in low-light enhancement, rain removal, alongside highly competitive results in super-resolution. Our code is available at https://github.com/C2022J/FDIN
{"title":"Frequency-Decomposed Interaction Network for Stereo Image Restoration","authors":"Xianmin Tian;Jin Xie;Ronghua Xu;Jing Nie;Jiale Cao;Yanwei Pang;Xuelong Li","doi":"10.1109/TIP.2026.3658219","DOIUrl":"10.1109/TIP.2026.3658219","url":null,"abstract":"Stereo image restoration in adverse environments, such as low-light conditions, rain, and low resolution, requires effective exploitation of cross-view complementary information to recover degraded visual content. In monocular image restoration, frequency decomposition has proven effective, where high-frequency components aid in recovering fine textures and reducing blur, while low-frequency components facilitate noise suppression and illumination correction. However, existing stereo restoration methods have yet to explore cross-view interactions by frequency decomposition, which is a promising direction for enhancing restoration quality. To address this, we propose a frequency-aware framework comprising a Frequency Decomposition Module (FDM), Detail Interaction Module (DIM), Structural Interaction Module (SIM), and Adaptive Fusion Module (AFM). FDM employs learnable filters to decompose the image into high- and low-frequency components. DIM enhances the high-frequency branch by capturing local detail cues through deformable convolution. SIM processes the low-frequency branch by modeling global structural correlations via a cross-view row-wise attention mechanism. Finally, AFM adaptively fuses the complementary frequency-specific information to generate high-quality restored images. Extensive experiments demonstrate the efficacy and generalizability of our framework across three diverse stereo restoration tasks, where it achieves state-of-the-art performance in low-light enhancement, rain removal, alongside highly competitive results in super-resolution. Our code is available at <uri>https://github.com/C2022J/FDIN</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1462-1477"},"PeriodicalIF":13.7,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146101789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-02DOI: 10.1109/TIP.2026.3658223
Ming Jin;Richang Hong
In video-text cross-domain retrieval tasks, the generalization ability of the retrieval models is key to improving their performance and is crucial for enhancing their practical applicability. However, existing retrieval models exhibit significant deficiencies in cross-domain generalization. On one hand, models tend to overfit specific training domain data, resulting in poor cross-domain matching and significantly reduced retrieval accuracy when dealing with data from different, new, or mixed domains. On the other hand, although data augmentation is a vital strategy for enhancing model generalization, most existing methods focus on unimodal augmentation and fail to fully exploit the multimodal correlations between video and text. As a result, the augmented data lack semantic diversity, which further limits the model’s ability to understand and perform in complex cross-domain scenarios. To address these challenges, this paper proposes an innovative collaborative augmentation approach named MDA-MAA, which includes two core modules: the Masked Attention Augmentation (MAA) module and the Multimodal Diffusion Augmentation (MDA) module. The MAA module applies masking to the original video frame features and uses an attention mechanism to predict the masked features, effectively reducing overfitting to training data and enhancing model generalization. The MDA module generates subtitles from video frames and uses the LLaMA model to infer comprehensive video captions. These captions, combined with the original video frames, are integrated into a diffusion model for joint learning, ultimately generating semantically enriched augmented video frames. This process leverages the multimodal relationship between video and text to increase the diversity of the training data distribution. Experimental results demonstrate that this collaborative augmentation method significantly improves the performance of video-text cross-domain retrieval models, validating its effectiveness in enhancing model generalization.
{"title":"MDA-MAA: A Collaborative Augmentation Approach for Generalizing Cross-Domain Retrieval","authors":"Ming Jin;Richang Hong","doi":"10.1109/TIP.2026.3658223","DOIUrl":"10.1109/TIP.2026.3658223","url":null,"abstract":"In video-text cross-domain retrieval tasks, the generalization ability of the retrieval models is key to improving their performance and is crucial for enhancing their practical applicability. However, existing retrieval models exhibit significant deficiencies in cross-domain generalization. On one hand, models tend to overfit specific training domain data, resulting in poor cross-domain matching and significantly reduced retrieval accuracy when dealing with data from different, new, or mixed domains. On the other hand, although data augmentation is a vital strategy for enhancing model generalization, most existing methods focus on unimodal augmentation and fail to fully exploit the multimodal correlations between video and text. As a result, the augmented data lack semantic diversity, which further limits the model’s ability to understand and perform in complex cross-domain scenarios. To address these challenges, this paper proposes an innovative collaborative augmentation approach named MDA-MAA, which includes two core modules: the Masked Attention Augmentation (MAA) module and the Multimodal Diffusion Augmentation (MDA) module. The MAA module applies masking to the original video frame features and uses an attention mechanism to predict the masked features, effectively reducing overfitting to training data and enhancing model generalization. The MDA module generates subtitles from video frames and uses the LLaMA model to infer comprehensive video captions. These captions, combined with the original video frames, are integrated into a diffusion model for joint learning, ultimately generating semantically enriched augmented video frames. This process leverages the multimodal relationship between video and text to increase the diversity of the training data distribution. Experimental results demonstrate that this collaborative augmentation method significantly improves the performance of video-text cross-domain retrieval models, validating its effectiveness in enhancing model generalization.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1595-1606"},"PeriodicalIF":13.7,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146101327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-30DOI: 10.1109/TIP.2026.3653202
Onur Keleş;A. Murat Tekalp
Neural networks commonly employ the McCulloch-Pitts neuron model, which is a linear model followed by a point-wise non-linear activation. Various researchers have already advanced inherently non-linear neuron models, such as quadratic neurons, generalized operational neurons, generative neurons, and super neurons, which offer stronger non-linearity compared to point-wise activation functions. In this paper, we introduce a novel and better non-linear neuron model called Padé neurons ($mathrm {textit {Paon}}$ s), inspired by Padé approximants. $mathrm {textit {Paon}}$ s offer several advantages, such as diversity of non-linearity, since each $mathrm {textit {Paon}}$ learns a different non-linear function of its inputs, and layer efficiency, since $mathrm {textit {Paon}}$ s provide stronger non-linearity in much fewer layers compared to piecewise linear approximation. Furthermore, $mathrm {textit {Paon}}$ s include all previously proposed neuron models as special cases, thus any neuron model in any network can be replaced by $mathrm {textit {Paon}}$ s. We note that there has been a proposal to employ the Padé approximation as a generalized point-wise activation function, which is fundamentally different from our model. To validate the efficacy of $mathrm {textit {Paon}}$ s, in our experiments, we replace classic neurons in some well-known neural image super-resolution, compression, and classification models based on the ResNet architecture with $mathrm {textit {Paon}}$ s. Our comprehensive experimental results and analyses demonstrate that neural models built by $mathrm {textit {Paon}}$ s provide better or equal performance than their classic counterparts with a smaller number of layers. The PyTorch implementation code for $mathrm {textit {Paon}}$ is open-sourced at https://github.com/onur-keles/Paon
{"title":"Padé Neurons for Efficient Neural Models","authors":"Onur Keleş;A. Murat Tekalp","doi":"10.1109/TIP.2026.3653202","DOIUrl":"10.1109/TIP.2026.3653202","url":null,"abstract":"Neural networks commonly employ the McCulloch-Pitts neuron model, which is a linear model followed by a point-wise non-linear activation. Various researchers have already advanced inherently non-linear neuron models, such as quadratic neurons, generalized operational neurons, generative neurons, and super neurons, which offer stronger non-linearity compared to point-wise activation functions. In this paper, we introduce a novel and better non-linear neuron model called Padé neurons (<inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula>s), inspired by Padé approximants. <inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula>s offer several advantages, such as diversity of non-linearity, since each <inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula> learns a different non-linear function of its inputs, and layer efficiency, since <inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula>s provide stronger non-linearity in much fewer layers compared to piecewise linear approximation. Furthermore, <inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula>s include all previously proposed neuron models as special cases, thus any neuron model in any network can be replaced by <inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula>s. We note that there has been a proposal to employ the Padé approximation as a generalized point-wise activation function, which is fundamentally different from our model. To validate the efficacy of <inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula>s, in our experiments, we replace classic neurons in some well-known neural image super-resolution, compression, and classification models based on the ResNet architecture with <inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula>s. Our comprehensive experimental results and analyses demonstrate that neural models built by <inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula>s provide better or equal performance than their classic counterparts with a smaller number of layers. The PyTorch implementation code for <inline-formula> <tex-math>$mathrm {textit {Paon}}$ </tex-math></inline-formula> is open-sourced at <uri>https://github.com/onur-keles/Paon</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1508-1520"},"PeriodicalIF":13.7,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146089897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-30DOI: 10.1109/TIP.2026.3657183
Xinran Qin;Yuning Cui;Shangquan Sun;Ruoyu Chen;Wenqi Ren;Alois Knoll;Xiaochun Cao
Multi-modal image fusion (MMIF) aims to integrate complementary information from heterogeneous sensor modalities. However, substantial cross-modality discrepancies hinder joint scene representation and lead to semantic degradation in the fused output. To address this limitation, we propose C2MFuse, a novel framework designed to preserve content while ensuring cross-modality consistency. To the best of our knowledge, this is the first MMIF approach to explicitly disentangle style and content representations across modalities for image fusion. C2MFuse introduces a content-preserving style normalization mechanism that suppresses modality-specific variations while maintaining the underlying scene structure. The normalized features are then progressively aggregated to enhance fine-grained details and improve content completeness. In light of the lack of ground truth and the inherent ambiguity of the fused distribution, we further align the fused representation with a well-defined source modality, thereby enhancing semantic consistency and reducing distributional uncertainty. Additionally, we introduce an adaptive consistency loss with learnable transformation, which provides dynamic, modality-aware supervision by enforcing global consistency across heterogeneous inputs. Extensive experiments on five datasets across three representative MMIF tasks demonstrate that C2MFuse achieves efficient and high-quality fusion, surpasses existing methods, and generalizes effectively to downstream visual applications.
{"title":"Disentangle to Fuse: Toward Content Preservation and Cross-Modality Consistency for Multi-Modality Image Fusion","authors":"Xinran Qin;Yuning Cui;Shangquan Sun;Ruoyu Chen;Wenqi Ren;Alois Knoll;Xiaochun Cao","doi":"10.1109/TIP.2026.3657183","DOIUrl":"10.1109/TIP.2026.3657183","url":null,"abstract":"Multi-modal image fusion (MMIF) aims to integrate complementary information from heterogeneous sensor modalities. However, substantial cross-modality discrepancies hinder joint scene representation and lead to semantic degradation in the fused output. To address this limitation, we propose C2MFuse, a novel framework designed to preserve content while ensuring cross-modality consistency. To the best of our knowledge, this is the first MMIF approach to explicitly disentangle style and content representations across modalities for image fusion. C2MFuse introduces a content-preserving style normalization mechanism that suppresses modality-specific variations while maintaining the underlying scene structure. The normalized features are then progressively aggregated to enhance fine-grained details and improve content completeness. In light of the lack of ground truth and the inherent ambiguity of the fused distribution, we further align the fused representation with a well-defined source modality, thereby enhancing semantic consistency and reducing distributional uncertainty. Additionally, we introduce an adaptive consistency loss with learnable transformation, which provides dynamic, modality-aware supervision by enforcing global consistency across heterogeneous inputs. Extensive experiments on five datasets across three representative MMIF tasks demonstrate that C2MFuse achieves efficient and high-quality fusion, surpasses existing methods, and generalizes effectively to downstream visual applications.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1756-1770"},"PeriodicalIF":13.7,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146088997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-29DOI: 10.1109/TIP.2026.3657636
Yun Liu;Tao Li;Chunping Tan;Wenqi Ren;Cosmin Ancuti;Weisi Lin
Image dehazing, a crucial task in low-level vision, supports numerous practical applications, such as autonomous driving, remote sensing, and surveillance. This paper proposes IHDCP, a novel Inverted Haze Density Correction Prior for efficient single image dehazing. It is observed that the medium transmission can be effectively modeled from the inverted haze density map using correction functions with various gamma coefficients. Based on this observation, a pixel-wise gamma correction coefficient is introduced to formulate the transmission as a function of the inverted haze density map. To estimate the transmission, IHDCP is first incorporated into the classic atmospheric scattering model (ASM), leading to a transcendental equation that is subsequently simplified to a quadratic form with a single unknown parameter using the Taylor expansion. Then, boundary constraints are designed to estimate this model parameter, and the gamma correction coefficient map is derived via the Vieta theorem. Finally, the haze-free result is recovered through ASM inversion. Experimental results on diverse synthetic and real-world datasets verify that our algorithm not only provides visually appealing dehazing performance with high computational efficiency, but also outperforms several state-of-the-art dehazing approaches in both subjective and objective evaluations. Moreover, our IHDCP generalizes well to various types of degraded scenes. Our code is available at https://github.com/TaoLi-TL/IHDCP.
{"title":"IHDCP: Single Image Dehazing Using Inverted Haze Density Correction Prior","authors":"Yun Liu;Tao Li;Chunping Tan;Wenqi Ren;Cosmin Ancuti;Weisi Lin","doi":"10.1109/TIP.2026.3657636","DOIUrl":"10.1109/TIP.2026.3657636","url":null,"abstract":"Image dehazing, a crucial task in low-level vision, supports numerous practical applications, such as autonomous driving, remote sensing, and surveillance. This paper proposes IHDCP, a novel Inverted Haze Density Correction Prior for efficient single image dehazing. It is observed that the medium transmission can be effectively modeled from the inverted haze density map using correction functions with various gamma coefficients. Based on this observation, a pixel-wise gamma correction coefficient is introduced to formulate the transmission as a function of the inverted haze density map. To estimate the transmission, IHDCP is first incorporated into the classic atmospheric scattering model (ASM), leading to a transcendental equation that is subsequently simplified to a quadratic form with a single unknown parameter using the Taylor expansion. Then, boundary constraints are designed to estimate this model parameter, and the gamma correction coefficient map is derived via the Vieta theorem. Finally, the haze-free result is recovered through ASM inversion. Experimental results on diverse synthetic and real-world datasets verify that our algorithm not only provides visually appealing dehazing performance with high computational efficiency, but also outperforms several state-of-the-art dehazing approaches in both subjective and objective evaluations. Moreover, our IHDCP generalizes well to various types of degraded scenes. Our code is available at <uri>https://github.com/TaoLi-TL/IHDCP</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"1448-1461"},"PeriodicalIF":13.7,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146073159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}