Pub Date : 2026-03-20DOI: 10.1109/TIP.2026.3674360
Jingqian Wu, Peiqi Duan, Zongqiang Wang, Changwei Wang, Boxin Shi, Edmund Y Lam
In low-light environments, conventional cameras often struggle to capture clear multi-view images of objects due to dynamic range limitations and motion blur caused by long exposure. Event cameras, with their high-dynamic range and high-speed properties, have the potential to mitigate these issues. Additionally, 3D Gaussian Splatting (GS) enables radiance field reconstruction, facilitating bright frame synthesis from multiple viewpoints in low-light conditions. However, naively using an event-assisted 3D GS approach still faced challenges because, in low lights, events are noisy, frames lack quality, and the color tone may be inconsistent. To address these issues, we propose Dark-EvGS, the first event-assisted 3D GS framework that enables the reconstruction of bright frames from arbitrary viewpoints along the camera trajectory. Triplet-level supervision is proposed to gain holistic knowledge, granular details, and sharp scene rendering. The color tone matching block is proposed to guarantee the color consistency of the rendered frames. Furthermore, we introduce the first real-captured dataset for the event-guided bright frame synthesis task via 3D GS-based radiance field reconstruction. Experiments demonstrate that our method achieves better results than existing methods, conquering radiance field reconstruction under challenging low-light conditions. The code and sample data are included in the supplementary material.
{"title":"Dark-EvGS: Event Camera as an Eye for Radiance Field in the Dark.","authors":"Jingqian Wu, Peiqi Duan, Zongqiang Wang, Changwei Wang, Boxin Shi, Edmund Y Lam","doi":"10.1109/TIP.2026.3674360","DOIUrl":"https://doi.org/10.1109/TIP.2026.3674360","url":null,"abstract":"<p><p>In low-light environments, conventional cameras often struggle to capture clear multi-view images of objects due to dynamic range limitations and motion blur caused by long exposure. Event cameras, with their high-dynamic range and high-speed properties, have the potential to mitigate these issues. Additionally, 3D Gaussian Splatting (GS) enables radiance field reconstruction, facilitating bright frame synthesis from multiple viewpoints in low-light conditions. However, naively using an event-assisted 3D GS approach still faced challenges because, in low lights, events are noisy, frames lack quality, and the color tone may be inconsistent. To address these issues, we propose Dark-EvGS, the first event-assisted 3D GS framework that enables the reconstruction of bright frames from arbitrary viewpoints along the camera trajectory. Triplet-level supervision is proposed to gain holistic knowledge, granular details, and sharp scene rendering. The color tone matching block is proposed to guarantee the color consistency of the rendered frames. Furthermore, we introduce the first real-captured dataset for the event-guided bright frame synthesis task via 3D GS-based radiance field reconstruction. Experiments demonstrate that our method achieves better results than existing methods, conquering radiance field reconstruction under challenging low-light conditions. The code and sample data are included in the supplementary material.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147492384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-12DOI: 10.1109/TIP.2026.3671622
Siyang Feng;Xipeng Pan;Huadeng Wang;Zhenbing Liu;Weidong Zhang;Rushi Lan
Using image-level weakly supervised semantic segmentation (WSSS) techniques to segment tissue regions in giga-pixel histopathological whole slide images (WSI) has garnered widespread attention, as it can reduce many annotation workloads for pathologists. Most recent studies are based on class activation mapping (CAM) to generate pseudo masks, which are then used to train segmentation model in a fully supervised manner. However, it is still a challenge to accurately segment non-predominant tissue categories due to the existence of long-tailed and inter-class homogeneity matters. For these matters, we propose three designs to solve them: 1) Diffusion-based Data Generation to synthesis new images of tail class to expand data distribution; 2) Feature Recalibration to reassign the logits in CAM to narrow the feature-level prediction gap between predominant and non-predominant classes; 3) Grade-skip Learning to correct the under-fitting tendency of hard samples during the segmentation phase. Moreover, we also design a powerful pipeline LoHo for histopathology tissue segmentation. Extensive experiments demonstrate that our method not only achieves new state-of-the-art performances but also significantly improves segmentation of tail classes. In addition, our methods are plug-and-play, making it easily integrable into many mainstream WSSS frameworks.
利用图像级弱监督语义分割(WSSS)技术对千兆像素组织病理学全切片图像(WSI)中的组织区域进行分割,可以减少病理学家的许多注释工作量,因此受到了广泛的关注。最近的研究大多是基于类激活映射(CAM)来生成伪掩码,然后用伪掩码以完全监督的方式训练分割模型。然而,由于长尾和类间同质性问题的存在,对非优势组织类别的准确划分仍然是一个挑战。针对这些问题,我们提出了三种解决方案:1)基于扩散的数据生成(Diffusion-based Data Generation),合成新的尾类图像,扩大数据分布;2) Feature Recalibration,重新分配CAM中的logits,以缩小优势类与非优势类之间的特征级预测差距;3)分段学习,纠正硬样本在分割阶段的欠拟合倾向。此外,我们还设计了一个强大的流水线LoHo用于组织病理学组织分割。大量的实验表明,我们的方法不仅达到了新的最先进的性能,而且显著改善了尾类的分割。此外,我们的方法是即插即用的,这使得它很容易集成到许多主流的WSSS框架中。
{"title":"Long-Tailed and Inter-Class Homogeneity Matters in Multi-Class Weakly Supervised Tissue Segmentation of Histopathology Images","authors":"Siyang Feng;Xipeng Pan;Huadeng Wang;Zhenbing Liu;Weidong Zhang;Rushi Lan","doi":"10.1109/TIP.2026.3671622","DOIUrl":"10.1109/TIP.2026.3671622","url":null,"abstract":"Using image-level weakly supervised semantic segmentation (WSSS) techniques to segment tissue regions in giga-pixel histopathological whole slide images (WSI) has garnered widespread attention, as it can reduce many annotation workloads for pathologists. Most recent studies are based on class activation mapping (CAM) to generate pseudo masks, which are then used to train segmentation model in a fully supervised manner. However, it is still a challenge to accurately segment non-predominant tissue categories due to the existence of long-tailed and inter-class homogeneity matters. For these matters, we propose three designs to solve them: 1) Diffusion-based Data Generation to synthesis new images of tail class to expand data distribution; 2) Feature Recalibration to reassign the logits in CAM to narrow the feature-level prediction gap between predominant and non-predominant classes; 3) Grade-skip Learning to correct the under-fitting tendency of hard samples during the segmentation phase. Moreover, we also design a powerful pipeline LoHo for histopathology tissue segmentation. Extensive experiments demonstrate that our method not only achieves new state-of-the-art performances but also significantly improves segmentation of tail classes. In addition, our methods are plug-and-play, making it easily integrable into many mainstream WSSS frameworks.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2513-2528"},"PeriodicalIF":13.7,"publicationDate":"2026-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147439808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Facial image acquisition under constrained illumination and with limited-resolution imaging devices often results in coupled photometric and geometric degradations, manifesting as low-light and low-resolution (LLR) conditions. Prevailing research predominantly follows fragmented optimization paradigms that address low-light image enhancement (LLIE) and face super-resolution (FSR) as isolated tasks. This approach overlooks the compound nature of the degradations, thereby significantly limiting their applicability in practical scenarios. To bridge this gap, we present DiffLLFace, a unified framework that harnesses diffusive generative capabilities with illumination-aware trajectories to achieve robust FSR from LLR observations. The core of our method lies in its alternate illumination-diffusion adaptation, which operates throughout the generation process. This mechanism not only captures degradation patterns in both brightness and structure to harmonize latent representations but also dynamically calibrates the illumination prior with the generative knowledge inherent to diffusion models. As such, DiffLLFace attains precise control over conditional adaptation and illumination rectification. We further devise a simple yet effective non-parametric Fourier enhancement strategy, which provides structural appearance clues that work in concert with the alternate adaptation to ensure texture and color consistency. Extensive experiments demonstrate the superiority of DiffLLFace over existing methods and remarkable generalizability on complex natural scenes. Code is available at https://github.com/KaishengPang/DiffLLFace
{"title":"DiffLLFace: Learning Alternate Illumination-Diffusion Adaptation for Low-Light Face Super-Resolution and Beyond","authors":"Runmin Cong;Kaisheng Pang;Feng Li;Hua Li;Huihui Bai;Sam Kwong;Wei Zhang","doi":"10.1109/TIP.2026.3671638","DOIUrl":"10.1109/TIP.2026.3671638","url":null,"abstract":"Facial image acquisition under constrained illumination and with limited-resolution imaging devices often results in coupled photometric and geometric degradations, manifesting as low-light and low-resolution (LLR) conditions. Prevailing research predominantly follows fragmented optimization paradigms that address low-light image enhancement (LLIE) and face super-resolution (FSR) as isolated tasks. This approach overlooks the compound nature of the degradations, thereby significantly limiting their applicability in practical scenarios. To bridge this gap, we present DiffLLFace, a unified framework that harnesses diffusive generative capabilities with illumination-aware trajectories to achieve robust FSR from LLR observations. The core of our method lies in its alternate illumination-diffusion adaptation, which operates throughout the generation process. This mechanism not only captures degradation patterns in both brightness and structure to harmonize latent representations but also dynamically calibrates the illumination prior with the generative knowledge inherent to diffusion models. As such, DiffLLFace attains precise control over conditional adaptation and illumination rectification. We further devise a simple yet effective non-parametric Fourier enhancement strategy, which provides structural appearance clues that work in concert with the alternate adaptation to ensure texture and color consistency. Extensive experiments demonstrate the superiority of DiffLLFace over existing methods and remarkable generalizability on complex natural scenes. Code is available at <uri>https://github.com/KaishengPang/DiffLLFace</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2499-2512"},"PeriodicalIF":13.7,"publicationDate":"2026-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147439809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-12DOI: 10.1109/TIP.2025.3641833
Tao Ye;Hongbin Ren;Chongbing Zhang;Haoran Chen;Xiaosong Li
Given the complexity of underwater environments and the variability of water as a medium, underwater images are inevitably subject to various types of degradation. The degradations present nonlinear coupling rather than simple superposition, which renders the effective processing of such coupled degradations particularly challenging. Most existing methods focus on designing specific branches, modules, or strategies for specific degradations, with little attention paid to the potential information embedded in their coupling. Consequently, they struggle to effectively capture and process the nonlinear interactions of multiple degradations from a bottom-up perspective. To address this issue, we propose JDPNet, a joint degradation processing network, that mines and unifies the potential information inherent in coupled degradations within a unified framework. Specifically, we introduce a joint feature-mining module, along with a probabilistic bootstrap distribution strategy, to facilitate effective mining and unified adjustment of coupled degradation features. Furthermore, to balance color, clarity, and contrast, we design a novel AquaBalanceLoss to guide the network in learning from multiple coupled degradation losses. Experiments on six publicly available underwater datasets, as well as two new datasets constructed in this study, show that JDPNet exhibits state-of-the-art performance while offering a better tradeoff between performance, parameter size, and computational cost.
{"title":"JDPNet: A Network Based on Joint Degradation Processing for Underwater Image Enhancement","authors":"Tao Ye;Hongbin Ren;Chongbing Zhang;Haoran Chen;Xiaosong Li","doi":"10.1109/TIP.2025.3641833","DOIUrl":"10.1109/TIP.2025.3641833","url":null,"abstract":"Given the complexity of underwater environments and the variability of water as a medium, underwater images are inevitably subject to various types of degradation. The degradations present nonlinear coupling rather than simple superposition, which renders the effective processing of such coupled degradations particularly challenging. Most existing methods focus on designing specific branches, modules, or strategies for specific degradations, with little attention paid to the potential information embedded in their coupling. Consequently, they struggle to effectively capture and process the nonlinear interactions of multiple degradations from a bottom-up perspective. To address this issue, we propose JDPNet, a joint degradation processing network, that mines and unifies the potential information inherent in coupled degradations within a unified framework. Specifically, we introduce a joint feature-mining module, along with a probabilistic bootstrap distribution strategy, to facilitate effective mining and unified adjustment of coupled degradation features. Furthermore, to balance color, clarity, and contrast, we design a novel AquaBalanceLoss to guide the network in learning from multiple coupled degradation losses. Experiments on six publicly available underwater datasets, as well as two new datasets constructed in this study, show that JDPNet exhibits state-of-the-art performance while offering a better tradeoff between performance, parameter size, and computational cost.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2423-2437"},"PeriodicalIF":13.7,"publicationDate":"2026-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147439783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-05DOI: 10.1109/TIP.2026.3666728
Liqiao Yang;Yexun Hu;Tai-Xiang Jiang;Yimin Wei;Guisong Liu;Michael K. Ng
Completing multidimensional color images is a fundamental challenge in image processing and computer vision. However, some tensor-based methods often treat RGB channels as independent modes, thereby neglecting their intrinsic correlations. To address this limitation, we represent RGB values as pure quaternions and organize them into a quaternion tensor for holistic modeling that preserves chromatic relationships. To better capture the nonlinear characteristics inherent in visual data and to improve the compactness of low-rank representations, we propose a nonlinear transformation within the quaternion domain. This design enables more expressive modeling compared to conventional linear approaches. In addition, we introduce two novel regularization terms that jointly encode global low-rankness and local smoothness, with the nonlinear transformation further enhancing the exploitation of structural priors. The overall model is optimized via a nonlinear alternating direction method of multipliers (ADMM), with theoretical guarantees of convergence. Extensive experiments on several datasets demonstrate that the proposed method significantly outperforms state-of-the-art low-rank tensor and quaternion tensor recovery techniques in multidimensional color image completion tasks.
{"title":"Nonlinear Transformed Low-Rank Quaternion Tensor Total Variation for Multidimensional Color Image Completion","authors":"Liqiao Yang;Yexun Hu;Tai-Xiang Jiang;Yimin Wei;Guisong Liu;Michael K. Ng","doi":"10.1109/TIP.2026.3666728","DOIUrl":"10.1109/TIP.2026.3666728","url":null,"abstract":"Completing multidimensional color images is a fundamental challenge in image processing and computer vision. However, some tensor-based methods often treat RGB channels as independent modes, thereby neglecting their intrinsic correlations. To address this limitation, we represent RGB values as pure quaternions and organize them into a quaternion tensor for holistic modeling that preserves chromatic relationships. To better capture the nonlinear characteristics inherent in visual data and to improve the compactness of low-rank representations, we propose a nonlinear transformation within the quaternion domain. This design enables more expressive modeling compared to conventional linear approaches. In addition, we introduce two novel regularization terms that jointly encode global low-rankness and local smoothness, with the nonlinear transformation further enhancing the exploitation of structural priors. The overall model is optimized via a nonlinear alternating direction method of multipliers (ADMM), with theoretical guarantees of convergence. Extensive experiments on several datasets demonstrate that the proposed method significantly outperforms state-of-the-art low-rank tensor and quaternion tensor recovery techniques in multidimensional color image completion tasks.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2470-2483"},"PeriodicalIF":13.7,"publicationDate":"2026-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147359258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-04DOI: 10.1109/TIP.2026.3666732
Shenshen Li;Xing Xu;Fumin Shen;Zhe Sun;Andrzej Cichocki;Heng Tao Shen
The grounded question answering in egocentric videos (Ego-GQA) aims to identify the relevant temporal window and generate corresponding responses in natural language given a textual question. Compared with third-person videos, egocentric video understanding requires more advanced human-centric thinking capability. However, existing Ego-GQA approaches often fail to distinguish the inherent limitations of dynamic egocentric context understanding, treating both first-person and third-person perspectives equally. This oversight leads to hallucinations and a lack of proper egocentric reasoning in first-person video understanding. To address this issue, we propose a novel Collaborated with Hallucination (CoHa) framework for the Ego-GQA, which quantifies the hallucinations generated by an Ego-GQA model and further leverages them as error demonstrations to constrain the model’s reasoning process, encouraging it to ground predictions in egocentric visual cues instead of relying on biased pretraining priors. Specifically, we first employ Subjective Logic to quantify the degree of uncertainty in unreliable answers. We then generate diffusion-based noisy visual inputs to amplify the hallucinations as error demonstrations, which are used to append appropriate constraints to the model according to the uncertainty. These constraints effectively steer predictions away from the unreliable semantics induced by inherent drawbacks in egocentric thinking. Additionally, we incorporate an interactive refinement module to facilitate the model to explore more fine-grained cues observed from the first-person view. Extensive experiments on two widely used benchmarks demonstrate that our CoHa method outperforms recent state-of-the-art methods. Our code is available at https://github.com/Mrshenshen/CoHa
{"title":"Collaborated With Hallucination: Enhancing Egocentric Grounded Question Answering via Error Demonstrations","authors":"Shenshen Li;Xing Xu;Fumin Shen;Zhe Sun;Andrzej Cichocki;Heng Tao Shen","doi":"10.1109/TIP.2026.3666732","DOIUrl":"10.1109/TIP.2026.3666732","url":null,"abstract":"The grounded question answering in egocentric videos (Ego-GQA) aims to identify the relevant temporal window and generate corresponding responses in natural language given a textual question. Compared with third-person videos, egocentric video understanding requires more advanced human-centric thinking capability. However, existing Ego-GQA approaches often fail to distinguish the inherent limitations of dynamic egocentric context understanding, treating both first-person and third-person perspectives equally. This oversight leads to hallucinations and a lack of proper egocentric reasoning in first-person video understanding. To address this issue, we propose a novel Collaborated with Hallucination (CoHa) framework for the Ego-GQA, which quantifies the hallucinations generated by an Ego-GQA model and further leverages them as error demonstrations to constrain the model’s reasoning process, encouraging it to ground predictions in egocentric visual cues instead of relying on biased pretraining priors. Specifically, we first employ Subjective Logic to quantify the degree of uncertainty in unreliable answers. We then generate diffusion-based noisy visual inputs to amplify the hallucinations as error demonstrations, which are used to append appropriate constraints to the model according to the uncertainty. These constraints effectively steer predictions away from the unreliable semantics induced by inherent drawbacks in egocentric thinking. Additionally, we incorporate an interactive refinement module to facilitate the model to explore more fine-grained cues observed from the first-person view. Extensive experiments on two widely used benchmarks demonstrate that our CoHa method outperforms recent state-of-the-art methods. Our code is available at <uri>https://github.com/Mrshenshen/CoHa</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2438-2453"},"PeriodicalIF":13.7,"publicationDate":"2026-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147350712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-02DOI: 10.1109/TIP.2026.3666796
Dizhan Xue;Shengsheng Qian;Changsheng Xu
Recently, backdoor attacks on Deep Neural Networks (DNNs) have raised urgent security threats, which can manipulate the behavior of an attacked model by embedding the backdoor trigger into the input. Since triggers can be designed to be stealthy and hard to recognize by the naked eye, segmenting these triggers in backdoor samples becomes a significant challenge. However, finding triggers embedded by the attacker can be crucial for analyzing the attacks and formulating a defense strategy. Therefore, in this paper, we propose the Backdoor Trigger Segmentation (BTS) task with a comprehensive benchmark consisting of 8 attack methods, 8 unique triggers, and 179 attack settings for image or text data. Moreover, we construct a mathematical system for BTS, abstracting various backdoor triggers into a unified theoretical framework. Based on the theoretical guarantees, we propose a unified Trigger Locator (TriLoc) algorithm to segment various triggers in backdoor samples of both image and text modalities, without prior knowledge of triggers. Extensive experimental results on our benchmark demonstrate the superior performance of our algorithm compared to state-of-the-art methods. Our benchmark and code are available at https://github.com/LivXue/Backdoor-Trigger-Segmentation
{"title":"A Unified Framework for Backdoor Trigger Segmentation","authors":"Dizhan Xue;Shengsheng Qian;Changsheng Xu","doi":"10.1109/TIP.2026.3666796","DOIUrl":"10.1109/TIP.2026.3666796","url":null,"abstract":"Recently, backdoor attacks on Deep Neural Networks (DNNs) have raised urgent security threats, which can manipulate the behavior of an attacked model by embedding the backdoor trigger into the input. Since triggers can be designed to be stealthy and hard to recognize by the naked eye, segmenting these triggers in backdoor samples becomes a significant challenge. However, finding triggers embedded by the attacker can be crucial for analyzing the attacks and formulating a defense strategy. Therefore, in this paper, we propose the Backdoor Trigger Segmentation (BTS) task with a comprehensive benchmark consisting of 8 attack methods, 8 unique triggers, and 179 attack settings for image or text data. Moreover, we construct a mathematical system for BTS, abstracting various backdoor triggers into a unified theoretical framework. Based on the theoretical guarantees, we propose a unified Trigger Locator (TriLoc) algorithm to segment various triggers in backdoor samples of both image and text modalities, without prior knowledge of triggers. Extensive experimental results on our benchmark demonstrate the superior performance of our algorithm compared to state-of-the-art methods. Our benchmark and code are available at <uri>https://github.com/LivXue/Backdoor-Trigger-Segmentation</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2364-2379"},"PeriodicalIF":13.7,"publicationDate":"2026-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147329218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-27DOI: 10.1109/TIP.2026.3666734
Xinxin Liu;Qi Zhang;Xue Wang;Guoqing Zhou;Qing Wang
The dependence of neural radiance fields (NeRF) on accurate camera poses has emerged as a critical obstacle to their widespread real-world applications. While recent advances have demonstrated the potential for simultaneously addressing camera registration and scene reconstruction, these methods inherently rely on reasonable initialization derived from pose or scene priors and struggle with complex scenes involving large camera motions, particularly in unordered 360-degree scenes. In this work, we propose Zero-Pose-Prior NeRF to recover radiance fields from unposed and unordered image collections without any prior knowledge. Our key insight is to decompose this complex problem into smaller sub-problems, wherein the sub-problems’ camera poses are initially estimated to provide self-bootstrapping priors for the global pose estimation, followed by a recursive registration and reconstruction. To achieve this, we first perform scene partitioning to establish a hierarchical structure that describes registration order from local to global. Thereafter, we devise a conditionally-decoupled positional encoding for NeRFs, which serves as the basic model for camera pose estimation and scene representation. Following this, we develop a recursive registration to recursively estimate the poses of local scenes and register them into a unified global pose space, ultimately enabling the reconstruction of the entire scene. Experiments on real-world scenes show that our approach outperforms the state-of-the-art pose-free methods in terms of accurate camera poses and robust radiance field reconstruction, resulting in high-fidelity view synthesis.
{"title":"Zero-Pose-Prior NeRF: Recursive Radiance Field Reconstruction From Unposed and Unordered Images","authors":"Xinxin Liu;Qi Zhang;Xue Wang;Guoqing Zhou;Qing Wang","doi":"10.1109/TIP.2026.3666734","DOIUrl":"10.1109/TIP.2026.3666734","url":null,"abstract":"The dependence of neural radiance fields (NeRF) on accurate camera poses has emerged as a critical obstacle to their widespread real-world applications. While recent advances have demonstrated the potential for simultaneously addressing camera registration and scene reconstruction, these methods inherently rely on reasonable initialization derived from pose or scene priors and struggle with complex scenes involving large camera motions, particularly in unordered 360-degree scenes. In this work, we propose Zero-Pose-Prior NeRF to recover radiance fields from unposed and unordered image collections without any prior knowledge. Our key insight is to decompose this complex problem into smaller sub-problems, wherein the sub-problems’ camera poses are initially estimated to provide self-bootstrapping priors for the global pose estimation, followed by a recursive registration and reconstruction. To achieve this, we first perform scene partitioning to establish a hierarchical structure that describes registration order from local to global. Thereafter, we devise a conditionally-decoupled positional encoding for NeRFs, which serves as the basic model for camera pose estimation and scene representation. Following this, we develop a recursive registration to recursively estimate the poses of local scenes and register them into a unified global pose space, ultimately enabling the reconstruction of the entire scene. Experiments on real-world scenes show that our approach outperforms the state-of-the-art pose-free methods in terms of accurate camera poses and robust radiance field reconstruction, resulting in high-fidelity view synthesis.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2320-2334"},"PeriodicalIF":13.7,"publicationDate":"2026-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147319102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Advancements in 3D scene reconstruction have transformed 2D images from the real world into 3D models, producing realistic 3D results from hundreds of input photos. Despite great success in dense-view reconstruction scenarios, rendering a detailed scene from sparse views is still an ill-posed optimization problem, often resulting in artifacts and distortions in unseen areas. In this paper, we propose ReconX, a novel 3D scene reconstruction paradigm that reframes the ambiguous reconstruction problem as a temporal generation task. The key insight is to unleash the strong generative prior of large pre-trained video diffusion models for sparse-view reconstruction. Nevertheless, it is challenging to preserve 3D view consistency when directly generating video frames from pre-trained models. To address this issue, given limited input views, the proposed ReconX first constructs a global point cloud and encodes it into a contextual space as the 3D structure condition. Guided by the condition, the video diffusion model then synthesizes video frames that are detail-preserved and exhibit a high degree of 3D consistency, ensuring the coherence of the scene from various perspectives. Finally, we recover the 3D scene from the generated video through a confidence-aware 3D Gaussian Splatting optimization scheme. Extensive experiments on various real-world datasets show the superiority of ReconX over state-of-the-art methods in terms of quality and generalizability.
{"title":"ReconX: Reconstruct Any Scene From Sparse Views With Video Diffusion Model","authors":"Fangfu Liu;Wenqiang Sun;Hanyang Wang;Yikai Wang;Haowen Sun;Junliang Ye;Jun Zhang;Yueqi Duan","doi":"10.1109/TIP.2026.3666733","DOIUrl":"10.1109/TIP.2026.3666733","url":null,"abstract":"Advancements in 3D scene reconstruction have transformed 2D images from the real world into 3D models, producing realistic 3D results from hundreds of input photos. Despite great success in dense-view reconstruction scenarios, rendering a detailed scene from sparse views is still an ill-posed optimization problem, often resulting in artifacts and distortions in unseen areas. In this paper, we propose ReconX, a novel 3D scene reconstruction paradigm that reframes the ambiguous reconstruction problem as a temporal generation task. The key insight is to unleash the strong generative prior of large pre-trained video diffusion models for sparse-view reconstruction. Nevertheless, it is challenging to preserve 3D view consistency when directly generating video frames from pre-trained models. To address this issue, given limited input views, the proposed ReconX first constructs a global point cloud and encodes it into a contextual space as the 3D structure condition. Guided by the condition, the video diffusion model then synthesizes video frames that are detail-preserved and exhibit a high degree of 3D consistency, ensuring the coherence of the scene from various perspectives. Finally, we recover the 3D scene from the generated video through a confidence-aware 3D Gaussian Splatting optimization scheme. Extensive experiments on various real-world datasets show the superiority of ReconX over state-of-the-art methods in terms of quality and generalizability.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2305-2319"},"PeriodicalIF":13.7,"publicationDate":"2026-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147319129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-27DOI: 10.1109/TIP.2026.3666737
Mohamed Awad;Ahmed Elliethy;M. Omair Ahmad;M. N. S. Swamy
RGB-Thermal (RGB-T) tracking enhances visual tracking robustness by combining RGB and thermal infrared (TIR) modalities, addressing limitations of RGB-only trackers under challenging conditions such as low light and appearance variations. However, most existing RGB-T trackers rely on complex fusion modules or modality-specific architectures, sacrificing efficiency for performance. In this paper, we propose a novel Multi-level Self-Distillation (MSD) framework that adapts a one-stream RGB tracker to the RGB-T setting without modifying the network architecture or adding any extra parameters. RGB and TIR inputs are jointly processed through a shared backbone, and training is guided by a combination of self-supervised and supervised objectives to enhance cross-modal feature representation. The self-supervised component includes a contrastive loss that aligns semantically consistent regions across template-search pairs, as well as a modality-gap alignment loss that reduces discrepancies between RGB and TIR features. These internal signals complement task-driven supervision, including an intermediate focal loss that strengthens early localization by enhancing shallow and mid-level features, modality-specific losses that preserve distinctive cues under partial modality degradation, and a fused tracking loss that drives final bounding box prediction. Comprehensive evaluations on LasHeR, RGBT234, and GTOT benchmarks demonstrate that MSD achieves state-of-the-art tracking accuracy while maintaining the computational efficiency of the original RGB tracker. Our work establishes a new paradigm in multi-modal tracking by demonstrating that optimized training strategies can outperform complex architectural modifications, offering significant practical advantages for real-world deployment.
{"title":"A Multi-Level Self-Distillation-Based Unified Tracker for Efficient RGB-T Tracking","authors":"Mohamed Awad;Ahmed Elliethy;M. Omair Ahmad;M. N. S. Swamy","doi":"10.1109/TIP.2026.3666737","DOIUrl":"10.1109/TIP.2026.3666737","url":null,"abstract":"RGB-Thermal (RGB-T) tracking enhances visual tracking robustness by combining RGB and thermal infrared (TIR) modalities, addressing limitations of RGB-only trackers under challenging conditions such as low light and appearance variations. However, most existing RGB-T trackers rely on complex fusion modules or modality-specific architectures, sacrificing efficiency for performance. In this paper, we propose a novel Multi-level Self-Distillation (MSD) framework that adapts a one-stream RGB tracker to the RGB-T setting without modifying the network architecture or adding any extra parameters. RGB and TIR inputs are jointly processed through a shared backbone, and training is guided by a combination of self-supervised and supervised objectives to enhance cross-modal feature representation. The self-supervised component includes a contrastive loss that aligns semantically consistent regions across template-search pairs, as well as a modality-gap alignment loss that reduces discrepancies between RGB and TIR features. These internal signals complement task-driven supervision, including an intermediate focal loss that strengthens early localization by enhancing shallow and mid-level features, modality-specific losses that preserve distinctive cues under partial modality degradation, and a fused tracking loss that drives final bounding box prediction. Comprehensive evaluations on LasHeR, RGBT234, and GTOT benchmarks demonstrate that MSD achieves state-of-the-art tracking accuracy while maintaining the computational efficiency of the original RGB tracker. Our work establishes a new paradigm in multi-modal tracking by demonstrating that optimized training strategies can outperform complex architectural modifications, offering significant practical advantages for real-world deployment.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"2407-2422"},"PeriodicalIF":13.7,"publicationDate":"2026-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147319116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}