Pub Date : 2026-01-28DOI: 10.1109/tip.2026.3655187
Wanhao Ma,Wei Zhang,Shuai Wan,Fuzheng Yang
Geometry-based point cloud compression (G-PCC), an international standard designed by MPEG, provides a generic framework for compressing diverse types of point clouds while ensuring interoperability across applications and devices. However, G-PCC underperforms compared to recent deep learning-based PCC methods despite its lower computational power consumption. To enhance the efficiency of G-PCC without sacrificing its interoperability or computational flexibility, we propose the first compression-oriented point cloud voxelization network jointly optimized with a differentiable G-PCC surrogate model. The surrogate model mimics the rate-distortion behavior of the non-differentiable G-PCC codec, enabling end-to-end gradient propagation. The versatile voxelization network adaptively transforms input point clouds using learning-based voxelization and effectively manipulates point clouds via global scaling, fine-grained pruning, and point-level editing for rate-distortion trade-off. During inference, only the lightweight voxelization network is prepended to the G-PCC encoder, requiring no modifications to the decoder, thus introducing no computational overhead for end users. Extensive experiments demonstrate a 38.84% average BD-rate reduction over G-PCC. By bridging classical codecs with deep learning, this work offers a practical pathway to enhance legacy compression standards while preserving their backward compatibility, making it ideal for real-world deployment.
{"title":"Deep G-PCC Geometry Preprocessing via Joint Optimization with a Differentiable Codec Surrogate for Enhanced Compression Efficiency.","authors":"Wanhao Ma,Wei Zhang,Shuai Wan,Fuzheng Yang","doi":"10.1109/tip.2026.3655187","DOIUrl":"https://doi.org/10.1109/tip.2026.3655187","url":null,"abstract":"Geometry-based point cloud compression (G-PCC), an international standard designed by MPEG, provides a generic framework for compressing diverse types of point clouds while ensuring interoperability across applications and devices. However, G-PCC underperforms compared to recent deep learning-based PCC methods despite its lower computational power consumption. To enhance the efficiency of G-PCC without sacrificing its interoperability or computational flexibility, we propose the first compression-oriented point cloud voxelization network jointly optimized with a differentiable G-PCC surrogate model. The surrogate model mimics the rate-distortion behavior of the non-differentiable G-PCC codec, enabling end-to-end gradient propagation. The versatile voxelization network adaptively transforms input point clouds using learning-based voxelization and effectively manipulates point clouds via global scaling, fine-grained pruning, and point-level editing for rate-distortion trade-off. During inference, only the lightweight voxelization network is prepended to the G-PCC encoder, requiring no modifications to the decoder, thus introducing no computational overhead for end users. Extensive experiments demonstrate a 38.84% average BD-rate reduction over G-PCC. By bridging classical codecs with deep learning, this work offers a practical pathway to enhance legacy compression standards while preserving their backward compatibility, making it ideal for real-world deployment.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"42 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Enhancing the resolution of scene text images is a critical preprocessing step that can substantially improve the accuracy of downstream text recognition in low-quality images. Existing methods primarily rely on auxiliary text features to guide the super-resolution process. However, these features often lack rich low-level information, making them insufficient for faithfully reconstructing both the global structure and fine-grained details of text. Moreover, previous methods often learn suboptimal feature representations from the original low-quality landmark images, which cannot provide precise guidance for super-resolution. In this study, we propose a Fine-Grained Feedback Domain-Complementary Network (FDNet) for scene text image super-resolution. Specifically, we first employ a fine-grained feedback mechanism to selectively refine landmark images, thereby enhancing feature representations. Then, we introduce a novel domain-trace prior interaction generator, which integrates domain-specific traces with a text prior to comprehensively complement the clear edges and structural coverage of the text. Finally, motivated by the limitations of existing datasets, which often exhibit limited scene scales and insufficient challenging scenarios, we introduce a new dataset, MDRText. The proposed dataset MDRText features multi-scale and diverse characteristics and is designed to support challenging text image recognition and super-resolution tasks. Extensive experiments on the MDRText and TextZoom datasets demonstrate that our method achieves superior performance in scene text image super-resolution and further improves the accuracy of subsequent recognition tasks.
{"title":"Domain-Complementary Prior with Fine-Grained Feedback for Scene Text Image Super-Resolution.","authors":"Shen Zhang,Yang Li,Pengwen Dai,Xiaozhou Zhou,Guotao Xie","doi":"10.1109/tip.2026.3657246","DOIUrl":"https://doi.org/10.1109/tip.2026.3657246","url":null,"abstract":"Enhancing the resolution of scene text images is a critical preprocessing step that can substantially improve the accuracy of downstream text recognition in low-quality images. Existing methods primarily rely on auxiliary text features to guide the super-resolution process. However, these features often lack rich low-level information, making them insufficient for faithfully reconstructing both the global structure and fine-grained details of text. Moreover, previous methods often learn suboptimal feature representations from the original low-quality landmark images, which cannot provide precise guidance for super-resolution. In this study, we propose a Fine-Grained Feedback Domain-Complementary Network (FDNet) for scene text image super-resolution. Specifically, we first employ a fine-grained feedback mechanism to selectively refine landmark images, thereby enhancing feature representations. Then, we introduce a novel domain-trace prior interaction generator, which integrates domain-specific traces with a text prior to comprehensively complement the clear edges and structural coverage of the text. Finally, motivated by the limitations of existing datasets, which often exhibit limited scene scales and insufficient challenging scenarios, we introduce a new dataset, MDRText. The proposed dataset MDRText features multi-scale and diverse characteristics and is designed to support challenging text image recognition and super-resolution tasks. Extensive experiments on the MDRText and TextZoom datasets demonstrate that our method achieves superior performance in scene text image super-resolution and further improves the accuracy of subsequent recognition tasks.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"7 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-28DOI: 10.1109/tip.2026.3657171
Hao Chen,Haoran Zhou,Yunshu Zhang,Zheng Lin,Yongjian Deng
In the RGB-D vision community, extensive research has been focused on designing multi-modal learning strategies and fusion structures. However, the complementary and fusion mechanisms in RGB-D models remain a black box. In this paper, we present an analytical framework and a novel score to dissect the RGB-D vision community. Our approach involves measuring proposed semantic variance and feature similarity across modalities and levels, conducting visual and quantitative analyzes on multi-modal learning through comprehensive experiments. Specifically, we investigate the consistency and specialty of features across modalities, evolution rules within each modality, and the collaboration logic used when optimizing a RGB-D model. Our studies reveal/verify several important findings, such as the discrepancy in cross-modal features and the hybrid multi-modal cooperation rule, which highlights consistency and specialty simultaneously for complementary inference. We also showcase the versatility of the proposed RGB-D dissection method and introduce a straightforward fusion strategy based on our findings, which delivers significant enhancements across various tasks and even other multi-modal data.
{"title":"Dissecting RGB-D Learning for Improved Multi-modal Fusion.","authors":"Hao Chen,Haoran Zhou,Yunshu Zhang,Zheng Lin,Yongjian Deng","doi":"10.1109/tip.2026.3657171","DOIUrl":"https://doi.org/10.1109/tip.2026.3657171","url":null,"abstract":"In the RGB-D vision community, extensive research has been focused on designing multi-modal learning strategies and fusion structures. However, the complementary and fusion mechanisms in RGB-D models remain a black box. In this paper, we present an analytical framework and a novel score to dissect the RGB-D vision community. Our approach involves measuring proposed semantic variance and feature similarity across modalities and levels, conducting visual and quantitative analyzes on multi-modal learning through comprehensive experiments. Specifically, we investigate the consistency and specialty of features across modalities, evolution rules within each modality, and the collaboration logic used when optimizing a RGB-D model. Our studies reveal/verify several important findings, such as the discrepancy in cross-modal features and the hybrid multi-modal cooperation rule, which highlights consistency and specialty simultaneously for complementary inference. We also showcase the versatility of the proposed RGB-D dissection method and introduce a straightforward fusion strategy based on our findings, which delivers significant enhancements across various tasks and even other multi-modal data.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"2 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-28DOI: 10.1109/tip.2026.3657216
Yuling Su,Xueliang Liu,Zhen Huang,Yunwei Zhao,Richang Hong,Meng Wang
Prompt tuning has proven to be an effective alternative for fine-tuning the pre-trained vision-language models (VLMs) to downstream tasks. Among existing approaches, class-shared prompts learn a unified prompt shared across all classes, while sample-specific prompts generate distinct prompts tailored to each individual sample. However, both approaches often struggle to adequately capture the unique characteristics of underrepresented classes, particularly in imbalanced scenarios where data for tail classes is scarce. To alleviate this issue, we propose an attribute-aware prompt tuning framework that prompts a more balanced understanding for imbalance tasks by explicitly modeling critical class-level attributes. The key intuition is that from the perspective of class, essential attributes tend to be relatively consistent across classes, regardless of sample sizes. Specifically, we build an attribute pool to learn potential semantic attributes of classes based on VLMs. For each input sample, we generate a unique attribute-aware prompt by selecting relevant class attributes from this pool through a matching mechanism. This design enables the model to capture essential class semantics and generate informative prompts, even for classes with limited data. Additionally, we introduce a ProAdapter module to facilitate the transfer of foundational knowledge from VLMs while enhancing generalization to underrepresented classes in imbalanced settings. Extensive experiments on standard and imbalance few-shot tasks demonstrate that our model achieves superior performance especially in tail classes.
{"title":"AttriPrompt: Class Attribute-aware Prompt Tuning for Vision-Language Model.","authors":"Yuling Su,Xueliang Liu,Zhen Huang,Yunwei Zhao,Richang Hong,Meng Wang","doi":"10.1109/tip.2026.3657216","DOIUrl":"https://doi.org/10.1109/tip.2026.3657216","url":null,"abstract":"Prompt tuning has proven to be an effective alternative for fine-tuning the pre-trained vision-language models (VLMs) to downstream tasks. Among existing approaches, class-shared prompts learn a unified prompt shared across all classes, while sample-specific prompts generate distinct prompts tailored to each individual sample. However, both approaches often struggle to adequately capture the unique characteristics of underrepresented classes, particularly in imbalanced scenarios where data for tail classes is scarce. To alleviate this issue, we propose an attribute-aware prompt tuning framework that prompts a more balanced understanding for imbalance tasks by explicitly modeling critical class-level attributes. The key intuition is that from the perspective of class, essential attributes tend to be relatively consistent across classes, regardless of sample sizes. Specifically, we build an attribute pool to learn potential semantic attributes of classes based on VLMs. For each input sample, we generate a unique attribute-aware prompt by selecting relevant class attributes from this pool through a matching mechanism. This design enables the model to capture essential class semantics and generate informative prompts, even for classes with limited data. Additionally, we introduce a ProAdapter module to facilitate the transfer of foundational knowledge from VLMs while enhancing generalization to underrepresented classes in imbalanced settings. Extensive experiments on standard and imbalance few-shot tasks demonstrate that our model achieves superior performance especially in tail classes.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"3 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-28DOI: 10.1109/tip.2026.3652003
Guangzhao Dai,Shuo Wang,Hao Zhao,Bin Zhu,Qianru Sun,Xiangbo Shu
Vision-and-Language Navigation in continuous environments (VLN-CE) requires an embodied robot to navigate the target destination following the natural language instruction. Most existing methods use panoramic RGB-D cameras for 360° observation of environments. However, these methods struggle in real-world applications because of the higher cost of panoramic RGB-D cameras. This paper studies a low-cost and practical VLN-CE setting, e.g., using monocular cameras of limited field of view, which means "Look Less" for visual observations and environment semantics. In this paper, we propose a ThinkMatter framework for monocular VLN-CE, where we motivate monocular robots to "Think More" by 1) generating novel views and 2) integrating instruction semantics. Specifically, we achieve the former by the proposed 3DGS-based panoramic generation to render novel views at each step, based on past observation collections. We achieve the latter by the proposed enhancement of the occupancy-instruction semantics, which integrates the spatial semantics of occupancy maps with the textual semantics of language instructions. These operations promote monocular robots with wider environment perceptions as well as transparent semantic connections with the instruction. Both extensive experiments in the simulators and real-world environments demonstrate the effectiveness of ThinkMatter, providing a promising practice for real-world navigation.
{"title":"ThinkMatter: Panoramic-Aware Instructional Semantics for Monocular Vision-and-Language Navigation.","authors":"Guangzhao Dai,Shuo Wang,Hao Zhao,Bin Zhu,Qianru Sun,Xiangbo Shu","doi":"10.1109/tip.2026.3652003","DOIUrl":"https://doi.org/10.1109/tip.2026.3652003","url":null,"abstract":"Vision-and-Language Navigation in continuous environments (VLN-CE) requires an embodied robot to navigate the target destination following the natural language instruction. Most existing methods use panoramic RGB-D cameras for 360° observation of environments. However, these methods struggle in real-world applications because of the higher cost of panoramic RGB-D cameras. This paper studies a low-cost and practical VLN-CE setting, e.g., using monocular cameras of limited field of view, which means \"Look Less\" for visual observations and environment semantics. In this paper, we propose a ThinkMatter framework for monocular VLN-CE, where we motivate monocular robots to \"Think More\" by 1) generating novel views and 2) integrating instruction semantics. Specifically, we achieve the former by the proposed 3DGS-based panoramic generation to render novel views at each step, based on past observation collections. We achieve the latter by the proposed enhancement of the occupancy-instruction semantics, which integrates the spatial semantics of occupancy maps with the textual semantics of language instructions. These operations promote monocular robots with wider environment perceptions as well as transparent semantic connections with the instruction. Both extensive experiments in the simulators and real-world environments demonstrate the effectiveness of ThinkMatter, providing a promising practice for real-world navigation.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"5 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-28DOI: 10.1109/tip.2026.3657203
Yi Huang,Jiangtao Peng,Weiwei Sun,Na Chen,Zhijing Ye,Qian Du
Classifying hyperspectral remote sensing images across different scenes has recently emerged as a significant challenge. When only historical labeled images (source domain, SD) are available, it is crucial to leverage these images effectively to train a model with strong generalization ability that can be directly applied to classify unseen samples (target domain, TD). To address these challenges, this paper proposes a novel single-domain generalization (SDG) network, termed the domain-aware adversarial domain augmentation network (DADAnet) for cross-scene hyperspectral image classification (HSIC). DADAnet involves two stages: adversarial domain augmentation (ADA) and task-specific training. ADA employs a progressive adversarial generation strategy to construct an augmented domain (AD). To enhance variability in both spatial and spectral dimensions, a domain-aware spatial-spectral mask (DSSM) encoder is constructed to increase the diversity of the generated adversarial samples. Furthermore, a two-level contrastive loss (TCC) is designed and incorporated into the ADA to ensure both the diversity and effectiveness of AD samples. Finally, DADAnet performs supervised learning jointly on the SD and AD during the task-specific training stage. Experimental results on two public hyperspectral image datasets and a new Hangzhouwan (HZW) dataset demonstrate that the proposed DADAnet outperforms existing domain adaptation (DA) and domain generalization (DG) methods, achieving overall accuracies of 80.69%, 63.75%, and 87.61% on three datasets, respectively.
{"title":"Domain-aware Adversarial Domain Augmentation Network for Hyperspectral Image Classification.","authors":"Yi Huang,Jiangtao Peng,Weiwei Sun,Na Chen,Zhijing Ye,Qian Du","doi":"10.1109/tip.2026.3657203","DOIUrl":"https://doi.org/10.1109/tip.2026.3657203","url":null,"abstract":"Classifying hyperspectral remote sensing images across different scenes has recently emerged as a significant challenge. When only historical labeled images (source domain, SD) are available, it is crucial to leverage these images effectively to train a model with strong generalization ability that can be directly applied to classify unseen samples (target domain, TD). To address these challenges, this paper proposes a novel single-domain generalization (SDG) network, termed the domain-aware adversarial domain augmentation network (DADAnet) for cross-scene hyperspectral image classification (HSIC). DADAnet involves two stages: adversarial domain augmentation (ADA) and task-specific training. ADA employs a progressive adversarial generation strategy to construct an augmented domain (AD). To enhance variability in both spatial and spectral dimensions, a domain-aware spatial-spectral mask (DSSM) encoder is constructed to increase the diversity of the generated adversarial samples. Furthermore, a two-level contrastive loss (TCC) is designed and incorporated into the ADA to ensure both the diversity and effectiveness of AD samples. Finally, DADAnet performs supervised learning jointly on the SD and AD during the task-specific training stage. Experimental results on two public hyperspectral image datasets and a new Hangzhouwan (HZW) dataset demonstrate that the proposed DADAnet outperforms existing domain adaptation (DA) and domain generalization (DG) methods, achieving overall accuracies of 80.69%, 63.75%, and 87.61% on three datasets, respectively.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"296 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-28DOI: 10.1109/tip.2026.3657170
Yuqian Ma,Youfa Liu,Bo Du
Few-shot class incremental learning (FSCIL) aims to continuously learn new classes from limited training samples while retaining previously acquired knowledge. Existing approaches are not fully capable of balancing stability and plasticity in dynamic scenarios. To overcome this limitation, we introduce a novel FSCIL framework that leverages graph neural networks (GNNs) to model interdependencies between different categories and enhance cross-modal alignment. Our framework incorporates three key components: (1) a Graph Isomorphism Network (GIN) to propagate contextual relationships among prompts; (2) a Hamiltonian Graph Network with Energy Conservation (HGN-EC) to stabilize training dynamics via energy conservation constraints; and (3) an Adversarially Constrained Graph Autoencoder (ACGA) to enforce latent space consistency. By integrating these components with a parameter-efficient CLIP backbone, our method dynamically adapts graph structures to model semantic correlations between textual and visual modalities. Additionally, contrastive learning with energy-based regularization is employed to mitigate catastrophic forgetting and improve generalization. Comprehensive experiments on benchmark datasets validate the framework's incremental accuracy and stability compared to state-of-the-art baselines. This work advances FSCIL by unifying graph-based relational reasoning with physics-inspired optimization, offering a scalable and interpretable framework.
Few-shot class incremental learning (FSCIL)旨在从有限的训练样本中不断学习新的类,同时保留先前获得的知识。现有的方法不能完全平衡动态场景下的稳定性和可塑性。为了克服这一限制,我们引入了一种新的FSCIL框架,该框架利用图神经网络(gnn)来模拟不同类别之间的相互依赖关系,并增强跨模态对齐。我们的框架包含三个关键组件:(1)在提示符之间传播上下文关系的图同构网络(GIN);(2)利用能量守恒约束稳定训练动态的哈密顿图网络(HGN-EC);(3)采用对抗约束图自编码器(ACGA)来增强潜在空间一致性。通过将这些组件与参数高效的CLIP主干集成,我们的方法动态地调整图结构来建模文本和视觉模式之间的语义相关性。此外,采用基于能量的正则化对比学习来减轻灾难性遗忘和提高泛化。在基准数据集上的综合实验验证了该框架与最先进的基线相比的增量精度和稳定性。这项工作通过将基于图的关系推理与物理启发的优化结合起来,提供了一个可扩展和可解释的框架,从而推动了FSCIL的发展。
{"title":"A Few-Shot Class Incremental Learning Method Using Graph Neural Networks.","authors":"Yuqian Ma,Youfa Liu,Bo Du","doi":"10.1109/tip.2026.3657170","DOIUrl":"https://doi.org/10.1109/tip.2026.3657170","url":null,"abstract":"Few-shot class incremental learning (FSCIL) aims to continuously learn new classes from limited training samples while retaining previously acquired knowledge. Existing approaches are not fully capable of balancing stability and plasticity in dynamic scenarios. To overcome this limitation, we introduce a novel FSCIL framework that leverages graph neural networks (GNNs) to model interdependencies between different categories and enhance cross-modal alignment. Our framework incorporates three key components: (1) a Graph Isomorphism Network (GIN) to propagate contextual relationships among prompts; (2) a Hamiltonian Graph Network with Energy Conservation (HGN-EC) to stabilize training dynamics via energy conservation constraints; and (3) an Adversarially Constrained Graph Autoencoder (ACGA) to enforce latent space consistency. By integrating these components with a parameter-efficient CLIP backbone, our method dynamically adapts graph structures to model semantic correlations between textual and visual modalities. Additionally, contrastive learning with energy-based regularization is employed to mitigate catastrophic forgetting and improve generalization. Comprehensive experiments on benchmark datasets validate the framework's incremental accuracy and stability compared to state-of-the-art baselines. This work advances FSCIL by unifying graph-based relational reasoning with physics-inspired optimization, offering a scalable and interpretable framework.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"52 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-28DOI: 10.1109/tip.2026.3657188
Yaru Qiu,Guoxia Wu,Yuanyuan Sun
Synthesizing novel perspectives of complex scenes in high quality using sparse image sequences, especially for those without camera poses, is a challenging task. The key to enhancing accuracy in such scenarios lies in sufficient prior knowledge and accurate camera motion constraints. Therefore, we propose an end-to-end novel view synthesis network named BP-NeRF. It is capable of using sequences of sparse images captured in indoor and outdoor complex scenes to estimate camera motion trajectories and generate novel view images. Firstly, to address the issue of inaccurate prediction of depth map caused by insufficient overlapping features in sparse images, we designed the RDP-Net module to generate depth maps for sparse image sequences and calculate the depth accuracy of these maps, providing the network with a reliable depth prior. Secondly, to enhance the accuracy of camera pose estimation, we construct a loss function based on the geometric consistency of 2D and 3D feature variations between frames, improving the accuracy and robustness of the network's estimations. We conducted experimental evaluations on the LLFF and Tanks datasets, and the results show that, compared to the current mainstream methods, BP-NeRF can generate more accurate novel views without camera poses.
{"title":"BP-NeRF: End-to-End Neural Radiance Fields for Sparse Images without Camera Pose in Complex Scenes.","authors":"Yaru Qiu,Guoxia Wu,Yuanyuan Sun","doi":"10.1109/tip.2026.3657188","DOIUrl":"https://doi.org/10.1109/tip.2026.3657188","url":null,"abstract":"Synthesizing novel perspectives of complex scenes in high quality using sparse image sequences, especially for those without camera poses, is a challenging task. The key to enhancing accuracy in such scenarios lies in sufficient prior knowledge and accurate camera motion constraints. Therefore, we propose an end-to-end novel view synthesis network named BP-NeRF. It is capable of using sequences of sparse images captured in indoor and outdoor complex scenes to estimate camera motion trajectories and generate novel view images. Firstly, to address the issue of inaccurate prediction of depth map caused by insufficient overlapping features in sparse images, we designed the RDP-Net module to generate depth maps for sparse image sequences and calculate the depth accuracy of these maps, providing the network with a reliable depth prior. Secondly, to enhance the accuracy of camera pose estimation, we construct a loss function based on the geometric consistency of 2D and 3D feature variations between frames, improving the accuracy and robustness of the network's estimations. We conducted experimental evaluations on the LLFF and Tanks datasets, and the results show that, compared to the current mainstream methods, BP-NeRF can generate more accurate novel views without camera poses.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"31 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}