首页 > 最新文献

IEEE Transactions on Image Processing最新文献

英文 中文
Deep G-PCC Geometry Preprocessing via Joint Optimization with a Differentiable Codec Surrogate for Enhanced Compression Efficiency. 基于可微编解码器代理的深度G-PCC几何预处理提高压缩效率。
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-28 DOI: 10.1109/tip.2026.3655187
Wanhao Ma,Wei Zhang,Shuai Wan,Fuzheng Yang
Geometry-based point cloud compression (G-PCC), an international standard designed by MPEG, provides a generic framework for compressing diverse types of point clouds while ensuring interoperability across applications and devices. However, G-PCC underperforms compared to recent deep learning-based PCC methods despite its lower computational power consumption. To enhance the efficiency of G-PCC without sacrificing its interoperability or computational flexibility, we propose the first compression-oriented point cloud voxelization network jointly optimized with a differentiable G-PCC surrogate model. The surrogate model mimics the rate-distortion behavior of the non-differentiable G-PCC codec, enabling end-to-end gradient propagation. The versatile voxelization network adaptively transforms input point clouds using learning-based voxelization and effectively manipulates point clouds via global scaling, fine-grained pruning, and point-level editing for rate-distortion trade-off. During inference, only the lightweight voxelization network is prepended to the G-PCC encoder, requiring no modifications to the decoder, thus introducing no computational overhead for end users. Extensive experiments demonstrate a 38.84% average BD-rate reduction over G-PCC. By bridging classical codecs with deep learning, this work offers a practical pathway to enhance legacy compression standards while preserving their backward compatibility, making it ideal for real-world deployment.
基于几何的点云压缩(G-PCC)是由MPEG设计的一项国际标准,它为压缩不同类型的点云提供了通用框架,同时确保了应用程序和设备之间的互操作性。然而,尽管G-PCC的计算功耗较低,但与最近基于深度学习的PCC方法相比,G-PCC表现不佳。为了在不牺牲其互操作性和计算灵活性的前提下提高G-PCC的效率,我们提出了第一个与可微G-PCC代理模型联合优化的面向压缩的点云体素化网络。代理模型模拟不可微G-PCC编解码器的速率失真行为,实现端到端梯度传播。通用体素化网络使用基于学习的体素化自适应转换输入点云,并通过全局缩放、细粒度剪枝和点级编辑有效地操纵点云,以实现率失真权衡。在推理过程中,G-PCC编码器只添加轻量级体素化网络,不需要修改解码器,因此不会给最终用户带来计算开销。大量实验表明,与G-PCC相比,平均bd速率降低了38.84%。通过将经典编解码器与深度学习相结合,这项工作提供了一种实用的途径来增强传统压缩标准,同时保持其向后兼容性,使其成为现实世界部署的理想选择。
{"title":"Deep G-PCC Geometry Preprocessing via Joint Optimization with a Differentiable Codec Surrogate for Enhanced Compression Efficiency.","authors":"Wanhao Ma,Wei Zhang,Shuai Wan,Fuzheng Yang","doi":"10.1109/tip.2026.3655187","DOIUrl":"https://doi.org/10.1109/tip.2026.3655187","url":null,"abstract":"Geometry-based point cloud compression (G-PCC), an international standard designed by MPEG, provides a generic framework for compressing diverse types of point clouds while ensuring interoperability across applications and devices. However, G-PCC underperforms compared to recent deep learning-based PCC methods despite its lower computational power consumption. To enhance the efficiency of G-PCC without sacrificing its interoperability or computational flexibility, we propose the first compression-oriented point cloud voxelization network jointly optimized with a differentiable G-PCC surrogate model. The surrogate model mimics the rate-distortion behavior of the non-differentiable G-PCC codec, enabling end-to-end gradient propagation. The versatile voxelization network adaptively transforms input point clouds using learning-based voxelization and effectively manipulates point clouds via global scaling, fine-grained pruning, and point-level editing for rate-distortion trade-off. During inference, only the lightweight voxelization network is prepended to the G-PCC encoder, requiring no modifications to the decoder, thus introducing no computational overhead for end users. Extensive experiments demonstrate a 38.84% average BD-rate reduction over G-PCC. By bridging classical codecs with deep learning, this work offers a practical pathway to enhance legacy compression standards while preserving their backward compatibility, making it ideal for real-world deployment.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"42 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Domain-Complementary Prior with Fine-Grained Feedback for Scene Text Image Super-Resolution. 基于细粒度反馈的场景文本图像超分辨率域互补先验。
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-28 DOI: 10.1109/tip.2026.3657246
Shen Zhang,Yang Li,Pengwen Dai,Xiaozhou Zhou,Guotao Xie
Enhancing the resolution of scene text images is a critical preprocessing step that can substantially improve the accuracy of downstream text recognition in low-quality images. Existing methods primarily rely on auxiliary text features to guide the super-resolution process. However, these features often lack rich low-level information, making them insufficient for faithfully reconstructing both the global structure and fine-grained details of text. Moreover, previous methods often learn suboptimal feature representations from the original low-quality landmark images, which cannot provide precise guidance for super-resolution. In this study, we propose a Fine-Grained Feedback Domain-Complementary Network (FDNet) for scene text image super-resolution. Specifically, we first employ a fine-grained feedback mechanism to selectively refine landmark images, thereby enhancing feature representations. Then, we introduce a novel domain-trace prior interaction generator, which integrates domain-specific traces with a text prior to comprehensively complement the clear edges and structural coverage of the text. Finally, motivated by the limitations of existing datasets, which often exhibit limited scene scales and insufficient challenging scenarios, we introduce a new dataset, MDRText. The proposed dataset MDRText features multi-scale and diverse characteristics and is designed to support challenging text image recognition and super-resolution tasks. Extensive experiments on the MDRText and TextZoom datasets demonstrate that our method achieves superior performance in scene text image super-resolution and further improves the accuracy of subsequent recognition tasks.
提高场景文本图像的分辨率是一个关键的预处理步骤,可以大大提高低质量图像下游文本识别的准确性。现有的方法主要依靠辅助文本特征来指导超分辨率过程。然而,这些特征往往缺乏丰富的底层信息,不足以忠实地重建文本的全局结构和细粒度细节。此外,以前的方法往往从原始的低质量地标图像中学习次优特征表示,无法为超分辨率提供精确的指导。在这项研究中,我们提出了一种用于场景文本图像超分辨率的细粒度反馈域互补网络(FDNet)。具体来说,我们首先采用细粒度反馈机制来选择性地细化地标图像,从而增强特征表征。然后,我们引入了一种新的领域跟踪先验交互生成器,它将特定于领域的跟踪与文本先验相结合,以全面补充文本的清晰边缘和结构覆盖。最后,由于现有数据集的局限性,通常表现出有限的场景规模和缺乏挑战性的场景,我们引入了一个新的数据集MDRText。提出的数据集MDRText具有多尺度和多样化的特征,旨在支持具有挑战性的文本图像识别和超分辨率任务。在MDRText和TextZoom数据集上的大量实验表明,我们的方法在场景文本图像超分辨率方面取得了优异的性能,并进一步提高了后续识别任务的准确性。
{"title":"Domain-Complementary Prior with Fine-Grained Feedback for Scene Text Image Super-Resolution.","authors":"Shen Zhang,Yang Li,Pengwen Dai,Xiaozhou Zhou,Guotao Xie","doi":"10.1109/tip.2026.3657246","DOIUrl":"https://doi.org/10.1109/tip.2026.3657246","url":null,"abstract":"Enhancing the resolution of scene text images is a critical preprocessing step that can substantially improve the accuracy of downstream text recognition in low-quality images. Existing methods primarily rely on auxiliary text features to guide the super-resolution process. However, these features often lack rich low-level information, making them insufficient for faithfully reconstructing both the global structure and fine-grained details of text. Moreover, previous methods often learn suboptimal feature representations from the original low-quality landmark images, which cannot provide precise guidance for super-resolution. In this study, we propose a Fine-Grained Feedback Domain-Complementary Network (FDNet) for scene text image super-resolution. Specifically, we first employ a fine-grained feedback mechanism to selectively refine landmark images, thereby enhancing feature representations. Then, we introduce a novel domain-trace prior interaction generator, which integrates domain-specific traces with a text prior to comprehensively complement the clear edges and structural coverage of the text. Finally, motivated by the limitations of existing datasets, which often exhibit limited scene scales and insufficient challenging scenarios, we introduce a new dataset, MDRText. The proposed dataset MDRText features multi-scale and diverse characteristics and is designed to support challenging text image recognition and super-resolution tasks. Extensive experiments on the MDRText and TextZoom datasets demonstrate that our method achieves superior performance in scene text image super-resolution and further improves the accuracy of subsequent recognition tasks.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"7 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications CAS-ViT:高效移动应用的卷积加性自注意视觉变压器
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-28 DOI: 10.1109/tip.2026.3655121
Tianfang Zhang, Lei Li, Yang Zhou, Wentao Liu, Chen Qian, Jenq-Neng Hwang, Xiangyang Ji
{"title":"CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications","authors":"Tianfang Zhang, Lei Li, Yang Zhou, Wentao Liu, Chen Qian, Jenq-Neng Hwang, Xiangyang Ji","doi":"10.1109/tip.2026.3655121","DOIUrl":"https://doi.org/10.1109/tip.2026.3655121","url":null,"abstract":"","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"86 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146070136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dissecting RGB-D Learning for Improved Multi-modal Fusion. 解剖RGB-D学习改善多模态融合。
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-28 DOI: 10.1109/tip.2026.3657171
Hao Chen,Haoran Zhou,Yunshu Zhang,Zheng Lin,Yongjian Deng
In the RGB-D vision community, extensive research has been focused on designing multi-modal learning strategies and fusion structures. However, the complementary and fusion mechanisms in RGB-D models remain a black box. In this paper, we present an analytical framework and a novel score to dissect the RGB-D vision community. Our approach involves measuring proposed semantic variance and feature similarity across modalities and levels, conducting visual and quantitative analyzes on multi-modal learning through comprehensive experiments. Specifically, we investigate the consistency and specialty of features across modalities, evolution rules within each modality, and the collaboration logic used when optimizing a RGB-D model. Our studies reveal/verify several important findings, such as the discrepancy in cross-modal features and the hybrid multi-modal cooperation rule, which highlights consistency and specialty simultaneously for complementary inference. We also showcase the versatility of the proposed RGB-D dissection method and introduce a straightforward fusion strategy based on our findings, which delivers significant enhancements across various tasks and even other multi-modal data.
在RGB-D视觉界,广泛的研究集中在设计多模态学习策略和融合结构上。然而,RGB-D模型中的互补和融合机制仍然是一个黑盒子。在本文中,我们提出了一个分析框架和一个新颖的分数来剖析RGB-D视觉社区。我们的方法包括测量跨模态和层次的建议语义方差和特征相似性,通过综合实验对多模态学习进行视觉和定量分析。具体来说,我们研究了模式之间特征的一致性和特殊性,每种模式内的演化规则,以及优化RGB-D模型时使用的协作逻辑。我们的研究揭示/验证了几个重要的发现,如跨模态特征的差异和混合多模态合作规则,它同时突出了一致性和特殊性,以进行互补推理。我们还展示了所提出的RGB-D解剖方法的多功能性,并根据我们的发现介绍了一种直接的融合策略,该策略在各种任务甚至其他多模态数据中提供了显著的增强。
{"title":"Dissecting RGB-D Learning for Improved Multi-modal Fusion.","authors":"Hao Chen,Haoran Zhou,Yunshu Zhang,Zheng Lin,Yongjian Deng","doi":"10.1109/tip.2026.3657171","DOIUrl":"https://doi.org/10.1109/tip.2026.3657171","url":null,"abstract":"In the RGB-D vision community, extensive research has been focused on designing multi-modal learning strategies and fusion structures. However, the complementary and fusion mechanisms in RGB-D models remain a black box. In this paper, we present an analytical framework and a novel score to dissect the RGB-D vision community. Our approach involves measuring proposed semantic variance and feature similarity across modalities and levels, conducting visual and quantitative analyzes on multi-modal learning through comprehensive experiments. Specifically, we investigate the consistency and specialty of features across modalities, evolution rules within each modality, and the collaboration logic used when optimizing a RGB-D model. Our studies reveal/verify several important findings, such as the discrepancy in cross-modal features and the hybrid multi-modal cooperation rule, which highlights consistency and specialty simultaneously for complementary inference. We also showcase the versatility of the proposed RGB-D dissection method and introduce a straightforward fusion strategy based on our findings, which delivers significant enhancements across various tasks and even other multi-modal data.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"2 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
StealthMark: Harmless and Stealthy Ownership Verification for Medical Segmentation via Uncertainty-Guided Backdoors StealthMark:基于不确定性引导后门的医疗分割的无害和隐形所有权验证
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-28 DOI: 10.1109/tip.2026.3655563
Qinkai Yu, Chong Zhang, Gaojie Jin, Tianjin Huang, Wei Zhou, Wenhui Li, Xiaobo Jin, Bo Huang, Yitian Zhao, Guang Yang, Gregory Y.H. Lip, Yalin Zheng, Aline Villavicencio, Yanda Meng
{"title":"StealthMark: Harmless and Stealthy Ownership Verification for Medical Segmentation via Uncertainty-Guided Backdoors","authors":"Qinkai Yu, Chong Zhang, Gaojie Jin, Tianjin Huang, Wei Zhou, Wenhui Li, Xiaobo Jin, Bo Huang, Yitian Zhao, Guang Yang, Gregory Y.H. Lip, Yalin Zheng, Aline Villavicencio, Yanda Meng","doi":"10.1109/tip.2026.3655563","DOIUrl":"https://doi.org/10.1109/tip.2026.3655563","url":null,"abstract":"","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"35 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146070135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AttriPrompt: Class Attribute-aware Prompt Tuning for Vision-Language Model. AttriPrompt:视觉语言模型的类属性感知提示调优。
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-28 DOI: 10.1109/tip.2026.3657216
Yuling Su,Xueliang Liu,Zhen Huang,Yunwei Zhao,Richang Hong,Meng Wang
Prompt tuning has proven to be an effective alternative for fine-tuning the pre-trained vision-language models (VLMs) to downstream tasks. Among existing approaches, class-shared prompts learn a unified prompt shared across all classes, while sample-specific prompts generate distinct prompts tailored to each individual sample. However, both approaches often struggle to adequately capture the unique characteristics of underrepresented classes, particularly in imbalanced scenarios where data for tail classes is scarce. To alleviate this issue, we propose an attribute-aware prompt tuning framework that prompts a more balanced understanding for imbalance tasks by explicitly modeling critical class-level attributes. The key intuition is that from the perspective of class, essential attributes tend to be relatively consistent across classes, regardless of sample sizes. Specifically, we build an attribute pool to learn potential semantic attributes of classes based on VLMs. For each input sample, we generate a unique attribute-aware prompt by selecting relevant class attributes from this pool through a matching mechanism. This design enables the model to capture essential class semantics and generate informative prompts, even for classes with limited data. Additionally, we introduce a ProAdapter module to facilitate the transfer of foundational knowledge from VLMs while enhancing generalization to underrepresented classes in imbalanced settings. Extensive experiments on standard and imbalance few-shot tasks demonstrate that our model achieves superior performance especially in tail classes.
提示调优已被证明是对预训练的视觉语言模型(vlm)进行下游任务微调的有效替代方法。在现有的方法中,类共享提示学习跨所有类共享的统一提示,而特定于示例的提示生成针对每个单独示例的不同提示。然而,这两种方法往往难以充分捕捉代表性不足的类的独特特征,特别是在尾部类数据稀缺的不平衡场景中。为了缓解这个问题,我们提出了一个属性感知的提示调优框架,该框架通过显式地建模关键的类级属性来促进对不平衡任务的更平衡的理解。关键的直觉是,从类的角度来看,无论样本大小如何,基本属性往往在类之间相对一致。具体来说,我们建立了一个属性池来学习基于vlm的类的潜在语义属性。对于每个输入样本,我们通过匹配机制从这个池中选择相关的类属性,从而生成一个唯一的属性感知提示。这种设计使模型能够捕获基本的类语义并生成信息提示,即使对于数据有限的类也是如此。此外,我们引入了ProAdapter模块,以促进vlm基础知识的转移,同时增强对不平衡设置中代表性不足的类的泛化。在标准任务和不平衡任务上的大量实验表明,我们的模型在尾类任务上取得了优异的性能。
{"title":"AttriPrompt: Class Attribute-aware Prompt Tuning for Vision-Language Model.","authors":"Yuling Su,Xueliang Liu,Zhen Huang,Yunwei Zhao,Richang Hong,Meng Wang","doi":"10.1109/tip.2026.3657216","DOIUrl":"https://doi.org/10.1109/tip.2026.3657216","url":null,"abstract":"Prompt tuning has proven to be an effective alternative for fine-tuning the pre-trained vision-language models (VLMs) to downstream tasks. Among existing approaches, class-shared prompts learn a unified prompt shared across all classes, while sample-specific prompts generate distinct prompts tailored to each individual sample. However, both approaches often struggle to adequately capture the unique characteristics of underrepresented classes, particularly in imbalanced scenarios where data for tail classes is scarce. To alleviate this issue, we propose an attribute-aware prompt tuning framework that prompts a more balanced understanding for imbalance tasks by explicitly modeling critical class-level attributes. The key intuition is that from the perspective of class, essential attributes tend to be relatively consistent across classes, regardless of sample sizes. Specifically, we build an attribute pool to learn potential semantic attributes of classes based on VLMs. For each input sample, we generate a unique attribute-aware prompt by selecting relevant class attributes from this pool through a matching mechanism. This design enables the model to capture essential class semantics and generate informative prompts, even for classes with limited data. Additionally, we introduce a ProAdapter module to facilitate the transfer of foundational knowledge from VLMs while enhancing generalization to underrepresented classes in imbalanced settings. Extensive experiments on standard and imbalance few-shot tasks demonstrate that our model achieves superior performance especially in tail classes.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"3 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ThinkMatter: Panoramic-Aware Instructional Semantics for Monocular Vision-and-Language Navigation. ThinkMatter:单目视觉和语言导航的全景感知教学语义。
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-28 DOI: 10.1109/tip.2026.3652003
Guangzhao Dai,Shuo Wang,Hao Zhao,Bin Zhu,Qianru Sun,Xiangbo Shu
Vision-and-Language Navigation in continuous environments (VLN-CE) requires an embodied robot to navigate the target destination following the natural language instruction. Most existing methods use panoramic RGB-D cameras for 360° observation of environments. However, these methods struggle in real-world applications because of the higher cost of panoramic RGB-D cameras. This paper studies a low-cost and practical VLN-CE setting, e.g., using monocular cameras of limited field of view, which means "Look Less" for visual observations and environment semantics. In this paper, we propose a ThinkMatter framework for monocular VLN-CE, where we motivate monocular robots to "Think More" by 1) generating novel views and 2) integrating instruction semantics. Specifically, we achieve the former by the proposed 3DGS-based panoramic generation to render novel views at each step, based on past observation collections. We achieve the latter by the proposed enhancement of the occupancy-instruction semantics, which integrates the spatial semantics of occupancy maps with the textual semantics of language instructions. These operations promote monocular robots with wider environment perceptions as well as transparent semantic connections with the instruction. Both extensive experiments in the simulators and real-world environments demonstrate the effectiveness of ThinkMatter, providing a promising practice for real-world navigation.
连续环境中的视觉和语言导航(VLN-CE)要求嵌入式机器人按照自然语言指令导航目标目的地。大多数现有的方法使用全景RGB-D相机360°观察环境。然而,由于全景RGB-D相机的成本较高,这些方法在实际应用中存在困难。本文研究了一种低成本和实用的VLN-CE设置,例如使用有限视场的单目相机,这意味着视觉观察和环境语义“少看”。在本文中,我们提出了一个用于单目VLN-CE的ThinkMatter框架,在该框架中,我们通过1)生成新的视图和2)整合指令语义来激励单目机器人“思考更多”。具体来说,我们通过提出的基于3dgs的全景生成来实现前者,以基于过去的观测集合在每一步呈现新的视图。我们提出增强占用-指令语义,将占用地图的空间语义与语言指令的文本语义相结合,从而实现后者。这些操作促进单目机器人具有更广泛的环境感知以及与指令透明的语义连接。在模拟器和现实世界环境中的大量实验都证明了ThinkMatter的有效性,为现实世界的导航提供了一个有前途的实践。
{"title":"ThinkMatter: Panoramic-Aware Instructional Semantics for Monocular Vision-and-Language Navigation.","authors":"Guangzhao Dai,Shuo Wang,Hao Zhao,Bin Zhu,Qianru Sun,Xiangbo Shu","doi":"10.1109/tip.2026.3652003","DOIUrl":"https://doi.org/10.1109/tip.2026.3652003","url":null,"abstract":"Vision-and-Language Navigation in continuous environments (VLN-CE) requires an embodied robot to navigate the target destination following the natural language instruction. Most existing methods use panoramic RGB-D cameras for 360° observation of environments. However, these methods struggle in real-world applications because of the higher cost of panoramic RGB-D cameras. This paper studies a low-cost and practical VLN-CE setting, e.g., using monocular cameras of limited field of view, which means \"Look Less\" for visual observations and environment semantics. In this paper, we propose a ThinkMatter framework for monocular VLN-CE, where we motivate monocular robots to \"Think More\" by 1) generating novel views and 2) integrating instruction semantics. Specifically, we achieve the former by the proposed 3DGS-based panoramic generation to render novel views at each step, based on past observation collections. We achieve the latter by the proposed enhancement of the occupancy-instruction semantics, which integrates the spatial semantics of occupancy maps with the textual semantics of language instructions. These operations promote monocular robots with wider environment perceptions as well as transparent semantic connections with the instruction. Both extensive experiments in the simulators and real-world environments demonstrate the effectiveness of ThinkMatter, providing a promising practice for real-world navigation.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"5 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Domain-aware Adversarial Domain Augmentation Network for Hyperspectral Image Classification. 高光谱图像分类的域感知对抗域增强网络。
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-28 DOI: 10.1109/tip.2026.3657203
Yi Huang,Jiangtao Peng,Weiwei Sun,Na Chen,Zhijing Ye,Qian Du
Classifying hyperspectral remote sensing images across different scenes has recently emerged as a significant challenge. When only historical labeled images (source domain, SD) are available, it is crucial to leverage these images effectively to train a model with strong generalization ability that can be directly applied to classify unseen samples (target domain, TD). To address these challenges, this paper proposes a novel single-domain generalization (SDG) network, termed the domain-aware adversarial domain augmentation network (DADAnet) for cross-scene hyperspectral image classification (HSIC). DADAnet involves two stages: adversarial domain augmentation (ADA) and task-specific training. ADA employs a progressive adversarial generation strategy to construct an augmented domain (AD). To enhance variability in both spatial and spectral dimensions, a domain-aware spatial-spectral mask (DSSM) encoder is constructed to increase the diversity of the generated adversarial samples. Furthermore, a two-level contrastive loss (TCC) is designed and incorporated into the ADA to ensure both the diversity and effectiveness of AD samples. Finally, DADAnet performs supervised learning jointly on the SD and AD during the task-specific training stage. Experimental results on two public hyperspectral image datasets and a new Hangzhouwan (HZW) dataset demonstrate that the proposed DADAnet outperforms existing domain adaptation (DA) and domain generalization (DG) methods, achieving overall accuracies of 80.69%, 63.75%, and 87.61% on three datasets, respectively.
对不同场景的高光谱遥感图像进行分类最近成为一项重大挑战。当只有历史标记图像(源域,SD)可用时,至关重要的是有效利用这些图像来训练具有强大泛化能力的模型,该模型可以直接应用于未见样本(目标域,TD)的分类。为了解决这些挑战,本文提出了一种新的单域泛化(SDG)网络,称为域感知对抗域增强网络(DADAnet),用于跨场景高光谱图像分类(HSIC)。DADAnet包括两个阶段:对抗域增强(ADA)和特定任务训练。ADA采用渐进式对抗生成策略构建增广域。为了增强空间和光谱维度的可变性,构建了一个域感知空间光谱掩码(DSSM)编码器来增加生成的对抗样本的多样性。此外,设计并将两级对比损耗(TCC)纳入ADA,以确保AD样本的多样性和有效性。最后,DADAnet在特定任务的训练阶段对SD和AD共同执行监督学习。在两个公开的高光谱图像数据集和一个新的杭州湾(HZW)数据集上的实验结果表明,所提出的DADAnet优于现有的域自适应(DA)和域概化(DG)方法,在三个数据集上的总体精度分别达到80.69%、63.75%和87.61%。
{"title":"Domain-aware Adversarial Domain Augmentation Network for Hyperspectral Image Classification.","authors":"Yi Huang,Jiangtao Peng,Weiwei Sun,Na Chen,Zhijing Ye,Qian Du","doi":"10.1109/tip.2026.3657203","DOIUrl":"https://doi.org/10.1109/tip.2026.3657203","url":null,"abstract":"Classifying hyperspectral remote sensing images across different scenes has recently emerged as a significant challenge. When only historical labeled images (source domain, SD) are available, it is crucial to leverage these images effectively to train a model with strong generalization ability that can be directly applied to classify unseen samples (target domain, TD). To address these challenges, this paper proposes a novel single-domain generalization (SDG) network, termed the domain-aware adversarial domain augmentation network (DADAnet) for cross-scene hyperspectral image classification (HSIC). DADAnet involves two stages: adversarial domain augmentation (ADA) and task-specific training. ADA employs a progressive adversarial generation strategy to construct an augmented domain (AD). To enhance variability in both spatial and spectral dimensions, a domain-aware spatial-spectral mask (DSSM) encoder is constructed to increase the diversity of the generated adversarial samples. Furthermore, a two-level contrastive loss (TCC) is designed and incorporated into the ADA to ensure both the diversity and effectiveness of AD samples. Finally, DADAnet performs supervised learning jointly on the SD and AD during the task-specific training stage. Experimental results on two public hyperspectral image datasets and a new Hangzhouwan (HZW) dataset demonstrate that the proposed DADAnet outperforms existing domain adaptation (DA) and domain generalization (DG) methods, achieving overall accuracies of 80.69%, 63.75%, and 87.61% on three datasets, respectively.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"296 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Few-Shot Class Incremental Learning Method Using Graph Neural Networks. 基于图神经网络的几次类增量学习方法。
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-28 DOI: 10.1109/tip.2026.3657170
Yuqian Ma,Youfa Liu,Bo Du
Few-shot class incremental learning (FSCIL) aims to continuously learn new classes from limited training samples while retaining previously acquired knowledge. Existing approaches are not fully capable of balancing stability and plasticity in dynamic scenarios. To overcome this limitation, we introduce a novel FSCIL framework that leverages graph neural networks (GNNs) to model interdependencies between different categories and enhance cross-modal alignment. Our framework incorporates three key components: (1) a Graph Isomorphism Network (GIN) to propagate contextual relationships among prompts; (2) a Hamiltonian Graph Network with Energy Conservation (HGN-EC) to stabilize training dynamics via energy conservation constraints; and (3) an Adversarially Constrained Graph Autoencoder (ACGA) to enforce latent space consistency. By integrating these components with a parameter-efficient CLIP backbone, our method dynamically adapts graph structures to model semantic correlations between textual and visual modalities. Additionally, contrastive learning with energy-based regularization is employed to mitigate catastrophic forgetting and improve generalization. Comprehensive experiments on benchmark datasets validate the framework's incremental accuracy and stability compared to state-of-the-art baselines. This work advances FSCIL by unifying graph-based relational reasoning with physics-inspired optimization, offering a scalable and interpretable framework.
Few-shot class incremental learning (FSCIL)旨在从有限的训练样本中不断学习新的类,同时保留先前获得的知识。现有的方法不能完全平衡动态场景下的稳定性和可塑性。为了克服这一限制,我们引入了一种新的FSCIL框架,该框架利用图神经网络(gnn)来模拟不同类别之间的相互依赖关系,并增强跨模态对齐。我们的框架包含三个关键组件:(1)在提示符之间传播上下文关系的图同构网络(GIN);(2)利用能量守恒约束稳定训练动态的哈密顿图网络(HGN-EC);(3)采用对抗约束图自编码器(ACGA)来增强潜在空间一致性。通过将这些组件与参数高效的CLIP主干集成,我们的方法动态地调整图结构来建模文本和视觉模式之间的语义相关性。此外,采用基于能量的正则化对比学习来减轻灾难性遗忘和提高泛化。在基准数据集上的综合实验验证了该框架与最先进的基线相比的增量精度和稳定性。这项工作通过将基于图的关系推理与物理启发的优化结合起来,提供了一个可扩展和可解释的框架,从而推动了FSCIL的发展。
{"title":"A Few-Shot Class Incremental Learning Method Using Graph Neural Networks.","authors":"Yuqian Ma,Youfa Liu,Bo Du","doi":"10.1109/tip.2026.3657170","DOIUrl":"https://doi.org/10.1109/tip.2026.3657170","url":null,"abstract":"Few-shot class incremental learning (FSCIL) aims to continuously learn new classes from limited training samples while retaining previously acquired knowledge. Existing approaches are not fully capable of balancing stability and plasticity in dynamic scenarios. To overcome this limitation, we introduce a novel FSCIL framework that leverages graph neural networks (GNNs) to model interdependencies between different categories and enhance cross-modal alignment. Our framework incorporates three key components: (1) a Graph Isomorphism Network (GIN) to propagate contextual relationships among prompts; (2) a Hamiltonian Graph Network with Energy Conservation (HGN-EC) to stabilize training dynamics via energy conservation constraints; and (3) an Adversarially Constrained Graph Autoencoder (ACGA) to enforce latent space consistency. By integrating these components with a parameter-efficient CLIP backbone, our method dynamically adapts graph structures to model semantic correlations between textual and visual modalities. Additionally, contrastive learning with energy-based regularization is employed to mitigate catastrophic forgetting and improve generalization. Comprehensive experiments on benchmark datasets validate the framework's incremental accuracy and stability compared to state-of-the-art baselines. This work advances FSCIL by unifying graph-based relational reasoning with physics-inspired optimization, offering a scalable and interpretable framework.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"52 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BP-NeRF: End-to-End Neural Radiance Fields for Sparse Images without Camera Pose in Complex Scenes. BP-NeRF:复杂场景中无相机姿态的稀疏图像的端到端神经辐射场。
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-28 DOI: 10.1109/tip.2026.3657188
Yaru Qiu,Guoxia Wu,Yuanyuan Sun
Synthesizing novel perspectives of complex scenes in high quality using sparse image sequences, especially for those without camera poses, is a challenging task. The key to enhancing accuracy in such scenarios lies in sufficient prior knowledge and accurate camera motion constraints. Therefore, we propose an end-to-end novel view synthesis network named BP-NeRF. It is capable of using sequences of sparse images captured in indoor and outdoor complex scenes to estimate camera motion trajectories and generate novel view images. Firstly, to address the issue of inaccurate prediction of depth map caused by insufficient overlapping features in sparse images, we designed the RDP-Net module to generate depth maps for sparse image sequences and calculate the depth accuracy of these maps, providing the network with a reliable depth prior. Secondly, to enhance the accuracy of camera pose estimation, we construct a loss function based on the geometric consistency of 2D and 3D feature variations between frames, improving the accuracy and robustness of the network's estimations. We conducted experimental evaluations on the LLFF and Tanks datasets, and the results show that, compared to the current mainstream methods, BP-NeRF can generate more accurate novel views without camera poses.
利用稀疏图像序列合成高质量复杂场景的新视角,特别是对于那些没有相机姿势的场景,是一项具有挑战性的任务。在这种情况下提高精度的关键在于充分的先验知识和准确的摄像机运动约束。因此,我们提出了一种新型的端到端视图合成网络BP-NeRF。它能够使用在室内和室外复杂场景中捕获的稀疏图像序列来估计相机运动轨迹并生成新的视图图像。首先,针对稀疏图像中重叠特征不足导致深度图预测不准确的问题,我们设计了RDP-Net模块,对稀疏图像序列生成深度图,并计算深度图的深度精度,为网络提供可靠的深度先验。其次,为了提高摄像机姿态估计的精度,我们基于帧间二维和三维特征变化的几何一致性构造了一个损失函数,提高了网络估计的精度和鲁棒性。我们对LLFF和Tanks数据集进行了实验评估,结果表明,与目前的主流方法相比,BP-NeRF可以在不需要相机姿态的情况下生成更准确的新视图。
{"title":"BP-NeRF: End-to-End Neural Radiance Fields for Sparse Images without Camera Pose in Complex Scenes.","authors":"Yaru Qiu,Guoxia Wu,Yuanyuan Sun","doi":"10.1109/tip.2026.3657188","DOIUrl":"https://doi.org/10.1109/tip.2026.3657188","url":null,"abstract":"Synthesizing novel perspectives of complex scenes in high quality using sparse image sequences, especially for those without camera poses, is a challenging task. The key to enhancing accuracy in such scenarios lies in sufficient prior knowledge and accurate camera motion constraints. Therefore, we propose an end-to-end novel view synthesis network named BP-NeRF. It is capable of using sequences of sparse images captured in indoor and outdoor complex scenes to estimate camera motion trajectories and generate novel view images. Firstly, to address the issue of inaccurate prediction of depth map caused by insufficient overlapping features in sparse images, we designed the RDP-Net module to generate depth maps for sparse image sequences and calculate the depth accuracy of these maps, providing the network with a reliable depth prior. Secondly, to enhance the accuracy of camera pose estimation, we construct a loss function based on the geometric consistency of 2D and 3D feature variations between frames, improving the accuracy and robustness of the network's estimations. We conducted experimental evaluations on the LLFF and Tanks datasets, and the results show that, compared to the current mainstream methods, BP-NeRF can generate more accurate novel views without camera poses.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"31 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Image Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1