首页 > 最新文献

IEEE Transactions on Image Processing最新文献

英文 中文
JOANet: An Integrated Joint Optimization Architecture Making Medical Image Segmentation Really Helped by Super-resolution Pre-processing JOANet:一种集成的联合优化架构,使超分辨率预处理真正帮助医学图像分割
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-10-17 DOI: 10.1109/tip.2025.3620627
Cheng-Hao Qiu, Xian-Shi Zhang, Yong-Jie Li
{"title":"JOANet: An Integrated Joint Optimization Architecture Making Medical Image Segmentation Really Helped by Super-resolution Pre-processing","authors":"Cheng-Hao Qiu, Xian-Shi Zhang, Yong-Jie Li","doi":"10.1109/tip.2025.3620627","DOIUrl":"https://doi.org/10.1109/tip.2025.3620627","url":null,"abstract":"","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"100 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145310824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hierarchical Multimodal Knowledge Matching for Training-Free Open-Vocabulary Object Detection. 面向无训练开放词汇目标检测的分层多模态知识匹配。
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-10-14 DOI: 10.1109/tip.2025.3618408
Qisen Ma,Yan Huang,Zikun Liu,Hyunhee Park,Liang Wang
Open-Vocabulary Object Detection (OVOD) aims to leverage the generalization capabilities of pre-trained vision-language models for detecting objects beyond the trained categories. Existing methods mostly focus on supervised learning strategies based on available training data, which might be suboptimal for data-limited novel categories. To tackle this challenge, this paper presents a Hierarchical Multimodal Knowledge Matching method (HMKM) to better represent novel categories and match them with region features. Specifically, HMKM includes a set of object prototype knowledge that is obtained using limited category-specific images, acting as off-the-shelf category representations. In addition, HMKM also includes a set of attribute prototype knowledge to represent key attributes of categories at a fine-grained level, with the goal to distinguish one category from its visually similar ones. During inference, two sets of object and attribute prototype knowledge are adaptively combined to match categories with region features. The proposed HMKM is training-free and can be easily integrated as a plug-and-play module into existing OVOD models. Extensive experiments demonstrate that our HMKM significantly improves the performance when detecting novel categories across various backbones and datasets.
开放词汇对象检测(OVOD)旨在利用预训练的视觉语言模型的泛化能力来检测超出训练类别的对象。现有的方法主要集中在基于可用训练数据的监督学习策略上,这对于数据有限的新类别来说可能是次优的。为了解决这一问题,本文提出了一种层次多模态知识匹配方法(HMKM),以更好地表示新类别并将其与区域特征进行匹配。具体来说,HMKM包括一组对象原型知识,这些知识是使用有限的特定于类别的图像获得的,充当现成的类别表示。此外,HMKM还包括一组属性原型知识,用于在细粒度级别上表示类别的关键属性,目的是将一个类别与视觉上相似的类别区分开来。在推理过程中,自适应地结合两组对象和属性原型知识,将类别与区域特征进行匹配。拟议的HMKM无需培训,可以轻松地作为即插即用模块集成到现有的OVOD模型中。大量的实验表明,我们的HMKM在检测不同主干和数据集的新类别时显着提高了性能。
{"title":"Hierarchical Multimodal Knowledge Matching for Training-Free Open-Vocabulary Object Detection.","authors":"Qisen Ma,Yan Huang,Zikun Liu,Hyunhee Park,Liang Wang","doi":"10.1109/tip.2025.3618408","DOIUrl":"https://doi.org/10.1109/tip.2025.3618408","url":null,"abstract":"Open-Vocabulary Object Detection (OVOD) aims to leverage the generalization capabilities of pre-trained vision-language models for detecting objects beyond the trained categories. Existing methods mostly focus on supervised learning strategies based on available training data, which might be suboptimal for data-limited novel categories. To tackle this challenge, this paper presents a Hierarchical Multimodal Knowledge Matching method (HMKM) to better represent novel categories and match them with region features. Specifically, HMKM includes a set of object prototype knowledge that is obtained using limited category-specific images, acting as off-the-shelf category representations. In addition, HMKM also includes a set of attribute prototype knowledge to represent key attributes of categories at a fine-grained level, with the goal to distinguish one category from its visually similar ones. During inference, two sets of object and attribute prototype knowledge are adaptively combined to match categories with region features. The proposed HMKM is training-free and can be easily integrated as a plug-and-play module into existing OVOD models. Extensive experiments demonstrate that our HMKM significantly improves the performance when detecting novel categories across various backbones and datasets.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"117 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145288508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-domain Few-shot Medical Image Segmentation via Dynamic Semantic Matching 基于动态语义匹配的医学图像分割
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-10-13 DOI: 10.1109/tip.2025.3618396
Yazhou Zhu, Shidong Wang, Tao Zhou, Zechao Li, Haofeng Zhang, Ling Shao
{"title":"Cross-domain Few-shot Medical Image Segmentation via Dynamic Semantic Matching","authors":"Yazhou Zhu, Shidong Wang, Tao Zhou, Zechao Li, Haofeng Zhang, Ling Shao","doi":"10.1109/tip.2025.3618396","DOIUrl":"https://doi.org/10.1109/tip.2025.3618396","url":null,"abstract":"","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"3 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145282984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring Vision-Based Active 3D Object Detection by Informativeness Characterization 探索基于视觉的主动3D目标检测的信息特征
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-10-01 DOI: 10.1109/tip.2025.3613927
Ruixiang Li, Yiming Wu, Yehao Lu, Xuewei Li, Xian Wang, Xiubo Liang, Xi Li
{"title":"Exploring Vision-Based Active 3D Object Detection by Informativeness Characterization","authors":"Ruixiang Li, Yiming Wu, Yehao Lu, Xuewei Li, Xian Wang, Xiubo Liang, Xi Li","doi":"10.1109/tip.2025.3613927","DOIUrl":"https://doi.org/10.1109/tip.2025.3613927","url":null,"abstract":"","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"104 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145203236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mgs-Stereo: Multi-scale Geometric-Structure-Enhanced Stereo Matching for Complex Real-World Scenes. Mgs-Stereo:复杂现实场景的多尺度几何结构增强立体匹配。
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-09-26 DOI: 10.1109/tip.2025.3612754
Zhien Dai,Zhaohui Tang,Hu Zhang,Yongfang Xie
Complex imaging environments and conditions in real-world scenes pose significant challenges for stereo matching tasks. Models are susceptible to underperformance in non-Lambertian surfaces, weakly textured regions, and occluded regions, due to the difficulty in establishing accurate matching relationships between pixels. To alleviate these problems, we propose a multi-scale geometrically enhanced stereo matching model that exploits the geometric structural relationships of the objects in the scene to mitigate these problems. Firstly, a geometric structure perception module is designed to extract geometric information from the reference view. Secondly, a geometric structure-adaptive embedding module is proposed to integrate geometric information with matching similarity information. This module integrates multi-source features dynamically to predict disparity residuals in different regions. Third, a geometric-based normalized disparity correction module is proposed to improve matching robustness for pathological regions in realistic complex scenes. Extensive evaluations on popular benchmarks demonstrate that our method achieves competitive performance against leading approaches. Notably, our model provides robust and accurate predictions in challenging regions containing edges, occlusions, reflective, and non-Lambertian surfaces. Our source code will be publicly available.
现实场景中复杂的成像环境和条件对立体匹配任务提出了重大挑战。由于难以在像素之间建立准确的匹配关系,模型在非朗伯曲面、弱纹理区域和遮挡区域容易表现不佳。为了缓解这些问题,我们提出了一种多尺度几何增强立体匹配模型,该模型利用场景中物体的几何结构关系来缓解这些问题。首先,设计几何结构感知模块,从参考视图中提取几何信息;其次,提出一种几何结构自适应嵌入模块,实现几何信息与匹配相似度信息的融合;该模块动态整合多源特征,预测不同区域的视差残差。第三,提出了一种基于几何的归一化视差校正模块,提高了真实复杂场景中病理区域的匹配鲁棒性。对流行基准的广泛评估表明,我们的方法与领先的方法相比具有竞争力。值得注意的是,我们的模型在包含边缘、遮挡、反射和非朗伯曲面的具有挑战性的区域提供了稳健和准确的预测。我们的源代码将是公开的。
{"title":"Mgs-Stereo: Multi-scale Geometric-Structure-Enhanced Stereo Matching for Complex Real-World Scenes.","authors":"Zhien Dai,Zhaohui Tang,Hu Zhang,Yongfang Xie","doi":"10.1109/tip.2025.3612754","DOIUrl":"https://doi.org/10.1109/tip.2025.3612754","url":null,"abstract":"Complex imaging environments and conditions in real-world scenes pose significant challenges for stereo matching tasks. Models are susceptible to underperformance in non-Lambertian surfaces, weakly textured regions, and occluded regions, due to the difficulty in establishing accurate matching relationships between pixels. To alleviate these problems, we propose a multi-scale geometrically enhanced stereo matching model that exploits the geometric structural relationships of the objects in the scene to mitigate these problems. Firstly, a geometric structure perception module is designed to extract geometric information from the reference view. Secondly, a geometric structure-adaptive embedding module is proposed to integrate geometric information with matching similarity information. This module integrates multi-source features dynamically to predict disparity residuals in different regions. Third, a geometric-based normalized disparity correction module is proposed to improve matching robustness for pathological regions in realistic complex scenes. Extensive evaluations on popular benchmarks demonstrate that our method achieves competitive performance against leading approaches. Notably, our model provides robust and accurate predictions in challenging regions containing edges, occlusions, reflective, and non-Lambertian surfaces. Our source code will be publicly available.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"42 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145153460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reduced Biquaternion Dual-Branch Deraining U-Network via Multi-Attention Mechanism. 基于多注意机制的简化双四元数双分支训练u -网络。
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-09-26 DOI: 10.1109/tip.2025.3612841
Shan Gai,Yihao Ni
As a prerequisite for many vision-oriented tasks, image deraining is an effective solution to alleviate performance degradation of these tasks on rainy days. In recent years, the introduction of deep learning has obtained the significant developments in deraining techniques. However, due to the inherent constraints of synthetic datasets and the insufficient robustness of network architecture designs, most existing methods are difficult to fit varied rain patterns and adapt to the transition from synthetic rainy images to real ones, ultimately resulting in unsatisfactory restoration outcomes. To address these issues, we propose a reduced biquaternion dual-branch deraining U-Network (RQ-D2UNet) for better deraining performance, which is the first attempt to apply the reduced biquaternion-valued neural network in the deraining task. The algebraic properties of reduced biquaternion (RQ) can facilitate modeling the rainy artifacts more accurately while preserving the underlying spatial structure of the background image. The comprehensive design scheme of U-shaped architecture and dual-branch structure can extract multi-scale contextual information and fully explore the mixed correlation between rain and rain-free features. Moreover, we also extend the self-attention and convolutional attention mechanisms in the RQ domain, which allow the proposed model to balance both global dependency capture and local feature extraction. Extensive experimental results on various rainy datasets (i.e., rain streak/rain-haze/raindrop/real rain), downstream vision applications (i.e., object detection and segmentation), and similar image restoration tasks (i.e., image desnowing and low-light image enhancement) demonstrate the superiority and versatility of our proposed method.
作为许多面向视觉任务的先决条件,图像脱除是缓解雨天这些任务性能下降的有效解决方案。近年来,深度学习的引入在培训技术方面取得了重大进展。然而,由于合成数据集的固有约束和网络架构设计的鲁棒性不足,大多数现有方法难以拟合各种降雨模式,也难以适应从合成降雨图像到真实降雨图像的过渡,最终导致恢复效果不理想。为了解决这些问题,我们提出了一个简化的双四元数双分支脱训练u网络(RQ-D2UNet),以获得更好的脱训练性能,这是首次尝试将简化的双四元数神经网络应用于脱训练任务。简化四元数(RQ)的代数特性可以在保留背景图像底层空间结构的同时,更准确地建模雨伪影。u型建筑和双分支结构的综合设计方案可以提取多尺度的文脉信息,充分挖掘雨与无雨特征之间的混合关联。此外,我们还扩展了RQ域中的自注意和卷积注意机制,使所提出的模型能够平衡全局依赖捕获和局部特征提取。在各种降雨数据集(如雨带/雨霾/雨滴/真雨)、下游视觉应用(如目标检测和分割)以及类似的图像恢复任务(如图像下雪和低光图像增强)上的大量实验结果证明了我们提出的方法的优越性和多功能性。
{"title":"Reduced Biquaternion Dual-Branch Deraining U-Network via Multi-Attention Mechanism.","authors":"Shan Gai,Yihao Ni","doi":"10.1109/tip.2025.3612841","DOIUrl":"https://doi.org/10.1109/tip.2025.3612841","url":null,"abstract":"As a prerequisite for many vision-oriented tasks, image deraining is an effective solution to alleviate performance degradation of these tasks on rainy days. In recent years, the introduction of deep learning has obtained the significant developments in deraining techniques. However, due to the inherent constraints of synthetic datasets and the insufficient robustness of network architecture designs, most existing methods are difficult to fit varied rain patterns and adapt to the transition from synthetic rainy images to real ones, ultimately resulting in unsatisfactory restoration outcomes. To address these issues, we propose a reduced biquaternion dual-branch deraining U-Network (RQ-D2UNet) for better deraining performance, which is the first attempt to apply the reduced biquaternion-valued neural network in the deraining task. The algebraic properties of reduced biquaternion (RQ) can facilitate modeling the rainy artifacts more accurately while preserving the underlying spatial structure of the background image. The comprehensive design scheme of U-shaped architecture and dual-branch structure can extract multi-scale contextual information and fully explore the mixed correlation between rain and rain-free features. Moreover, we also extend the self-attention and convolutional attention mechanisms in the RQ domain, which allow the proposed model to balance both global dependency capture and local feature extraction. Extensive experimental results on various rainy datasets (i.e., rain streak/rain-haze/raindrop/real rain), downstream vision applications (i.e., object detection and segmentation), and similar image restoration tasks (i.e., image desnowing and low-light image enhancement) demonstrate the superiority and versatility of our proposed method.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"52 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145153461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hyperbolic Self-Paced Multi-Expert Network for Cross-Domain Few-Shot Facial Expression Recognition. 跨域少镜头面部表情识别的双曲自定步多专家网络。
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-09-25 DOI: 10.1109/tip.2025.3612281
Xueting Chen,Yan Yan,Jing-Hao Xue,Chang Shu,Hanzi Wang
Recently, cross-domain few-shot facial expression recognition (CF-FER), which identifies novel compound expressions with a few images in the target domain by using the model trained only on basic expressions in the source domain, has attracted increasing attention. Generally, existing CF-FER methods leverage the multi-dataset to increase the diversity of the source domain and alleviate the discrepancy between the source and target domains. However, these methods learn feature embeddings in the Euclidean space without considering imbalanced expression categories and imbalanced sample difficulty in the multi-dataset. Such a way makes the model difficult to capture hierarchical relationships of facial expressions, resulting in inferior transferable representations. To address these issues, we propose a hyperbolic self-paced multi-expert network (HSM-Net), which contains multiple mixture-of-experts (MoE) layers located in the hyperbolic space, for CF-FER. Specifically, HSM-Net collaboratively trains multiple experts in a self-distillation manner, where each expert focuses on learning a subset of expression categories from the multi-dataset. Based on this, we introduce a hyperbolic self-paced learning (HSL) strategy that exploits sample difficulty to adaptively train the model from easy-to-hard samples, greatly reducing the influence of imbalanced expression categories and imbalanced sample difficulty. Our HSM-Net can effectively model rich hierarchical relationships of facial expressions and obtain a highly transferable feature space. Extensive experiments on both in-the-lab and in-the-wild compound expression datasets demonstrate the superiority of our proposed method over several state-of-the-art methods. Code will be released at https://github.com/cxtjl/HSM-Net.
近年来,cross-domain few-shot facial expression recognition (CF-FER)越来越受到人们的关注,该方法是利用源域的基本表情训练模型,在目标域中使用少量图像识别新的复合表情。一般来说,现有的CF-FER方法利用多数据集来增加源域的多样性,减轻源域和目标域之间的差异。然而,这些方法在欧几里得空间中学习特征嵌入,而没有考虑多数据集中不平衡的表达类别和不平衡的样本难度。这种方式使得模型难以捕捉面部表情的层次关系,导致较差的可转移表示。为了解决这些问题,我们提出了一个双曲自定节奏多专家网络(HSM-Net),它包含位于双曲空间的多个专家混合(MoE)层,用于CF-FER。具体来说,HSM-Net以自蒸馏的方式协同训练多个专家,其中每个专家专注于从多数据集中学习表达类别的子集。在此基础上,我们引入了双曲自节奏学习(HSL)策略,利用样本难度从易-难样本中自适应训练模型,大大降低了不平衡表达类别和不平衡样本难度的影响。我们的HSM-Net可以有效地建模丰富的面部表情层次关系,并获得高度可转移的特征空间。在实验室和野外复合表达数据集上进行的大量实验表明,我们提出的方法优于几种最先进的方法。代码将在https://github.com/cxtjl/HSM-Net上发布。
{"title":"Hyperbolic Self-Paced Multi-Expert Network for Cross-Domain Few-Shot Facial Expression Recognition.","authors":"Xueting Chen,Yan Yan,Jing-Hao Xue,Chang Shu,Hanzi Wang","doi":"10.1109/tip.2025.3612281","DOIUrl":"https://doi.org/10.1109/tip.2025.3612281","url":null,"abstract":"Recently, cross-domain few-shot facial expression recognition (CF-FER), which identifies novel compound expressions with a few images in the target domain by using the model trained only on basic expressions in the source domain, has attracted increasing attention. Generally, existing CF-FER methods leverage the multi-dataset to increase the diversity of the source domain and alleviate the discrepancy between the source and target domains. However, these methods learn feature embeddings in the Euclidean space without considering imbalanced expression categories and imbalanced sample difficulty in the multi-dataset. Such a way makes the model difficult to capture hierarchical relationships of facial expressions, resulting in inferior transferable representations. To address these issues, we propose a hyperbolic self-paced multi-expert network (HSM-Net), which contains multiple mixture-of-experts (MoE) layers located in the hyperbolic space, for CF-FER. Specifically, HSM-Net collaboratively trains multiple experts in a self-distillation manner, where each expert focuses on learning a subset of expression categories from the multi-dataset. Based on this, we introduce a hyperbolic self-paced learning (HSL) strategy that exploits sample difficulty to adaptively train the model from easy-to-hard samples, greatly reducing the influence of imbalanced expression categories and imbalanced sample difficulty. Our HSM-Net can effectively model rich hierarchical relationships of facial expressions and obtain a highly transferable feature space. Extensive experiments on both in-the-lab and in-the-wild compound expression datasets demonstrate the superiority of our proposed method over several state-of-the-art methods. Code will be released at https://github.com/cxtjl/HSM-Net.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"42 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145140286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VisionHub: Learning Task-Plugins for Efficient Universal Vision Model. VisionHub:高效通用视觉模型的学习任务插件。
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-09-25 DOI: 10.1109/tip.2025.3611645
Haolin Wang,Yixuan Zhu,Wenliang Zhao,Jie Zhou,Jiwen Lu
Building on the success of universal language models in natural language processing (NLP), researchers have recently sought to develop methods capable of tackling a broad spectrum of visual tasks within a unified foundation framework. However, existing universal vision models face significant challenges when adapting to the rapidly expanding scope of downstream tasks. These challenges stem not only from the prohibitive computational and storage expenses associated with training such models but also from the complexity of their workflows, which makes efficient adaptations difficult. Moreover, these models often fail to deliver the required performance and versatility for a broad spectrum of applications, largely due to their incomplete visual generation and perception capabilities, limiting their generalizability and effectiveness in diverse settings. In this paper, we present VisionHub, a novel universal vision model designed to concurrently manage multiple visual restoration and perception tasks, while offering streamlined transferability to downstream tasks. Our model leverages the frozen denoising U-Net architecture from Stable Diffusion as the backbone, fully exploiting its inherent potential for both visual restoration and perception. To further enhance the model's flexibility, we propose the incorporation of lightweight task-plugins and the task router, which are seamlessly integrated onto the U-Net backbone. This architecture enables VisionHub to efficiently handle various vision tasks according to user-provided natural language instructions, all while maintaining minimal storage costs and operational overhead. Extensive experiments across 11 different vision tasks showcase both the efficiency and effectiveness of our approach. Remarkably, VisionHub achieves competitive performance across a variety of benchmarks, including 53.3% mIoU on ADE20K semantic segmentation, 0.253 RMSE on NYUv2 depth estimation, and 74.2 AP on MS-COCO pose estimation.
基于自然语言处理(NLP)中通用语言模型的成功,研究人员最近寻求开发能够在统一基础框架内处理广泛视觉任务的方法。然而,现有的通用视觉模型在适应快速扩展的下游任务范围时面临着重大挑战。这些挑战不仅源于与训练这些模型相关的令人望而却步的计算和存储费用,还源于它们的工作流程的复杂性,这使得有效的适应变得困难。此外,这些模型通常无法为广泛的应用程序提供所需的性能和多功能性,主要是由于它们不完整的视觉生成和感知能力,限制了它们在不同设置中的通用性和有效性。在本文中,我们提出了VisionHub,一个新的通用视觉模型,旨在同时管理多个视觉恢复和感知任务,同时为下游任务提供流线型的可转移性。我们的模型利用Stable Diffusion的冷冻去噪U-Net架构作为主干,充分利用其在视觉恢复和感知方面的固有潜力。为了进一步增强模型的灵活性,我们建议合并轻量级任务插件和任务路由器,它们无缝集成到U-Net骨干网上。这种架构使VisionHub能够根据用户提供的自然语言指令有效地处理各种视觉任务,同时保持最小的存储成本和操作开销。在11种不同的视觉任务中进行的大量实验显示了我们的方法的效率和有效性。值得注意的是,VisionHub在各种基准测试中都取得了具有竞争力的性能,包括ADE20K语义分割上的53.3% mIoU, NYUv2深度估计上的0.253 RMSE, MS-COCO姿态估计上的74.2 AP。
{"title":"VisionHub: Learning Task-Plugins for Efficient Universal Vision Model.","authors":"Haolin Wang,Yixuan Zhu,Wenliang Zhao,Jie Zhou,Jiwen Lu","doi":"10.1109/tip.2025.3611645","DOIUrl":"https://doi.org/10.1109/tip.2025.3611645","url":null,"abstract":"Building on the success of universal language models in natural language processing (NLP), researchers have recently sought to develop methods capable of tackling a broad spectrum of visual tasks within a unified foundation framework. However, existing universal vision models face significant challenges when adapting to the rapidly expanding scope of downstream tasks. These challenges stem not only from the prohibitive computational and storage expenses associated with training such models but also from the complexity of their workflows, which makes efficient adaptations difficult. Moreover, these models often fail to deliver the required performance and versatility for a broad spectrum of applications, largely due to their incomplete visual generation and perception capabilities, limiting their generalizability and effectiveness in diverse settings. In this paper, we present VisionHub, a novel universal vision model designed to concurrently manage multiple visual restoration and perception tasks, while offering streamlined transferability to downstream tasks. Our model leverages the frozen denoising U-Net architecture from Stable Diffusion as the backbone, fully exploiting its inherent potential for both visual restoration and perception. To further enhance the model's flexibility, we propose the incorporation of lightweight task-plugins and the task router, which are seamlessly integrated onto the U-Net backbone. This architecture enables VisionHub to efficiently handle various vision tasks according to user-provided natural language instructions, all while maintaining minimal storage costs and operational overhead. Extensive experiments across 11 different vision tasks showcase both the efficiency and effectiveness of our approach. Remarkably, VisionHub achieves competitive performance across a variety of benchmarks, including 53.3% mIoU on ADE20K semantic segmentation, 0.253 RMSE on NYUv2 depth estimation, and 74.2 AP on MS-COCO pose estimation.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"91 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145140450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Revisiting RGBT Tracking Benchmarks from the Perspective of Modality Validity: A New Benchmark, Problem, and Solution 从模态有效性的角度重新审视rbt跟踪基准:一个新的基准、问题与解决方案
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-09-25 DOI: 10.1109/tip.2025.3611687
Zhangyong Tang, Tianyang Xu, Xiao-Jun Wu, Xuefeng Zhu, Chunyang Cheng, Zhenhua Feng, Josef Kittler
{"title":"Revisiting RGBT Tracking Benchmarks from the Perspective of Modality Validity: A New Benchmark, Problem, and Solution","authors":"Zhangyong Tang, Tianyang Xu, Xiao-Jun Wu, Xuefeng Zhu, Chunyang Cheng, Zhenhua Feng, Josef Kittler","doi":"10.1109/tip.2025.3611687","DOIUrl":"https://doi.org/10.1109/tip.2025.3611687","url":null,"abstract":"","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"31 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep Sparse-to-Dense Inbetweening for Multi-View Light Fields. 多视点光场的深度稀疏到密集间。
IF 10.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-09-25 DOI: 10.1109/tip.2025.3612257
Yifan Mao,Zeyu Xiao,Ping An,Deyang Liu,Caifeng Shan
Light field (LF) imaging, which captures both intensity and directional information of light rays, extends the capabilities of traditional imaging techniques. In this paper, we introduce a task in the field of LF imaging, sparse-to-dense inbetweening, which focuses on generating dense novel views from sparse multi-view LFs. By synthesizing intermediate views from sparse inputs, this task enhances LF view synthesis through filling in interperspective gaps within an expanded field of view and increasing data robustness by leveraging complementary information between light rays from different perspectives, which are limited by non-robust single-view synthesis and the inability to handle sparse inputs effectively. To address these challenges, we construct a high-quality multi-view LF dataset, consisting of 60 indoor scenes and 59 outdoor scenes. Building upon this dataset, we propose a baseline method. Specifically, we introduce an adaptive alignment module to dynamically align information by capturing relative displacements. Next, we explore angular consistency and hierarchical information using a multi-level feature decoupling module. Finally, a multi-level feature refinement module is applied to enhance features and facilitate reconstruction. Additionally, we introduce a universally applicable artifact-aware loss function to effectively suppress visual artifacts. Experimental results demonstrate that our method outperforms existing approaches, establishing a benchmark for sparse-to-dense inbetweening. The code is available at https://github.com/Starmao1/MutiLF.
光场成像(LF)可以捕获光线的强度和方向信息,扩展了传统成像技术的能力。在本文中,我们介绍了LF成像领域中的一项任务,稀疏到密集的中间化,该任务侧重于从稀疏的多视图LF中生成密集的新视图。通过从稀疏输入合成中间视图,该任务通过填充扩展视场内的透视间隙来增强LF视图合成,并通过利用来自不同视角的光线之间的互补信息来提高数据鲁棒性,这些信息受到非鲁棒单视图合成和无法有效处理稀疏输入的限制。为了解决这些问题,我们构建了一个高质量的多视图LF数据集,包括60个室内场景和59个室外场景。在此数据集的基础上,我们提出了一个基线方法。具体来说,我们引入了一个自适应对齐模块,通过捕获相对位移来动态对齐信息。接下来,我们使用多层次特征解耦模块探索角度一致性和层次信息。最后,采用多级特征细化模块增强特征,便于重构。此外,我们引入了一个普遍适用的伪影感知损失函数来有效地抑制视觉伪影。实验结果表明,我们的方法优于现有的方法,为稀疏到密集之间建立了一个基准。代码可在https://github.com/Starmao1/MutiLF上获得。
{"title":"Deep Sparse-to-Dense Inbetweening for Multi-View Light Fields.","authors":"Yifan Mao,Zeyu Xiao,Ping An,Deyang Liu,Caifeng Shan","doi":"10.1109/tip.2025.3612257","DOIUrl":"https://doi.org/10.1109/tip.2025.3612257","url":null,"abstract":"Light field (LF) imaging, which captures both intensity and directional information of light rays, extends the capabilities of traditional imaging techniques. In this paper, we introduce a task in the field of LF imaging, sparse-to-dense inbetweening, which focuses on generating dense novel views from sparse multi-view LFs. By synthesizing intermediate views from sparse inputs, this task enhances LF view synthesis through filling in interperspective gaps within an expanded field of view and increasing data robustness by leveraging complementary information between light rays from different perspectives, which are limited by non-robust single-view synthesis and the inability to handle sparse inputs effectively. To address these challenges, we construct a high-quality multi-view LF dataset, consisting of 60 indoor scenes and 59 outdoor scenes. Building upon this dataset, we propose a baseline method. Specifically, we introduce an adaptive alignment module to dynamically align information by capturing relative displacements. Next, we explore angular consistency and hierarchical information using a multi-level feature decoupling module. Finally, a multi-level feature refinement module is applied to enhance features and facilitate reconstruction. Additionally, we introduce a universally applicable artifact-aware loss function to effectively suppress visual artifacts. Experimental results demonstrate that our method outperforms existing approaches, establishing a benchmark for sparse-to-dense inbetweening. The code is available at https://github.com/Starmao1/MutiLF.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"41 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145140288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Image Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1