首页 > 最新文献

Image and Vision Computing最新文献

英文 中文
Multi-level global context fusion for camouflaged object detection 用于伪装目标检测的多级全局上下文融合
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-28 DOI: 10.1016/j.imavis.2026.105915
Baichuan Shen , Yan Dou , Yaolei Li , Wenjun Zhang , Xiaoyan Wang
Due to the low contrast between camouflaged objects and backgrounds, the diversity of object edge shapes, and occlusions in complex scenes, existing deep learning-based Camouflaged Object Detection (COD) methods still face significant challenges in achieving high-precision detection. These challenges include difficulties in extracting multi-scale detail features for small object detection, modeling global context in occluded scenarios, and accurately distinguishing the boundaries between objects and backgrounds in complex edge detection tasks.To address these issues, this paper proposes MGCF-Net (Multi-level Global Context Fusion Network), a novel approach that integrates multi-scale context learning and feature fusion. The method employs an improved Pyramid Vision Transformer (PVTv2) as the backbone, coupled with a Cross-Scale Self-Attention (CSSA) module and a Multi-scale Fusion Attention (MFA) module. A Guided Alignment Feature Module (GAFM) aligns multi-scale features, while a large-kernel convolution structure (SHRF) enhances the global context capture capability. Experimental results on several COD benchmark data sets show that the proposed method improves 2.2%, 2.1% and 4.9% in structure metric, mean enhancement metric and weighted F metric respectively compared with FEDER, which is the second best overall performance, while the mean absolute error (MAE) decreases by 21.4%. It shows significant advantages in detection accuracy and generalization performance compared with several state-of-the-art methods (SOTA). Additionally, the method demonstrates excellent generalization to related tasks, such as polyp segmentation, COVID-19, lung infection detection, and defect detection.
由于伪装物体与背景对比度低、物体边缘形状多样、复杂场景中存在遮挡等问题,现有基于深度学习的伪装物体检测方法在实现高精度检测方面仍面临重大挑战。这些挑战包括难以提取用于小目标检测的多尺度细节特征,在遮挡场景中建模全局上下文,以及在复杂的边缘检测任务中准确区分目标和背景之间的边界。为了解决这些问题,本文提出了MGCF-Net(多层次全球上下文融合网络),这是一种集成了多尺度上下文学习和特征融合的新方法。该方法采用改进的金字塔视觉变压器(PVTv2)作为主干,结合跨尺度自注意(CSSA)模块和多尺度融合注意(MFA)模块。制导对齐特征模块(GAFM)用于多尺度特征对齐,而大核卷积结构(SHRF)增强了全局上下文捕获能力。在多个COD基准数据集上的实验结果表明,与FEDER相比,该方法在结构度量、平均增强度量和加权F度量上分别提高了2.2%、2.1%和4.9%,综合性能排名第二,平均绝对误差(MAE)降低了21.4%。与几种最先进的方法(SOTA)相比,它在检测精度和泛化性能方面具有显著的优势。此外,该方法对息肉分割、COVID-19、肺部感染检测和缺陷检测等相关任务具有很好的泛化能力。
{"title":"Multi-level global context fusion for camouflaged object detection","authors":"Baichuan Shen ,&nbsp;Yan Dou ,&nbsp;Yaolei Li ,&nbsp;Wenjun Zhang ,&nbsp;Xiaoyan Wang","doi":"10.1016/j.imavis.2026.105915","DOIUrl":"10.1016/j.imavis.2026.105915","url":null,"abstract":"<div><div>Due to the low contrast between camouflaged objects and backgrounds, the diversity of object edge shapes, and occlusions in complex scenes, existing deep learning-based Camouflaged Object Detection (COD) methods still face significant challenges in achieving high-precision detection. These challenges include difficulties in extracting multi-scale detail features for small object detection, modeling global context in occluded scenarios, and accurately distinguishing the boundaries between objects and backgrounds in complex edge detection tasks.To address these issues, this paper proposes MGCF-Net (Multi-level Global Context Fusion Network), a novel approach that integrates multi-scale context learning and feature fusion. The method employs an improved Pyramid Vision Transformer (PVTv2) as the backbone, coupled with a Cross-Scale Self-Attention (CSSA) module and a Multi-scale Fusion Attention (MFA) module. A Guided Alignment Feature Module (GAFM) aligns multi-scale features, while a large-kernel convolution structure (SHRF) enhances the global context capture capability. Experimental results on several COD benchmark data sets show that the proposed method improves 2.2%, 2.1% and 4.9% in structure metric, mean enhancement metric and weighted F metric respectively compared with FEDER, which is the second best overall performance, while the mean absolute error (MAE) decreases by 21.4%. It shows significant advantages in detection accuracy and generalization performance compared with several state-of-the-art methods (SOTA). Additionally, the method demonstrates excellent generalization to related tasks, such as polyp segmentation, COVID-19, lung infection detection, and defect detection.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105915"},"PeriodicalIF":4.2,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146078415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MoMIL: Multi-order enhanced multiple instance learning for computational pathology MoMIL:计算病理学的多阶增强多实例学习
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-28 DOI: 10.1016/j.imavis.2026.105918
Yuqi Zhang , Xiaoqian Zhang , Jiakai Wang , Baoyu Liang , Yuancheng Yang , Chao Tong
Computational pathology (CPath) has significantly advanced the clinical practice of pathology. Despite the progress made, Multiple Instance Learning (MIL), a promising paradigm within CPath, continues to face challenges, especially those related to structural fixation and incomplete information utilization. To address these limitations, we propose a novel MIL framework named Multi-order MIL (MoMIL). Our framework utilizes the SSD model to perform long-sequence modeling on multi-order WSI patches and combines lightweight feature fusion to achieve more comprehensive feature information utilization. This framework supports the fusion of a broader range of features and is highly flexible, allowing for expansion based on specific usage requirements. Additionally, we introduce a sequence transformation method specifically designed for WSIs. This method is not only adaptable to different WSI sizes but also captures additional feature expression, resulting in a more effective exploitation of sequential cues. Extensive experiments demonstrate that MoMIL surpasses state-of-the-art MIL methods, up to 0.027 AUC improvements for cancer sub-typing. We conducted extensive experiments on three downstream tasks with a total of five datasets, achieving improvements in all performance metrics. The code is available at https://github.com/YuqiZhang-Buaa/MoMIL.
计算病理学(CPath)显著地促进了病理学的临床实践。尽管取得了一些进展,但多实例学习(MIL)作为CPath中一个有前途的范式,仍然面临着挑战,特别是与结构固定和不完全信息利用有关的挑战。为了解决这些限制,我们提出了一种新的MIL框架,称为多阶MIL (MoMIL)。我们的框架利用SSD模型对多阶WSI补丁进行长序列建模,并结合轻量级特征融合实现更全面的特征信息利用。该框架支持更广泛的功能融合,并且高度灵活,允许根据特定的使用需求进行扩展。此外,我们还介绍了一种专门为wsi设计的序列转换方法。该方法不仅适用于不同的WSI大小,而且可以捕获额外的特征表达式,从而更有效地利用序列线索。大量的实验表明,MoMIL超过了最先进的MIL方法,在癌症分型方面提高了0.027 AUC。我们对总共五个数据集的三个下游任务进行了广泛的实验,在所有性能指标上都取得了改进。代码可在https://github.com/YuqiZhang-Buaa/MoMIL上获得。
{"title":"MoMIL: Multi-order enhanced multiple instance learning for computational pathology","authors":"Yuqi Zhang ,&nbsp;Xiaoqian Zhang ,&nbsp;Jiakai Wang ,&nbsp;Baoyu Liang ,&nbsp;Yuancheng Yang ,&nbsp;Chao Tong","doi":"10.1016/j.imavis.2026.105918","DOIUrl":"10.1016/j.imavis.2026.105918","url":null,"abstract":"<div><div>Computational pathology (CPath) has significantly advanced the clinical practice of pathology. Despite the progress made, Multiple Instance Learning (MIL), a promising paradigm within CPath, continues to face challenges, especially those related to structural fixation and incomplete information utilization. To address these limitations, we propose a novel MIL framework named Multi-order MIL (MoMIL). Our framework utilizes the SSD model to perform long-sequence modeling on multi-order WSI patches and combines lightweight feature fusion to achieve more comprehensive feature information utilization. This framework supports the fusion of a broader range of features and is highly flexible, allowing for expansion based on specific usage requirements. Additionally, we introduce a sequence transformation method specifically designed for WSIs. This method is not only adaptable to different WSI sizes but also captures additional feature expression, resulting in a more effective exploitation of sequential cues. Extensive experiments demonstrate that MoMIL surpasses state-of-the-art MIL methods, up to 0.027 AUC improvements for cancer sub-typing. We conducted extensive experiments on three downstream tasks with a total of five datasets, achieving improvements in all performance metrics. The code is available at <span><span>https://github.com/YuqiZhang-Buaa/MoMIL</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105918"},"PeriodicalIF":4.2,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146078416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SRformer: A hybrid semantic-regional transformer for indoor 3D object detection SRformer:一种用于室内三维物体检测的混合语义区域变压器
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-27 DOI: 10.1016/j.imavis.2026.105919
Kunpeng Bi, Shuang Wang, Xiangyang Jiang, Miaohui Zhang
Detection transformer has been widely applied to 3D object detection, achieving impressive results in various scenarios. However, effectively fusing regional and semantic features in query selection and cross-attention remains a challenge. This paper systematically analyzes detection transformers and proposes SRformer, a novel two-stage 3D object detector with several key designs. First, SRformer introduces a Hybrid Query Selector (HQS), which splits the first stage into a prediction branch and a sampling branch. The sampling branch is supervised by a novel hybrid query loss based on regional and semantic features, thereby filtering out high-quality initial query boxes. Next, a Regional Reinforcement Attention (RRA) is introduced to enhance instance-level attention. The RRA learns a set of key points and maps their regional differences to a relative coordinate table to construct explicit instance-level regional context feature constraints, thereby modulating the cross-attention map. Additionally, a Top-K Bipartite Graph Matching (KBM) is introduced to increase the number of positive samples and enhance training stability, along with a Residual-based Bounding Box Decoder (RBBD) that parameterizes the bounding box into residual components relative to predefined base sizes for more robust and precise regression. Extensive experiments on the challenging ScanNetV2 and SUN RGB-D datasets demonstrate the effectiveness and robustness of SRformer, achieving a new state-of-the-art result on ScanNetV2, with 76.8 and 64.8 in mAP25 and mAP50, respectively.
检测变压器已广泛应用于三维物体检测,在各种场景下取得了令人印象深刻的效果。然而,如何在查询选择和交叉关注中有效地融合区域特征和语义特征仍然是一个挑战。本文对检测变压器进行了系统的分析,提出了一种新型的两级三维目标检测器——SRformer。首先,SRformer引入混合查询选择器(HQS),将第一阶段分为预测分支和抽样分支。采样分支由一种基于区域特征和语义特征的混合查询损失来监督,从而过滤出高质量的初始查询框。其次,引入区域强化注意(RRA)来增强实例级注意。RRA学习一组关键点,并将它们的区域差异映射到一个相对坐标表中,以构建显式的实例级区域上下文特征约束,从而调节交叉注意图。此外,引入了Top-K二部图匹配(KBM)来增加正样本数量并增强训练稳定性,以及基于残差的边界盒解码器(RBBD),该解码器将边界盒参数化为相对于预定义基大小的残差分量,以实现更稳健和精确的回归。在具有挑战性的ScanNetV2和SUN RGB-D数据集上进行的大量实验证明了SRformer的有效性和鲁棒性,在ScanNetV2上取得了新的最先进的结果,在mAP25和mAP50上分别为76.8和64.8。
{"title":"SRformer: A hybrid semantic-regional transformer for indoor 3D object detection","authors":"Kunpeng Bi,&nbsp;Shuang Wang,&nbsp;Xiangyang Jiang,&nbsp;Miaohui Zhang","doi":"10.1016/j.imavis.2026.105919","DOIUrl":"10.1016/j.imavis.2026.105919","url":null,"abstract":"<div><div>Detection transformer has been widely applied to 3D object detection, achieving impressive results in various scenarios. However, effectively fusing regional and semantic features in query selection and cross-attention remains a challenge. This paper systematically analyzes detection transformers and proposes SRformer, a novel two-stage 3D object detector with several key designs. First, SRformer introduces a Hybrid Query Selector (HQS), which splits the first stage into a prediction branch and a sampling branch. The sampling branch is supervised by a novel hybrid query loss based on regional and semantic features, thereby filtering out high-quality initial query boxes. Next, a Regional Reinforcement Attention (RRA) is introduced to enhance instance-level attention. The RRA learns a set of key points and maps their regional differences to a relative coordinate table to construct explicit instance-level regional context feature constraints, thereby modulating the cross-attention map. Additionally, a Top-K Bipartite Graph Matching (KBM) is introduced to increase the number of positive samples and enhance training stability, along with a Residual-based Bounding Box Decoder (RBBD) that parameterizes the bounding box into residual components relative to predefined base sizes for more robust and precise regression. Extensive experiments on the challenging ScanNetV2 and SUN RGB-D datasets demonstrate the effectiveness and robustness of SRformer, achieving a new state-of-the-art result on ScanNetV2, with 76.8 and 64.8 in mAP25 and mAP50, respectively.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"168 ","pages":"Article 105919"},"PeriodicalIF":4.2,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146102558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CNN-CECA: Underwater image enhancement via CNN-driven nonlinear curve estimation and channel-wise attention in multi-color spaces CNN-CECA:基于cnn驱动的非线性曲线估计和多色空间的信道关注的水下图像增强
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-26 DOI: 10.1016/j.imavis.2026.105916
Imran Afzal, Guo Jichang, Fazeela Siddiqui, Muhammad Fahad
High-quality underwater images are essential for marine exploration, environmental monitoring, and scientific analysis. However, they are degraded by light attenuation, scattering, and wavelength-dependent absorption, which cause color shifts, low contrast, and detail loss. Furthermore, many existing deep learning techniques function as black boxes, offering limited interpretability and often generalizing poorly across diverse underwater conditions. To address this, we propose CNN-CECA, a novel deep learning framework whose core innovation is the hybrid integration of a convolutional backbone with physically-inspired, non-linear curve estimation across multiple color spaces. A lightweight CNN adjusts brightness, contrast, and color balance, and ResNet-50 guides the analysis of polynomial, sigmoid, and exponential curves in RGB, HSV, and CIELab, enabling both global and local adaptation. A key component is our novel Triple Channel-wise Attention (TCA) module, which fuses results across the three color spaces, dynamically allocating weights to recover natural colors and delicate structures. Post-processing with contrast stretching and edge sharpening adds final refinement while preserving efficiency for real-time use. Extensive experiments on synthetic and real-world datasets (e.g., UIEB, UCCS, EUVP, and NYU-v2) demonstrate superior quantitative scores and visually faithful restorations compared with traditional and state-of-the-art methods. Ablation studies verify the contributions of curve estimation and attention. This interpretable and adaptive approach offers a robust, scalable, and efficient solution for underwater image enhancement and is broadly applicable to vision tasks supporting autonomous platforms and human operators. The approach generalizes well across scenes and varying water conditions globally.
高质量的水下图像对海洋勘探、环境监测和科学分析至关重要。然而,由于光衰减、散射和波长相关的吸收,它们会导致颜色偏移、低对比度和细节损失。此外,许多现有的深度学习技术就像黑盒子一样,提供有限的可解释性,并且通常在不同的水下条件下泛化得很差。为了解决这个问题,我们提出了CNN-CECA,这是一个新的深度学习框架,其核心创新是卷积主干与跨多个颜色空间的物理启发非线性曲线估计的混合集成。轻量级CNN可以调整亮度、对比度和色彩平衡,ResNet-50可以指导RGB、HSV和CIELab中的多项式、s形曲线和指数曲线的分析,从而实现全局和局部适应。一个关键组件是我们新颖的三通道智能注意力(TCA)模块,它融合了三个色彩空间的结果,动态分配权重以恢复自然色彩和精致结构。后处理与对比度拉伸和边缘锐化增加了最终的细化,同时保持实时使用的效率。与传统和最先进的方法相比,在合成和现实世界数据集(例如,UIEB, UCCS, EUVP和NYU-v2)上进行的大量实验表明,与传统和最先进的方法相比,该方法具有更好的定量分数和视觉忠实度恢复。消融研究证实了曲线估计和关注的贡献。这种可解释和自适应的方法为水下图像增强提供了强大、可扩展和高效的解决方案,广泛适用于支持自主平台和人类操作员的视觉任务。该方法可以很好地推广到不同的场景和全球不同的水条件。
{"title":"CNN-CECA: Underwater image enhancement via CNN-driven nonlinear curve estimation and channel-wise attention in multi-color spaces","authors":"Imran Afzal,&nbsp;Guo Jichang,&nbsp;Fazeela Siddiqui,&nbsp;Muhammad Fahad","doi":"10.1016/j.imavis.2026.105916","DOIUrl":"10.1016/j.imavis.2026.105916","url":null,"abstract":"<div><div>High-quality underwater images are essential for marine exploration, environmental monitoring, and scientific analysis. However, they are degraded by light attenuation, scattering, and wavelength-dependent absorption, which cause color shifts, low contrast, and detail loss. Furthermore, many existing deep learning techniques function as black boxes, offering limited interpretability and often generalizing poorly across diverse underwater conditions. To address this, we propose CNN-CECA, a novel deep learning framework whose core innovation is the hybrid integration of a convolutional backbone with physically-inspired, non-linear curve estimation across multiple color spaces. A lightweight CNN adjusts brightness, contrast, and color balance, and ResNet-50 guides the analysis of polynomial, sigmoid, and exponential curves in RGB, HSV, and CIELab, enabling both global and local adaptation. A key component is our novel Triple Channel-wise Attention (TCA) module, which fuses results across the three color spaces, dynamically allocating weights to recover natural colors and delicate structures. Post-processing with contrast stretching and edge sharpening adds final refinement while preserving efficiency for real-time use. Extensive experiments on synthetic and real-world datasets (e.g., UIEB, UCCS, EUVP, and NYU-v2) demonstrate superior quantitative scores and visually faithful restorations compared with traditional and state-of-the-art methods. Ablation studies verify the contributions of curve estimation and attention. This interpretable and adaptive approach offers a robust, scalable, and efficient solution for underwater image enhancement and is broadly applicable to vision tasks supporting autonomous platforms and human operators. The approach generalizes well across scenes and varying water conditions globally.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105916"},"PeriodicalIF":4.2,"publicationDate":"2026-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146078417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhanced medical image segmentation via synergistic feature guidance and multi-scale refinement 基于协同特征引导和多尺度细化的医学图像分割
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-20 DOI: 10.1016/j.imavis.2026.105914
Shaoqiang Wang , Guiling Shi , Xiaofeng Xu , Tiyao Liu , Yawu Zhao , Xiaochun Cheng , Yuchen Wang
Medical image segmentation is pivotal for clinical diagnosis but remains challenged by the inherent trade-offs between global context modeling and local detail preservation, as well as the susceptibility of deep networks to acquisition noise and scale variations. While hybrid CNN-Transformer architectures have emerged to address receptive field limitations, they often incur prohibitive computational costs and lack the inductive bias required for small-sample medical datasets. To resolve these systemic bottlenecks efficiently, we propose SFRNet V2. By integrating parallel local-regional perception, active noise filtration in skip connections, and elastic multi-scale aggregation at the bottleneck, our approach systematically overcomes the limitations of fixed receptive fields and feature ambiguity. Extensive experiments on four diverse public datasets (CVC-ClinicDB, ISIC 2017, TN3K, and MICCAI Tooth) demonstrate that SFRNet V2 consistently outperforms recent competitors. Notably, our model achieves the highest accuracy with only 19.85 M parameters and a rapid inference speed of 2.7 ms, offering a superior balance between precision and clinical deployability.
医学图像分割对临床诊断至关重要,但仍然受到全局上下文建模和局部细节保存之间固有权衡的挑战,以及深度网络对采集噪声和尺度变化的敏感性。虽然混合CNN-Transformer架构已经出现,以解决接受场限制,但它们通常会产生过高的计算成本,并且缺乏小样本医疗数据集所需的诱导偏差。为了有效地解决这些系统瓶颈,我们提出了SFRNet V2。该方法通过在瓶颈处集成并行局部区域感知、跳跃连接中的主动噪声过滤和弹性多尺度聚合,系统地克服了固定接受域和特征模糊的局限性。在四个不同的公共数据集(CVC-ClinicDB、ISIC 2017、TN3K和MICCAI Tooth)上进行的大量实验表明,SFRNet V2的性能始终优于最近的竞争对手。值得注意的是,我们的模型达到了最高的精度,只有19.85 M个参数和2.7 ms的快速推理速度,在精度和临床可部署性之间提供了卓越的平衡。
{"title":"Enhanced medical image segmentation via synergistic feature guidance and multi-scale refinement","authors":"Shaoqiang Wang ,&nbsp;Guiling Shi ,&nbsp;Xiaofeng Xu ,&nbsp;Tiyao Liu ,&nbsp;Yawu Zhao ,&nbsp;Xiaochun Cheng ,&nbsp;Yuchen Wang","doi":"10.1016/j.imavis.2026.105914","DOIUrl":"10.1016/j.imavis.2026.105914","url":null,"abstract":"<div><div>Medical image segmentation is pivotal for clinical diagnosis but remains challenged by the inherent trade-offs between global context modeling and local detail preservation, as well as the susceptibility of deep networks to acquisition noise and scale variations. While hybrid CNN-Transformer architectures have emerged to address receptive field limitations, they often incur prohibitive computational costs and lack the inductive bias required for small-sample medical datasets. To resolve these systemic bottlenecks efficiently, we propose SFRNet V2. By integrating parallel local-regional perception, active noise filtration in skip connections, and elastic multi-scale aggregation at the bottleneck, our approach systematically overcomes the limitations of fixed receptive fields and feature ambiguity. Extensive experiments on four diverse public datasets (CVC-ClinicDB, ISIC 2017, TN3K, and MICCAI Tooth) demonstrate that SFRNet V2 consistently outperforms recent competitors. Notably, our model achieves the highest accuracy with only 19.85 M parameters and a rapid inference speed of 2.7 ms, offering a superior balance between precision and clinical deployability.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105914"},"PeriodicalIF":4.2,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146023141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Compositional Gamba for 3D human pose estimation 合成甘巴三维人体姿态估计
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-19 DOI: 10.1016/j.imavis.2026.105913
Lu Zhou , Yingying Chen , Jinqiao Wang
GCNs (graph convolutional networks) based 2D to 3D human pose estimation has sparked a wave of research and garnered widespread attention profited from its strong competence in joint relation modeling. Yet, the performance still lags behind on account of the scarcity of universal and sophisticated human knowledge. Advancements in state space models, notably Mamba, which demonstrates extraordinary sequential modeling talents has proved its effectiveness on long sequence modeling and macro knowledge acquisition. To alleviate the modeling bias in existing techniques, we advance an innovative hybrid architecture where GCNs are married with the Mamba to learn the multi-level human knowledge in a collaborative manner which is an effective manner to conquer the dilemma caused by the ill-posed issue. Concretely, we design a compositional Gamba (GCNs-Mamba) block where GCNs and Mamba enforce the local–global modeling upon different feature segments alternatively. Additionally, a compositional pattern is skillfully formulated in which multi-level human topological relation is learned and explicit human prior is embedded. The proposed approach outperforms the preceding published works on both the Human3.6M and MPI-INF-3DHP benchmarks, attesting to the efficacy of the hybrid architecture.
基于图形卷积网络(GCNs)的二维到三维人体姿态估计因其在关节关系建模方面的强大能力而引起了广泛的研究和关注。然而,由于缺乏普遍而复杂的人类知识,这种表现仍然落后。状态空间模型的发展,尤其是表现出非凡序列建模天赋的Mamba,已经证明了它在长序列建模和宏观知识获取方面的有效性。为了减轻现有技术中的建模偏差,我们提出了一种创新的混合架构,将GCNs与曼巴结合起来,以协作的方式学习多层次的人类知识,这是克服病态问题所带来的困境的有效方法。具体来说,我们设计了一个组合Gamba (GCNs-Mamba)块,其中GCNs和Mamba在不同的特征段上交替执行局部全局建模。此外,巧妙地制定了一个组合模式,其中学习了多层次的人类拓扑关系,并嵌入了明确的人类先验。所提出的方法在Human3.6M和MPI-INF-3DHP基准测试上都优于之前发表的作品,证明了混合架构的有效性。
{"title":"Compositional Gamba for 3D human pose estimation","authors":"Lu Zhou ,&nbsp;Yingying Chen ,&nbsp;Jinqiao Wang","doi":"10.1016/j.imavis.2026.105913","DOIUrl":"10.1016/j.imavis.2026.105913","url":null,"abstract":"<div><div>GCNs (graph convolutional networks) based 2D to 3D human pose estimation has sparked a wave of research and garnered widespread attention profited from its strong competence in joint relation modeling. Yet, the performance still lags behind on account of the scarcity of universal and sophisticated human knowledge. Advancements in state space models, notably Mamba, which demonstrates extraordinary sequential modeling talents has proved its effectiveness on long sequence modeling and macro knowledge acquisition. To alleviate the modeling bias in existing techniques, we advance an innovative hybrid architecture where GCNs are married with the Mamba to learn the multi-level human knowledge in a collaborative manner which is an effective manner to conquer the dilemma caused by the ill-posed issue. Concretely, we design a compositional Gamba (GCNs-Mamba) block where GCNs and Mamba enforce the local–global modeling upon different feature segments alternatively. Additionally, a compositional pattern is skillfully formulated in which multi-level human topological relation is learned and explicit human prior is embedded. The proposed approach outperforms the preceding published works on both the Human3.6M and MPI-INF-3DHP benchmarks, attesting to the efficacy of the hybrid architecture.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105913"},"PeriodicalIF":4.2,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146023142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MSTPFormer: Mamba-driven spatiotemporal bidirectional dual-stream parallel transformer for 3D human pose estimation MSTPFormer:用于三维人体姿态估计的mamba驱动的时空双向双流并联变压器
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-14 DOI: 10.1016/j.imavis.2026.105912
Tiandi Peng , Yanmin Luo , Jiancong Liang , Gonggeng Lin
Monocular 3D human pose estimation from video sequences requires effectively capturing both spatial and temporal information. However, ensuring long-term temporal consistency while maintaining accurate local motion remains a major challenge. In this paper, we present MSTPFormer, a dual-branch framework that separately models global temporal dynamics and local spatial representations for robust spatio-temporal learning. To model global motion, We design two modules based on the state space mechanism (SSM) of Mamba. The Spatial Scan Block (S-Scan) applies a bidirectional spatial scanning strategy to form closed-loop joint interactions, enhancing local motion chain representation. The Temporal Scan Block (T-Scan) constructs joint-specific temporal channels along the sequence, enabling individualized motion trajectory modeling for each of the 17 joints. For local modeling, we design a Transformer branch to refine spatial features within each frame, thereby enhancing the expressiveness of joint-level details. This dual-branch design enables effective decoupling and fusion of global–local and spatial–temporal cues. Experiments on Human3.6M and MPI-INF-3DHP demonstrate that MSTPFormer achieves state-of-the-art performance, with P1 errors of 37.6 mm on Human3.6M and 13.6 mm on MPI-INF-3DHP.
从视频序列中进行单目三维人体姿态估计需要有效地捕获空间和时间信息。然而,确保长期的时间一致性,同时保持准确的局部运动仍然是主要的挑战。在本文中,我们提出了MSTPFormer,这是一个双分支框架,它分别对全局时间动态和局部空间表示进行建模,以实现健壮的时空学习。为了模拟全局运动,我们基于曼巴的状态空间机制(SSM)设计了两个模块。空间扫描块(S-Scan)采用双向空间扫描策略形成闭环关节相互作用,增强局部运动链表征。时间扫描块(T-Scan)沿着序列构建关节特定的时间通道,为17个关节中的每个关节实现个性化的运动轨迹建模。对于局部建模,我们设计了一个Transformer分支来细化每帧内的空间特征,从而增强了联合级细节的表现力。这种双分支设计能够有效地解耦和融合全局-局部和时空线索。在Human3.6M和MPI-INF-3DHP上的实验表明,MSTPFormer达到了最先进的性能,在Human3.6M和MPI-INF-3DHP上的P1误差分别为37.6 mm和13.6 mm。
{"title":"MSTPFormer: Mamba-driven spatiotemporal bidirectional dual-stream parallel transformer for 3D human pose estimation","authors":"Tiandi Peng ,&nbsp;Yanmin Luo ,&nbsp;Jiancong Liang ,&nbsp;Gonggeng Lin","doi":"10.1016/j.imavis.2026.105912","DOIUrl":"10.1016/j.imavis.2026.105912","url":null,"abstract":"<div><div>Monocular 3D human pose estimation from video sequences requires effectively capturing both spatial and temporal information. However, ensuring long-term temporal consistency while maintaining accurate local motion remains a major challenge. In this paper, we present MSTPFormer, a dual-branch framework that separately models global temporal dynamics and local spatial representations for robust spatio-temporal learning. To model global motion, We design two modules based on the state space mechanism (SSM) of Mamba. The Spatial Scan Block (S-Scan) applies a bidirectional spatial scanning strategy to form closed-loop joint interactions, enhancing local motion chain representation. The Temporal Scan Block (T-Scan) constructs joint-specific temporal channels along the sequence, enabling individualized motion trajectory modeling for each of the 17 joints. For local modeling, we design a Transformer branch to refine spatial features within each frame, thereby enhancing the expressiveness of joint-level details. This dual-branch design enables effective decoupling and fusion of global–local and spatial–temporal cues. Experiments on Human3.6M and MPI-INF-3DHP demonstrate that MSTPFormer achieves state-of-the-art performance, with P1 errors of 37.6 mm on Human3.6M and 13.6 mm on MPI-INF-3DHP.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105912"},"PeriodicalIF":4.2,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146023143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MMDehazeNet: Cross-modality attention with feature correction and multi-scale encoding for visible-infrared dehazing MMDehazeNet:基于特征校正和多尺度编码的可见红外消雾跨模态关注
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-13 DOI: 10.1016/j.imavis.2026.105896
Liangliang Duan
Haze-induced image degradation significantly degrades visual quality and impairs the performance of outdoor computer vision systems. Traditional single-image dehazing methods suffer from inherent limitations in dense haze scenarios due to the ill-posed nature of the problem. Leveraging complementary information from visible (RGB) and near-infrared (NIR) modalities offers a robust solution, as NIR signals exhibit superior penetration through atmospheric particles. This paper presents MMDehazeNet, a novel end-to-end multimodal fusion network for visible-infrared image dehazing. Adopting a U-Net-based dual-encoder architecture, it jointly processes hazy RGB and NIR images, with three key innovations: (1) a Gated Cross-Modality Attention (GCMA) module for efficient multi-level fusion; (2) a Multimodal Feature Correction (MMFC) module with a learned gating mechanism for adaptive inter-modal alignment; and (3) Multi-Scale Convolutional Layers (MSCL) for multi-receptive field feature extraction. Three variants (i.e., MMDehazeNet-S, -B, -L) are proposed. Extensive evaluations on the AirSim-VID, EPFL, and FANVID datasets demonstrate that MMDehazeNet achieves state-of-the-art performance. Quantitative and qualitative comparisons validate its significant superiority over existing single- and multi-modal methods, particularly under challenging medium and dense haze conditions.
雾霾引起的图像退化显著降低了视觉质量,损害了室外计算机视觉系统的性能。由于问题的病态性,传统的单图像除雾方法在密集雾霾场景中存在固有的局限性。利用可见光(RGB)和近红外(NIR)模式的互补信息提供了一个强大的解决方案,因为近红外信号在穿透大气颗粒方面表现出卓越的能力。本文提出了一种新型的端到端多模态融合网络MMDehazeNet,用于可见光-红外图像去雾。采用基于u - net的双编码器架构,对模糊RGB和近红外图像进行联合处理,主要创新有三个:(1)门控跨模态注意(GCMA)模块,实现高效多级融合;(2)基于学习门控机制的多模态特征校正(MMFC)模块,用于自适应多模态对齐;(3)用于多感受野特征提取的多尺度卷积层(MSCL)。提出了三种变体(即MMDehazeNet-S、-B、-L)。对AirSim-VID、EPFL和FANVID数据集的广泛评估表明,MMDehazeNet达到了最先进的性能。定量和定性比较证实了它比现有的单模态和多模态方法具有显著的优势,特别是在具有挑战性的中、重度雾霾条件下。
{"title":"MMDehazeNet: Cross-modality attention with feature correction and multi-scale encoding for visible-infrared dehazing","authors":"Liangliang Duan","doi":"10.1016/j.imavis.2026.105896","DOIUrl":"10.1016/j.imavis.2026.105896","url":null,"abstract":"<div><div>Haze-induced image degradation significantly degrades visual quality and impairs the performance of outdoor computer vision systems. Traditional single-image dehazing methods suffer from inherent limitations in dense haze scenarios due to the ill-posed nature of the problem. Leveraging complementary information from visible (RGB) and near-infrared (NIR) modalities offers a robust solution, as NIR signals exhibit superior penetration through atmospheric particles. This paper presents MMDehazeNet, a novel end-to-end multimodal fusion network for visible-infrared image dehazing. Adopting a U-Net-based dual-encoder architecture, it jointly processes hazy RGB and NIR images, with three key innovations: (1) a Gated Cross-Modality Attention (GCMA) module for efficient multi-level fusion; (2) a Multimodal Feature Correction (MMFC) module with a learned gating mechanism for adaptive inter-modal alignment; and (3) Multi-Scale Convolutional Layers (MSCL) for multi-receptive field feature extraction. Three variants (i.e., MMDehazeNet-S, -B, -L) are proposed. Extensive evaluations on the AirSim-VID, EPFL, and FANVID datasets demonstrate that MMDehazeNet achieves state-of-the-art performance. Quantitative and qualitative comparisons validate its significant superiority over existing single- and multi-modal methods, particularly under challenging medium and dense haze conditions.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105896"},"PeriodicalIF":4.2,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Disentangling co-occurrence with class-specific banks for Weakly Supervised Semantic Segmentation 弱监督语义分割中类特定库的共现解纠结
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-12 DOI: 10.1016/j.imavis.2025.105893
Hang Yao, Yuanchen Wu, Kequan Yang, Jide Li, Chao Yin, Zihang Li, Xiaoqiang Li
In Weakly Supervised Semantic Segmentation (WSSS), co-occurring objects often degrade the quality of Class Activation Maps (CAMs), ultimately compromising segmentation accuracy. Many recent WSSS methods leverage Contrastive Language-Image Pre-training (CLIP) by contrasting target-class images with text descriptions of background classes, thus providing additional supervision. However, these methods only rely on a shared background class set across all target classes, ignoring that each class has its own unique co-occurring objects. To resolve this limitation, this paper proposes a novel method that constructs semantically related class banks for each target class to disentangle co-occurring objects (dubbed DiCo). Specifically, DiCo first uses Large Language Models (LLMs) to generate semantically related class banks for each target class, which are further divided into negative and positive class banks to form contrastive pairs. The negative class banks include co-occurring objects related to the target class, while the positive class banks consist of the target class itself, along with its super-classes and sub-classes. By contrasting these negative and positive class banks with images through CLIP, DiCo disentangles target classes from co-occurring classes, simultaneously enhancing the semantic representations of the target class. Moreover, different classes have differential contributions to the disentanglement of co-occurring classes. DiCo introduces an adaptive weighting mechanism to adjust the contributions of co-occurring classes. Experimental results demonstrate that DiCo achieves superior performance compared to previous work on PASCAL VOC 2012 and MS COCO 2014.
在弱监督语义分割(WSSS)中,共同出现的对象通常会降低类激活图(CAMs)的质量,最终影响分割的准确性。最近的许多WSSS方法利用对比语言图像预训练(CLIP),将目标类图像与背景类的文本描述进行对比,从而提供额外的监督。但是,这些方法只依赖于跨所有目标类设置的共享背景类,而忽略了每个类都有自己唯一的共同发生对象。为了解决这一限制,本文提出了一种新的方法,即为每个目标类构建语义相关的类库来解纠缠共存对象(称为DiCo)。具体而言,DiCo首先使用大型语言模型(Large Language Models, llm)为每个目标类生成语义相关的类库,并将其进一步划分为负类库和正类库,形成对比对。负类库包括与目标类相关的共生对象,而正类库由目标类本身及其超类和子类组成。通过CLIP将这些负类库和正类库与图像进行对比,DiCo将目标类从共存类中分离出来,同时增强了目标类的语义表示。此外,不同的类对共发生类的解纠缠有不同的贡献。DiCo引入了一种自适应加权机制来调整共同发生类的贡献。实验结果表明,与之前在PASCAL VOC 2012和MS COCO 2014上的工作相比,DiCo取得了更好的性能。
{"title":"Disentangling co-occurrence with class-specific banks for Weakly Supervised Semantic Segmentation","authors":"Hang Yao,&nbsp;Yuanchen Wu,&nbsp;Kequan Yang,&nbsp;Jide Li,&nbsp;Chao Yin,&nbsp;Zihang Li,&nbsp;Xiaoqiang Li","doi":"10.1016/j.imavis.2025.105893","DOIUrl":"10.1016/j.imavis.2025.105893","url":null,"abstract":"<div><div>In Weakly Supervised Semantic Segmentation (WSSS), co-occurring objects often degrade the quality of Class Activation Maps (CAMs), ultimately compromising segmentation accuracy. Many recent WSSS methods leverage Contrastive Language-Image Pre-training (CLIP) by contrasting target-class images with text descriptions of background classes, thus providing additional supervision. However, these methods only rely on a shared background class set across all target classes, ignoring that each class has its own unique co-occurring objects. To resolve this limitation, this paper proposes a novel method that constructs semantically related class banks for each target class to <strong>di</strong>sentangle <strong>co</strong>-occurring objects (dubbed <strong>DiCo</strong>). Specifically, DiCo first uses Large Language Models (LLMs) to generate semantically related class banks for each target class, which are further divided into negative and positive class banks to form contrastive pairs. The negative class banks include co-occurring objects related to the target class, while the positive class banks consist of the target class itself, along with its super-classes and sub-classes. By contrasting these negative and positive class banks with images through CLIP, DiCo disentangles target classes from co-occurring classes, simultaneously enhancing the semantic representations of the target class. Moreover, different classes have differential contributions to the disentanglement of co-occurring classes. DiCo introduces an adaptive weighting mechanism to adjust the contributions of co-occurring classes. Experimental results demonstrate that DiCo achieves superior performance compared to previous work on PASCAL VOC 2012 and MS COCO 2014.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105893"},"PeriodicalIF":4.2,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing UAV small target detection: A balanced accuracy-efficiency algorithm with tiered feature focus 增强无人机小目标检测:一种具有分层特征焦点的精度-效率平衡算法
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-10 DOI: 10.1016/j.imavis.2026.105897
Hanwei Guo, Shugang Liu
Small target detection in unmanned aerial vehicle (UAV) imagery is crucial for both military and civilian applications. However, achieving a balance between detection performance, efficiency, and lightweight architecture remains challenging. This paper introduces TF-DEIM-DFINE, a tiered focused small target detection model designed specifically for UAV tasks.We propose the Convolutional Gated-Visual Mamba (CG-VIM) module to enhance global dependency capture and local detail extraction through long sequence modeling, along with the Half-Channel Single-Head Attention (HCSA) module for global modeling, which improves fine-grained representation while reducing computational redundancy. Additionally, our Tiered Focus-Feature Pyramid Networks (TF-FPN) improve the representational capability of high-frequency information in multi-scale features without significantly increasing computational overhead. Experimental results on the VisDrone dataset demonstrate a 4.7% improvement in APM and a 5.8% improvement in AP metrics, with a 37% reduction in parameter count and only a 6% increase in GFLOPs, maintaining unchanged FPS. These results highlight TF-DEIM-DFINE’s ability to improve detection accuracy while preserving a lightweight and efficient structure
无人机图像中的小目标检测在军事和民用应用中都是至关重要的。然而,在检测性能、效率和轻量级架构之间取得平衡仍然具有挑战性。本文介绍了一种针对无人机任务设计的分层聚焦小目标检测模型TF-DEIM-DFINE。我们提出了卷积门控视觉曼巴(CG-VIM)模块,通过长序列建模增强了全局依赖捕获和局部细节提取,以及半通道单头注意(HCSA)模块,用于全局建模,提高了细粒度表示,同时减少了计算冗余。此外,我们的分层焦点-特征金字塔网络(TF-FPN)在不显著增加计算开销的情况下提高了多尺度特征中高频信息的表示能力。在VisDrone数据集上的实验结果表明,APM提高了4.7%,AP指标提高了5.8%,参数计数减少了37%,GFLOPs仅增加了6%,FPS保持不变。这些结果突出了TF-DEIM-DFINE在保持轻量化和高效结构的同时提高检测精度的能力
{"title":"Enhancing UAV small target detection: A balanced accuracy-efficiency algorithm with tiered feature focus","authors":"Hanwei Guo,&nbsp;Shugang Liu","doi":"10.1016/j.imavis.2026.105897","DOIUrl":"10.1016/j.imavis.2026.105897","url":null,"abstract":"<div><div>Small target detection in unmanned aerial vehicle (UAV) imagery is crucial for both military and civilian applications. However, achieving a balance between detection performance, efficiency, and lightweight architecture remains challenging. This paper introduces TF-DEIM-DFINE, a tiered focused small target detection model designed specifically for UAV tasks.We propose the Convolutional Gated-Visual Mamba (CG-VIM) module to enhance global dependency capture and local detail extraction through long sequence modeling, along with the Half-Channel Single-Head Attention (HCSA) module for global modeling, which improves fine-grained representation while reducing computational redundancy. Additionally, our Tiered Focus-Feature Pyramid Networks (TF-FPN) improve the representational capability of high-frequency information in multi-scale features without significantly increasing computational overhead. Experimental results on the VisDrone dataset demonstrate a 4.7% improvement in AP<span><math><msub><mrow></mrow><mrow><mtext>M</mtext></mrow></msub></math></span> and a 5.8% improvement in AP metrics, with a 37% reduction in parameter count and only a 6% increase in GFLOPs, maintaining unchanged FPS. These results highlight TF-DEIM-DFINE’s ability to improve detection accuracy while preserving a lightweight and efficient structure</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105897"},"PeriodicalIF":4.2,"publicationDate":"2026-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145927627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Image and Vision Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1